Information Integration (academic year 2013/2014)

This is one of the sections of the course Elective in Software and Services. The lectures of this section were held in February-May 2014.

For whom is this course. This 3 credits course is actually one of the sections of the course Elective in Software and Services for the students of the Master in Computer Engineering (School of Engineering) of Sapienza Università di Roma.
Prerequisites. A good knowledge of the fundamentals of Programming Structures, Programming Languages, Databases (SQL, relational data model, Entity-Relationship data model, conceptual and logical database design) and Database systems, as well as a basic knowledge of Mathematical Logic is required.
Course goals. Information integration is the problem of combining data residing at different sources, and providing the user with a unified view of these data. The problem of designing information integration systems is important in current real world applications, and is characterized by a number of issues that are interesting from both a theoretical and a practical point of view. In the last years, there has been a huge amount of research work on data integration, and a precise, clear picture of a systematic approach to such problem is now available. This section will present an overview of the research work carried out in the area of data integration, with emphasis on the theoretical results that are relevant for the development of information integration solutions. Special attention will be devoted to the following aspects: architectures for information integration, modeling an information integration application, ontology-based data access and integration, processing queries in information integration, data exchange, and reasoning on queries.

  • News
    • March 20, 2015 In April, the exam will be held on April 21, at 4pm in room B217 in via Ariosto 25. The students who want to do the exam on that date should send a message to Prof. Lenzerini.
    • March 12, 2014 The students are invited to download the slides from the course page in Moodle.
  • Topics covered
    • Architectures for information integration
    • Distributed data management
    • Data federation
    • Data exchange and data warehousing
    • ETL (Extraction, Transformation and Loading), data cleaning and data reconciliation
    • Data integration
    • Ontology-based data integration
  • Teaching material
    • Please, visit the course page in Moodle for downloading all the material of the course, as well as for participating in the Forum. The password for accessing the Moodle pages of this course will be communicated to the students during the lectures.
    • Before the beginning of the lectures, students are invited to (re)study the basic notions of propositional and first-order logic. For this purpose, students may use the material they used in previous courses, or have a look at:
  • Slides

    The lecture notes can be downloaded from the course page in Moodle

  • Exams

    For the exam, each student is expected to do one of the following things:

    • Study a tool for data integration or data federation, or data exchange, and then make a presentation (in English), where the characteristics of the tool are described, the position of the tool in the spectrum of information integration principles illustrated in the course, is discussed, and a demo of the tool is presented. For a picture of the available tools for data integration, the student should search on the web. Here is an incomplete list of possible tools: Karma, IBM Infosphere, Oracle data integrator, CloverETL, Pentaho, TEIID, Talend, Jitterbit, Adeptia, Open Refine etc.
    • Choose a set of data sources with data relevant for a certain phenomenon (for example, data taken from open data published on-line, or data taken from a database or from an xls file known by the student), and develop a data integration or data exchange application using such data sources (and using any tool selected by the student). This work can be carried out in a group (of at most two students).
    • Study a paper on information integration, and then discuss the paper in a 15 minutes presentation (in English), again, including a part for positioning the work in the context of the spectrum of the principles illustrated in the course. Here is a (non exhaustive) list of papers that can be considered (use Google to find the papers and download them):
      • 1. Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, Moshe Y. Vardi. Rewriting of Regular Expressions and Regular Path Queries. In J. Comput. Syst. Sci. 64(3):443-465, 2002
      • 2. Alon Y. Levy, Alberto O. Mendelzon, Yehoshua Sagiv, Divesh Srivastava. Answering Queries Using Views. PODS 1995: 95-104
      • 3. Rachel Pottinger, Alon Halevy. MiniCon: A scalable algorithm for answering queries using views.The VLDB Journal” The International Journal on Very Large Data Bases, Volume 10, Issue 2-3 (September 2001)
      • 4. Oliver M. Duschka, Michael R. Genesereth, Alon Y. Levy. Recursive Query Plans for Data Integration. J. Log. Program. 43(1): 49-73 (2000)
      • 5. Philippe Adjiman, Philippe Chatalic, Francois Goasdou, Marie-Christine Rousset, Laurent Simon. Distributed Reasoning in a Peer-to-Peer Setting: Application to the Semantic Web. Journal of Artificial Intelligence Research (JAIR) 25: 269-314 (2006)
      • 6. Xin Luna Dong, Alon Y. Halevy, Cong Yu. Data integration with uncertainty. VLDB J. 18(2): 469-500 (2009)
      • 7. Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, Wang Chiew Tan. Composing schema mappings: Second-order dependencies to the rescue. ACM Trans. Database Syst. 30(4): 994-1055 (2005)
      • 8. Andrea Calì, Domenico Lembo, Riccardo Rosati. On the decidability and complexity of query answering over inconsistent and incomplete databases. PODS 2003: 260-271
      • 9. Jens Bleiholder, Felix Naumann. Data fusion. ACM Comput. Surv. 41(1): (2008)
      • 10. Marcelo Arenas, Leopoldo E. Bertossi, Jan Chomicki. Consistent Query Answers in Inconsistent Databases. PODS 1999: 68-79
      • 11. George Konstantinidis, José Luis Ambite. Scalable query rewriting: a graph-based approach, SIGMOD '11 Proceedings of the 2011 international conference on Management of data.
      • 12. Hector Gonzalez, Alon Y. Halevy, Christian S. Jensen, Anno Langen, Jayant Madhavan, Rebecca Shapley, Warren Shen, Jonathan Goldberg-Kidon. Google fusion tables: web-centered data management and collaboration. SIGMOD Conference 2010: 1061-1066
      • 13. Mary Roth, Wang-Chiew Tan: Data Integration and Data Exchange: It's Really About Time. CIDR 2013
      • 14. Bogdan Alexe, Balder ten Cate, Phokion G. Kolaitis, Wang Chiew Tan: Characterizing schema mappings via data examples. ACM Trans. Database Syst. 36(4): 23 (2011)
      • 15. Anastasios Kementsietsidis, Marcelo Arenas, Renée J. Miller: Mapping Data in Peer-to-Peer Systems: Semantics and Algorithmic Issues. SIGMOD Conference 2003: 325-336

    In all the above three cases, once the student has chosen the topic, (s)he should send an email message to prof. Lenzerini with the description of the topic, and wait for confirmation, or a request to change the topic, if the topic (tool, paper, or use case) is already taken.

  • Schedule of exams:

    The exam will be held in the following months:

    • First exam: June 2014
    • Second exam: July 2014
    • Third exam: September 2014
    • First special session: October 2014
    • Fourth exam: January 20, 2015, at 4pm, in via Ariosto 25
    • Fifth exam: February 24, 2015, at 4pm, in via Ariosto 25
    • Second special session: April 21, 2015, at 4pm, in via Ariosto 25

    Once the student is ready for the exam, (s)he should send an email message to prof. Lenzerini with the indication of the date when (s)he wants to give the exam.

  • Lectures
    • When: during the period (February 24 - May 30, 2014), every Thursday at 8:30am - 10:00am, and sometime on Wednesday at 8:30am - 10:00am (check the schedule below)
    • Where: via Ariosto 25, Roma - classroom A2
    • Schedule:

    Week Wednesday (8:30am - 10:00am)
    classroom A2
    Thursday (8:30am -10:00am)
    classroom A2
    01 (Feb 24) Lectures 1
    - Introduction to information integration
    02 (Mar 3)
    Lectures 2,3
    - First-order logic: syntax and semantics
    03 (Mar 10) Lectures 4,5
    - First-order logic: syntax and semantics
    Lectures 6,7
    - Architectures for data integration
    04 (Mar 17)
    Lectures 8,9
    - Logical formalization of data integration
    05 (March 24)
    Lectures 10,11
    - The notion of certain answers to a query
    06 (March 31)
    Lectures 12,13
    - Mapping languages: GAV, LAV and GLAV
    07 (Apr 7)
    Lectures 14,15
    - Algorithms for Query Answering in GAV without constraints
    08 (Apr 14) Lectures 16,17
    - Materialization-based algorithms in LAV without constraints
    09 (Apr 21)
    Lectures 18,19
    - Materialization-based algorithms in LAV without constraints
    10 (Apr 28)
    11 (May 5)
    Lectures 20,21
    - Rewriting-based algorithms in LAV without constraints
    12 (May 12)
    Lectures 22,23
    - Ontology-based data integration
    13 (May 19)
    Lectures 24,25
    - Ontology-based data integration
    - Tools for data integration and data federation: Teiid (Hoang Trinh Nguyen)
    - Tools for data integration and data federation: the ETL system Talend (Luca Vallarelli)
    14 (May 26)
    Lectures 26,27
    - Tools for data integration and data federation: interfacing Teiid with MongoDB (Antonio Gallo)
    - Tools for data integration and data federation: the ETL system Pentaho (Marco Console)

  • Past editions
  • Office hours. Tuesday, 5:00 pm, at the Dipartimento di Informatica e Sistemistica "Antonio Ruberti",
    via Ariosto 25, Roma, second floor, room B203 (if available), or room B217 (otherwise) -- please, look at the last
    minute news for the next office hours