Information Integration (academic year 2019/2020)

This is one of the two sections of the course Large Scale Data Management. The lectures of this section will be held in March-May 2020.

For whom is this course. This 3 credits course is actually one of the two sections of the course Large Scale Data Management for the students of the Master in Engineering of Computer Science (School of Engineering) of Sapienza Università di Roma.
Prerequisites. A good knowledge of the fundamentals of Programming Structures, Programming Languages, Databases (SQL, relational data model, Entity-Relationship data model, conceptual and logical database design) and Database systems, as well as a basic knowledge of Mathematical Logic is required.
Course goals. Information integration is the problem of combining data residing at different sources, and providing the user with a unified view of these data. The problem of designing information integration systems is important in current real world applications, and is characterized by a number of issues that are interesting from both a theoretical and a practical point of view. In the last years, there has been a huge amount of research work on data integration, and a precise, clear picture of a systematic approach to such problem is now available. This section will present an overview of the research work carried out in the area of data integration, with emphasis on the theoretical results that are relevant for the development of information integration solutions. Special attention will be devoted to the following aspects: architectures for information integration, modeling an information integration application, ontology-based data access and integration, processing queries in information integration, data exchange, and reasoning on queries.

  • News
    • February 20, 2021 Starting from the Academic Year 2020/2021, the coordinator of the curse is Prof. Domenico Lembo. Please, refer to him for registering the exam.
    • December 27, 2020 The students who want their exam of Large-Scale Data Integration to be registered in the March-April session should book for the exam in the Infostud system.
    • March 9, 2020. In order to allow prof. Lenzerini to record the on-line lectures, every student willing to attend such lectures should please download this form, fill it, sign it, and then send it to lenzerini[AT]
    • March 9, 2020. Starting from Thursday, March 12, 2020, at 10:00 the lectures will be in streaming, using the "Google meet" platform. I invite all students to use Chrome as browsers and make sure they have access to the "Google meet" application. For the moment, the various lectures will be held during the usual time schedule of the course. The link to follow in order to participate in the lectures has been posted in the MOODLE system, and the post has been sent to all students registered for the course. All the on-line lectures will be recorded and the recording will be available in the MOODLE page of the course.
    • March 4, 2020. The lectures in all universities in Italy are suspended. Please, stay tuned for new information in the next days.
    • February 20, 2020. The lecture will start on Thursday, February 27, 2020.
  • Topics covered (tentative)
    • Architectures for information integration
    • Distributed data management
    • Data federation
    • Data exchange and data warehousing
    • ETL (Extraction, Transformation and Loading), data cleaning and data reconciliation
    • Data integration
    • Ontology-based data integration
  • Teaching material
    • Before the beginning of the lectures, students are invited to (re)study the basic notions of propositional and first-order logic. For this purpose, students may use the material they used in previous courses, or have a look at:
    • Slides
      The lecture notes can be downloaded from the course page in Moodle

    • Book
      A good book on information integration is: Principles of data integration, by AnHai Doan, Alon Halevy, Zachary Ives.

    • Lectures

      • When: Thursday, 10:00am - 1:00pm, starting from March, 2019 (sometimes, also Tuesday, from 4pm to 7pm, or Thursday 8:00-10:00).
      • Where: Classroom A3 (Tuesday), and A7 (Thursday), via Ariosto 25, Roma
      • Schedule
        All the on-line lectures will be recorded and the recording will be available in the MOODLE page of the course.

        Week Tuesday (4:00pm-7:00pm) - A3
        Thursday (8:00am - 10:00am) - A7
        Thursday (10:00am - 01:00pm)
        classroom A7
        01 (Feb 24)
        Lectures 1,2,3
        - Introduction to information integration
        - Propositional logic: syntax and semantics
        02 (Mar 02)
        03 (Mar 09)
        Lectures 4,5,6
        - First-order logic
        - Relationship between logic and data management
        (on-line using Meet)
        04 (Mar 16)
        Lectures 7,8,9
        - Relationship between FOL and Relational Algebra
        (on-line using Meet)
        05 (Mar 23)
        Lectures 10,11,12
        - The notion of incomplete information
        - Complexity of FOL
        (on-line using Meet)
        06 (Mar 30)
        Lectures 13,14,15
        - History of information integration
        - Data integration: formal framework
        (on-line using Meet)
        07 (Apr 06)
        08 (Apr 13)
        Lectures 16,17,18
        - Mapping languages
        - Methods for mapping specification
        (on-line using Meet)
        09 (Apr 20)
        Lectures 19,20,21
        - Exercise on mapping specification
        - Query answering in GAV
        (on-line using Meet)
        10 (Apr 27)
        Lectures 22,23,24
        - Query answering in LAV: materialization
        (on-line using Meet)
        11 (May 4)
        Lectures 25,26,27
        - Query answering in LAV: virtualization
        (on-line using Meet)
        12 (May 11)
        Lectures 28,29,30
        - Query answering with axioms in the global schema
        (on-line using Meet)
        13 (May 18)
        Lectures 31,32
        - Discussion of student projects
        (on-line using Meet)
        14 (May 25)
        Lectures 33,34,35
        - Discussion of student projects

  • Papers and tools

    This is a list of papers that students can read if they are interested in specific topics:

    • Reasoning about schema mapping
      • Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, Wang Chiew Tan. Composing schema mappings: Second-order dependencies to the rescue. ACM Trans. Database Syst. 30(4): 994-1055 (2005)
      • Bogdan Alexe, Balder ten Cate, Phokion G. Kolaitis, Wang Chiew Tan: Characterizing schema mappings via data examples. ACM Trans. Database Syst. 36(4): 23 (2011)
    • Query answering
      • Alon Y. Levy, Alberto O. Mendelzon, Yehoshua Sagiv, Divesh Srivastava. Answering Queries Using Views. PODS 1995: 95-104
      • Rachel Pottinger, Alon Halevy. MiniCon: A scalable algorithm for answering queries using views.The VLDB Journal” The International Journal on Very Large Data Bases, Volume 10, Issue 2-3 (September 2001)
      • Oliver M. Duschka, Michael R. Genesereth, Alon Y. Levy. Recursive Query Plans for Data Integration. J. Log. Program. 43(1): 49-73 (2000)
      • George Konstantinidis, José Luis Ambite. Scalable query rewriting: a graph-based approach, SIGMOD '11 Proceedings of the 2011 international conference on Management of data.
    • Probabilistic data integration
      • Xin Luna Dong, Alon Y. Halevy, Cong Yu. Data integration with uncertainty. VLDB J. 18(2): 469-500 (2009)
    • Query answering under inconsistencies
      • Andrea Calì, Domenico Lembo, Riccardo Rosati. On the decidability and complexity of query answering over inconsistent and incomplete databases. PODS 2003: 260-271
      • Marcelo Arenas, Leopoldo E. Bertossi, Jan Chomicki. Consistent Query Answers in Inconsistent Databases. PODS 1999: 68-79
      • Balder ten Cate, Gaëlle Fontaine, Phokion G. Kolaitis: On the Data Complexity of Consistent Query Answering. Theory Comput. Syst. 57(4): 843-891 (2015)
    • Data cleaning and reconciliation
      • Douglas Burdick, Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, Wang Chiew Tan: Expressive Power of Entity-Linking Frameworks. ICDT 2017: 10:1-10:18
      • Anja Gruenheid, Xin Luna Dong, Divesh Srivastava: Incremental Record Linkage. PVLDB 7(9): 697-708 (2014)
    • Ontology-based data integration
      • Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, Antonella Poggi, Mariano Rodriguez-Muro, Riccardo Rosati: Ontologies and Databases: The DL-Lite Approach. Reasoning Web 2009: 255-356

    Here is an incomplete list of possible tools to study/experiment in a small project: Karma, IBM Infosphere, Oracle data integrator, CloverETL, Pentaho, Jedox, TEIID, Apatar, Talend, Jitterbit, Myddleware, Adeptia, Open Refine, Denovo, Dremio, Trifacta

  • Exams

    The following are the rules for the exam. There are three possibilities for the exam:

    • Study a paper on or a tool for data integration, data federation, ETL or data exchange, and then make a presentation (in English), where the characteristics of the paper or the tool are described, and the position in the spectrum of information integration principles illustrated in the course is discussed.
    • Choose a set of data sources with data relevant for a certain phenomenon (for example, data taken from open data published on-line, or data taken from a database or from csv files, or from xls files known by the student), and develop a data integration or data exchange application using such data sources (and using any tool selected by the student). This work can be carried out in a group of at most two students.
    • Choose a coordinated project with the section on Big Data Management taught by Prof. Lembo. Just to give an idea of possibile coordinated projects, here is a(n incomplete) list of possible projects: (1) the part related to Information Integration can be an ETL system integrating interesting data sources into a data warehouse using a chosen tool, and the parte on Big Data Management can be an application of OLAP operations over the integrated data in order to carry out interesting analyses. (2) The part related to Big Data Management can be the set up of heterogeneous data sources, including NoSQL data sources, and the part on Information Integration can be a system integrating such data sources using a chosen tool. (3) The part related to Information Integration can be a system integrating interesting data sources into a NoSQL source (such as a graph database), and the part on Big Data Management can be the implementation of queries over the integrated data.

    In all the above cases, once the student has chosen the topic, (s)he should send an email message to prof. Lenzerini (and, in case of a joint project with Big Data Management, to Prof. Lembo too), with the description of the topic, and wait for confirmation, or a request to change the topic. Also, once the student has decided on the date for the exam, (s)he should send an email message to prof. Lenzerini with the indication of the date. The current dates for the presentations of the students are:

  • Schedule of exams
    • First exam: June 2020
    • Second exam: July 2020
    • Third exam: September 2020
    • First special session: October 2020
    • Fourth exam: January 2021
    • Fifth exam: February 2021
    • Second special session: April 2021
  • Past editions
  • Office hours. Tuesday, 5:00 pm, at the Dipartimento di Informatica e Sistemistica "Antonio Ruberti",
    via Ariosto 25, Roma, second floor, room B203 (if available), or room B217 (otherwise) -- please, look at the last
    minute news for the next office hours