Information Integration (academic year 2016/2017)

This is one of the sections of the course Large Scale Data Management. The lectures of this section were held in March-May 2017.

For whom is this course. This 3 credits course is actually one of the two sections of the course Large Scale Data Management for the students of the Master in Computer Engineering (School of Engineering) of Sapienza Università di Roma.
Prerequisites. A good knowledge of the fundamentals of Programming Structures, Programming Languages, Databases (SQL, relational data model, Entity-Relationship data model, conceptual and logical database design) and Database systems, as well as a basic knowledge of Mathematical Logic is required.
Course goals. Information integration is the problem of combining data residing at different sources, and providing the user with a unified view of these data. The problem of designing information integration systems is important in current real world applications, and is characterized by a number of issues that are interesting from both a theoretical and a practical point of view. In the last years, there has been a huge amount of research work on data integration, and a precise, clear picture of a systematic approach to such problem is now available. This section will present an overview of the research work carried out in the area of data integration, with emphasis on the theoretical results that are relevant for the development of information integration solutions. Special attention will be devoted to the following aspects: architectures for information integration, modeling an information integration application, ontology-based data access and integration, processing queries in information integration, data exchange, and reasoning on queries.

  • News
    • February 20, 2018 For the dates of the exam in April 2018, please see below. The students who must register the whole exam of Large Scale Data Integration during April 2018 session can now book for the exam using the INFOSTUD system until April 10, 2018.
    • May 22, 2017 The today lecture was the last lecture of the corse. Prof. Lenzerini thanks all the students that attended the course this year.
  • Topics covered
    • Architectures for information integration
    • Distributed data management
    • Data federation
    • Data exchange and data warehousing
    • ETL (Extraction, Transformation and Loading), data cleaning and data reconciliation
    • Data integration
    • Ontology-based data integration
  • Teaching material
  • Slides

    The lecture notes can be downloaded from the course page in Moodle

  • Exams

    The following are the rules for the exam. There are three possibilities for the exam:

    • Study a tool for data integration, data federation, ETL or data exchange, and then make a presentation (in English), where the characteristics of the tool are described, the position of the tool in the spectrum of information integration principles illustrated in the course is discussed, and a demo of the tool is presented. For a complete picture of the available tools for data integration, the student should search on the web. Here is an incomplete list of possible tools: Karma, IBM Infosphere, Oracle data integrator, CloverETL, Pentaho, TEIID, Talend, Jitterbit, Adeptia, Open Refine etc.
    • Choose a set of data sources with data relevant for a certain phenomenon (for example, data taken from open data published on-line, or data taken from a database or from csv files, or from xls files known by the student), and develop a data integration or data exchange application using such data sources (and using any tool selected by the student). This work can be carried out in a group of at most two students.
    • Choose a coordinated project with the section on Big Data Management taught by Prof. Lembo. Just to give an idea of possibile coordinated projects, here is a(n incomplete) list of possible projects: (1) the part related to Information Integration can be an ETL system integrating interesting data sources into a data warehouse using a chosen tool, and the parte on Big Data Management can be an application of OLAP operations over the integrated data in order to carry out interesting analyses. (2) The part related to Big Data Management can be the set up of heterogeneous data sources, including NoSQL data sources, and the part on Information Integration can be a system integrating such data sources using a chosen tool. (3) The part related to Information Integration can be a system integrating interesting data sources into a NoSQL source (such as a graph database), and the part on Big Data Management can be the implementation of queries over the integrated data.
    • Study a paper on information integration, and then discuss the paper in a 15 minutes presentation (in English), again, including a part for positioning the work in the context of the spectrum of the principles illustrated in the course. Here is a (non exhaustive) list of papers that can be considered (use Google to find the papers and download them):
      • 1. Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, Moshe Y. Vardi. Rewriting of Regular Expressions and Regular Path Queries. In J. Comput. Syst. Sci. 64(3):443-465, 2002
      • 2. Alon Y. Levy, Alberto O. Mendelzon, Yehoshua Sagiv, Divesh Srivastava. Answering Queries Using Views. PODS 1995: 95-104
      • 3. Rachel Pottinger, Alon Halevy. MiniCon: A scalable algorithm for answering queries using views.The VLDB Journal” The International Journal on Very Large Data Bases, Volume 10, Issue 2-3 (September 2001)
      • 4. Oliver M. Duschka, Michael R. Genesereth, Alon Y. Levy. Recursive Query Plans for Data Integration. J. Log. Program. 43(1): 49-73 (2000)
      • 5. Philippe Adjiman, Philippe Chatalic, Francois Goasdou, Marie-Christine Rousset, Laurent Simon. Distributed Reasoning in a Peer-to-Peer Setting: Application to the Semantic Web. Journal of Artificial Intelligence Research (JAIR) 25: 269-314 (2006)
      • 6. Xin Luna Dong, Alon Y. Halevy, Cong Yu. Data integration with uncertainty. VLDB J. 18(2): 469-500 (2009)
      • 7. Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, Wang Chiew Tan. Composing schema mappings: Second-order dependencies to the rescue. ACM Trans. Database Syst. 30(4): 994-1055 (2005)
      • 8. Andrea Calì, Domenico Lembo, Riccardo Rosati. On the decidability and complexity of query answering over inconsistent and incomplete databases. PODS 2003: 260-271
      • 9. Jens Bleiholder, Felix Naumann. Data fusion. ACM Comput. Surv. 41(1): (2008)
      • 10. Marcelo Arenas, Leopoldo E. Bertossi, Jan Chomicki. Consistent Query Answers in Inconsistent Databases. PODS 1999: 68-79
      • 11. George Konstantinidis, José Luis Ambite. Scalable query rewriting: a graph-based approach, SIGMOD '11 Proceedings of the 2011 international conference on Management of data.
      • 12. Hector Gonzalez, Alon Y. Halevy, Christian S. Jensen, Anno Langen, Jayant Madhavan, Rebecca Shapley, Warren Shen, Jonathan Goldberg-Kidon. Google fusion tables: web-centered data management and collaboration. SIGMOD Conference 2010: 1061-1066
      • 13. Mary Roth, Wang-Chiew Tan: Data Integration and Data Exchange: It's Really About Time. CIDR 2013
      • 14. Bogdan Alexe, Balder ten Cate, Phokion G. Kolaitis, Wang Chiew Tan: Characterizing schema mappings via data examples. ACM Trans. Database Syst. 36(4): 23 (2011)
      • 15. Anastasios Kementsietsidis, Marcelo Arenas, Renée J. Miller: Mapping Data in Peer-to-Peer Systems: Semantics and Algorithmic Issues. SIGMOD Conference 2003: 325-336
      • 16. Douglas Burdick, Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, Wang Chiew Tan: Expressive Power of Entity-Linking Frameworks. ICDT 2017: 10:1-10:18
      • 17. Balder ten Cate, Gaëlle Fontaine, Phokion G. Kolaitis: On the Data Complexity of Consistent Query Answering. Theory Comput. Syst. 57(4): 843-891 (2015)
      • 18. Anja Gruenheid, Xin Luna Dong, Divesh Srivastava: Incremental Record Linkage. PVLDB 7(9): 697-708 (2014)

    In all the above three cases, once the student has chosen the topic, (s)he should send an email message to prof. Lenzerini (and, in case of a joint project with Big Data Management, to Prof. Lembo too), with the description of the topic, and wait for confirmation, or a request to change the topic, if the topic (tool, paper, or use case) is already taken. Also once the student has decided on the date for the exam, (s)he should send an email message to prof. Lenzerini with the indication of the date. Here are the dates for March-April 2018 (the exam will be held in the office of Prof. Lenzerini):

    • March 27, 2018 at 5pm
    • April 10, 2018 at 5pm
    • April 17, 2018 at 5pm

  • Schedule of exams:
    • First exam: June 2017
    • Second exam: July 2017
    • Third exam: September 2017
    • First special session: October 2017
    • Fourth exam: January 2018
    • Fifth exam: February 2018
    • Second special session: April 2018

  • Lectures
    • When: Monday, 8:00am - 10:00am, starting from February, 2017. Occasionally, lectures will also take place on Wednesday, 8:00am - 9:00am (when this happens, it will be announced in advance)
    • Where: Classroom A3 on Monday and A2 on Wednesday, via Ariosto 25, Roma
    • Schedule (lecture will be on Thursday, with the addition of Friday only in some of the weeks):

      Week Monday (8:00am - 10:00am)
      classroom A3
      Wednesday (8:00am -9:00am)
      classroom A2
      01 (Feb 20) Lectures 1,2
      - Introduction to information integration
      - Propositional logic: syntax and semantics
      02 (Feb 27)
      03 (Mar 6) Lectures 3,4
      - First-order logic; syntax and semantics
      - First-order logic and databases

      04 (Mar 13) Lectures 5,6
      Architectures for information integration
      05 (Mar 20) Lectures 7,8
      Semantics of information integration systems
      06 (Mar 27) Lectures 9,10
      Semantics of query answering in information integration
      07 (Apr 3) Lectures 11,12
      Types of mappings
      08 (Apr 10) Lectures 13,14
      GAV and LAV mappings
      09 (Apr 17) Lectures 15,16
      Computing the certain answers in GAV: the bottom-up approach
      10 (Apr 24) Lectures 17,18
      Computing the certain answers in GAV: the top-down approach
      11 (May 1) Lectures 19,20
      Computing the certain answers in (G)LAV: the bottom-up approach
      12 (May 8) Lectures 21,22
      Computing the certain answers in (G)LAV: the top-down approach
      13 (May 15) Lectures 23,24
      Ontology-based data integration
      14 (May 22) Lectures 25,26
      Ontology-based data integration

  • Past editions
  • Office hours. Tuesday, 5:00 pm, at the Dipartimento di Informatica e Sistemistica "Antonio Ruberti",
    via Ariosto 25, Roma, second floor, room B203 (if available), or room B217 (otherwise) -- please, look at the last
    minute news for the next office hours