Data Management for Data Science - Sapienza Universita' di Roma

Corso di laurea magistrale in Data Science
Facoltà di Ingegneria dell'Informazione, Informatica e Statistica, Sapienza Università di Roma

Data Management for Data Science -
Homework Assignments

2020/2021

Prof. Domenico Lembo
Prof. Riccardo Rosati

Students can present homeworks during the lectures. There will be three homework assignments, which will be announced on this web page.

For every homework, every student will get a score ranging from -4 to +1. The final exam score will be computed as follows:

final_score = hw_1 + hw_2 + hw_3 + 30

where hw_n is the score of homework n (if final_score > 30, then the final score is 30 cum laude).

Assignment 1 - SQL

Choose an application domain and, using a relational DBMS, build a database. This can be done in two ways:

(recommended) use an interesting existing dataset, i.e.:
1. get interesting data from the Web or other sources (e.g., use the Web to look for a whole database, or data that can be easily imported into a relational DBMS) and build a relational database using such data
2. formulate a set of SQL queries (about 8-10) over the relational schema
3. execute such queries over the database and analyze the results
4. NOTICE: all datasets are fine EXCEPT MOVIE DATASETS (too many projects used movie DBs in the previous years)
create the schema and the dataset from scratch, i.e.:
1. define the relational schema (i.e., write SQL statements to create tables defining attributes, domains, and possibly integrity constaints);
2. insert tuples into tables (through SQL statements)
3. formulate a set of SQL queries (about 10) over the relational schema
4. execute such queries over the database and analyze the results

The work must be done by groups of two students.

Students can use publicly available DBMSs like MySQL or PostgreSQL (see below), or other, commercial DBMSs.

The queries defined by the students should comprise all the aspects of SQL queries analyzed during the course lectures and exercises (joins, aggregations, nested queries, queries with negated subqueries). The complexity of the queries produced should be (at least) comparable to the specification appearing in the following exercise on SQL.

Assignment 2 - SQL evaluation and optimization

Starting from the database developed in the first homework, every group has to identify at least 4 SQL queries that pose performance problems to the DBMS. The students have to show both the "slow" and the "fast" execution of the queries, where the fast version is obtained by:

adding integrity constraints to one or more tables
rewriting the SQL query (without changing its meaning)
adding indices to one or more tables
modifying the schema of the database
adding views or new (materialized) tables derived from the existing database tables

Ideally, these queries should be picked from the queries created for the first homework; however, new queries can be considered if none of the previous queries poses performance problems to the DBMS.

The first homework will be presented, together with the second one, in the week April 12 - April 16. The presentation of the work done wlll consist of a short (10-15 minutes) session in which the students will show the work done by directly interacting with the relational DBMS on their own laptop.

Useful links:

Link to download the MySQL Server and Workbench
There are many tutorials on how to install MySQL Server and Workbench, see e.g. this one.
Link to download the PostgreSQL DBMS

Assignment 3 - NoSQL

Use a NoSQL tool (property-graph database, RDF triple store, document-based database, key-value database, column-family database) to manage and query a dataset. Ideally, the groups should use the same dataset used in the first and second homework. Examples of such systems include (but are not limited to) Neo4J, GraphDB, MongoDB, Redis, Cassandra (see the course material on aggregate databases, graph and RDF databases for more details).

The work must be done by the same student groups who presented the first and second homework assignments (i.e., current project groups cannot be modified).

The presentation of the work done wlll consist of a short session (15 minutes at most) in which the students will show the work done by directly interacting with the NoSQL system on their own laptop, highlighting the differences with respect to a standard (SQL) relational database system.

The presentations of the third homework will be held during the lectures of May 24 and May 26, 2021. The order of presentations will be the same as the one for homeworks 1 and 2: so, the groups who presented HW 1 and 2 on April 12 will present HW 3 on May 24, and the groups who presented HW 1 and 2 on April 14 will present HW 3 on May 26 (a list with a more detailed presentation schedule will be published on Google Classroom).

Corso di laurea magistrale in Data Science Facoltà di Ingegneria dell'Informazione, Informatica e Statistica, Sapienza Università di Roma