MURST'2000 - Modello B - Visualizza modello

D2I: Integrazione, warehousing e mining di sorgenti eterogenee di dati

D2I: Integration, Warehousing, and Mining of Heterogeneous Data Sources

LENZERINI	MAURIZIO
(cognome)	(nome)
Università degli Studi di ROMA "La Sapienza"	Facoltà di INGEGNERIA
(università)	(facoltà)
K05A	Dipartimento di INFORMATICA E SISTEMISTICA
(settore scient.discipl.)	(Dipartimento/Istituto)

lenzerini@dis.uniroma1.it
(E-mail)

LENZERINI	MAURIZIO
(cognome)	(nome)

Professore ordinario	14/12/1954	LNZMRZ54T14G388K
(qualifica)	(data di nascita)	(codice di identificazione personale)

Università degli Studi di ROMA "La Sapienza"	Facoltà di INGEGNERIA
(università)	(facoltà)
K05A	Dipartimento di INFORMATICA E SISTEMISTICA
(settore scient.discipl.)	(Dipartimento/Istituto)

06/8841954	06/85300849	lenzerini@dis.uniroma1.it
(prefisso e telefono)	(numero fax)	(E-mail)

Maurizio Lenzerini è nato a Pavia il 14/12/1954. E' professore ordinario dal 1990. E' autore di numerosi libri universitari sulla progettazione del software, e sul progetto di basi di dati. Dal 1983 svolge la sua attività di ricerca presso l'Università di Roma "La Sapienza", dove dirige attualmente un gruppo di ricerca sulle Basi di Dati e l'Intelligenza Artificiale. I suoi principali interessi di ricerca riguardano i modelli concettuali e semantici dei dati, l'integrazione di sistemi eterogenei, i data warehouse, la gestione di dati semi-strutturati, la rappresentazione della conoscenza e le tecniche di ragionamento, e i metodi di sviluppo orientati agli oggetti. Conduce attualmente progetti di ricerca nazionali ed internazionali su integrazione di dati, data warehousing, e dati semi-strutturati. E' autore di più di 200 articoli pubblicati su conferenze e riviste internazionali, tra le quali compaiono le piu' prestigiose dei settori di interesse, quali Journal of Computer and System Science, Information and Computation, Information Systems, Artificial Intelligence, IEEE Data and Knowledge Engineering, ACM-PODS, ACM-SIGMOD, IEEE-ICDE, VLDB, ICDT, IJCAI, AAAI, KR, CoopIS. E' stato editor di diversi libri internazionali, tra cui un recente libro su "Data Warehouse Quality". E' regolarmente membro del Comitato di Programma delle piu' importanti conferenze internazionali dei settori di interesse, quali IJCAI, AAAI, EDBT, PODS, KR, CoopIS, ER, ICDT. Ha organizzato conferenze e workshop internazionali. Fa parte dell'Editorial Board di diverse riviste internazionali. E' editor della rivista internazionale Information Systems per l'area di Data Modeling, Knowledge Representation and Reasoning. E' stato Presidente del Comitato di Programma della Quarta International Conference on Cooperative Information Systems, tenutasi nel 1999 a Edinburgo, e Presidente della International Conference on Conceptual Modeling, tenutasi nel 1999 a Parigi.

Maurizio Lenzerini was born in Pavia, on December 14, 1954. Since 1990, he is full professor in Computer Science and Engineering. He is the author of several academic books on fundamentals of Computer Science, Software Engineering, and Database design. Since 1983, he has been carrying out his research activity at the Università di Roma "La Sapienza", where is leading a research group on Databases and Artificial Intelligence. His main research interests are oriented towards conceptual and semantic data modeling, data integration, data warehousing, semistructured data management, knowledge representation and reasoning, and object-oriented methodologies. He is currently involved in national and international research projects on data integration, data warehousing, and semi-structured data. He is the author of more than 200 publications in international conferences and journals, including the most prestigeous ones in the above mentioned areas, such as Journal of Computer and System Science, Information and Computation, Artificial Intelligence, Information Systems, IEEE Data and Knowledge Engineering, ACM-PODS, ACM-SIGMOD, IEEE-ICDE, VLDB, ICDT, IJCAI, AAAI, KR, CoopIS. He is the editor of several international books, including a recent one on "Data Warehouse Quality". He is regularly a member of the Program Committee of the most important international conferences in the above areas, including IJCAI, AAAI, EDBT, PODS, KR, CoopIS, ER, ICDT. He organized several international conferences and workshops. He is a member of the Editorial Board of various international journals. He is the editor of Information Systems: An International Journal, for the area of Data Modeling, Knowledge Representation and Reasoning. He was Program co-Chair of the 4th International Conference on Cooperative Information Systems, that was held in Edinburgh in 1999. He was the Conference Chair of the International Conference on Conceptual Modeling, which was held in Paris in 1999.

Nº	Cognome	Nome	Dipart./Istituto	Qualifica	Settore scient.	Mesi uomo
2000	2001

1	LENZERINI	MAURIZIO	INFORMATICA E SISTEMISTICA	Prof. ordinario	K05A	5	5
2	CADOLI	MARCO	INFORMATICA E SISTEMISTICA	Prof. associato	K05A	1	3
3	CATARCI	TIZIANA	INFORMATICA E SISTEMISTICA	Prof. associato	K05A	4	4
4	DE GIACOMO	GIUSEPPE	INFORMATICA E SISTEMISTICA	Ricercatore	K05A	5	5
5	SALZA	SILVIO	INFORMATICA E SISTEMISTICA	Prof. associato	K05A	4	4

Nº	Cognome	Nome	Università	Dipart./Istituto	Qualifica	Settore scient.	Mesi uomo
2000	2001

Nº	Cognome	Nome	Dipart./Istituto	Anno del titolo	Mesi uomo
2000	2001

Nº	Cognome	Nome	Dipart./Istituto	Anno del titolo	Mesi uomo
1.	CALI'	ANDREA	INFORMATICA E SISTEMISTICA	2002	16
2.	CALVANESE	DIEGO	INFORMATICA E SISTEMISTICA	1997	16

Nº	Qualifica	Costo previsto	Mesi uomo
1.	Laureato	24	12

Viste integrate di dati provenienti da sorgenti eterogenee: metodi e strumenti per la modellazione, l'interrogazione e la visualizzazione

Integrated views of data coming from heterogeneous sources: Methods and tools for modeling, query processing, and visualization.

La descrizione della base di partenza scientifica verra' sviluppata distinguendo i due temi del progetto in cui l'unita` e` coinvolta: integrazione e data mining.
Le problematiche relative alla integrazione vengono studiate da molti anni. Negli anni '80 esse venivano affrontate dal punto di vista del progetto di basi di dati, in cui e' spesso necessario fondere schemi che rappresentano prodotti parziali di progetto [Batini86]. Con la crescente importanza dei sistemi distribuiti, federati e cooperativi, maggiore attenzione e' stata data alla integrazione dei dati, allo scopo di consentire l'accesso a sorgenti informative autonome, liberando l'utente dai problemi relativi alla eterogeneita' e alla dislocazione delle sorgenti stesse. Nuove importanti nozioni, quali proprieta' inter-schema, mediatori, wrappers, meta-dati, ed altre sono state proposte per affrontare il problema della integrazione all'interno di un quadro metodologico razionale. In particolare, i vari scenari possibili per l'integrazione sono stati caratterizzati in modo completo e soddisfacente [Calvanese98]. Si parla quindi di read-only integration (quando le sorgenti sono autonome e non aggiornabili) o read-and-update integration (quando le sorgenti sono aggiornabili dal sistema globale), di virtual integration (quando le viste globali non sono materializzate) o di materialized integration, di global-as-view (quando le viste globali sono definite mediante query sulle sorgenti), o di local-as-view (quando le sorgenti sono descritte mediante query sulla vista globale) [Ullman97]. I metodi e le tecniche sviluppate hanno pero' finora riguardato il caso di sorgenti strutturate (ad esempio basi di dati), ed hanno solo parzialmente e preliminarmente affrontato il caso di sorgenti semi-strutturate. L'importanza di questo tipo di sorgenti informative sta nel fatto che con la diffusione delle reti e dei sistemi informativi basati su www, un sempre crescente numero di sorgenti contiene dati memorizzati in pagine Web e/o documenti. Tipicamente, tra questi oggetti intercorrono relazioni la cui struttura da una parte non e` completamente libera, e d'altra parte non puo` essere predeterminata in modo rigido attraverso uno schema. Quindi, i dati in tali sorgenti possono essere modellati correttamente solo attraverso una collezione di oggetti semi strutturati [Buneman97]. Tra i vari problemi tipici dell'integrazione, quello che necessita nuovi approcci e soluzione e' il problema della risposta alle interrogazione formulate su viste globali. E' noto che, in generale, questo aspetto richiede di affrontare il problema del query rewriting using views [Ullman97] e del query answering using views [Levy95]. Il primo riguarda la necessita' di riformulare una interrogazione espressa su una vista globale in termini di interrogazioni da eseguire sulle sorgenti. Il secondo riguarda il problema di decidere come utilizzare dati materializzati dal sistema di integrazione al fine di rispondere alle interrogazioni. Entrambi questi problemi sono stati studiati estesamente per sorgenti eterogenee [Levy00] e per sorgenti semistrutturate sotto opportune assunzioni
semplificative [Calvanese99a, Calvanese00, Calvanese00b], ma nel caso di sorgenti semi-strutturate devono essere ancora affrontati e risolti nella loro piena generalita`. Un secondo problema e' relativo alla riconciliazione dei dati provenienti da sorgenti diverse, allo scopo di eliminare eventuali inconsistenze e ridondanze. In particolare, si rende necessario risolvere i problemi legati alla eterogeneita' delle sorgenti, quali differenze nella rappresentazione di uno stesso oggetto, possibili errori nella codifica di proprieta' di oggetti, possibili discrepanze nelle proprieta' attribuite agli oggetti nelle varie sorgenti. Solo recentemente i problemi legati alla riconciliazione sono stati affrontati con metodi formali e scientifici [Galhardas00, Calvanese99d]. La metodologia proposta nell'ambito del progetto Europeo DWQ (Foundations of Datawarehouse Quality) [Jarke00b, Calvanese99d] si basa sulla specifica dichiarativa di corrispondenze di riconciliazione tra dati in diverse sorgenti. Prevede la sintesi automatica dei mediatori per l'accesso alle sorgenti eterogee e l'integrazione dei dati estratti da tali sorgenti, attraverso algoritmi di riscrittura che tengono conto anche delle corrispondenze di riconciliazione.
L'attivita' denominata "data mining" e', in linea di principio, soltanto una componente del piu' complesso processo di "knowledge discovery", che porta alla scoperta di informazioni non ovvie, ricavabili da un'analisi approfondita dei dati, condotta con tecniche ad-hoc [Fayyad96, Agrawal98]. Tuttavia, si tende ormai a far coincidere l'intero processo di scoperta di informazioni con il data mining, ma mentre la prima attivita', cioe' la scoperta di informazioni, e' ovviamente di tipo altamente interattivo e guidato dall'utente, i tool esistenti che implementano tecniche di data mining si comportano spesso come scatole nere, stand-alone, completamente impermeabili a possibili interazioni con l'utente. Infatti, da una parte non offrono nessun tipo di aiuto per permettere all'utente di comprendere i risultati delle elaborazioni da essi prodotte e dall'altra non permettono all'utente stesso di indirizzare in alcun modo la scoperta di informazioni [Brachman96]. L'importanza e l'efficacia delle rappresentazioni visuali dei dati sono d'altra parte ben note [Catarci96, Catarci99], come anche testimoniato dalla crescente offerta di prodotti basati sulla "information visualization" che competono per conquistare diversi settori di mercato (per esempio, recenti proposte sono Origami [Louie99] e Structure Explorer [Lin00]). E' facile dimostrare come appropriate visualizzazioni dei dati possano rappresentare lo strumento chiave per supportare varie attivita' relative alla scoperta di informazioni, quali la comprensione del dominio dei dati, la scoperta di correlazioni, andamenti, anomalie, e infine l'analisi del risultato prodotto da altre tecniche di mining [Keim96]. Tuttavia, per poter sfruttare appieno le potenzialita' della visualizzazione nel processo di knowledge discovery e' necessario disporre di ambienti integrati [Klemettinen97, Gunopolus97, Mannila97, Sarawagi98], in cui l'interazione dell'utente tramite opportune rappresentazioni visuali e primitive grafiche di manipolazione, guidi lo strumento nel processo di scoperta e sia anche un valido ausilio per l'analisi dei dati prodotti utilizzando le diverse tecniche. Nessuno dei sistemi esistenti soddisfa il requisito di integrazione di varie tecniche e strumenti in un unico ambiente interattivo, in cui un'efficace visualizzazione dei dati rappresenta il denominatore comune per permettere all'utente di estrarre, comprendere e sfruttare al meglio le informazioni "nascoste" nei dati [Catarci96, Catarci97, Catarci98].

The description of the scientific base will be provided separately for the two topics of the project in which the unit is involved, namely, data integration and data mining.
Data integration has been addressed since more than two decades. In the beginning, the problems concerning integration were addressed in the context of database design, where the need arises of comparing, restructuring and merging schemata coming from intermediate steps of the design process [Batini86]. With the growing importance of distributed, federated and cooperative information systems, more and ore attention has been devoted to the problem of integrating data rather than schemata, with the goal of providing the ability to access heterogeneous and autonomous sources, in a way that is independent of the heterogeneity and the physical characteristics of the sources. New important notions and techniques, such as inter-schema properties, mediators, wrappers, etc. have been proposed and investigated in order to cope with integration within a structured and principled framework. In particular, the design space for integration has been successfully characterized [Calvanese98], by distinguishing between read-only integration (when the sources are autonomous and cannot be updated by the integrated system) and read-and-update integration (when the integration system has also update capabilites), between virtual interation (when the integrated views are not materialized) and materialized integration, and between global-as-view (when the integrated views are defined in terms of the source data) and local-as-view (when the sources are specified in terms of queries over the global views). Nevertheless, the methods and the techniques proposed in the recent years are specifically devoted to the case of structured sources (e.g., databases), while they only marginally considered the case of semistructured sources. The importance of semi-structured sources stems from the fact that, with the growing diffusion of networks and www-based information systems, more and more information sources contain data stored in Web pages and/or documents. Typically, the interrelationships among such objects have a structure, which on the one hand is not completely free, and on the other hand cannot be rigidly determined by a schema. Hence, data in such sources can be represented correctly only by resorting to a collection of semistructured objects [Buneman97]. Among the typical problems related to integration, the one of answering queries posed to integrated views is the one that requires new approaches and solutions. It is well known that this aspect calls for addressing the problems of query rewriting using views [Ullman97] and query answering using views [Levy95]. The former concerns the issue of reformulating a query expressed on the integrated view in terms of suitable queries on the sources]. The latter concerns the issue of how to use the materialized views in order to provide the answer to the queries. Both these problems have been studied extensively for heterogeneous sources [Levy 00] and for semistructured sources under simplifying assumptions [Calvanese99a,
Calvanese00, Calvanese00b], but for the case of semistructured sources they still have to be studied and solved in their full generality. A second problem concerns reconciliation of data coming from heterogeneous sources, to eliminate possible inconsistencies and redundancies. In particular, it is necessary to solve the problems related to the hetereogenity of sources, such as differences in representing the same objects, possible errors in the coding of objects stored in the sources, possible inconsistencies between the properties assigned to objects in the various sources. The problems related to data reconciliation have been addressed with formal methods only recently [Galhardas00, Calvanese99d]. The methodology proposed within the European Project DWQ (Foundations of Datawarehouse Quality) [Jarke00b, Calvanese99d] is based on the declarative specification of reconciliation correspondences between data in different sources. It allows for the automatic synthesis of the mediators for accessing the heterogeneous sources and integrating the data extracted from such sources, by means of algorithms for query rewriting that take into account also the recociliation correspondences.
In principle, data mining is only one of many activities in the long process of knowledge discovery, i.e., the discovery process of non-trivial information, deriving from an accurate data analysis with ad-hoc techniques [Fayyad96, Agrawal 98]. However, the term "data mining" is very often used as a synonym of "knowledge discovery", since it denotes the most important step in the overall discovery process. Whereas knowledge discovery is a highly-interactive user-driven process, most existing data mining tools operate as stand-alone black-boxes, which are completely blind to possible interactions with the user. They offer no support to the user on comprehending the findings of their search, and they receive no user guidance on how to better focus their search [Brachman96]. On the other hand, the importance and effectiveness of visual representations are very well known (see, e.g., [Catarci96, Catarci99]), as witnessed by the increasing offer for "information visualization" products (for example, recent proposals are Origami [Louie99] and Structure Explorer [Lin00]). Adequate data visualization can be a key tool to support various data mining-related activities, such as the discovery of correlations, trends, anomalies, the comprehension of data domains and of the results of other mining techniques [Keim96]. In order to fully exploit the capabilities of visualization in the knowledge discovery process it is necessary to work in an integrated environment [Klemettinen97, Gunopolus97, Mannila97, Sarawagi98], where user interaction can drive the tool in the discovery process, by means of visual presentations and graphical manipulation primitives. None of the existing systems successfully integrates different techniques in an interactive environment. Such integration would be the common denominator to allow the user to extract, understand, and exploit the information hidden in data [Catarci96, Catarci97, Catarci98].

[Ullman97] J.D. Ullman, "Information Integration Using Logical Views", International Conference on Database Theory, ICDT 1997.
[Batini86] C. Batini, M. Lenzerini, S. Navathe, "A Comparative Analysis of Methodologies for Database Schema Integration", ACM Computing Surveys Vol. 18, N. 4, 1986.
[Buneman97] P. Buneman, Semistructured data. 16th ACM Symposium on Principles of Database Systems (PODS'97), 1997.
[Calvanese98] D. Calvanese, G. de Giacomo, M. Lenzerini, D. Nardi, R. Rosati: "Information Integration: Conceptual Modeling and Reasoning Support", Int. Conference on Cooperative Information Systems, CoopIS98, New York, 1998.
[Calvanese99a] D. Calvanese, G. De Giacomo, M. Lenzerini, M.Y. Vardi. Rewriting of regular expressions and regular path queries. PODS, 1999.
[Jarke00b] M. Jarke, M. Lenzerini, Y. Vassiliou, P. Vassiliadis. Fundamentals of Data Warehouses. Springer-Verlag , 2000.
[Calvanese00] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, Moshe Y. Vardi. Answering Regular Path Queries using Views. IEEE-ICDE, 2000.
[Calvanese00b] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, Moshe Y. Vardi. Query Processing using Views for Regular Path Queries with Inverse. PODS, 2000.
[Galhardas00] H. Galhardas, D. Florescu, D. Shasha, E. Simon. An Extensible Framework for Data Cleaning. IEEE-ICDE, 2000.
[Calvanese99d] D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati. A Principled Approach to Data Integration and Reconciliation in Data Warehousing. Workshop on Design and Management of Data Warehouses, 1999.
[Catarci96] T.Catarci, S.K.Chang, M.F.Costabile, S.Levialdi, G.Santucci. A Graph-based Framework for Multiparadigmatic Visual Access to Databases. IEEE TKDE, 8(3), 1996.
[Catarci97] T.Catarci, G.Santucci, J.Cardiff. Graphical Interaction with Heterogeneous Databases. VLDB Journal, 6(2).
[Catarci98] G.Sciscio, T.Catarci. Data Mining: Tecnologie e Strumenti. Rivista di Informatica, AICA, 28(3), 1998.
[Catarci99] T.Catarci, G.Santucci, L.Tarantino. Emerging Issues in Visual Interfaces. Knowledge Engineering Review, 14(1), 1999.
[Lin00] T.Lin. Visualising relationships for real world applications. Workshop on New paradigms in information visualization and manipulation, 2000.
[Louie99] J. Q. Louie, T. Kraay. Origami: a new data visualization tool. ACM Knowledge discovery and data mining, 1999.
[Levy95] A.Y. Levy, A.O. Mendelzon, Y. Sagiv, D. Srivastava, Answering Queries Using Views, PODS 1995.
[Levy00] A.Y. Levy, Answering Queries Using VIews: A Survey. Technical Report, University of Washington, 2000.
[Brachman96] R.J. Brachman, T. Anand, "The Process of Knowledge Discovery in Databases", In: U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (eds.), Advances in Knowledge Discovery and Data Mining, The MIT Press, 1996.
[Fayyad96] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (eds.), "Advances in Knowledge Discovery and Data Mining", The MIT Press, 1996.
[Agrawal98] R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan, "Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications", in Proc. SIGMOD Conference, 1998.
[Keim96] D. A. Keim, H.-P. Kriegel, "Visualization techniques for Mining Large Databases: A Comparison", IEEE TKDE Vol. 8, n. 6, 1996 (923-938).
[Klemettinen97] M. Klemettinen, H. Mannila, and H. Toivonen, "A data-mining methodology and its application to semi-automatic knowledge acquisition", in Proc. of the 8th International Conference and Workshop on Database and Expert Systems Applications (DEXA'97), France, 1997.
[Mannila97] H. Mannila, "Methods and Problems in Data Mining", in Proc. ICDT 1997, 1997.
[Sarawagi98] S. Sarawagi, S. Thomas, R. Agrawal, "Integrating Mining with Relational Database Systems: Alternatives and Implications", in Proc. SIGMOD Conference, 1998.

La descrizione verra' fornita relativamente ai due temi del progetto nei quali l'unita` di Roma e` coinvolta: integrazione e data mining, e relativamente alla costruzione del repository di meta-dati. Nell'ambito dei singoli temi, si fara' riferimento alle 4 fasi del progetto.
Relativamente alle attivita` che riguardano il repository di meta-dati, l'unita` di Roma coordinera` il lavoro svolto in stretta collaborazione tra tutte le unita`. La prima fase prevede la definizione dei metodi di rappresentazione e di gestione dei meta-dati necessari per produrre le specifiche per il repository, che fornira` la base comune per le metodologie e gli strumenti che verranno sviluppati nell'ambito del progetto. Il prodotto di questa prima fase sara` un rapporto tecnico, prodotto in collaborazione con le altre unita`, contenente le specifiche per il repository di meta-dati (D0.R1). Nella seconda fase verra' definita precisamente la struttura del repository di meta-dati, e verra' specificato l'insieme dei servizi che il repository stesso dovra' offrire. Il prodotto di questa seconda fase sara` un rapporto tecnico, prodotto in collaborazione con le altre unita`, contenente la specifica dell'architettura funzionale del repository di meta-dati (D0.R2). La terza fase prevede la realizzazione delle varie funzionalita` del repository di meta dati, sotto la guida dell'unita` di Roma. Il prodotto di questa fase e` il repository di meta-dati (D0.P1), realizzato in collaborazione da tutte le unita`. La quarta fase prevede un utilizzo del repository di meta-dati nella sperimentazione degli strumenti sviluppati nelle fasi precedenti. Per questa fase non e` previsto un prodotto specifico per l'attivita` che riguarda il repository.
Relativamente al tema dell'integrazione, l'unita' di Roma prevede nella prima fase di studiare e analizzare i nuovi requisiti che emergono sulla integrazione di dati quando si considerano sorgenti fortemente eterogenee, cioè sorgenti strutturate (es: basi di dati) e semistrutturate (es: documenti HTML e XML). Le problematiche tipiche di integrazione verranno indagate nell'ambito di questo nuovo contesto. Verranno studiati i requisiti per nuovi metodi di rappresentazione dei dati, tenendo conto della presenza di sorgenti semistrutturate. Verranno confrontati i modelli per dati semistrutturati proposti in letteratura, allo scopo di caratterizzarne il potere espressivo. Verranno indagati metodi per definire e specificare parametri di qualità delle sorgenti (affidabilità, completezza, ridondanza, accuratezza, ecc.) e i metodi per la riconciliazione di dati provenienti da sorgenti eterogenee. Si analizzeranno i metodi esistenti per il problema del query rewriting e del query answering using views. Il prodotto di questa prima fase saranno un rapporto tecnico su metodi e tecniche di estrazione, rappresentazione ed integrazione di sorgenti strutturate e semistrutturate, prodotto in collaborazione con le altre unita` (D1.R1), ed un rapporto tecnico contenente una rassegna sui metodi per il query rewriting e il query answering using views (D1.R5).
Nella seconda fase si svilupperanno le tecniche intelligenti di tipo semi-automatico per l'identificazione e la riconciliazione di eterogeneità basate sulle proprietà dei dati, che saranno parte integrante della metodologia per la costruzione di viste riconciliate di dati semi-strutturati provenienti da sorgenti eterogenee, sviluppata in collaborazione con le altre unita`. Verranno inoltre definiti algoritmi per la riscrittura di interrogazioni rispetto ad un insieme di viste (query rewriting e query answering using views), estendendo, modificando e adattando gli approcci attuali tenendo conto della esistenza di sorgenti semistrutturate. Il prodotto di questa fase sara` un rapporto tecnico su metodologia e strumenti per la riconciliazione dei dati (D1.R11).
Nella terza fase, che prevede la realizzazione di prototipi che implementino le funzioni enucleate dai risultati scientifici prodotti nella fase precedente, si realizzera` un prototipo per gli algoritmi di query rewriting e query answering using views sviluppati nella fase 2 e per la riconciliazione dei dati (D1.P3). Particolare cura verra` dedicata alla integrabilita` del prototipo con gli ambienti e gli strumenti definiti e progettati dalle altre unita`; in particolare l'ambiente di supporto alla costruzione di una vista globale (Unita` di Modena), lo strumento di estrazione di proprietà inter-schema (Unita` della Calabria) e il Query Manager per la gestione di query globali (Unita` di Modena). A questo proposito, il repository comune di meta-dati rappresenta un elemento fondamentale nell'architettura globale del sistema di integrazione, in quanto permette un facile scambio e riutilizzo di tutti i metadati prodotti e utilizzati dai diversi strumenti.
Nella quarta fase verra` completata la realizzazione del prototipo per gli algoritmi di query rewriting e query answering using views e per la riconciliazione dei dati, e la sua integrazione con gli altri strumenti. Verranno sperimentate e validate le metodologie e gli strumenti per la riconciliazione dei dati sviluppati nelle fasi precedenti, utilizzando le sorgenti Telecom. Il prodotto della quarta fase sara` un rapporto tecnico sui risultati della sperimentazione delle metodologie e dei prototipi per l'integrazione, prodotto in collaborazione con le altre unita` (D1.R12).
Relativamente al tema del data mining, l'obiettivo dell'unita' di Roma consiste nella realizzazione di un prototipo di sistema di data mining di nuova generazione, che sia "user-centered" ed in cui tecniche e strumenti di data mining esistenti siano integrati con nuove componenti che mirano a fornire un supporto sostanziale all'utente in tutte le fasi della scoperta di informazioni. Caratteristiche distintive del sistema saranno non soltanto la possibilita' di integrare sistemi diversi in un ambiente integrato ed orientato all'utente, ma anche la capacita' di fornire nuove soluzioni sia ad alcuni dei problemi aperti esistenti sia nelle singole tecniche di data mining, sia a quelli derivanti dalla integrazione di tecniche diverse in un unico ambiente. Preliminare alla definizione del sistema sara' uno studio teorico che sistematizzi e formalizzi il rapporto esistente tra le varie modalita' di visualizzazione dei dati e le varie attivita ' di scoperta di informazioni. Il sistema dovra' avere le seguenti caratteristiche fondamentali:
1. fornire varie modalita' di visualizzazione atte a trasmettere efficacemente proprieta' rilevanti dei dati;
2. fornire varie strategie di ausilio per permettere all'utente di formulare facilmente un piano per la scoperta di informazioni nascoste;
3. fornire primitive per la creazione semi-automatica di nuove visualizzazioni ad-hoc, a seconda del tipo dei dati e degli obiettivi dell'utente;
4. essere adattabile ai vari tipi di utenti, offrendo automaticamente le modalita' di visualizzazione e strategie di ricerca piu' appropriate;
5. essere flessibile e capace di integrare sistemi e tecniche esistenti in un ambiente omogeneo.
Il sistema, nel corso del suo sviluppo, sara' validato tramite la stretta interazione con gruppi di potenziali utenti quali la Telecom Italia. L'architettura del sistema conterra' tre tipi di componenti fondamentali, organizzate in diversi "strati":
1. componenti per la visualizzazione dell'informazione (che implementino diverse metafore visuali e tecniche per il mining visuale);
2. componenti per il "knowledge discovery" (che implementino piu' tecniche diverse per la scoperta di informazioni); e,
3. componenti per la gestione dei dati (che forniscano le strutture multidimensionali necessarie per memorizzare e manipolare i dati).
I primi due insiemi di componenti lavoreranno in stretta connessione, come agenti cooperanti all'ottenimento di un obiettivo comune, cioe' la scoperta di informazioni. Per esempio, una prima analisi visuale dei dati potrebbe evidenziare una zona particolare da studiare con tecniche specifiche di knowledge discovery, il risultato prodotto potrebbe essere poi di nuovo visualizzato, e cosi' via. Il terzo strato del sistema (cioe' lo strato addetto alla gestione dei dati) agira' invece come server per gli altri due. In ogni caso, l'utente potra' accedere a tutte le componenti tramite un'interfaccia amichevole ed adattiva (che si adatti, cioe', ai vari tipi di utente) e guidare l'intero processo. Specifici risultati del progetto riguarderanno:
a) un modello di formalizzazione delle rappresentazioni visuali e del loro legame (in termini di efficacia) con dati, utenti e compiti;
b) tecniche di proiezione di spazi multidimensioni su spazi due-tre dimensionali allo scopo di evidenziare caratteristiche di ripetitivita' nei dati;
c) nuove strutture di dati che possano velocizzare i cambiamenti nella visualizzazione derivanti da azioni dell'utente (raggiungendo cosi' una interattivita' "near-real-time");
d) nuove visualizzazioni atte a rappresentare grandi insiemi di dati.
Nella prima fase, sistemi ed approcci esistenti verranno confrontati sulla base di un insieme di casi reali di applicazione, allo scopo di scoprirne da una parte le mancanze da superare, e, dall'altra, le caratteristiche positive da mantenere. I risultati di questi confronti verranno riportati in un rapporto tecnico prodotto in collaborazione con le unita` di Bologna e della Calabria, che costituira` il prodotto della prima fase (D3.R1).
Questa fase fornira' inoltre gli input alla seconda fase, riguardante sia la definizione dell'architettura del sistema, sia un insieme di risultati teorici su alcuni dei problemi centrali evidenziati. L'unita` di Roma si concentrera` in particolare sugli aspetti del sistema legati alla visualizzazione. Il prodotto della seconda fase sara` un rapporto tecnico prodotto in collaborazione con le unita` di Bologna e della Calabria (D3.R1) sull'architettura del sistema integrato di data mining e visualizzazione (D3.R2).
La terza fase e quarta fase saranno incentrate sullo sviluppo (D3.P4) e sul test del sistema, seguendo il modello iterativo di progetto tipico delle metodologie user-centered.
Parallelamente alla verifica tecnica del corretto funzionamento dei moduli software sviluppati verra' attivata, la produzione e la esecuzione di un ben definito insieme di test di usabilita', che si concentreranno soprattutto sui meccanismi di interazione offerti all'utente finale e sulle modalita' di visualizzazione disponibili per il modulo di data mining. Si prevede, pertanto, di procedere nella implementazione della interfaccia utente utilizzando un modello del ciclo di vita a spirale in cui siano prodotte almeno due versioni dell'interfaccia, la prima da utilizzarsi per i test di usabilita', la seconda da ottenersi come raffinamento della prima tramite le indicazioni emerse dai test stessi. Infatti, mentre nelle fasi alte del ciclo di sviluppo il confronto tra progettista ed utente deve cercare di produrre feedback utili al disegno della soluzione migliore, più avanti, quando è disponibile un prototipo sufficientemente realistico del risultato finale, è possibile valutare in che misura sono raggiunti gli obiettivi dell'utente e della organizzazione. Le tecniche di valutazione sono moltissime: la loro scelta dipende dai vincoli di tempo ed economici del progetto. Relativamente a questo aspetto, appare evidente che i tempi e le risorse del progetto non permettono di procedere in questa fase usando un approccio completo, che comprenda valutazioni empiriche e valutazioni analitiche. Il prodotto di questa quarta fase sara` un rapporto tecnico sulla validazione e lo studio di usabilita` dei prototipi di clustering, metaquerying, ricerche approssimate e visualizzazione, prodotto in collaborazione con le unita` di Bologna e della Calabria (D3.R4).

The research plan will be described in terms of the two main topics of the project in which the research group of Rome is involved, namely, data integration and data mining, and in terms of the construction of the meta-data repository. Within these topics, we will distinguish among the four phases of the project.
With regard to the development of the meta-data repository, the research group of Rome will coordinate the work carried out in tight cooperation between all research groups involved in the project. In the first phase the methods for representing and manipulating the meta-data which is necessary for producing the specification of the repository will be defined. These will provide the common basis for the methodologies and tools developed within the project. The result of the first phase will be a technical report, developed in collaboration with the other research groups, containing the specification for the meta-data repository (D0.R1). In the second phase the structure of the meta-data repository will be precisely defined, and the services that the meta-data repository must provide will be determined. The result of the second phase will be a technical report, developed in collaboration with the other research groups, containing the functional architecture of the meta-data repository (D0.R2). In the third fase the various functionalities of the meta-data repository will be implemented under the supervision of the research group of Roma. The result of this phase is the meta-data repository itself (D0.P1), developed by all research groups. The fourth phase involves the deployment of the meta-data repository in the experimentation of the tools developed in the preceding phases. In this phase no specific result concerning the meta-data repository is foreseen.
With regard to data integration, the first phase will be concerned with the study and analysis of new requirements on data integration that become relevant when considering strongly heterogeneous sources, i.e., sources that are structured (e.g., databases) and semi-structured (e.g., HTML or XML documens). The typical integration aspects will be investigated in this new context. The requirements for the new data representation methods will be studied, considering the presence of semi and non-structured sources. The data models for semi-structured data proposed in the literature will be compared against each other in order to characterize their expressive power. Methods for defining and specifying quality parameters for sources (reliability, completeness, redundancy, accuracy, etc.) and for reconcyling data coming from different sources will be considered. The existing methods to solve the problem of query rewriting and query answering using views will be analyzed. The results of the first phase will be a research report, developed in collaboration with the other research groups, describing methods and Techniques for the automatic extraction, representation and integration of structured and semistructured data sources (D1.R1), and a survey on methods for query rewriting and query answering using views (D1.R5).
In the second phase, the intelligent semi-automatic techniques for identifying and reconciling the heterogeneity due to properties of semi-structured data, will be developed. Such techniques will become an integral part of the methodology for the construction of reconciled views of semi-structured data coming from heterogeneous sources, developed in collaboration with the other research groups. Algorithms will be defined for the rewriting of queries with respect to a set of materialized views (query rewriting and query answering using views), by extending, modifying and adapting the current approaches in order to take care of the existence of semi and non-structured sources. The result of this phase will be a technical report on the methodology and tools to reconcile data (D1.R11).
The goal of the third phase is the development implementation of a set of prototypes that realize the functionalities pointed out by the scientific results produced by in the previous phase In particular, the research group of Rome will develop a prototype realizing the algorithms for query rewriting and query answering using views and data reconciliation developed in Phase 2 (D1.P3). Particular attention will be devoted to the integration of the prototype with the tools designed and developed by the other research groups; in particular, the design tool for the construction of a global view (group of Modena), the tool for the extraction of inter-schema properties (group of Calabria), and the Query Manager for dealing with global queries (group of Modena). To this end, the common meta-data repository represents an essential element of the global architecture of the integration system, since it allows for a simple exchange and reuse of all meta-data produced and used by the various tools.
In the fourth phase, the development of the prototype for the algorithms of query rewriting and query answering using views and for data reconciliation, and its integration with the other tools will be completed. Using the data sources of Telecom, experiments will be conducted to evaluate the methdologies and tools for data reconciliation developed in the previous phases. The result of the fourth phase will be a technical report on the evaluation and usability study of the prototypes for integration, produced in collaboration with the other research groups (D1.R12)
With regard to data mining, the aim of the Roma research group is the design and development of a prototype of a "user-centered" data mining environment, where existing tools and techniques will be integrated with new components to support the user in every phase of the knowledge discovery process. Distinguishing features of the environment will be not only its capability of offering to the user a homogeneous scenario to interact with, but also its ability to propose new solutions for both existing open problems in the area and new problems arising from the integration of different techniques and tools. Preliminary to the system design will be a theoretical study aiming to understand the fundamental relationship between different information visualizations and knowledge discovery tasks. The system will exhibit the following main functionalities:
1. providing effective visualizations of relevant data to convey specific data properties;
2. offering strategies that could help the user in getting more precise ideas on how to proceed;
3. providing interactive or automatic creation of new visualisations depending upon the nature of the tasks, data, and media;
4. developing automatical adaptation to the type of user, offering the most appropriate visualizations and search strategies;
5. providing smooth integration of existing tools and techniques.
The overall environment and resulting prototypes will be backed up by the analysis of a suite of existing visualisation and data mining techniques and by usability studies with representative user groups, coming e.g., from Telecom Italia. The environment will have a multi-layered architecture, bringing together tools with three types of functionality:
1. information visualization (providing multiple visual data mining techniques and multiple visual metaphors);
2. knowledge discovery (providing multiple knowledge discovery techniques); and
3. data management (providing the necessary multi-dimensional structures to store the data).
The first two components will play the roles of two interoperating agents always working on a common goal, e.g., the visualization tool could drive the knowledge discovery tool, the results of which could be visualized by the first tool. The third component will act as a server to the other two. Users will have direct access to all three components through a friendly adaptive interface and will be driving the entire interaction.
Main results of the project will be:
a) a model of the visualization domain, which formalizes the basic features of the various visual components, e.g., diagrams, icons, pictures. Such features will be related with the data domain, so that one could understand what could be appropriate visualizations to convey specific data properties.
b) Techniques to project the high-dimensionality space onto a 2D (or 3D) representation using also other features (e.g., color, orientation, etc.) to capture some of the other dimensions. The projection should be so that the patterns of interest in the data emerge.
c) a careful study on data structures and algorithms that can make the process (and the resulting display) as real-time as possible, taking advantage of existing structures on the data, or developing new structures that will help in the process.
d) new visualizations for large data sets.
In the first phase of the project, existing systems and approaches for visualization will be compared, on the basis of real test beds, in order to point out their weakness and strength aspects. The results of this comparison will be reported in a technical report produced in collaboration with the research groups of Bologna and of Calabria, and which constitutes the result of the first phase (D3.R1).
This phase will also provide the input to the second phase, which will involve developing the architecture of the global environment, and theoretical studies on specific problems evidentiated during the first phase. The research group of Roma will concentrate in particular on the aspects related to visualization. The product of the second phase will be a technical report, developed together with the groups of Bologna and of Calabria, on the architecture of the integrated system for data mining and visualization (D3.R2).
The third and fourth phases will concentrate on the implementation (D3.P4) and testing of the system, following the iterative model typical of user-centered methodologies.
In parallel with the functional test of the system modules, from the very beginning of the system development, a set of usability tests will be designed and executed. Such tests will mainly concentrate on the interaction mechanisms and the visualization modalities provided by the data mining module. In accordance with the iterative model of software development, we plan to develop at least two versions of the user interface. The first one will be tested against real users and their feedbacks will be exploited to produce a new improved version. While in the early phases of the development cycle, cooperation and conflict between designers and user should just produce useful feedback for the design of the best solution, later, when a sufficiently realistic prototyping of the final result will be available, it is likely to assess to which extent user and organizational objectives are reached. In all design phases, the evaluation techniques are very numerous: their choice depends on the time and financial resources of the projects. As for this, it is quite obvious that time scheduling and economical resources of the project do not allow following a complete evaluation procedure, which includes both analytical and empirical techniques. The result of the fourth phase will be a technical report on the evaluation and usability study of the prototypes for clustering, meta-querying, approximate search and visualization, developed in collaboration with the groups of Bologna and Calabria (D3.R4).

Nº	Anno di acquisizione	Descrizione
Nº	Anno di acquisizione	Testo italiano	Testo inglese
1.	1998	3 MacIntosh Workstation	3 MacIntosh Workstation
2.	1998	3 PC Pentium	3 PC Pentium
3.	1999	SUN Sparc station	SUN Sparc station
4.
5.

Voce di spesa	Spesa		Descrizione
Voce di spesa	M£	Euro	Testo italiano	Testo inglese
Materiale inventariabile	40	20.658	Libri, pubblicazioni, personal computer, software	Publications, personal computers, software
Grandi Attrezzature
Materiale di consumo e funzionamento	20	10.329	Carta, cancelleria, fotocopie, supporti magnetici	Papers, photocopies, diskettes, etc.
Spese per calcolo ed elaborazione dati	10	5.165	Uso di macchine per il trattamento di grandi quantita' di dati	Data processing for very large data sets
Personale a contratto	24	12.395	Progetto e sviluppo di tool	Tool design and development
Servizi esterni	25	12.911	Linee telefoniche per trasmissione dati, supporto e assistenza hardware e software	Phone lines for data communication, hardware and software assistence
Missioni	58	29.955	Incontri, riunioni, congressi	Project meetings, conferences, etc.
Altro	15	7.747	Spese da definire nel corso del progetto, anche in relazione al coordinamento	To be detailed durign the project, especially in connection to coordination activities

	M£	Euro
Costo complessivo del Programma dell'Unità di Ricerca	192	99.160

Costo minimo per garantire la possibilità di verifica dei risultati	153	79.018

Fondi disponibili (RD)	58	29.955

Fondi acquisibili (RA)	0

Cofinanziamento richiesto al MURST	134	69.205

Provenienza	Anno	Importo disponibile		nome Resp. Naz.	Note
Provenienza	Anno	M£	Euro	nome Resp. Naz.	Note
Università
Dipartimento	2000	30	15.494		Contratti di ricerca
MURST (ex 40%)
CNR
Unione Europea	1999	28	14.461		DWQ, Laurin
Altro
TOTAL		58	29.955

Provenienza	Anno della domanda o stipula del contratto	Stato di approvazione	Quota disponibile per il programma		Note
Provenienza	Anno della domanda o stipula del contratto	Stato di approvazione	M£	Euro	Note
Università
Dipartimento
CNR
Unione Europea
Altro
TOTAL			0