About me.

Currently, I am a postdoctoral researcher at the Research Center of Cyber intelligence and Information Security.

I received my Ph.D. in Engineering in Computer Science, at Sapienza University of Rome, working in the Web Algorithmics and Data Mining Lab (WADAM) under the supervision of Prof. Aris Anagnostopoulos.

My Ph.D studies evolved around modeling and mining dynamic processes on large-scale social and information networks. My investigation lies on the interface of computational methods and social theories, combining them to analyze effectively social media data, enabling to study human behavioural processes that have an impact on the structure, evolution and lifecycle of social networks.

Main research topics
dynamic data mining, distributed big-data algorithms, influence analysis
Other
spatio/temporal urban mining, information security, blockchains

Previous positions:

  • Research Fellow at the Department of Statistics, Sapienza University of Rome
  • Data Scientist Intern at Adobe Systems, San Jose (CA), USA


Contact information

Mara Sorella
Department of Computer, Control
and Management Engineering
Sapienza University of Rome
Via Ariosto, 25
00185, Rome (Italy)
Room: B113
Phone: +39 06.77274120
Email:

Teaching

Here you can find a list of courses and lectures.

Projects

Here you can find a list of my recent projects:

Similarity Search for Dynamic Data Streams

Collaborative filtering is a widely adopted approach in recommender systems to produce a set of item recommendations to a user by basing such set on the most similar users to him. To apply such an approach, one typically requires two things: (1) a measure of similarity (or dissimilarity) between users and (2) scalable algorithms for evaluating these similarities. Though many algorithms assume a fast, black-box access to similar profiles to be available, such an assumption is not realistic web-scale applications. Efficient approaches exist for insertion-only data streams, such as locality-sensitive hashing (LSH) works by storing succinct similarity-preserving representations of sets (fingerprints) and generating candidate pairs exploiting their locality properties. However, In web applications, users profiles information can vary with time: if certain attributes get deleted, a full recomputation of fingerprints is in order. In this project, we initiate the study of scalable locality sensitive hashing (LSH) for fully dynamic data-streams. Our algorithms have little overhead in terms of running time compared to previous LSH approaches for the insertion only case, and drastically outperform them in case of deletions. We have reason to believe that the algorithm can be successfully applied in real-world applications, as evidenced by its performance for finding Last.fm users with similar musical tastes.

Cylab: Emulation Environment for Cyber Security Analyses

Computer networks are undergoing a phenomenal growth, driven by the rapidly increasing number of nodes constituting the networks. At the same time, the number of security threats on Internet and intranet networks is constantly growing, and the testing and experimentation of cyber defense solutions require the availability of separate, test environments that best emulate the complexity of a real system. Such environments support the deployment and monitoring of complex mission-driven network scenarios, and cyber security training activities, thus enabling the study of cyber defense strategies. We designed an application built on top of OpenNebula IaaS, deployed on dedicated hardware, which allows to automatically deploy a target networked physical environment via Infrastructure-as-a-code abstractions,that supports various cyber experimentation, as well as data collection and analysis tasks.

Fast Scalable k-mer Counting for Large-scale Genomic Sequences using Spark

The introduction of new generation sequencing technologies revolutionized the life sciences by providing an unprecedented amount of data, requiring new algorithmic approaches able to efficiently analyze it. In this paper we present FastKmer, a Spark-based distributed algorithm for the solution of a paradigmatic Bioinformatics problem: the extraction of k-mer statistics from large collection of genomic sequences, with arbitrary values of k. One of the most relevant contributions of FastKmer is the introduction of a custom Spark data partitioning module for balancing the statistics aggregation workload over the nodes of a computing cluster that overcomes typical data skew problems arising in genomic sequence mining. We also present the results of a comparative experimental analysis showing that our approach is currently the fastest available, based on distributed data processing engines, also exhibiting very good scalability.

Cultural dynamics

A fundamental open question is understanding the role that influence and selection play in shaping the evolution of socio-cultural systems. Quantifying these forces in real settings is still a big challenge, especially in the large-scale case in which the entire social network between the users may not be known, and only longitudinal data in terms of sizes of cultural groups (e.g., political affiliation, market share, cultural tastes) may be available. We propose an influence and selection model encompassing an explicit characterization of the cultural feature space in the form of a natural equation-based macroscopic model, following an approach by Kempe et al. Our main goal is to estimate edge influence strengths and selection parameters from an observed time series. We perform learning and prediction on real datasets from Last.FM and Wikipedia, achieving good prediction accuracy.

Targeted Interest-Driven Advertising in Cities using Twitter

Targeted advertising is a key characteristic of online as well as traditional-media marketing. However it is very limited in outdoor advertising, that is, performing campaigns by means of billboards in public places. The reason is the lack of information about the interests of the particular passersby, except at a very imprecise and aggregate demographic or traffic estimates. In this work we propose a methodology for performing targeted outdoor advertising by leveraging the use of social media. In particular, we use the Twitter social network to gather information about users' degree of interest in given advertising categories and about the common routes that they follow, characterizing in this way each zone in a given city. Then we use our characterization for recommending physical locations for advertising. Given an advertisement category, we estimate the most promising areas to be selected for the placement of an ad that can maximize its targeted effectiveness. We show that our approach is able to select advertising locations better with respect to a baseline reflecting a current ad placement policy.

Entweety Graph

Dynamic visualization of a time-evolving co-mentioning graph of entities in a Twitter stream using Python and d3.js.

FollowThings

This project aims at developing a full stack application for tracking a (possibly event-related) query on Twitter in real time, mining the stream of matched tweets for top keywords live, thus allowing the end user to have a grasp of the evolution of interest in time related to the query. The implementation is based on a Storm Cluster for stream processing equipped with a Redis queuing system (+ other technologies, like Node.js web server with RESTful interface for dynamically building a web page the end user, to display the aggregated rankings provided by the cluster in real time)

Wadam Dataset Repository

An online search engine for the WADAM dataset repository, curated by the Web Algorithmics and Data mining research group. Based on the awesome Node.js and MongoDB.

remote-task-manager

A Java based application to deploy, run, monitor and gather results of remote tasks (experiments!) executing on cloud or privately-owned machines.

Publications

Journal

  • M. Bury, C. Schwiegelshohn and M. Sorella
    "Similarity Search for Dynamic Data Streams"
    Transactions on Knowledge and Data Engineering (link), May 2019.

  • U. F. Petrillo, M. Sorella, G. Cattaneo, R. Giancarlo and S. E. Rombo
    "Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics"
    BMC Bioinformatics (link), April 2019.

  • A. Anagnostopoulos, F. Petroni and M. Sorella
    "Targeted Interest-Driven Advertising in Cities using Twitter"
    Knowledge Discovery and Data Mining, (link), July 2017.

Conference

  • Florin Dragos Tanasache, Mara Sorella, Silvia Bonomi, Raniero Rapone, and Davide Meacci
    "Building an Emulation Environment for Cyber Security Analyses of Complex Networked Systems"
    Proc. of the 19th International Conference on Distributed Computing and Networking (ICDCN 2019), Bangalore, India, January 2019.

  • M. Bury, C. Schwiegelshohn and M. Sorella
    "Sketch 'Em All: Fast Approximate Similarity Search for Dynamic Data Streams"
    Proc. of the 11th ACM International Conference on Web Search and Data Mining (WSDM 2018), Los Angeles (CA), USA, February 2018.

  • A. Anagnostopoulos, F. Petroni and M. Sorella
    "Targeted Interest-Driven Advertising in Cities using Twitter"
    Proc. of the 10th AAAI International conference on Web and Social Media (ICWSM 2016), (pdf), Cologne, Germany. May 2016.

  • A. Anagnostopoulos and M. Sorella
    "Learning a Macroscopic Model of Cultural Dynamics"
    Proc. of the 2015 IEEE International Conference on Data Mining (ICDM 2015) (pdf, extended version), Atlantic City (NJ), USA. November 2015.

  • R. Baldoni, S. Bonomi, G.A. Di Luna, L. Montanari, and M. Sorella
    "Understanding (Mis)Information Spreading for Improving Corporate Network Trustworthiness"
    Proc. of the 14th European Workshop on Distributed Computing, 2013 (EWDC 2013), Coimbra, Portugal. May 2013.

Selected Presentations

  • Mining Dynamics of User Preferences in Complex Networks (slides)
    Final defense session (Ph.D. Program in Computer Engineering, XXIX cycle), DIAG, Via Ariosto 25, Rome, Italy. February 2018.

    Learning a Macroscopic Model of Cultural Dynamics (slides)
    2015 IEEE International Conference on Data Mining (ICDM 2015), Atlantic City, NJ, USA. November 2015.

    COLITA: Collaborative Interest-driven targeted advertising (slides)
    International Conference on Computational Social Science (ICCSS 2015), Helsinki, Finland. June 2015.

  • Understanding (Mis)Information Spreading for Improving Corporate Network Trustworthiness (slides)
    M.Sc. dissertation @ Faculty of Engineering (Sapienza University of Rome), Rome, Italy. July 2013.

Personal

While offline, I enjoy playing piano and guitar.
I also like home-making 100% organic ecological detergents and make-up and maintaining my freshwater aquarium (and the wild beasts living therein).
I am a proud member of the Ninux Wireless Community Network.
At Ninux, we enjoy building a large wireless network putting directional antennas like this on our rooftops.

Social outlets:


Tab 6

An extra tab you can use for whatever purpose you see fit. Rename/Hide this tab by editing the #navbar section of the index.html.

Tab 7

An extra tab you can use for whatever purpose you see fit. Rename/Hide this tab by editing the #navbar section of the index.html.

Tab 8

An extra tab you can use for whatever purpose you see fit. Rename/Hide this tab by editing the #navbar section of the index.html.