Software Transactional Memory (STM) has emerged as a powerful programming paradigm for concurrent applications. It allows encapsulating the access to data shared across concurrent threads within transactions, thus avoiding the need for synchronization mechanisms to be explicitly coded by the programmer. On the other hand, synchronization transparency must not come the expense of performance. Hence, STM-based systems must be enriched with mechanisms providing optimized run-time efficiency. Among the issues to be tackled, a core one is related to determining the optimal level of concurrency (number of threads) to be employed for running the application on top of the STM layer. For too low levels of concurrency, parallelism can be hampered. On the other hand, overdimensioning the concurrency level may give rise to thrashing phenomena caused by excessive data contention and consequent transaction aborts.
In this thesis we propose a set of techniques in order to build “application specific” performance models allowing to dynamically tune the level of concurrency to the best suited value depending of the specific execution phase of the application. We will present three different approaches: a) one based on a pure Machine Learning (ML) model that doesn’t require a detailed knowledge of the application internals to predict the optimal concurrency level, b) one based on a parametric analytical performance model customized for a specific application/platform through regression analysis that, respect to the previous one, requires a lighter training phase and c) one based on a combination of analytical and Machine Learning techniques, that allows to combine the strengths of the previous
two approaches, that is it has the advantage of reducing the training time of pure machine learning methods avoiding the approximation errors typically affecting pure analytical approaches. Hence it allows very fast construction of highly reliable performance models, which can be promptly and effectively exploited for optimizing actual application runs.
We also present real implementations of concurrency regulation architectures, based on our performance predictions approaches, which have been integrated within the open source TinySTM package, together with experimental data related to runs of application profiles taken from the STAMP benchmark suite demonstrating the effectiveness of our proposals. The experimental data confirm how our self-adjusting concurrency schemes constantly provides optimal performance, thus avoiding performance loss phases caused by non-suited selection of the amount of concurrent threads and associated with the above depicted phenomena.
Moreover we present a mechanism that allows to dynamically shrinks or enlarges the set of input features to be exploited by the performance predictors. This allows for tuning the concurrency level while also minimizing the overhead for input-features sampling, given that the cardinality of the input-feature set is always tuned to the minimum value that still guarantees reliability of workload characterization. We also present a fully fledged implementation of this solution again within the TinySTM open source framework, and we provide the results of an experimental study relying on the STAMP benchmark suite, which show significant reduction of the application execution time with respect to proposals based on static feature selection.