Modern Internet services exhibit the strong trend to be structured according to a three-tier, and in general multi-tier, system organization, which allows reflecting at both the software and hardware level the logical decomposition of applications. Even though the partitioning of the application into multiple tiers provides the potentialities to achieve high modularity and flexibility, the multiplicity and diversity of the employed components, and their interdependencies, make reliability a complex issue to tackle. As an example, in classical client-server environments, database systems represented the reliability backbone of mission critical services, ensuring consistent evolution of the state trajectory of business applications through the notion of atomic transactions. However, the fault-tolerance capabilities provided by transactional components, and, in broader sense, by traditional approaches to reliability, address issues restricted to specific subsystems involved in the end-to-end interaction. Hence, they are unable to tackle the wide spectrum of failure scenarios that can arise along the whole chain of components constituting a multi-tier system.
The design of reliability solutions for Internet services is made even more challenging by the open, heterogeneous and inherently asynchronous nature of the Internet itself, which dramatically reduces the possibility to monitor and control the distributed components involved in a multi-tier application. Further, coupled with global access enabled by the Internet and with widespread diffusion of complex services, the urge for achieving high scalability and minimizing response times has accordingly grown. This has imposed stringent performance requirements on the underlying reliability mechanisms.
This is precisely the focus of this thesis. Specifically, we introduce innovative protocols ensuring the e-Transaction (exactly-once Transaction) guarantees, namely a recent formalization of desirable end-to-end reliability properties for multi-tier systems in presence of crash failures. These protocols advance the state of the art in a twofold direction. From a practical perspective, they achieve unparalleled scalability levels, exhibit very limited overhead, thus revealing particularly attractive in the context of emerging large scale service delivery platforms. From a theoretical standpoint, our solutions can cope with purely asynchronous systems, where no assumption on the accuracy of the failure detection mechanism can be guaranteed.
As we will show, some of the building blocks underlying the previous fault tolerant protocols can also be used to construct distributed protocols allowing the treatment of a more general class of failures, which we can refer to as “performance failures”. These model situations of reduced system responsiveness due to both crashes and overloads/congestions on some component.