Simple computing architectures are just memories. The Moore’s law is the empirical rule that predicts the electronic development for over 40 years, so accurately, that lays the "roadmap" foundations for most of the semiconductor manufacturers. Until a few decades ago, the computer architecture evolution was mainly based on the operating frequency growth, namely the speed of the processor. However, in 2003 the gap between the performance achieved by processors and the Moore’s law came up, diverting the com-
puter progression to other ways. Moore’s law is still sound. The more and more availability of transistors has been exploited to implement more sophisticated architectures, able to take advantage of polished capabilities rather than simple brute force.
Nowadays, processors are based on supercalar architectures as well as out-of-order execution engines. These not only allows achieving better instructions per clock (IPC) than scalar solutions, but also optimize the code execution through the employment of further mechanisms such as speculative computation.
Besides these core improvements, companies walks the road toward the process parallelism. This led to multi-core processors, which include more computing cores on the same chip. However, the increasing power capability on processors and the memory speed did not go hand in hand. Accessing the memory still represents a bottleneck during computation because CPU-core processing is far faster than memory operations. To overcome this speed limitation, memory elements have been directly implemented on chip such
that the communication with the processing units is subjected to smaller latency. These memories are know as cache memories which are so pervasive that their structure have been further enhanced by providing several layers, resulting, along with the main memory, in a sophisticated hierarchy.
The high number of cores sharing memory in a single system bumped in another memory issue. As a matter of fact, the memory cannot easily handle the concurrent requests by all the cores thus becoming the main performance bottleneck in different scenarios. Non Uniform Memory Access (NUMA) systems were born from the need of coping with this ineptitude. Such systems are formed by a set of nodes which cooperate, sharing computational power and memory resources to carry out advanced system management and problem resolution.
Heterogeneous computing refers to systems that take advantage of dedicated cores to carry out specific tasks. An example of such solution is the combined work of CPUs and GPUs. They are suitable for different kinds of calculations and mixing their capabilities allows achieving several goals such as efficiency, higher performance and less power consumption. In such a complex world, modern software, for its part, tries to benefit—in the best possible way— from the underlying hardware facilities without exposing too many hardware-level details to developers. Yet, how is it possible to find out the reason of a program behaviour when it does not act as we expected? Simple: use a profiler!
Profilers are tools specially designed for observing the execution of an application or the entire system with the aim of a profile creation. Such a profile holds the information gathered during the investigation and can be fed to external tools for further analysis. Most of the profilers are based on software techniques which, though capable of revealing a lot of execution information, may just observe high-level events also incurring in a significant overhead.
Most of the modern processors include within their architecture some specialized elements used to gather information about what is going on, atthe hardware level, during code execution. Such elements are known as Performance Monitor Units and allow to understand the reasons of several issues that may not directly recognized at higher level. Enhancing profiling tools with this support may lead to a low-overhead and more transparent software analysis.