In the middle of the 2000s a fundamental change of course occurred in computer architecture because techniques such as frequency scaling and instruction level parallelism were providing rapidly diminishing returns. Since then, scaling up threadlevel parallelism through increasingly parallel multicore processors has become the primary driver of performance gains, exacerbating the pre-existing problem of the Memory Wall.
In response to this, cache and memory architecture have become more complex, while still maintaining a shared view of memory to software. As trends such as increasing parallelism and heterogeneity continue apace, the contribution of the memory hierarchy as a proportion of the overall system performance profile will continue to grow.
Since the middle of the 2000s, thread-level parallelism has increased across almost all computer processor designs, bringing the problem of programmability into sharper focus. One of the most promising developments in programming models in the past fifteen years has been task-based programming models. Such programming models provide ease programmability for the user, at a level which is abstract enough to allow the runtime system layer to expertly optimise execution for the underlying hardware.
The main goal of this thesis is to exploit information available in task-based programming models to drive optimisations in the memory hierarchy, through a hardware/software co-design approach. Data movement becomes the primary factor affecting power and performance as shared memory system architectures scale up in core count and therefore network diameter.
The first contribution of this thesis studies the ability of a task-based programming model to constrain data movement in a real, very large shared memory system. It characterises directly and in detail the effectiveness of the programming model’s runtime system at minimising data traffic in the hardware. The analysis demonstrates that the runtime system can maximise locality between tasks and the data they use, thus minimising the traffic in the cache coherent interconnect.
The second and third contributions of this thesis investigate hardware/software co-design proposals to increase efficiency within the on-chip memory hierarchy. These two contributions exploit information already captured in existing taskbased programming models. They communicate this information from the runtime system to the hardware and use it there to drive power, performance and area improvements in the memory hierarchy. A simulator based approach is used to model and analyse both the second and third contributions.
Scaling cache coherence among growing numbers of private caches is a crucial issue in computer architecture as core counts continue to increase. Improving the scalability of cache coherence is the topic of the second contribution. It demonstrates the ability of a runtime system and hardware co-design approach to dramatically reduce capacity demand on the coherence directory, which is a central issue in scaling cache coherence among private caches.
Non-uniform cache access (NUCA) shared caches are also increasing in size and share of on-chip resources, as they are the last line of defence against costly off-chip memory accesses. The third proposal focuses on optimising such NUCA caches to increase their effectiveness at dealing with the bottleneck between computation and memory. It shows a runtime system and hardware co-design approach can successfully reduce the network distance costs in a shared NUCA cache.
Together the three contributions of this thesis demonstrate the potential for task-based programming models to address key elements of scalability in the memory hierarchy of future systems at the level of private caches, shared caches and main memory.
© 2001-2024 Fundación Dialnet · Todos los derechos reservados