Combining Recency of Information with Selective Random and a Victim Cache in Last-Level Caches

¿Quieres contarnos tu reto? Pincha aquí y te ayudamos a encontrar una solución

Autores UPV

Valero Bresó Alejandro, Sahuquillo Borrás Julio, Petit Martí Salvador Vicente, López Rodríguez Pedro Juan, Duato Marín José Francisco

Año

2012

Revista

ACM Transactions on Architecture and Code Optimization

Abstract

Memory latency has become an important performance bottleneck in current microprocessors. This problem aggravates as the number of cores sharing the same memory controller increases. To palliate this problem, a common solution is to implement cache hierarchies with large or huge Last-Level Cache (LLC) organizations. LLC memories are implemented with a high number of ways (e.g., 16) to reduce conflict misses. Typically, caches have implemented the LRU algorithm to exploit temporal locality, but its performance goes away from the optimal as the number of ways increases. In addition, the implementation of a strict LRU algorithm is costly in terms of area and power. This article focuses on a family of low-cost replacement strategies, whose implementation scales with the number of ways while maintaining the performance. The proposed strategies track the accessing order for just a few blocks, which cannot be replaced. The victim is randomly selected among those blocks exhibiting poor locality. Although, in general, the random policy helps improving the performance, in some applications the scheme fails with respect to the LRU policy leading to performance degradation. This drawback can be overcome by the addition of a small victim cache of the large LLC. Experimental results show that, using the best version of the family without victim cache, MPKI reduction falls in between 10% and 11% compared to a set of the most representative state-of-the-art algorithms, whereas the reduction grows up to 22% with respect to LRU. The proposal with victim cache achieves speedup improvements, on average, by 4% compared to LRU. In addition, it reduces dynamic energy, on average, up to 8%. Finally, compared to the studied algorithms, hardware complexity is largely reduced by the baseline algorithm of the family.

Más Información