A Low-latency On-chip Cache Hierarchy for Load-to-use Stall Reduction in GPUs | Publicación