Abstract
Events that take 10s to 100s of ns like cache misses increasingly cause CPU stalls. However, hiding the latency of these events is challenging: hardware mechanisms suffer from the lack of flexibility, whereas prior software mechanisms fall short due to large overhead and limited event visibility. In this paper, we argue that with a combination of two emerging techniques - light-weight coroutines and sample-based profiling, hiding these events in software is within reach.