|[ APLAS 2017 ]||
Black-box equivalence checking across compiler optimizations
The fundamental problem of checking equivalence of two computer programs is as old as computing itself. The problem is undecidable. This paper presents an algorithm that attempts to compute equivalence across end-to-end optimizations produced by GCC, LLVM, ICC and CompCert compilers. The experiments reported in this paper are exhaustive and first of its kind and provide evidence on the robustness and limitations of our algorithm. The paper also provides insight into the power of modern-day compilers.
|[ HVC 2017 ]||
Modeling undefined behaviour semantics for checking equivalence across compiler optimizations
Undefined behaviour semantics in programming languages are typically introduced for better optimization opportunity. Indeed we find that C compilers make extensive use of undefined behaviour semantics to produce faster code. Yet, previous work on translation validation has ignored such optimizations that rely on undefined behaviour. Section 2 of this paper contains a nice example explaining the subtlety of the problem. Our solution which involves extending the simulation relation with assumptions that remove undefined-behaviour seems straight-forward in retrospect. Our alias analysis algorithm for LLVM and assembly code is another important contribution of this work. The experimental results show the importance of different types of undefined behaviour on compiler optimization.
|[ TPDS 2017 ]||
The Unicorn Runtime: Efficient distributed shared memory programming for hybrid CPU-GPU clusters
How can a programmer harness the combined computing power of CPUs and GPUs in a computing cluster without dealing with all the nitty gritties of the underlying system and its optimizations? The paper presents an intuitive programming model (which is perhaps not entirely new). The contribution of the paper is in algorithms to automatically map a program written in this programming model on a cluster of CPUs and GPUs through compile-time and run-time support in a system called Unicorn.
|[ SOSP 2013 ]||
Fast Dynamic Binary Translation for the Kernel [ talk video ]
Dynamic binary translation (DBT) is a technique with a wide variety of applications; innovative ways to employ DBT appear almost regularly in the research literature. This paper explains how to perform DBT for an OS kernel efficiently even in the presence of high rates of interrupt and exception activity. The capability of turning DBT on and off during system execution is quite cool, and did not exist in any previous DBT system. The orchestration of translated instructions in the processor's instruction cache to improve the performance of the system (i.e., the translated system can be faster than the native system) is quite interesting. We use our DBT implementation to monitor the sharing behaviour of the Linux kernel.
This was also the first SOSP paper ever from India.
|[ ASPLOS 2013 ]||
Efficient Virtualization on Embedded Power Architecture Platforms
I quote one of our ASPLOS 2013 reviewers:
---begin quote--- It is almost like asking, how would the universe differ if the fine
structure constant had a value of 0.00829 instead of 0.00729.
Answer: a lot!
For this reason, I think the paper is a very good fit for ASPLOS. Even
if architects do their best to think carefully about ramifications of
their choices, this paper is an eye opener into a world of, probably,
unforeseen consequences. As such, like it very much. And I think it should
be a must-read for current and future architects (and not just narrowly
from a virtualization point of view).
It is almost like asking, how would the universe differ if the fine structure constant had a value of 0.00829 instead of 0.00729. Answer: a lot!
For this reason, I think the paper is a very good fit for ASPLOS. Even if architects do their best to think carefully about ramifications of their choices, this paper is an eye opener into a world of, probably, unforeseen consequences. As such, like it very much. And I think it should be a must-read for current and future architects (and not just narrowly from a virtualization point of view).
Thank you dear (anonymous) reviewer for liking our work!
|[ OSDI 2008 ]||
Binary Translation Using Peephole Superoptimizers
Can we automatically infer translations from one architecture to another using synthesis and superoptimization techniques? This paper demonstrates such capability for PowerPC to x86 translation. The performance of the translated code is better than the widely-used commercial Rosetta translator (shipped with Apple's OS X) at that time. The paper provides techniques to dynamically map the state of one machine to the state of another machine in tandem with synthesis-based code generation, for better overall quality of the generated code.
|[ ASPLOS 2006 ]||
Automatic Generation of Peephole Superoptimizers
How to automatically generate a peephole optimizer for x86 using brute-force superoptimization? This paper has received significant attention since it was published. Our current research is targetted towards generalizing and scaling the ideas presented in this paper.
|[ WINET 2006 ]||
Energy Efficiency and Capacity of TCP Traffic in Multi-Hop Wireless Networks
How does the TCP protocol interact with the underlying congestion-control algorithms in a multi-hop wireless network? What are the implications on throughput (capacity) and energy efficiency?
|[ FAST 2004 ]||
CAR: Clock with Adaptive Replacement
This paper extends the well-known CLOCK algorithm for cache replacement (used in most operating systems) with ideas presented in previous work on Adaptive Replacement Cache (ARC). As mentioned on ARC's wikipedia page, variants of this approach have been implemented and deployed in several systems. The primary advantage of CAR (CLOCK with Adaptive Replacement) over ARC is that it gets rid of the LRU lock, while retaining all the benefits of ARC.