Gigahorse: Thorough, Declarative Decompilation of Smart Contracts
Neville Grech
University of Athens and University of Malta Greece and Malta
me@nevillegrech.com
Lexi Brent
The University of Sydney Australia
lexi.brent@sydney.edu.au
Bernhard Scholz
The University of Sydney Australia
bernhard.scholz@sydney.edu.au
Yannis Smaragdakis
University of Athens Greece
yannis@smaragd.org
Abstract—The rise of smart contracts—autonomous applications running on blockchains— as led to a growing number of threats, necessitating sophisticated program analysis. However, smart contracts, which transact valuable tokens and cryptocurrencies, are compiled to very low-level bytecode. This bytecode is the ultimate semantics and means of enforcement of the contract. We present the Gigahorse toolchain. At its core is a reverse compiler (i.e., a decompiler) that decompiles smart contracts from Ethereum Virtual Machine (EVM) bytecode into a highlevel 3-address code representation. The new intermediate representation of smart contracts makes implicit data- and control flow dependencies of the EVM bytecode explicit. Decompilation obviates the need for a contract’s source and allows the analysis of both new and deployed contracts.
Gigahorse advances the state of the art on several fronts. It gives the highest analysis precision and completeness among decompilers for Ethereum smart contracts—e.g., Gigahorse can decompile over 99.98% of deployed contracts, compared to 88% for the recently-published Vandal decompiler and under 50% for the state-of-the-practice Porosity decompiler. Importantly, Gigahorse offers a full-featured toolchain for further analyses (and a “batteries included” approach, with multiple clients already implemented), together with the highest performance and scalability. Key to these improvements is Gigahorse’s use of a declarative, logic-based specification, which allows high-level insights to inform low-level decompilation. Index Terms—Ethereum, Blockchain, Decompilation, Program
Analysis, Security
INTRODUCTION
Distributed blockchain platforms have captured the imagination of scientists and the public alike. Blockchain technology offers decentralized consensus mechanisms for any transactions that, in the past, would have required a trusted centralized authority. One of the most evident embodiments of this vision is the development of smart contracts: Turing-complete autonomous agents that run on distributed blockchains, such as Ethereum or Cardano. A smart contract may, for instance, implement a lending policy, a charging scheme for digital goods, an auction, the full set of operations of a bank, and virtually any other logic governing multi-party transactions. Ethereum is the best-known, most popular blockchain platform that supports full-featured smart contracts. (As of this writing, the Ethereum cryptocurrency market capitalization is $13B.) Ethereum offers an excellent demonstration of the potential for smart contracts, as well as their technical challenges. Developers typically write smart contracts in a high-level language called Solidity, which is compiled into immutable low-level Ethereum VM (EVM) bytecode for the blockchain’s distributed virtual machine.
The open nature of smart contracts, as well as their role in handling high-value currency, raise the need for thorough contract analysis and validation. This task is hindered, however, by the low-level stack-based design of the EVM bytecode that has hardly any abstractions as found as in other languages, such as Java’s virtual machine. For example, there is no notion of functions or calls—a compiler that translates to EVM bytecode needs to invent its own conventions for implementing local calls over the stack. It is telling that recent research [1], [9], [22], [34] has focused on decompiling smart contracts into a higher-level representation, before applying any further (usually securityoriented) analysis. Past decompilation efforts have been, at best, incomplete. The best-known decompiler (largely defining the state-of-the-practice) is Porosity [33], which in our study fails to yield results for 50% of deployed contracts of all smart contracts on the block chain. Upcoming research tools including the Vandal decompiler [35] still fail to decompile a significant portion of real contracts (around 12%) due to the complex task of converting EVM’s stack-based operations to a register-based intermediate representation.
Such difficulties are much more than technicalities of the platform or idiosyncrasies of existing tools. Any current or future smart contract platform is likely to employ virtual machines that are low-level. The designs of these virtual machines are optimized for massively replicated execution of smart contracts. The bytecode effectively represents an assembly language designed for efficient execution and compact program representation, since the bytecode must be stored on the blockchain. Hence, the bytecode for smart contracts will never be optimal for human readability or reverse compilation. For effective decompilation, significant program comprehension of the bytecode is necessary. A decompiler requires deep program understanding before it can reconstruct the bytecode to a high-level representation. For instance, to recognize which low-level jumps correspond to high-level function calls, a decompiler must deduce possible addresses for jump instructions to be able to reconstruct the control-flow of a smart contract. Such understanding is compounding: once calls are recognized as calls (and not just as a mere intra-procedural
jump), its decompilation precision can further improve, by pruning impossible targets.
In this paper, we introduce the Gigahorse toolchain for analysis of Ethereum smart contracts. Gigahorse addresses the above challenges, making the following contributions:
- It offers a highly-effective decompiler, yielding higher precision and completeness than the state-of-the-art (e.g., decompiling virtually all existing contracts on the Ethereum blockchain).
- Gigahorse anchors around it a full-featured tool suite, offering libraries for building analyses, as well as readymade clients for existing analyses.
- The Gigahorse approach provides new decompilation insights (possibly of value for many more platforms): As higher level program features are discovered, they feed back to lower level analyses. E.g., by discovering functions, stack analyses are performed locally (more precisely), and the effects of function calls on the stack are also summarized, enabling both more precise and more scalable analysis.
- Gigahorse showcases an unconventional decompilation approach, which enables the above benefits: the decompiler is specified declaratively, using logic-based (Datalog) rules.
- Gigahorse is evaluated and its features illustrated on the full set of smart contracts of the Ethereum blockchain. Gigahorse is powering the ongoing free Contract Library service (https://contract-library.com), which offers decompiled versions of all contracts on the Ethereum blockchain. Contract Library is a valuable service for Ethereum security analysts, and is currently receiving several tens of unique visitors and over a thousand page views per day.
BACKGROUND
We next present some background on the EVM bytecode language and declarative static analysis. A. Low-Level Bytecode in the Ethereum Block