You Pay For What You Touch: Locality as Ethereum's Next Cost Model

This post is based on a talk given by Neville Grech at the Stateless Summit 2026, drawing on a 2021 study Dedaub conducted for the Ethereum Foundation on the impact of Vitalik Buterin’s Verkle tree gas metering proposal — the direct precursor of EIP-7864.

TL;DR

Real-world Ethereum contracts are written to fight today’s gas model — not to respect locality. Under EIP-7864, the unified binary tree replacing the hexary Merkle-Patricia trie, access locality becomes a first-class cost dimension: you pay extra every time execution touches a code chunk or a storage slot that lives outside the small co-located region around the contract account. In a 2021 Dedaub study commissioned by the Ethereum Foundation, we measured a net ~26% gas cost increase on existing contracts under the precursor Verkle proposal, with ~96% of internal transactions worse off. That number sounds alarming. It is also a one-time, upper-bound repricing of contracts written with zero awareness of locality. Most of the gap is closable by compilers, standard libraries, and a handful of opt-in primitives. This post walks through the cost model, the systematic patterns that misbehave under it, and concrete optimization opportunities.

Why Stateless Ethereum needs a new cost model

Stateless clients are one of the most consequential scalability and decentralization unlocks Ethereum has on its roadmap. A stateless client should be able to verify any individual block from just the header and a small “witness” — the slice of state the block actually accesses, accompanied by a proof of inclusion. State-holding nodes produce these witnesses; verifiers no longer need a full copy of the world state.

The catch is witness size. In Ethereum’s current hexary Merkle-Patricia trie, an average account witness is close to 3 kB and the worst case is several times that. With an upper bound of roughly 6,000 state accesses per block under current gas pricing, naive witnesses balloon to ~18 MB per block — too large to safely fan out across the p2p network inside a 12-second slot. Contract code makes this dramatically worse: a 2,600-gas CALL (per EIP-2929) against a 24 kB contract drags the entire bytecode into the witness, pushing the worst case past 100 MB. These figures are from Vitalik’s original Verkle tree EIP notes, which framed the witness-size problem the binary tree is now solving differently.

Two structural changes are needed to make stateless verification work:

A tree structure with smaller proofs. EIP-7864 replaces the hexary trie with a unified binary tree where account headers, contract code, and storage all live in a single 32-byte key-value space. As the EIP itself notes in its “Rationale” section, branch length for an equivalent-size tree drops from ~2,880 bytes (k=16) to ~768 bytes (k=2). The earlier Verkle proposal achieved similar savings via polynomial commitments; EIP-7864 instead leans on hash functions only, which simplifies the cryptography and is positioned for post-quantum security as ECC deprecation approaches.
Per-chunk metering of code and state access. Once witnesses are bounded by what was actually touched, gas must reflect that. Bytecode is split into 31-byte chunks and every chunk that the EVM executes or accesses must be paid for, otherwise an attacker can construct a transaction whose witness blows up the network without compensating its bandwidth cost.
Why 31 and not 32? Because each chunk needs a leading byte that says how many bytes at its start are PUSH immediates from a PUSH instruction that began in the previous chunk. Without that, a verifier handed a single chunk can’t tell whether its first byte is an opcode or a leftover immediate. There is a cleaner alternative — disallow PUSH32 (which already needs 33 bytes anyway, so it can never live entirely inside a 32-byte chunk) and disallow any PUSH immediate from crossing a chunk boundary. Then chunks can be a clean 32 bytes with no metadata byte. That’s a deeper EVM change than EIP-7864 wants to make today, but it’s worth knowing the design space.

This is where locality enters the picture.

Background: EIP-7864

EIP-7864 organises the state as a binary tree where the first 31 bytes of any key form a stem and the final byte indexes one of 256 slots inside that stem. Things that share a stem share a subtree — and therefore amortize their proof. The EIP defines a fixed layout for each contract account:

EIP-7864 binary tree layout: account header stem (left, in green) with basic_data, code_hash, 64 storage slots, and 128 code chunks co-located in one subtree; overflow stems (right, in blue) for storage slots and code chunks beyond the hot region, grouped in 256 leaves per stem.

Figure: Account-data layout in EIP-7864’s unified binary tree. Source: EIP-7864 — “Tree embedding” (CC0). The green panel shows the account header stem — a single subtree containing basic_data, code_hash, the first 64 storage slots, and the first 128 code chunks, all co-located. The blue panel shows the overflow stems that hold storage and code beyond the account header, grouped in batches of 256 leaves per stem. Witness costs are paid per stem entered and per leaf touched, so this layout is the entire reason “locality” becomes a first-class cost dimension.

Basic data (nonce, balance, code size, code hash) is packed into a small region at the beginning of the account’s stem (BASIC_DATA_LEAF_KEY = 0, CODE_HASH_LEAF_KEY = 1). This is a notable change from the 2021 Verkle draft, where the basic-data fields were laid out separately.
The first 64 storage slots (HEADER_STORAGE_OFFSET = 64) live in the same stem as the basic data.
The first 128 code chunks (CODE_OFFSET = 128) live in the same stem. 128 × 31 bytes ≈ 4 kB — that is the size of the cheap region of bytecode for any contract.
The remainder of storage (MAIN_STORAGE_OFFSET = 256^31) and any code beyond the first 128 chunks spread across the tree in groups of 256 within their own stems.

In other words, each contract has a small hot region — the first ~4 kB of bytecode and the first 64 storage slots — that you can touch cheaply because their proofs piggy-back on each other. Anything outside that region costs more, both because crossing into a new stem is metered as a “cold” access and because you re-pay for every additional chunk you load.

It’s worth being precise about what a long jump actually wastes. The cost model is two-tier:

At the chunk level, jumping mid-chunk discards the bytes you’d already paid to load — anything past the jump in the current 31-byte chunk is dead weight.
At the subtree level, when you’ve already paid to descend a stem (you’ve taken the proof hit, you’ve materialised the path-to-root), every chunk in that stem you don’t end up touching is also wasted — and a long jump that lands in a different stem walks away from all of it.

So compiler heuristics that share short instruction sequences across unrelated functions need to be minimal.

The actual pricing of access events is currently in EIP-4762: Statelessness gas cost changes (also Draft), which defines the witness-side constants:

Constant	Value	When charged
`WITNESS_BRANCH_COST`	1,900	First touch of a stem in this tx
`WITNESS_CHUNK_COST`	200	First touch of a leaf in an (already-touched) stem
`SUBTREE_EDIT_COST`	3,000	First write to a stem
`CHUNK_EDIT_COST`	500	First write to a leaf
`CHUNK_FILL_COST`	6,200	Writing a previously-empty leaf

Composing the witness-side numbers, the read-side cost looks like:

First read of a leaf in a brand-new stem: 1,900 + 200 = 2,100 gas (witness side), comparable to today’s COLD_SLOAD_COST.
First read of a different leaf in a stem you have already touched: 200 gas (witness side).
Repeat read of a leaf you have already touched: 0 gas at the witness layer. (The opcode-level SLOAD_GAS is still paid on top, so a re-read still costs the base SLOAD price; it just doesn’t accrue any further witness charge.)
Code chunks follow the same shape: the first execution byte in a fresh chunk pays the chunk’s witness cost (200, plus the 1,900 stem cost if the stem itself is new), and re-traversing an already-paid chunk later in the same transaction is free. This has no equivalent today — under EIP-2929, pure code execution is free once CALL has been paid.

These numbers are almost certainly subject to change. EIP-4762 was originally calibrated against the Verkle witness shape — polynomial commitments, where branch proofs are essentially constant-size and the per-leaf chunk cost dominates. Binary trees invert that: branch proofs are O(log n) hashes and substantively larger per branch, so the relative weight of WITNESS_BRANCH_COST vs WITNESS_CHUNK_COST is structurally different from what 4762 was tuned for. EIP-7864’s author Ignacio Hagopian has acknowledged this directly — in the EIP-7864 Ethereum Magicians thread he writes that “Gas remodeling is EIP-4762. It might need constant adjustments, but the overall approach would be the same.” However, it is likely the case that future cost models will require payment per stem you enter, payment per leaf you touch, free re-traversal.

What we found in 2021 — and what still holds in 2026

In 2021, before EIP-2929’s “Berlin” access-list semantics had even fully bedded in, the Ethereum Foundation commissioned Dedaub to measure the impact of the Verkle gas changes on real contracts (full report here). The simulation already had two competing forces baked in: cold SLOAD/SSTORE (per EIP-2929 and EIP-2200) would get cheaper whenever the touched slot lived in a previously visited subtree, but code chunking would introduce a per-chunk cost that has no analogue today. The 26% headline is the net of those two effects. We approached the question two ways:

A path-sensitive static analysis built on the gigahorse-toolchain binary lifter. For every public function of every mainnet contract that had transacted in the prior two weeks, we computed an upper bound on additional cost from code chunking, considering only paths that successfully reach STOP/RETURN.
A modified Erigon client that re-executed historical transactions under the new gas semantics, recording per-internal-transaction gas at every cold/warm boundary and comparing it against EIP-2929’s cold costs.

The headline numbers from that study:

~26% average gas cost increase across replayed internal transactions.
~96% of internal transactions were worse off.
The distribution was approximately normal — there were no isolated cliffs, just a broad systemic shift.
Modest gains from heuristics like indexing the tree by code hash for chunks above 127 (only a ~3% reduction in code chunking costs in our experiments).

A quick check of some Ethereum blocks from March 2026 (to see whether five years of language and library evolution had moved the needle) suggests that software-engineering patterns have happily expanded into the storage address space (mappings of mappings of structs, account-abstraction wallets, registry contracts, deeply nested upgradeable proxies) precisely because the address space is free under today’s pricing. None of that activity has any reason to land in the first 64 slots, so:

Storage locality is weak in practice. ~85% of smart contract calls touched at least one storage slot beyond the first 64. Only ~56% touched the first 64 slots at all. Real workloads do not concentrate state in the co-located region — and not because they couldn’t.
Code locality is somewhat better, but still leaks. 100% of executions of course start within the first 128 chunks (the entry point lives there), but ~60% of executions extend beyond the first 128 chunks.
The locality-shaped optimization that is widely deployed today is packing more things into the same slot, which is an orthogonal concern.

Pathological code: how today’s compilers waste chunks

Two examples make the cost shape vivid.

Uniswap V2 getter — an order of magnitude too expensive

The WETH() and factory() getters on the Uniswap V2 router are two of the simplest functions on Ethereum — they do almost nothing but return an immutable. Yet under the EIP-7864 semantics, the gas attributable to code chunking is an order of magnitude larger than the gas to actually execute the function. The reason becomes obvious from the compacted execution trace:

1116: 0x4:    MSTORE(0x40, 0x80)
1118: 0x7:    4 = CALLDATASIZE()
...
1150: 0xc2:   1 = EQ(0xad5c4648, 0xad5c4648)
1152: 0xc6:   JUMPI(0x954, 1)
// Function selector ends here

1153: 0x954:  JUMPDEST()
1154: 0x955:  0 = CALLVALUE()
1156: 0x957:  1 = ISZERO(0)
1158: 0x95b:  JUMPI(0x960, 1)
1159: 0x960:  JUMPDEST()
1163: 0x968:  JUMP(0x2671)
// The Solidity compiler reused the !payable check, which ends here

1164: 0x2671: JUMPDEST()
1167: 0x2694: JUMP(0x969)
// The jump to the previous basic block was a superfluous artifact
// of the Solidity compiler (it reuses common instruction sequences)

1168: 0x969:  JUMPDEST()
1171: 0x96d:  0x80 = MLOAD(0x40)
...
1186: 0x991:  RETURN(0x80, 0x20)

For one tiny getter, control flow ricochets across at least four distinct regions of the bytecode: the selector dispatch, a deduplicated !payable check, a deduplicated common-tail, and finally the function body. The mechanism is worth pausing on. Solidity (solc) in this case hoists the implicit require(msg.value == 0) that every non-payable function needs into a single shared block and routes every non-payable function through it. Under today’s pricing, that’s a free code-size win. Under EIP-7864, it means that calling non-payable functions in this contract triggers a long jump into the shared block, and a long jump back out to whichever function actually wanted to run. Every one of those jumps lands in a fresh code chunk.

There’s a side observation worth noting for anyone who’s tried to decompile or statically analyse Solidity bytecode: this same cross-function instruction sharing is also what makes solc output unusually painful to reverse-engineer. Control flow and data flow from completely unrelated public functions get interleaved in shared private blocks, so a “function” in the source no longer corresponds to a single connected region in the bytecode. The locality story and the analysability story are the same story.

Curve gauge selector — Vyper’s linear dispatch

Some versions of the Vyper compiler compile function selectors as a chain of nested conditions:

if (function_selector != 0xdeadbabe)
    JUMP <FAR>   // incur 200 or 2100 gas
else
    balanceOf(...)

Functions called from deeper in the chain pay every selector check that came before them, and each “no, not this one” branch is a JUMP that abandons the current chunk before its bytes are exhausted. The deeper the function is in the dispatch, the more chunks the call has loaded and discarded by the time it finally enters the body. Public functions end up scattered across the contract’s code space, and the dispatcher’s own logic becomes the dominant cost on cheap calls.

Systematic patterns that fight locality

The Uniswap and Curve cases above are compiler-side. The more uncomfortable observation is that mainstream Solidity language and library conventions burn the exact storage region that EIP-7864 makes cheapest.

1. Mappings burn a slot and hashes adjacency away

contract C {
    mapping(address => uint256) public balances;
    // ...
}

Solidity assigns the mapping a slot p — but p itself stores nothing. The actual value for key k lives at keccak256(h(k) . p), scattered uniformly across the address space. Two adjacent users have value slots in completely unrelated stems.

0:                       [empty — just a root for the mapping]
1:                       [next variable...]
...
keccak256(alice . 0):    alice's balance   ← far outside hot region
keccak256(bob   . 0):    bob's balance     ← different random location

Nested mappings compound the indirection. Today this is fine — the extra hop is a constant-gas concern. Under EIP-7864 it means every mapping read is a cold cross-stem access, and the slot reserved for the mapping in the hot region is wasted.

2. Arrays — even small ones — are out-of-line

contract C {
    uint256[] public reserves;   // slot p = 0
    // ...
}

For dynamic arrays, slot p holds only the length; elements start at keccak256(p):

0:                  length = 3      ← hot
keccak256(0):       reserves[0]     ← cold
keccak256(0) + 1:   reserves[1]     ← cold (adjacent to reserves[0])
keccak256(0) + 2:   reserves[2]     ← cold

bytes and string have a short-inline mode for values ≤31 bytes, where the value is stored entirely in slot p. The generic T[] does not: even a one-element uint256[] pays the full indirection cost. Every “small list of things” written as a Solidity dynamic array silently leaves the hot region.

3. Upgradeability spends the hot region on gaps

OpenZeppelin’s upgradeable contracts standardise storage gaps: a base contract reserves an array of unused slots so that subclasses can add fields later without colliding with the parent’s layout. ERC-1967 proxies go further and place implementation and admin pointers at hash-derived slot indices chosen specifically to never collide with compiler layout — the constants below are taken directly from the EIP.

contract ERC20Upgradeable {
  mapping(address => uint256) private _balances;        // slot 0 (cold — mapping)
  mapping(address => mapping(...)) _allowances;         // slot 1 (cold — mapping)
  uint256 private _totalSupply;                         // slot 2
  string  private _name;                                // slot 3
  string  private _symbol;                              // slot 4
  uint256[45] private __gap;   // slots 5–49 reserved, intentionally wasted
}

bytes32 constant IMPLEMENTATION_SLOT =
    0x360894a13ba1a3210667c828492db98dca3e2076cc3735a920a3ca505d382bbc;
bytes32 constant ADMIN_SLOT =
    0xb53127684a568b3173ae13b9f8a6016e243e63b6e8ee1178d6a717850b5d6103;

Composing the picture for a typical upgradeable ERC-20:

Slot 0–4:        ERC-20 fields (but 0–1 are mappings, so values are cold)
Slot 5–49:       __gap (50 slots, all empty, all wasted)
Slot 50+:        subclass fields
Slot 0x3608...:  implementation address  ← random cold location
Slot 0xb531...:  admin address           ← random cold location

These are rational, well-motivated choices under today’s gas model — the gap exists to make upgradeability safe, and the proxy slots exist to be unguessable. Under EIP-7864 they happen to allocate exactly the bytes that would be cheapest to access into the things least likely to be touched on the hot path.

The ERC-1967 case is quantifiably the worst of the bunch. The implementation slot isn’t read occasionally — it is read on every single proxy invocation, because every call has to look up where to delegatecall to. Under EIP-7864 that’s a guaranteed cold cross-stem read on every call to the proxy, before any of the implementation’s code has even been touched.

What about EIP-7864-aware compilers?

The good news is that almost every pathological case above has a known, mechanical fix at the compiler or standard-library level.

Replace unrolled selector dispatch with a tight loop

Early Solidity selector dispatch was an unrolled linear search over the selector list. Modern Solidity is an unrolled binary search. Both are unrolled for the same reason: jumps were free, so why pay for a loop counter when you can hard-code every branch? On a per-chunk-metered VM, that calculus inverts. A loop-based dispatcher reads the selector once, then walks a sorted in-bytecode table of selectors with CODECOPY, accessing only ~log2(N_selectors) chunks total:

start := 0
end := N_SELECTORS + 1
selector := CALLDATALOAD 0x0, 0x4
<loop>:
  JUMPDEST
  EQ start end
  JUMPI <selector_eq>
  index := (start + end) / 2
  CODECOPY 0, SELECTOR_START + (index * 4), 4
  s := MLOAD 0x0
  s := s >> (8 * 28)
  EQ selector, s
  JUMPI <selector_eq>
  GT selector, s
  JUMPI <selector_gt>
  end := index
  JUMP <loop>
<selector_gt>:
  JUMPDEST
  start := index
  JUMP <loop>
<selector_eq>:
  JUMPDEST
  CODECOPY 0, <FUNCTION_PTR_TABLE_START> + (index * 2), 0x4
  POP ...   // pop all stack items
  s := MLOAD 0x0
  s := s >> (8 * 30)
  JUMP s
<SELECTOR_START>:
  // 4-byte function selectors, sorted
  0x00000000      // fallback function
  ...
<FUNCTION_PTR_TABLE_START>:
  // 2-byte jump targets, one per selector
  <receive()>
  ...
<receive()>:
  JUMPDEST

The loop body executes once. The comparison body executes ~log2(N_selectors) times. All of it lives in roughly one chunk. The selector table and the jump-target table are static data stored between code sections, contiguous with the dispatch loop itself — a single chunk gets touched per dispatch, and the table reads come from the same stem the dispatcher is already paying for. Even a naive loop, with no further optimisation, beats Solidity’s unrolled binary search under EIP-7864 because the cost of loading a fresh code chunk is orders of magnitude higher than the optimisation return from sharing instruction sequences across functions. The intuition that “unrolling is always faster” is a faithful inheritance from CPUs with branch predictors and i-caches; on a per-chunk-metered VM, unrolling is wrong.

Superinstructions and hot-path layout

Superinstructions are a well-known technique from fast bytecode interpreters: instead of dispatching one EVM opcode at a time through the interpreter loop, you identify short, frequently occurring sequences and synthesise a single opcode that does all of them. The motivation in interpreter land is branch-prediction friendliness — fewer trips through the central dispatch switch. The motivation here is compacting the code: a superinstruction is one chunk-resident helper that absorbs work that would otherwise be spread across different chunks.

Compilers can mine bytecode for these patterns (4x POP, ADD_STORE, FILTER_MAP, …) and emit them as superinstruction helpers placed inside the first 128 chunks. Calling such a helper costs two jumps (~20 gas) but could avoid loading another full chunk (~200 gas) — an order-of-magnitude win.

Larger functions that don’t fit inside the hot region should still be partitioned into one subtree, so that the body of a single function is paid for once rather than spread across many cold stems. Error-handling branches (anything that ends in revert) should be relegated out of the hot region to maximise the value of every chunk that does get loaded.

A natural extension is to admit hints in the code, or furthermore, profile-guided layout (similar to PGO) — instrument runtime execution, learn which paths actually execute, and place those bytes adjacent in code.

Inline data sections

Mixing static data tables (selector tables, jump tables, constant arrays) directly into the code section, between instructions, lets the loaded chunk do double duty. This is a well-trodden technique on resource-constrained hardware. However, the EVM Object Format (EOF) bytecode specification will likely make it harder or impossible to mix static data and code locality this way — EOF formalises a clean separation between code and data sections that, while desirable for verification and analysis, is not ideal if you need data and the code that consumes it to share a stem.

Standard-library primitives

This is where the language community has the most upside:

InlineArray<T, N> (like a Rust “SmallVec”) — keep the first N elements inline at the array’s slot in the hot region, spill only the tail to hashed slots. A natural extension is an inline hash table that keeps a small fixed bucket count inline before falling back.
Shared library infrastructure — delegatecall already lets multiple contracts share the same physical code. Lowering the cost of delegatecall (or making compiled output reuse a canonical library deployment) means a project’s common routines pay their chunk cost once and amortise across every call. A large fraction of bytecode currently deployed on chain is the same code over and over again.

Conclusion: 26% is the headline; the steady state is much smaller

In summary, the cost increase EIP-7864 imposes on today’s contracts is real, broadly distributed, and dominated by code chunking, not by storage repricing. That increase is a one-time repricing of bytecode written with zero awareness of locality. It is however not a steady state. The least optimal code examples (selector dispatch, upgradeable contracts, proxies, unoptimal code placement) are also the most optimisable ones. There is no fundamental algorithmic problem to solve — only existing compiler heuristics to update and a small number of new standard-library primitives to introduce. A handful of language-level affordances, such as inline data structures and shared libraries, profile-guided code layout would let developers opt in to the new cost model without manual slot engineering.

Dedaub builds program analysis tools and audits software systems for the most demanding teams in crypto.