Tech Deep Dive
Dedaub Logo
NEVILLE GRECH
17 December 2023

Arbitrum Sequencer Outage | Root Cause Analysis

The Arbitrum network experienced significant downtime on December 15 due to problems with its sequencer and feed. The network had been down for almost three hours. The major outage began at 10:29 a.m. ET amid a substantial increase in a type of network traffic called Inscriptions. Arbitrum’s layer-2 network had processed over 22.29 million transactions and had a total value locked of $2.3 billion. Despite the success of the network, the current design suffers from a significant chokepoint when posting transactions to L1, causing the sequener to stall. While advancements such as Arbitrum Nova and Proto-danksharding might alleviate these design issues, this is not the first time Arbitrum has experienced such issues – a bug in the sequencer also halted the network in June 2023.

Arbitrum Sequencer Outage | Background

Arbitrum is a Layer-2 (L2) solution which settles transactions off the Ethereum mainnet. L2s provide lower gas fees and reduce congestion on the primary blockchain (In this case, Ethereum, L1). The current incarnation of Arbitrum is called Nitro. Arbitrum Nitro processes transactions in two stages: sequencing, where transactions are ordered and committed to this sequence, and deterministic execution, where each transaction undergoes a state transition function. Nitro combines Ethereum emulation software with extensions for cross-chain functionalities and uses an optimistic rollup protocol based on interactive fraud proofs. The Sequencer is a key component in the Nitro architecture. Its primary role is to order incoming transactions honestly, typically following a first-come, first-served policy. This is a centralized component operated by Offchain Labs. The Sequencer publishes its transaction order both as a real-time feed and to Ethereum, in the calldata of an “Inbox” smart contract. This publication ensures the final and authoritative transaction ordering. Additionally, a Delayed Inbox mechanism exists for L1 Ethereum contracts to submit transactions and as a backup for direct submission in case of Sequencer failure or censorship.

Arbitrum Sequencer Outage | Root cause

In the two hours prior to the outage more than 90% of Arbitrum traffic consisted of Ethscriptions. Ethscriptions are digital artifacts on EVM chains created using Ethereum calldata. Unlike traditional NFTs managed by smart contracts, Ethscriptions make the blockchain data itself a unique NFT. They are inspired by Bitcoin inscriptions (Ordinals) but function differently. Creating an Ethscription involves selecting an image, converting it to data URI format, then to hexadecimal format, and finally embedding it into a 0 ETH transaction’s Hex data field. Each Ethscription must be unique; duplicate data submissions are ignored. Owners can use Ethscriptions IDs for proof or transfer of ownership. In practice the calldata or Ethscriptions look like the code below:

data:,{"p":"fair-20","op":"mint","tick":"fair","amt":"1000"}

Calldata example of an Ethscription. This represents a token mint.

Since Ethscriptions are very cheap, one can do a lot of them for the same unit of cost. Indeed, a staggering 90% of transactions posted on-chain were Ethscriptions. Also, for a relatively low cost, the amount of transaction entropy that needed to be committed to L1 increased to 80MB/hr vs. the 3MB/hr that was typical before the traffic spike. We calculated this by looking at average on-chain transaction postings for the sequencer.

Now, look at the architecture diagram of Arbitrum below. Note that in order to commit transaction sequences to L1, the data poster needs to post the increased amount of data over a larger number of transactions. Prior to the outage, the number of transactions posted per hour was around 10 – 20x higher than the December mean.

However, the code responsible for posting these transactions has an in-built limitation that imposes limits to the rate at which L1 batches are posted. Prior to the outage, if there are 10 batches still in the L1 mempool, no more batches are sent to L1, stalling the sequencer. This limit was subsequently raised to 20 batches after the outage. This is probably not a good long-term solution however, as it increases the chances of batches needing to be reposted due to transaction nonce issues.

// Check that posting a new transaction won't exceed maximum pending
// transactions in mempool.
if cfg.MaxMempoolTransactions > 0 {
  unconfirmedNonce, err := p.client.NonceAt(ctx, p.Sender(), nil)
  if err != nil {
    return fmt.Errorf("getting nonce of a dataposter sender: %w", err)
  }
  if nextNonce >= cfg.MaxMempoolTransactions+unconfirmedNonce {
    return fmt.Errorf(
      "... transaction with nonce: %d will exceed max mempool size ...",
      nextNonce, cfg.MaxMempoolTransactions, unconfirmedNonce
    )
  }
}
return nil

Batch poster is responsible for posting the sequenced transaction sequence as Ethereum calldata.

Arbitrum Sequencer Outage | Recommendations

There are several indications that point towards the sequencer, and thus the network, not being tested enough in a realistic setting or in an adversarial environment. However, luckily the upcoming Proto-Danksharding upgrade to Ethereum should also help for reducing L1-induced congestion. Irrespective of this the Arbitrum engineers can consider the following recommendations:

  • Whether the Arbitrum gas price of L2 calldata is set too low, compared to other kinds of operations. Gas is an anti-DoS mechanism, which is intimately tied to the L1 characteristics. If this increase in L2 calldata causes a proportionally large increase in batch size, then attackers can craft L2 transactions with large calldatas that result in batches that don’t compress well under Brotli compression, causing a DoS attack on the sequencer. Note that Arbitrum Nova should not suffer as much from this issue as the transaction data is not stored on L1, only a hash is.
  • Whether there is a tight feedback loop between the size of the L1 batches currently in the mempool and L2 gas price. There is an indirect feedback loop, via the gas price on L1 and backlog sizes, but this may not be too tight. In addition, since the sequencer is centralized anyway, anti-DoS measures might be encoded directly into it to reject transactions. (Note: A more decentralized sequencer is being considered for the future, so this last measure wouldn’t work)
  • Long-term, the engineers more research into making the rollups more efficient to decrease the sizes of batches committed to L1. This may include ZKP rollups at some point.
  • Additionally, security audits to the sequencer should consider DoS situations, both through simulation/fuzzing and also by having auditors think of hostile situations through adversarial thinking based off their deep knowledge of the involved chains.

Finally, the Arbitrum team made a small change to the way transactions are soft-committed. In this change the feed backlog is populated irrespective of whether the sequencer coordinator is running, which carries its own risks but enables dApps running on Arbitrum to be more responsive during certain periods.

Disclaimer: The Arbitrum sequencer is solely operated by Offchain labs. Thus, most of the information regarding its operational issues (such as logs) are not publicly available so it’s hard to get a complete picture of the issue. Dedaub has not audited Arbitrum or Offchain labs software. Dedaub has however audited other (non-Arbitrum) software and projects running on Arbitrum such as GMX, Chainlink, Rysk & Stella.