Mask
Doppler Doppler 2220 35522 23 39872 33 137364 59281 40 24194

Arbitrum Sequencer Major Outage - Root Cause Analysis

Neville Grech Profile Image
By Neville Grech
17.12.2023
Screenshot from 2023 12 17 15 28 35

The Arbitrum network experienced significant downtime on December 15 due to problems with its sequencer and feed. The network had been down for almost three hours. The major outage began at 10:29 a.m. ET amid a substantial increase in a type of network traffic called Inscriptions. Arbitrum’s layer-2 network had processed over 22.29 million transactions and had a total value locked of $2.3 billion. Despite the success of the network, the current design suffers from a significant chokepoint when posting transactions to L1, causing the sequener to stall. While advancements such as Arbitrum Nova and Proto-danksharding might alleviate these design issues, this is not the first time Arbitrum has experienced such issues - a bug in the sequencer also halted the network in June 2023.

Background

Arbitrum is a Layer-2 (L2) solution which settles transactions off the Ethereum mainnet. L2s provide lower gas fees and reduce congestion on the primary blockchain (In this case, Ethereum, L1). The current incarnation of Arbitrum is called Nitro. Arbitrum Nitro processes transactions in two stages: sequencing, where transactions are ordered and committed to this sequence, and deterministic execution, where each transaction undergoes a state transition function. Nitro combines Ethereum emulation software with extensions for cross-chain functionalities and uses an optimistic rollup protocol based on interactive fraud proofs. The Sequencer is a key component in the Nitro architecture. Its primary role is to order incoming transactions honestly, typically following a first-come, first-served policy. This is a centralized component operated by Offchain Labs. The Sequencer publishes its transaction order both as a real-time feed and to Ethereum, in the calldata of an "Inbox" smart contract. This publication ensures the final and authoritative transaction ordering. Additionally, a Delayed Inbox mechanism exists for L1 Ethereum contracts to submit transactions and as a backup for direct submission in case of Sequencer failure or censorship.

Root cause

In the two hours prior to the outage more than 90% of Arbitrum traffic consisted of Ethscriptions. Ethscriptions are digital artifacts on EVM chains created using Ethereum calldata. Unlike traditional NFTs managed by smart contracts, Ethscriptions make the blockchain data itself a unique NFT. They are inspired by Bitcoin inscriptions (Ordinals) but function differently. Creating an Ethscription involves selecting an image, converting it to data URI format, then to hexadecimal format, and finally embedding it into a 0 ETH transaction's Hex data field. Each Ethscription must be unique; duplicate data submissions are ignored. Owners can use Ethscriptions IDs for proof or transfer of ownership. In practice the calldata or Ethscriptions look like the code below:

data:,{"p":"fair-20","op":"mint","tick":"fair","amt":"1000"}

Calldata example of an Ethscription. This represents a token mint.

Since Ethscriptions are very cheap, one can do a lot of them for the same unit of cost. Indeed, a staggering 90% of transactions posted on-chain were Ethscriptions. Also, for a relatively low cost, the amount of transaction entropy that needed to be committed to L1 increased to 80MB/hr vs. the 3MB/hr that was typical before the traffic spike. We calculated this by looking at average on-chain transaction postings for the sequencer.

Over 90% of Arbitrum transactions prior to outage were Ethscriptions

Now, look at the architecture diagram of Arbitrum below. Note that in order to commit transaction sequences to L1, the data poster needs to post the increased amount of data over a larger number of transactions. Prior to the outage, the number of transactions posted per hour was around 10 - 20x higher than the December mean.

Arbitrum Architecture Diagram
Arbitrum Architecture Diagram (annotated with comments in red)

However, the code responsible for posting these transactions has an in-built limitation that imposes limits to the rate at which L1 batches are posted. Prior to the outage, if there are 10 batches still in the L1 mempool, no more batches are sent to L1, stalling the sequencer. This limit was subsequently raised to 20 batches after the outage. This is probably not a good long-term solution however, as it increases the chances of batches needing to be reposted due to transaction nonce issues.

// Check that posting a new transaction won't exceed maximum pending
// transactions in mempool.
if cfg.MaxMempoolTransactions > 0 {
  unconfirmedNonce, err := p.client.NonceAt(ctx, p.Sender(), nil)
  if err != nil {
    return fmt.Errorf("getting nonce of a dataposter sender: %w", err)
  }
  if nextNonce >= cfg.MaxMempoolTransactions+unconfirmedNonce {
    return fmt.Errorf(
      "... transaction with nonce: %d will exceed max mempool size ...",
      nextNonce, cfg.MaxMempoolTransactions, unconfirmedNonce
    )
  }
}
return nil

Batch poster is responsible for posting the sequenced transaction sequence as Ethereum calldata.

Recommendations

There are several indications that point towards the sequencer, and thus the network, not being tested enough in a realistic setting or in an adversarial environment. However, luckily the upcoming Proto-Danksharding upgrade to Ethereum should also help for reducing L1-induced congestion. Irrespective of this the Arbitrum engineers can consider the following recommendations:

Finally, the Arbitrum team made a small change to the way transactions are soft-committed. In this change the feed backlog is populated irrespective of whether the sequencer coordinator is running, which carries its own risks but enables dApps running on Arbitrum to be more responsive during certain periods.

Disclaimer: The Arbitrum sequencer is solely operated by Offchain labs. Thus, most of the information regarding its operational issues (such as logs) are not publicly available so it's hard to get a complete picture of the issue. Dedaub has not audited Arbitrum or Offchain labs software. Dedaub has however audited other (non-Arbitrum) software and projects running on Arbitrum such as GMX, Chainlink, Rysk & Stella.