Introducing Termina's zkSVM

Nitro Labs

December 17, 2024

Motivation

Termina enables teams to deploy Solana network extensions (NEs) to scale their applications without resorting to moving workloads completely off-chain. One common type of NE that the platform supports is an optimistic rollup, which relies on a fraud proving system that allows anyone to confirm or challenge the correctness of a block. It’s ideal for this system to leverage zero-knowledge proofs (ZKPs) because they offer many advantages over traditional bisection proofs. One main benefit is that they’re non-interactive, so disputes can be settled in a few hours instead of been drawn out over a week-long challenge window. In addition, advanced ZKPs (e.g. Groth16, PLONK) may be used in ZK rollups in the future to provide privacy guarantees for sensitive transactions.

To achieve this, we needed the ability to generate a ZKP over any Solana transaction (e.g. account creation, SPL token transfer, arbitrary program execution). It was clear that we should build on a general-purpose ZK virtual machine (zkVM), rather than handwrite custom circuits, so we chose Succinct’s SP1 for its performance and ease of use. While SP1 supports Rust code, we couldn’t simply pass in raw Solana or Anchor programs because they’re actually compiled to extended Berkeley Packet Filter (eBPF) bytecode before being deployed to the blockchain. Instead, we had to prove the execution of the Solana eBPF interpreter over the program’s compiled bytecode, rather than over the program’s original Rust code.

This integration presented a number of unexpected technical challenges, but we were able to resolve these issues and implement key changes that allow the Solana Virtual Machine (SVM) to run within SP1. We’re excited to share these with the community and have categorized them into three broad areas: i) randomness and time, ii) threads and files, and finally iii) bit depth discrepancies.

‍

Randomness and Time

There are several Rust crates that don’t work with a ZK system off-the-shelf due to their nondeterministic behavior. One such instance is getrandom, which we replaced with a repeatable pseudo-random number generator. Since Agave relies on an older version of getrandom, we had to fork the library and manually hook in SP1’s source of randomness.

pub fn getrandom_inner(s: &mut [u8]) -> Result<(), Error> { unsafe { sp1_zkvm::syscalls::sys_rand(s.as_mut_ptr(), s.len()) }; Ok(()) }

‍

In addition, the Rust standard library’s time crate (the Instant and Duration structs, in particular) can cause inconsistent results across multiple executions. In fact, this library triggers a panic and fails to build in SP1. We patched these references with stubs that would compile and run properly but always give a constant output.

It's good to note that Solana must remain deterministic for validators to reach a quorum on every block, which is why randomness and time are used in the codebase but not for consensus-critical logic. Agave uses randomness only for local salts and cluster diversification to enhance resistance against exploits, while time calculations are limited to absolute or elapsed times for record-keeping and performance metrics. As a result, the changes we introduced still preserve the SVM's functionality and correctness.

‍

Threads and Files

Threading and filesystem access are high-level functionality that are not supported by SP1. Multi-threading inherent can lead to slightly different results on runs with the same inputs. Agave uses multithreading to maintain its parallel transaction-processing threads as well as local disk caches that pull recently accessed account information in and out of memory. These disk caches represent another issue, namely file access. In order to prove a result, all inputs (and persistent or meaningful outputs) must be accounted for, and SP1 currently disallows any file access, to remove files from this I/O requirement.

To handle these restrictions, we only run Agave’s SVM crate as a single execution thread. As Termina is not designed for ongoing communication between hundreds of validators to verify blocks, it doesn’t have the same caching and threading requirements. This simplification allows more streamlined code, a faster development and testing cycle, and smaller ZKPs.

‍

64-bit vs. 32-bit Architectures

Background

Finally, the key piece that we had to resolve was the discovery of a fundamental disconnect between SP1’s requirements and Agave’s codebase. SP1 compiles and proves applications within a virtual 32-bit RISC-V core, but Agave was only designed to run in 64 bits. Its execution in 32 bits was mostly untested, and we quickly ran into invalid memory access errors. In particular, the SVM runs on 64-bit eBPF code and uses a 64-bit virtual memory space—regardless of the compiled bit depth. In other words, it’s 64 bits regardless of the size of the host machine.

In several areas of the interpreter, such as the code handling cross-program invocation (CPI), the existing logic took a few shortcuts. It assumed that pointers within the interpreter’s 64-bit virtual memory would remain 64-bit when treated as Rust pointers. This shortcut may be clever in a 64-bit environment, but it breaks down in a 32-bit build of the SVM like the one used by SP1.

‍

Modifications

To resolve the memory issues, we made several changes to our fork of the Agave codebase. Data structures used within parts of the eBPF interpreter, most notably the AccountInfo and slice structures, were updated to store pointers explicitly as 64-bit integers, rather than as pointers. This also applied for any values stored as usize or isize, which change in storage size and result in different structure sizes depending on the compiled bit depth. By changing how the Rust code interacted with data within the virtual memory space, we were able to achieve full compatibility for SVM programs within SP1 and will contribute the improved code to the upstream Agave repository itself, which also enables 32-bit targets for Solana validator builds themselves.

‍

Example Walkthrough

As a concrete example, consider Rust’s slice object. While it seems like it’s just “a piece of a vector” and has its own special notation, it of course is a defined object, and that object is:

{ ptr: *T, len: usize }

‍

When code is strictly within the eBPF interpreter, everything is in 64 bits and you can use a Rust slice to represent data in the virtual memory space (as long as you do a translation of the pointer into a physical address before you follow it). You can even do pointer math on it, like finding the (virtual) address of the ith element of the slice.

However, in a 32-bit build, when code outside the interpreter needs to look at data within that that space, both ptr and len are now 32 bits. So a Rust slice that itself is located in physical memory has a 32-bit pointer trying to hold a 64-bit virtual address. That address is truncated, and the translation of that pointer to a physical address fails. At the same time, a Rust slice that’s located in virtual memory, such as a slice-of-slices that holds a list of Pubkeys, or one inside an AccountInfo structure, is the wrong size. It was stored (within the interpreter) as a 64-bit-based data structure, and now Rust thinks it’s a 32-bit-based data structure. As a result, the mappings of the struct members aren’t right, and the values no longer line up properly.

To get around the issue in this particular example, we created a new VmSlice object, which is instead defined as:

{ addr: u64, len: u64 }

‍

This structure maps correctly onto slices that were built in a 64-bit system, and it can also hold a full virtual address, regardless of whether the app was built with 32-bit or 64-bit pointers. Note that the first field is now addr, to make it clear that this is a virtual address, not a true pointer. We will be working with the team at Anza to update the Agave source to use these VmSlices both within code looking at data within the virtual space, as well as in a new VmAccountInfo data structure (as used within the CPI code), which has all of its pointer and size-based data converted to u64 virtual addresses and translated as needed.

‍

Benchmarks

With this full integration, we have been able to benchmark some simple multi-instruction transaction scenarios, such as creating wallet accounts and transferring SOL, or creating associated token accounts (ATAs) and using them to mint Solana Program Library (SPL) tokens, which are the equivalent of ERC20 tokens on Ethereum. A 3-instruction SOL transaction referencing three accounts, such as the following, would use around 20 million cycles to execute:

Create account
Transfer SOL to Alice
Transfer SOL to Bob

An 8-instruction SPL transaction referencing five accounts, such as the following example, could use around 65 million cycles to execute:

Create the SPL mint account
Initialize mint account and set the token metadata
Create two associated token accounts for Alice and Bob
Mint SPL tokens to Alice
Transfer SPL tokens from Alice to Bob
Burn Bob’s tokens
Close Bob’s account

As a broad generalization, the key factors in how expensive a transaction is are the code being executed and the total number of account Pubkeys being used. This is a fairly complex transaction, well above the typical number of instructions per Solana transaction, which averages out to 4 or 5.

We have provided some charts below with benchmarks for the above transactions. Proving times presented are using Succinct’s public Prover Network with the the V3 release of their software. An upcoming V4 release is expected to reduce these times significantly.

The important takeaways from this are:

The overhead from the prover itself is negligible, compared to the computation we’re proving in even a single transaction.
Proving time is linear over the number of transactions being proven.

‍

Cost

In the context of an optimistic rollup, the network does not suffer from the costs of needing to prove every transaction; only when potential fraud is detected is a proof generated. However, it still may be informative to compare costs to similar projects to contextualize this work.

The stats for Base and OP Mainnet were taken from here. Note that these stats are rapidly changing and were collected with the V3 release of SP1, and the upcoming V4 is expected to have large performance improvements.

The chart above compares the cycle counts for the SVM example to representative Ethereum-based blocks using Optimism. OP Succinct is a project that uses SP1 to ZK prove Optimism transactions on top of Ethereum. For their mainnet, they report proving costs of 18M cycles or 1.34 cents per transaction, with an average of 16 transactions per block. Termina’s complex example transactions shown here require three times as many cycles per transaction, but due to the optimistic design, the amortized proving cost per transaction trends toward zero. When fraud is detected, there is a one-time proving cost, $5.32 in this example of 100 transactions, that is paid for by slashing the fraudulent client and rewarding the entity who issued the fraud proof. There is no ongoing, per-transaction proving cost to Termina clients. By leveraging ZK fraud proofs, the network only pays for proof generation when there is a potential dispute, making the chain affordable for all to use.

‍

Acknowledgements

This work wouldn’t have been possible without the support of our collaborators. We’d like to thank Uma and the Succinct team for their contributions to OSS zkVMs through SP1 and their help in making patches throughout our integration. We’d also like to thank Alessandro, Joe, and the rest of the Anza team for their work with the modular SVM and eBPF VM, answering all questions about 64-bit pointers, providing feedback on our proposed changes, and reviewing our pull requests.