Getting to Grips with Enterprise Blockchain Indexing and...

Decision-Makers Summary

Hey everyone, we've got some key updates coming your way for 2024-2025! With Ethereum’s EIP‑4844 blobs (you can think of these as temporary L2 data) and EIP‑4444, which is all about partial history expiry (essentially client pruning), it's becoming pretty clear that just leaning on raw node RPCs isn't going to meet the heavy data demands at the enterprise level.

What you really need right now is a specialized query layer that can pull, normalize, and deliver reliable multi-chain data, all backed by service level agreements (SLAs). This is super important for creating robust products, managing risks, and keeping expenses under control. If you want to dive deeper, take a look at the full specs over at eips.ethereum.org.

Why this matters now

In the last year and a half, Ethereum has introduced Dencun (EIP‑4844), and it’s making waves in the space. This update changes how rollup batch data is handled by moving it into something called “blobs.” These blobs chill in consensus clients but are only around for about two weeks (about 18 days). If you don’t snag the blob contents quickly--like pulling them from a beacon node’s blob sidecar API--you’ll lose access to that essential raw L2 data. This data is super important for things like settlement analytics, fraud monitoring, and reconciliation. Once it’s gone, it’s really gone. Check it out here: (eips.ethereum.org).

Ethereum clients have just introduced a cool new feature that aligns with EIP‑4444, letting them partially expire historical data. What this means is that execution clients can now ditch a chunk of the pre‑Merge history, and they can pass on the burden of storing that historical data to specialized providers. If you’re using apps that depend on “the network” for that in-depth historical info, you might start to notice some drop in availability over P2P connections. To keep everything running smoothly, you might want to either hook up with external archives or set up your own. Check out more details in the blog post here.

With these protocol-level insights in mind, it's pretty obvious that your API strategy shouldn’t just stop at basic JSON-RPC endpoints. You might want to think about adding an indexing and query layer to the mix, too.

Raw node RPCs weren’t designed for business queries

Node RPCs provide a solid set of basic tools for your needs. They do a great job of helping you send transactions and keep up with the latest state. However, when you dive into more advanced analytics or queries that involve multiple entities, they can start to show their limitations.

So, using eth_getLogs can get a little complicated since providers put some limits in place (think 10,000-block windows or a cap of 10k logs). You'll need to be clever about breaking up your requests and dealing with pagination. And if you’re tackling big ranges, be prepared for possible timeouts or even overwhelming the nodes. For a deeper dive, check it out here.
Alright, so when we talk about “Trace” data, it can get a little tricky. Different clients don't have a standard way of handling it, which means you’ll probably need some extra modules (like debug/trace) and sometimes even some archive hardware to get things going. Plus, how you approach it can really depend on the client you’re using--like with Nethermind debug_traceTransaction or Erigon trace_filter. If you want to dive deeper into this, check out this page.
Looking to dig into historical states like balances, storage, or code from earlier blocks? You’re going to need archive data for that! The usual ~128-block recent state window just won’t cut it. Full nodes might need to re-run requests, or they might not even have that data available. That’s where archive nodes or third-party archives come in handy. For all the details, take a look here.
Remember the importance of finality! If you’re querying for “latest” data, you might run into some reorgs. If you want something a bit more stable, go for “safe/finalized” data--it updates every epoch (which is roughly 6.4 minutes), and it takes two epochs (about 12-13 minutes) to really hit finalization. So, it’s key to be smart with how you set your query policies and client tags. Want to dive deeper? Check it out here.
When it comes to rollups, the batches after Dencun are kept in blobs that you can grab from beacon nodes instead of the execution-layer JSON-RPC. So, your data ingestion setup will need to chat with both EL and CL APIs. If you want to explore this topic further, take a look at this link.
When it comes to Solana, the high throughput can make using direct RPC for analytics a little tricky. The best way to tackle this is by using Geyser plugins or opting for provider-managed archival or streaming APIs. If you want to dive deeper into this, take a look at the documentation.

Bottom line: when you ask for things like, “can you show me all swaps for this user across chains from last quarter?” or “let me know if a bridge vault moved funds in a pre-finalized L2 batch,” these kinds of queries really don’t work with standard RPC calls.

What a query layer is (and isn’t)

A query layer is basically the shiny interface that sits on top of the raw nodes. It makes sure everything's nicely indexed and verified for integrity:

It gathers info from a bunch of different sources--like execution clients, beacon clients for blobs, and L2 nodes--then it decodes that data (think ABIs and traces), normalizes the schemas, and dishes out consistent APIs (GraphQL/REST/SQL) along with SLAs.
It totally understands chain semantics too, like finality windows, reorgs, blob retention, and L2 challenge periods, and it makes sure to factor all that into queries and freshness flags.
And on top of that, it’s super observable--with SLOs and error budgets--so teams can keep everything running smoothly while rolling out new features without messing up data contracts. (sre.google)

It's not just "a quicker node." We're talking about a complete data system here! It comes loaded with its own ETL, a one-of-a-kind storage layout, governance features, and APIs tailored for products.

2025 realities that force the upgrade

1) Ethereum’s blob world is ephemeral by design

Blobs are kept in beacon nodes rather than execution clients, and they get pruned after roughly two weeks. If you're diving into rollup data--like the Optimism OP Stack or Arbitrum--you'll want to snag those blobs quickly from the beacon APIs and check them against the KZG commitments in the L1 headers. For more details, check it out here.
The OP Stack’s Ecotone derivation pipeline handles type-3 (blob) transactions a little differently. It pulls blob contents via beacon endpoints. This means your indexer has to take care of blob retrieval and verification, or you’ll need to rely on some specialized archivers to manage that for you. You can dive into the details here.
Over in Arbitrum’s Nitro, you can now post batches as blobs, and it even offers tuning flags for your blob posting strategy. From there, it’s up to the L2 indexer to parse everything and keep tabs on those batches. If you want to dive deeper into this, check it out here.

2) History pruning is here

EF has shared some really cool news about client support for partial history expiry! You can now clear out anywhere between 300-500 GB by getting rid of that pre‑Merge history. Plus, they've got EIP‑4444 in the pipeline, which is all about rolling expiry on the P2P layer. It’s super important for apps to get ready for the gradual fading of historical data from random peers and to figure out how to access those history endpoints. For all the nitty-gritty details, take a look at the Ethereum blog!

Now that we've made these two changes, saying “I can always fetch it later” isn’t such a reliable option anymore.

Reference architecture: a modern blockchain query layer

Take a look at this reliable, classic design that we think you should definitely think about using (or buying) in 2025:

Ingestion (multi-protocol, reorg-safe)

EVM EL: When you're fetching blocks, receipts, or logs, it's a good idea to monitor those windows--aim for requests of ≤2k blocks or ≤10k logs. Also, remember to auto-chunk them by time or amount and make sure to de-duplicate based on (blockHash, logIndex). For more details, check it out here.
EVM Traces: If you're looking to dive into call trees and state diffs, Erigon’s trace_filter and trace_block are your best friends. They’re much quicker than the usual debugging with Geth. Not to mention, they keep tabs on parent-child call relationships and error flags, which is really useful. You can find all the details here.
Beacon CL (for blobs): Keep yourself in the loop by subscribing to the finalized headers and pulling blob sidecars for any L2 batch transactions (just keep an eye out for type 0x03). Don't forget to check against the versioned hashes. If anything goes awry, you can always switch to a backup beacon or archiver. For more details, check it out here.
Solana: If you want to lighten the load on your RPC while maintaining quick access, you should definitely look into using validator Geyser plugins or stream your data to external storage solutions like Kafka, Postgres, or ClickHouse. Trust me, it’ll make your life a whole lot easier. Get all the details here.

Transformation (parallel, idempotent)

Dive into Substreams/Firehose to take advantage of parallel processing chains. You can cache the outputs of your modules and stream everything into sinks. This handy approach can chop backfill times from weeks down to just a few hours, making reorg healing a walk in the park with the help of cursors. (thegraph.com)
Time to get those schemas in shape! Set up your entity tables for accounts, contracts, tokens, and NFTs, along with event fact tables and trace tables. And hey, don’t forget about those L2 batch metadata tables for blob commitments and frame indices.
Make sure to calculate your invariants (like balances and TVL) as materialized views that update incrementally.

Storage (cheap, queryable, durable)

Keep your raw and refined data organized in slick columnar formats like Parquet stored in cloud object storage. It makes running analytics a breeze with tools like BigQuery. By the way, BigQuery just added public datasets for a bunch of blockchains, including BTC, ETH, Polygon, Arbitrum, Optimism, Tron, and more. This is a fantastic way to mix your private data lake with public reference data! You can check it out here.
Don't forget to partition your data by chain_id/date/hour and cluster it by address/topic. This approach really helps in keeping those scanning costs low!

Serving (APIs built for product)

When you're building transactional apps, GraphQL really shines. It makes things like entity joins, filtering, and pagination a breeze.
If you're diving into quantitative research, stick with SQL endpoints. They pull data from warehouses, which is super reliable.
And hey, don’t overlook webhooks and streams! They're fantastic for real-time triggers, whether it’s for order fills or vault movements.

Observability and SLOs

Don't forget to post your SLOs! Think about goals like hitting 99.9% availability, keeping P95 response times below 250 ms for your hottest endpoints, and making sure data freshness is under a minute for finalized info. It's super important to monitor those error budgets, too. According to SRE policy, if you find you’re overspending, it's time to pause any changes. You can get more info here.

Example 1: L2 settlement-risk monitor with blob awareness

Goal: Get Notified if a Rollup Sequencer Has Outlier Transfers to a Bridge Vault Before Finalization

We're looking to create an alert system that lets us know whenever a rollup sequencer handles outlier transfers to a bridge vault just before finalization. This way, we can spot any strange activity and keep everything secure.

Ingest: Watch out for L1 blocks that have type-3 transactions from familiar L2 batchers. For every one of those, snag the blob sidecars via the beacon API, decode the batches, and assemble those L2 transactions. (specs.optimism.io)
Verify: Double-check that the KZG commitments match the versioned hashes in the L1 header. It’s a good idea to steer clear of any unverified third-party mirrors floating around. (eip4844.com)
Retention: Blob data tends to get pruned after about 2 weeks, so it’s a good idea to save those decoded frames and derived L2 transactions in your lake as soon as possible. You can't really bank on fetching them later! (eip4844.com)
Policy: Create an API that highlights any anomalies and includes freshness tags such as: latest, safe, and finalized. Just so you know, “finalized” means it corresponds to two epochs on Ethereum, which is roughly 12-13 minutes. (alchemy.com)

Example 2: Cross-chain NFT holder snapshot (ETH + Base) without blowing RPC limits

To backfill logs for ERC‑721 Transfer events, use bounded windows--think of 2k-block chunks on Base and ETH. This way, you can avoid hitting those annoying provider caps (yeah, the 10k logs per request limit). Remember to merge with token lists and decode any proxies as well. For more info, check this out here.
Incrementally calculate the current owners from the logs. Make sure to double-check against trace-based mints and burns to catch any potential edge cases that could crop up.
Serve everything via GraphQL: this setup allows you to query holders using pagination and filters (like trait or mint windows), and also keep track of the finality status for each response.

Example 3: Solana high-throughput indexing without hammering RPC

With a Geyser plugin, you can easily stream accounts and transactions straight into RabbitMQ or Kafka, and then shoot that data over to ClickHouse for some seriously fast analytics. If you prefer, you can also tap into Helius's managed archival, which includes the handy getTransactionsForAddress call. This feature bundles signatures and transactions together, making it perfect for crafting wallet timelines and keeping your compliance backfills in check. Check out the details here: (docs.solanalabs.com)
For anyone on the hunt for real-time solutions, definitely consider using provider streaming with gRPC or WebSockets. And don't forget about managing those cursor-based reconnections. Keeping an index that links slots to program changes can really help speed up those low-latency searches. Check it out here: (helius.mintlify.app)

Example 4: Audit-at-scale with public data warehousing

Link your internal indices with Google's public crypto datasets to see how various chains compare or to verify your own ETL results. BigQuery is now on board with several chains, not just BTC and ETH, but also Avalanche, Optimism, Polygon, and Tron. Take a look here: (cloud.google.com)

Emerging practices we see working best

Consider using “finality” as a query parameter. This lets clients decide how up-to-date they want their data to be: they can pick from latest, safe, or finalized. For analytics that users interact with, it’s a good idea to set safe as the default option. (alchemy.com)
Prioritize retrieving blobs. Get at least one trusted beacon node up and running, and start caching blobs to object storage as soon as you notice them. Also, make sure to have a backup ready with a secondary beacon endpoint, just in case! (specs.optimism.io)
If you're diving into high-volume traces, Erigon is the way to go. The trace_* module really shines when it comes to efficiently filtering through call graphs and state diffs. Plus, combining it with columnar storage makes scanning super budget-friendly. Check it out here: (docs.erigon.tech)
Boost your efficiency by using Substreams/Firehose to parallelize backfills. This method can seriously slash the time it takes to prepare historical indexes and helps save on your RPC costs when you compare it to the older sequential pull approach. Check it out here: (thegraph.com)
Get ready for some client-side pruning! It's wise to think ahead and assume that P2P historical serving will slow down eventually, all thanks to EIP‑4444. It might be a good idea to team up with history providers or keep your own archival records. Check out more details here.
Make sure to optimize your getLogs calls. Keep an eye on those provider limits, like the 10k-block windows or the 10k-log caps. It’s a good idea to incorporate a retry/backoff mechanism, and don't forget to monitor topic selectivity so you can automatically tweak your chunk sizes. Check out more details here.
Differentiate between “backfill” and “live.” Create distinct pipelines and compute pools for each type. With Substreams modules, you can seamlessly handle both your historical data and live updates using the same codebase. (thegraph.com)
Transform SLOs into a product. Ensure your data freshness and latency SLOs are accessible for each endpoint, and set up an error-budget policy to strike the right balance between speed and reliability when rolling out new features. (sre.google)

Build vs. buy in 2025: a pragmatic take

You've got some great options to choose from across the stack. The ideal combo really hinges on how much control you want, your deadlines, and, of course, what you’re looking to spend.

The Graph (Subgraphs + Substreams). With Subgraphs, you can access indexed on-chain entities using GraphQL, and the best part? It's fully decentralized now! You’ll love the pay-as-you-go setup--your first 100k queries are on the house, and after that, you just pay based on what you use. Substreams take it even further by offering parallel backfills and real-time streams across over 90 networks. It’s a great option if you want a smooth experience without dealing with all that heavy infrastructure. (thegraph.com)
Goldsky. This platform provides backwards-compatible subgraph hosting and “Mirror” pipelines that let you stream blocks, logs, and traces directly into your database or data lake with reliable service level agreements (SLAs). It’s a lifesaver if you're looking for dedicated performance or multi-chain replication without the hassle of building your own ETL processes. Check it out here: (docs.goldsky.com)
Solana experts. Helius has everything you need with its independent archival services. It also offers user-friendly historical endpoints, such as getTransactionsForAddress. On top of that, you get low-latency streams, which makes it a great pick for dealing with Solana's huge throughput and enterprise-level backfills. Check it out at helius.dev!
Public data warehouses. The public crypto datasets in BigQuery are on the rise! They’re super handy for validating your indices, diving into cross-chain analytics, or just speeding up your prototyping--all without the hassle of setting up nodes. Check it out here: (cloud.google.com)

Just a heads up: vendor roadmaps can shift pretty quickly! Take Flipside, for example--they totally revamped their API/SDK offerings back in 2025. Instead of sticking with their original programmatic API, they decided to pivot and focus on Snowflake data sharing. If your product is closely tied to a third-party API, it might be smart to set aside some budget for possible migration paths. Another option is to think about building a lightweight internal query layer, which would make it easier to switch out upstream providers when needed. You can find all the details here.

Deep-dive: handling finality and reorgs in your API

Let’s make it super clear what we mean by “commitment level” in our responses. Think about it in terms of latest, safe, and finalized. For the dashboards, we should definitely stick with safe as the default, but we'll want finalized levels when it comes to financial statements. (alchemy.com)
For Solana, it’d be awesome to have the slot/confirmation depth as something we can configure. And hey, make sure to jot down those commitment levels you pick in the API docs!
Don’t forget about the reorg queue: if a block gets reorged, we need to upsert the affected entities and fire off those correction webhooks. That cursor model from Substreams could be super handy for this--consider bringing it on board, even if you’re set on creating your own processors. (thegraph.com)

Deep-dive: making EVM traces usable

When diving into power custody, MEV, and compliance analytics, keeping an eye on raw trees can really add up. A solid tip is to use Erigon’s trace_filter to sift through the call types and addresses you actually need right from the get-go. This allows you to set up a neat “call_edge” table that packs in all the crucial details: tx_hash, parent_id, call_id, type, from, to, value, and error.

If you’re feeling extra organized, you could also whip up an optional “state_diff” table for your auditing purposes. By going this route, you'll be able to speed up your queries by a jaw-dropping 10-100 times, sparing yourself the headache of reparsing JSON for each request. For more insights, take a look at the Erigon documentation.

Governance and integrity: proofs and checkpoints

The Proof of Indexing (POI) from The Graph is a really useful concept. It creates digests of changes in the entity store, enabling indexers to prove they’ve indexed the same data. Even if Subgraphs aren't your thing, you might want to think about using a similar checkpointing method. It can be a great way to catch any inconsistencies between different clusters or regions. (thegraph.com)
For L2 blobs, it’s super important to keep an eye on the versioned hash, KZG commitment, and verification status for each batch frame. This will make it a breeze for you (or any auditors) to verify the provenance down the road. (specs.optimism.io)

Cost control: fewer RPCs, more parallelization

Data becomes a lot more budget-friendly when you:

Instead of getting stuck in the RPC loop, which can really slow things down for providers, try backfilling with Substreams/Firehose (you know, those cool parallel, cursored options). You can dive into it at firehose.streamingfast.io.
Utilize provider-specific endpoints designed for bulk operations, like Solana's getTransactionsForAddress. This way, you’re not trying to stitch together thousands of calls on the client side. Check it out at helius.dev.
Make sure to cache your decoded events and traces in columnar storage, and then serve them right from there.

Decision checklist for leaders

So, will your product still work smoothly when blobs get pruned in about two weeks and execution clients stop dishing out old history by default? Do you have access to the beacon and archives? (eip4844.com)
Can your team manage archive/trace infrastructure and ensure reorg-safe ETL, or would it be better to go with a managed indexer like Subgraphs, Mirror, or Helius?
Are your SLOs (like availability, latency, and freshness) clearly laid out, published, and supported by error budgets? (sre.google)
Do your APIs indicate commitment levels (like latest/safe/finalized) and provide clear info on Ethereum epochs and rollup challenge windows (you know, usually around a week by default on Arbitrum)? (docs.arbitrum.io)
Is your schema ready for cross-chain use with things like chain_id, block_time, address canonicalization, topic decoding, and L2 batch metadata?

The takeaway

Indexing has really become essential these days. Thanks to EIP‑4844, L2 data is now more transient, and EIP‑4444 has complicated things a bit for regular nodes trying to reliably serve historical data over time. If you're aiming to roll out trustworthy products, you'll definitely want a robust query layer that can tap into both execution and consensus sources, handle backfills in parallel, stick to finality semantics, and offer user-friendly APIs with service level objectives (SLOs). Whether you decide to whip this up in-house or buy it, just ensure it’s blob-aware, resistant to reorgs, and ready for a future where not everything from the past is easily accessible. (specs.optimism.io)

Getting to Grips with Enterprise Blockchain Indexing and Indexed Blockchain Data: Why We Need a Query Layer in APIs