ByAUJay
Blockchain Indexing vs Indexing Blockchain Data vs Blockchain Indexer: Core Concepts for Data Teams
Summary: “Blockchain indexing” is the strategy; “indexing blockchain data” is the ETL you actually run; a “blockchain indexer” is the component (or service) that turns raw blocks into queryable tables. This guide clarifies the terms, shows what’s changed as of 2026, and gives concrete designs, limits, and best practices you can apply today.
Decision-makers ask us variations of the same question: “Do we need an indexer, a data lake, or an API like The Graph/Covalent—what’s the difference?” The terms are often used interchangeably, but they are not the same. In fast-moving chains and L2s, getting this wrong creates brittle pipelines, reorg bugs, unpredictable latency, and runaway costs.
This post clarifies definitions, maps technology choices to outcomes, and provides precise engineering details you can plug into your roadmap.
The three terms—precise definitions for 2026
- Blockchain indexing
- The overall methodology for transforming raw chain data into models your apps, analytics, and ML can query quickly (think architecture and SLAs).
- Indexing blockchain data
- The ETL/ELT implementation: how you extract blocks/receipts/events, handle reorgs/finality, transform to schemas, and load into storage (OLTP/OLAP).
- Blockchain indexer
- The component or service that performs extraction and transformation of chain data (e.g., a subgraph, Substreams pipeline, custom Kafka + workers, or commercial APIs like Covalent). (thegraph.com)
Why it matters: choosing “an indexer” without a strategy creates single points of failure and long-term lock-in. Choosing “indexing blockchain data” without a reorg/finality model leads to data corruption.
Finality, safety, and reorgs—what your SLAs must reflect
Different chains expose different safety signals. Your ingestion pipeline must align to these, or you will surface false positives to users.
- Ethereum (PoS)
- Slot = ~12s; epoch = 32 slots (~6.4 min). Economic finality typically after two justified epochs, so ~12–15 minutes in normal conditions. Clients and RPCs expose block tags “safe” and “finalized” to differentiate risk profiles. (ethereum.github.io)
- OP Stack L2s (e.g., OP Mainnet, Base)
- Three stages for L2 blocks: unsafe (sequencer-confirmed), safe (data posted to L1), finalized (L1 block finalized). Typical latencies: unsafe in seconds; safe in minutes; finalized roughly 15–30 minutes depending on L1 finality. Your indexer can query “safe”/“finalized” via standard JSON-RPC tags. (docs.optimism.io)
- Solana
- Commitment levels: processed, confirmed (≥66% stake voted), finalized (31+ confirmed descendants). Typical finality ~10–20s in steady state. Design for commitment-aware indexing and make the level configurable per dataset. (docs.solanalabs.com)
- ZK/fault-proof L2 nuance
- Even validity rollups can see reorganizations at the sequencing layer during upgrades or multi-sequencer transitions—budget for rare but real reorgs in your pipeline. (starknet.io)
Practical impact: instrument your pipeline with two ingestion modes per chain: “fast path” (unsafe/processed) for product UX with later reconciliation, and “final path” that rewrites or promotes rows on safe/finalized signals.
Data sources you can build on in 2026
- JSON-RPC (pull)
- Standard for EVM data (blocks, txs, receipts, logs). Most providers limit eth_getLogs block range per call; design pagination and backoff. Examples: 10k–20k block caps or response-size caps across providers; some chains enforce 100–1000 block caps. (alchemy.com)
- Tracing RPC
- To capture internal transfers and call trees, use Erigon/OpenEthereum trace_* (trace_block, trace_filter, trace_transaction) on nodes that support it. Plan for archive/pruning constraints and enable trace modules explicitly. (docs.erigon.tech)
- Streaming-first extraction (Firehose/Substreams)
- Firehose writes chain data to flat files and streams, while Substreams lets you transform in parallel and sink to DBs, queues, or subgraphs; it’s now integrated with The Graph and used in production. (streamingfast.io)
- Managed indexing APIs and platforms
- The Graph Network (40+ chains), Substreams-powered subgraphs for 100x+ faster sync; Goldsky managed subgraphs/webhooks/data pipelines; SubQuery for Cosmos/Polkadot and more; Covalent for normalized multi-chain data. (thegraph.com)
- Public warehouse datasets
- BigQuery public crypto datasets are useful for analytics and prototyping but not guaranteed real-time; check freshness per table and be aware of occasional ingestion delays (e.g., Solana dataset incidents). Google also offers a managed Ethereum dataset (goog_blockchain_ethereum_mainnet_us). (docs.cloud.google.com)
Architectures that work (and ones that don’t)
A. RPC-polling indexer (good for focused datasets)
- Ingest source: receipts/logs via eth_getLogs filtered by address/topic.
- Core loop requirements:
- Split block ranges aggressively based on provider caps and log volume; cap response size. (alchemy.com)
- Use “safe” for low false-positives and “finalized” for settlement-grade workflows when supported. (docs.chainstack.com)
- Handle reorgs with the logs.removed flag and idempotent upserts. (docs.chainrpc.io)
- When it fails: attempting to backfill large chains with single long-range calls will time out or be rate-limited; you’ll miss events during spikes unless you checkpoint cursors per topic/address and implement exponential backoff. (alchemy.com)
Code scaffold (Node.js, paginated eth_getLogs):
async function fetchLogsPaginated(provider, baseFilter, maxBlocks = 2000) { const latest = parseInt(await provider.send('eth_blockNumber', []), 16); const from = parseInt(baseFilter.fromBlock, 16); const to = Math.min(parseInt(baseFilter.toBlock, 16) || latest, latest); let all = []; for (let start = from; start <= to; start += maxBlocks) { const end = Math.min(start + maxBlocks - 1, to); const filter = { ...baseFilter, fromBlock: '0x' + start.toString(16), toBlock: '0x' + end.toString(16) }; const page = await provider.send('eth_getLogs', [filter]); all.push(...page); } return all; }
Set maxBlocks per chain/provider (e.g., 2,000–10,000, sometimes 100–1000 on newer L2s like Monad). (alchemy.com)
B. Node + trace (deep EVM semantics)
- Run Erigon or Nethermind with trace APIs for internal calls; plan storage.
- Erigon archive footprint ≈1.77 TB as of Sep 2025; full ≈920 GB. Geth/Nethermind archives are considerably larger (10–14 TB+). Use NVMe and high IOPS. (docs.erigon.tech)
- Pros: complete call trees, accurate DEX routing, internal ETH flows. Cons: ops-heavy, higher storage and maintenance.
C. Streaming-first (Firehose/Substreams)
- For backfills and low-latency pipelines, Substreams processes blocks in parallel and sinks to Postgres, Kafka, S3, or powers subgraphs; teams report order-of-magnitude faster syncs and lower infra costs. (streamingfast.io)
- With The Graph’s Substreams-powered subgraphs, published results are queryable via GraphQL and can sync 100x+ faster vs legacy subgraphs on some workloads (e.g., Uniswap v3 case). (thegraph.com)
D. Managed indexers and APIs
- Goldsky: managed subgraphs, webhooks, and “Mirror” pipelines; transparent usage-based pricing with included free tiers; production SLAs; suitable for teams that want push + GraphQL without ops. (goldsky.com)
- Covalent: normalized multi-chain historical/current data, including balances, txs, logs, NFT metadata—useful for rapid prototyping and multi-chain products. (docs.arbitrum.io)
- SubQuery: strong for Cosmos, Polkadot, and EVM; SDK-first with GraphQL and parallelized indexing. (subquery.network)
- The Graph Network: decentralized indexing across 40+ chains with upgrade away from hosted services complete; free query tiers and Substreams support. (thegraph.com)
Concrete, chain-aware pipeline examples
1) EVM “token transfers” table that won’t bite you later
- Scope: ERC-20 Transfer(address,address,uint256)
- Keys: (chain_id, block_number, tx_hash, log_index)
- Safety: ingest at tag “safe”, promote to “finalized” after L1 finality.
- Backfill: paginate 2k–10k blocks per request depending on provider/chain; enforce payload-size caps. (alchemy.com)
- Reorgs: if logs.removed=true, delete by composite key and reinsert from the canonical block. (docs.chainrpc.io)
- Internal transfers: if you must reconcile value movements beyond logs, run trace_block/trace_transaction on an archive Erigon node or a provider that exposes trace_*. (docs.erigon.tech)
Minimal schema:
transfers_evm( chain_id int, block_time timestamp, block_number bigint, tx_hash bytea, log_index int, contract bytea, from_addr bytea, to_addr bytea, amount numeric(78,0), safety_level text, -- unsafe|safe|finalized primary key (chain_id, tx_hash, log_index) )
2) OP Stack “bridge messages” with L1-aware safety
- Use L2 RPC “safe” to cut false positives once data is posted to L1; promote when L1 block is finalized. Typical windows: unsafe in seconds; safe in minutes; finalized ~15–30 minutes. (docs.optimism.io)
- For cross-chain safety (multi-rollup apps), consider OP-Supervisor’s model of unsafe/local-safe/cross-safe/finalized when building multi-chain invariants. (docs.optimism.io)
3) Solana program events with commitment control
- Query at commitment=confirmed for UX, but reconcile at finalized. Expect ~10–20s to finality; store slot and confirmationStatus. (docs.solanalabs.com)
Storage choices for analytics-grade indexing
- Start with columnar files (Parquet) for backfills; layer an open table format for evolution and ACID.
- Apache Iceberg supports schema/partition evolution and hidden partitioning—crucial as event schemas grow. (iceberg.apache.org)
- Delta Lake offers Change Data Feed (CDF) to stream updates/promotions from unsafe→safe→finalized into downstream tables. (docs.delta.io)
- Partitioning
- For OLAP: partition by date (block_time day) plus chain_id; avoid high-cardinality address-based partitions; use clustering/sort by contract or topic for hot queries.
- Warehouse data sources
- Use BigQuery’s managed Ethereum dataset for rapid prototyping; always measure freshness (newest block) and don’t assume real-time; keep an eye on dataset lag incidents for other chains. (docs.cloud.google.com)
Ops details engineering teams often miss
- Cursoring and exactly-once
- Anchor consumer offsets to (block_number, tx_index, log_index) or (slot, index) per chain. Store per-source cursors in a durable KV.
- Idempotency
- Upsert by composite keys; treat removed=true as delete; re-emit derived aggregates idempotently.
- Provider limits and timeouts
- Respect eth_getLogs caps; split ranges; apply topic filters; expect 150MB response caps on some providers. (alchemy.com)
- Trace availability
- trace_* requires specific clients/config and often archive data or “recent N blocks” only—plan fallbacks or scheduled “deep trace” jobs. (docs.erigon.tech)
- Finality-aware retries
- Ethereum and OP Stack expose JSON-RPC tags “safe”/“finalized”; use them instead of hard-coded “12 blocks” rules. (docs.chainstack.com)
- Node sizing if you self-host
- Erigon archive ~1.77 TB as of Sep 2025; still use ≥4 TB NVMe for headroom. Nethermind/Geth archives can be 10–14 TB+. Don’t run on HDDs; target 10k+ IOPS. (docs.erigon.tech)
Build vs buy in 2026—clear decision guardrails
- Choose Substreams + The Graph when:
- You need high throughput, low-latency backfills, modular Rust transforms, and GraphQL endpoints with decentralized serving across 40+ chains. (thegraph.com)
- Choose Goldsky when:
- You want managed subgraphs, push webhooks, and sinks with usage-based pricing and no infra to run. Evaluate free included worker-hours and entity storage, then scale elastically. (goldsky.com)
- Choose Covalent when:
- You need a normalized multi-chain data model (balances, logs, NFTs) fast, without building ETL from scratch. (docs.arbitrum.io)
- Choose SubQuery when:
- You’re in Cosmos/Polkadot or need SDK-driven custom indexing across many networks. (subquery.network)
- Choose self-hosted Erigon/Nethermind when:
- You need full internal traces, custom archive queries, or strict data sovereignty.
Emerging best practices we recommend to clients
- Two-phase ingestion per dataset
- Fast path: consume unsafe/processed or confirmed commitments for UX.
- Final path: promote rows to finalized using L1/L2 safety signals and CDF-style updates to downstream tables. (docs.delta.io)
- Event-first modeling
- Primary keys: (chain_id, tx_hash, log_index) for EVM; (slot, signature, index) for Solana; include safety_level enum.
- Cross-domain safety for rollups
- Use OP Stack’s safe/finalized and, for cross-chain messages, verify L1 inclusion before marking final in destination chain. (specs.optimism.io)
- Substreams for heavy backfills
- For protocols with millions of events, Substreams can compress multi-week backfills into hours and stream to both DBs and subgraphs. (thegraph.com)
- Provider-aware pagination and filters
- Enforce per-chain block-range caps and topic filters; monitor response size limits; for new chains (e.g., Monad), caps may be 100–1000 blocks. (docs.monad.xyz)
- Alerting and SLOs
- Track head lag for unsafe/safe/finalized independently; fire alerts when safe lag > N minutes or finalized lag > 2 epochs on Ethereum. (ethereum.github.io)
- Warehouse hygiene
- Adopt Iceberg for schema/partition evolution without rewrites; keep Parquet as interchange; use Delta CDF for downstream consumers that must see ordered changes. (iceberg.apache.org)
“Indexing blockchain data” playbook: a concrete blueprint
- Source selection per chain
- EVM L1/L2: RPC logs + “safe/finalized” tags; add trace_* for internal flows on archive Erigon. (docs.chainstack.com)
- Solana: WebSocket/HTTP at commitment=confirmed→finalized promotion; store slot/confirmationStatus. (docs.solanalabs.com)
- Ingestion mechanics
- Paginate eth_getLogs respecting block-range and payload caps; store cursors per address/topic. (alchemy.com)
- Deduplicate by composite keys; handle logs.removed=true. (docs.chainrpc.io)
- Transform and serve
- Normalize entities (transfers/trades/positions) and enrich with token metadata and prices in separate dimension tables; if you need market data or NFT metadata, consider managed platforms (Goldsky Mirror, Covalent add-ons). (goldsky.com)
- Storage
- Write to Parquet; register Iceberg tables for evolution; expose through Trino/Spark/BigQuery; for change propagation, materialize to Delta with CDF for simple downstream consumption. (iceberg.apache.org)
- Safety promotion
- Maintain per-table safety_level; background jobs promote unsafe→safe→finalized using chain-native tags. On OP Stack, never mark final until the L1 origin is finalized. (docs.optimism.io)
- Monitoring
- Track head lag by safety level, reorg counts, ingestion error rates, and provider throttling.
Common pitfalls we fix most often
- Assuming “12 confirmations” on Ethereum equals finality. It doesn’t under PoS; use safe/finalized RPC tags for correctness. (docs.chainstack.com)
- Treating L2 “unsafe” as final. Sequencers can reorg unsafe blocks; promote only after L1 posting and finalization. (docs.optimism.io)
- Fetching eth_getLogs over multi-million-block ranges. Providers cap block ranges or payload sizes; without pagination you’ll time out or get partial results. (alchemy.com)
- Ignoring trace availability. trace_* requires specific clients and often archive mode or a recent-window constraint. (docs.erigon.tech)
- Relying on public warehouse datasets for real-time UX. Great for analysis; check freshness and plan fallbacks. (docs.cloud.google.com)
Quick buyer’s guide for decision-makers
- Need a production GraphQL API fast with low ops? Use Goldsky or publish a Substreams-powered subgraph on The Graph Network. (goldsky.com)
- Multi-chain analytics with normalized schemas and time-to-value? Evaluate Covalent and/or BigQuery managed datasets. (docs.arbitrum.io)
- You require internal call traces, MEV analysis, or bespoke data sovereignty? Budget for self-hosted Erigon archive + trace pipeline. (docs.erigon.tech)
- You’re indexing Cosmos/Polkadot and want SDK-first controls? SubQuery is built for that. (subquery.network)
TL;DR checklist you can apply this quarter
- Define per-chain safety SLAs (unsafe/safe/finalized) and make them visible in your tables. (docs.chainstack.com)
- Implement paginated log ingestion with provider-specific caps and topic filters. (alchemy.com)
- Make upserts idempotent and reorg-aware (use removed=true). (docs.chainrpc.io)
- Use Iceberg/Delta CDF to support schema evolution and state promotions without table rewrites. (iceberg.apache.org)
- For heavy backfills or multi-sink needs, adopt Substreams or a managed equivalent. (thegraph.com)
- If you rely on public datasets, add a freshness check to every job. (docs.cloud.google.com)
If you want a neutral architecture review or a sprint plan to move from brittle RPC scrapers to safety-aware, streaming-first indexing, 7Block Labs can help you model costs, pick the right stack per chain, and ship production pipelines that survive real-world reorgs and growth.
Like what you're reading? Let's build together.
Get a free 30‑minute consultation with our engineering team.

