ByAUJay
Blockchain Indexing vs Indexing Blockchain Data vs Blockchain Indexer: Core Concepts for Data Teams
Summary: Alright, let’s break it down: when we talk about “blockchain indexing,” that’s basically the strategy we’re using. Then there’s “indexing blockchain data,” which is the actual process where the magic happens. And finally, a “blockchain indexer” is the handy tool or service that takes all those messy raw blocks and turns them into neat tables that you can easily query. This guide is here to simplify some of the terms for you, showcase what's fresh and exciting as of 2026, and share some practical designs, limitations, and best practices that you can start using right away.
We often hear decision-makers asking us a common question: “Do we actually need an indexer, a data lake, or an API like The Graph or Covalent? Aren't they all pretty much the same thing?” It’s easy to mix these terms up since they get thrown around a lot, but trust me, they each have their own unique roles. In the ever-changing landscape of chains and L2s, getting this wrong can really mess things up. You could end up with shaky pipelines, pesky reorg bugs, delays you didn't see coming, and costs that just keep climbing.
In this post, I’m going to break down some definitions for you, connect tech choices with their outcomes, and share some specific engineering details that you can easily add to your roadmap.
The three terms--precise definitions for 2026
- Blockchain Indexing This guide walks you through the entire process of transforming raw blockchain data into a format that's super easy for your apps, analytics, and machine learning tools to query. Consider what goes into the architecture and the service-level agreements (SLAs) that come with it.
- Indexing Blockchain Data Alright, let’s get into the nitty-gritty of ETL and ELT! We’re going to take a closer look at how to extract those important pieces like blocks, receipts, and events. We’ll also tackle things like reorganizations and finality. Plus, we’ll chat about turning all that data into schemas and then loading it up into storage, whether you're looking at OLTP or OLAP. It’s going to be a fun ride!
- Blockchain Indexer
- This cool little component or service takes care of pulling out and transforming chain data. It could really be anything! Maybe you're looking at a subgraph, or perhaps you're diving into a Substreams pipeline. You might even be thinking about a custom Kafka setup with workers. And let's not forget about commercial APIs like Covalent--those can be super useful too! Take a look for all the juicy details! You can check it out here: thegraph.com.
Why It Matters
If you just choose "an indexer" without having a solid game plan in place, you're really opening yourself up to some major risks. Things like having single points of failure or getting stuck in a long-term commitment can become real headaches down the line. Also, if you decide to "index blockchain data" without having a strong reorg or finality model in place, you could end up facing some serious data corruption issues. Just something to keep in mind!
Finality, safety, and reorgs--what your SLAs must reflect
Different chains send out various safety signals. It's really important to keep your ingestion pipeline aligned with these signals. If you don't, you might end up presenting users with some misleading information.
- Ethereum (PoS) So, on Ethereum, each slot takes about 12 seconds to wrap up. If you look at an epoch, it’s made up of 32 slots, which totals around 6.4 minutes. 4 minutes. Economic finality typically occurs after two validated time periods, so you can expect it to take about 12 to 15 minutes under normal circumstances. To make it easier for you to understand the risk levels, our clients and RPCs use block tags like "safe" and "finalized." Feel free to take a look at it here. It's definitely worth a peek!
- OP Stack Layer 2s (like OP Mainnet and Base). So, let's break down L2 blocks on OP Stack. There are really three key stages to keep in mind. First up is the "unsafe" stage, which is when the sequencer gives it the thumbs up. Then we move on to "safe," which is when the data gets posted to L1. Finally, we hit the "finalized" stage, and that’s when the L1 block is officially locked in. Pretty straightforward, right? Usually, the latencies are pretty fast. You can expect unsafe transactions to go through in just a few seconds, while safe ones typically take a few minutes. As for finalized transactions, they usually wrap up in about 15 to 30 minutes, but that can vary a bit based on L1 finality. Your indexer is all set to use the standard JSON-RPC tags, so you can easily query for "safe" and "finalized" statuses. Check it out here.
- Solana When it comes to Solana, there are a few different levels of commitment to keep in mind. First up, there's "processed," which is just the initial stage. Then, we get to "confirmed," and that happens when 66% or more of the staked votes come in. Finally, we reach the "finalized" stage, which means there are at least 31 confirmed blocks that come after it. In a steady state, you can usually expect finality to take anywhere from about 10 to 20 seconds. They've really thought this through! They've made it possible to use commitment-aware indexing, and you can adjust the level based on what your dataset requires. If you want to dive deeper into the details, you can check it out here.
- ZK/fault-proof L2 nuance Hey, just a quick note: even validity rollups can run into some hiccups at the sequencing layer. This tends to happen more during upgrades or when they're switching between different sequencers. Just something to keep in mind! It's definitely a good idea to keep these rare but real reorganizations in mind when you're working on your pipeline. If you’re curious to learn more, just check it out here.
Practical Impact
Alright, let's get your pipeline up and running! You'll want to set it up with two different ingestion modes for each chain.
- Fast Path: If you're looking to get things done quickly, this mode's got you covered! It’s a little sketchy since it relies on some iffy data, but honestly, it makes the product way more enjoyable for users. You can always sort out the data later on. No rush!
- Final Path: This one's definitely the more dependable option. It updates or boosts rows based on reliable, finalized signals.
This setup really gives you the best of both worlds!
Data sources you can build on in 2026
- JSON-RPC (pull)
- This is your best bet when you need to grab EVM data, such as blocks, transactions, receipts, and logs.
Just a heads-up: a lot of providers have a limit on the block range for the
eth_getLogscall. So, when you're working with it, make sure to plan for pagination and implement some backoff strategies. It’ll save you some headaches down the line! You often see things like caps ranging from 10k to 20k blocks, or even limitations on response size that vary by provider. Also, some chains could have tougher rules in place, setting limits that range anywhere from 100 to 1000 blocks. (alchemy.com). - Tracing RPC
If you want to track those internal transfers and call trees, you should definitely check out Erigon or OpenEthereum. They have these handy
trace_*methods, liketrace_block,trace_filter, andtrace_transaction, that work on nodes that support them. Just a heads up, using these tools can really help you get the insights you need! Hey, just a quick reminder! Don’t forget to factor in any constraints for archiving or pruning. Also, be sure to turn on those trace modules explicitly. It’ll save you some headaches down the line! (docs.erigon.tech). - Streaming-first extraction (Firehose/Substreams) Firehose is really awesome! It takes chain data and writes it out to flat files and streams. On the other hand, Substreams lets you do some pretty neat transformations on that data simultaneously, so you can easily send it off to databases, queues, or even subgraphs. It's like having the best of both worlds! It's now hooked up with The Graph and is already running in production. (streamingfast.io).
- Took charge of indexing APIs and platforms. The Graph Network is pretty impressive, supporting more than 40 chains! With those Substreams-powered subgraphs, you can sync your data at lightning speed--think 100 times faster! If you're looking for managed subgraphs, webhooks, and data pipelines, Goldsky has got you covered. On the other hand, if you’re working with Cosmos or Polkadot, SubQuery is definitely worth checking out. And let’s not forget Covalent, which offers normalized multi-chain data that can really make your life easier! (thegraph.com).
- Public warehouse datasets Hey, just a quick note! BigQuery offers some cool public crypto datasets that can be super helpful for analytics and prototyping. However, keep in mind that they might not always be up-to-the-minute. Make sure to keep an eye on the freshness of each table. Sometimes there can be delays in getting the data in--kind of like those hiccups we saw with the Solana datasets. Hey! Just so you know, Google has this cool managed dataset for Ethereum called goog_blockchain_ethereum_mainnet_us. It's pretty handy if you're diving into blockchain stuff! (docs.cloud.google.com).
Architectures that work (and ones that don’t)
A. RPC-polling indexer (good for focused datasets)
- Ingest Source: Pull in receipts and logs by using
eth_getLogs. Just make sure to filter them by the right address and topic! - Core Loop Requirements:
- Chop those block ranges down to size so they stay within the provider limits and help keep the log volume in check; it’s all about managing the response size!
(alchemy.com).
If you want to keep those false positives down, stick with “safe.” But when you really need that solid, settlement-grade reliability, definitely choose “finalized” if it’s an option.
(docs.chainstack.com).
Hey, just a heads-up! When you're handling reorganizations, don't forget to use the
logs.removedflag and keep those upserts idempotent. It's super important for smooth sailing! (docs.chainrpc.io). - When Things Go South: If you attempt to backfill those massive chains using a single long-range call, just a heads up--you're likely to run into timeouts or hit those pesky rate limits. When things get hectic, it’s super easy to overlook events if you’re not keeping an eye on the cursors for each topic and address. Plus, don’t forget to use some exponential backoff. It really helps! (alchemy.com).
Code Scaffold (Node.js, Paginated eth_getLogs)
Here's a cool code setup you can use with Node.js for making those paginated eth_getLogs calls. It should make your life a bit easier!
This example gives you a solid starting point for grabbing logs from an Ethereum node easily and efficiently.
Setting Up the Project
Alright, let’s kick things off by setting up a fresh Node.js project! First things first, make sure you've got Node.js installed on your machine. Once that’s all set, go ahead and create a new directory for your project. After that, just run the following command:
mkdir eth-logs-fetcher
cd eth-logs-fetcher
npm init -y
Alright, next up, you’ll want to install Axios. It’s super handy for making HTTP requests! Just follow the steps, and you’ll have it ready to go in no time.
npm install axios
Code Example
Let me kick things off with a simple example to help you out:
const axios = require('axios');
const INFURA_URL = 'https://mainnet.infura.io/v3/YOUR_INFURA_PROJECT_ID';
const BATCH_SIZE = 1000;
const LOG_FILTER = {
fromBlock: '0x1',
toBlock: 'latest',
address: '0xADDRESS',
topics: ['0xTOPIC'],
};
async function fetchLogs(filter) {
let allLogs = [];
let fromBlock = parseInt(filter.fromBlock, 16);
let toBlock = parseInt(filter.toBlock, 16);
while (fromBlock <= toBlock) {
const endBlock = Math.min(fromBlock + BATCH_SIZE - 1, toBlock);
const currentFilter = {
...filter,
fromBlock: `0x${fromBlock.toString(16)}`,
toBlock: `0x${endBlock.toString(16)}`,
};
try {
const response = await axios.post(INFURA_URL, {
jsonrpc: '2.0',
method: 'eth_getLogs',
params: [currentFilter],
id: 1,
});
if (response.data.result) {
allLogs = allLogs.concat(response.data.result);
}
} catch (error) {
console.error(`Error fetching logs: ${error}`);
break;
}
fromBlock += BATCH_SIZE;
}
return allLogs;
}
async function main() {
try {
const logs = await fetchLogs(LOG_FILTER);
console.log(`Fetched ${logs.length} logs`);
// Process your logs as needed
} catch (error) {
console.error(`Error in main: ${error}`);
}
}
main();
How It Works
1. Getting Your Filter Ready: First, you'll want to set up your filter by choosing the block range. Don't forget to include any specific addresses or topics you’re interested in, too!
2.
Fetch Logs: So, the fetchLogs function is designed to pull in logs in chunks, starting from a specific block all the way up to the most recent one (unless it hits the limit set by the filter you've applied).
3. Error Handling: If there are any hiccups while trying to fetch logs, don’t worry! We’ll log those issues to the console, so you’ll always be in the loop about what’s happening.
4.
Main Function: The main function is where the action starts! It takes care of processing the logs right after they've been fetched.
You’re welcome to play around with the LOG_FILTER, change up the BATCH_SIZE, or handle the logs in whatever way works best for your app. Go ahead and make it your own!
Happy coding!.
async function fetchLogsPaginated(provider, baseFilter, maxBlocks = 2000) {
const latest = parseInt(await provider.send('eth_blockNumber', []), 16);
const from = parseInt(baseFilter.fromBlock, 16);
const to = Math.min(parseInt(baseFilter.toBlock, 16) || latest, latest);
let all = [];
for (let start = from; start <= to; start += maxBlocks) {
const end = Math.min(start + maxBlocks - 1, to);
const filter = {
...baseFilter,
fromBlock: '0x' + start.toString(16),
toBlock: '0x' + end.toString(16)
};
const page = await provider.send('eth_getLogs', [filter]);
all.push(...page);
}
return all;
}
You can adjust the maxBlocks for each chain or provider, and it usually falls somewhere between 2,000 and 10,000. With some of the newer Layer 2 solutions, like Monad, you might notice it dropping as low as 100 to 1,000. You can find all the details you need right here. Just click the link to dive in!
B. Node + trace (deep EVM semantics)
Alright, here’s the deal: when you start up Erigon or Nethermind, make sure you’re using those trace APIs for any internal calls you need. And hey, don’t overlook the storage aspect--definitely plan for that!
- By September 2025, the Erigon archive is gonna need roughly 1. So, we’re looking at 77 TB in storage, and the whole setup comes in at about 920 GB. On the other hand, if you’re looking at Geth or Nethermind archives, be prepared for some serious storage needs--like 10-14 TB or even more! That’s why it’s a good idea to invest in NVMe with high IOPS. It’ll make a huge difference! (docs.erigon.tech).
- Pros: You’ll have access to detailed call trees, precise DEX routing, and a straightforward look at your internal ETH flows. Cons: It can be a bit labor-intensive and demands more storage and upkeep than you might anticipate.
C. Streaming-first (Firehose/Substreams)
When you talk about backfills and low-latency pipelines, Substreams really stands out because it’s great at processing blocks simultaneously. You can easily send data to Postgres, Kafka, S3, or even use it to power subgraphs. It's super versatile! Teams are seeing some seriously impressive sync speeds that are way faster than before, plus they're managing to cut down on infrastructure costs quite a bit. Take a look at this: (streamingfast.io). You might find it interesting!
- With The Graph's Substreams-powered subgraphs, you can easily search for the results you publish using GraphQL. On top of that, they can sync over 100 times faster than those older subgraphs, especially for certain tasks--like the Uniswap v3 situation. It's pretty impressive how much quicker they can do their thing! Check out more details here: (thegraph.com). You won't want to miss it!
D. Managed indexers and APIs
- Goldsky: This one dives into managed subgraphs, webhooks, and they’ve got this awesome feature called “Mirror” pipelines. They’ve got a straightforward pricing model based on usage, which is super transparent. There are even free tiers available, along with solid service level agreements for production. This is just right for teams looking to add push notifications with GraphQL without getting bogged down by all the operational hassles. Take a look at goldsky.com and see what they have to offer!
- Covalent: Looking for normalized multi-chain historical or current data? Well, Covalent has you covered! Whether it’s balances, transactions, logs, or NFT metadata, they’ve got all the info you need. It's really great for quickly putting together prototypes and creating products that work across multiple chains. Check out their docs at docs.arbitrum.io when you get a chance! It’s a great resource.
- SubQuery: This platform really stands out for projects on Cosmos, Polkadot, and EVM. It's like its bread and butter! They really emphasize SDK-first development and are all about supporting GraphQL. Plus, they're working on indexing in parallel, which is pretty cool! It’s super impressive! If you want to dive deeper, check it out at subquery.network. You won’t regret it!
- The Graph Network: They've made a big leap by rolling out decentralized indexing for over 40 chains! Plus, they’ve officially ditched hosted services for good. They’ve got some great options like free query tiers and support for Substreams, too! You should definitely check out thegraph.com. It’s a pretty interesting read!
Concrete, chain-aware pipeline examples
1) EVM “token transfers” table that won’t bite you later
- Scope: We're focusing on ERC-20 Token Transfers, and more specifically, the
Transfer(address,address,uint256)function. - Keys: Don’t forget to keep an eye on these key identifiers:
(chain_id, block_number, tx_hash, log_index). They'll be super helpful! - Safety: First things first, let’s label it as “safe.” Once we’ve nailed down that L1 finality, we can bump it up to “finalized.”
- Backfill: When you're backfilling, try to paginate between 2,000 and 10,000 blocks with each request. Just keep in mind that the exact amount can vary based on your provider or the specific chain you’re working with.
Hey, just a quick reminder to keep an eye on those payload size limits! You can check out more info on it here. It's super important!
- Reorgs: So, if you come across
logs.removed=true, just delete those entries using the composite key. After that, you can reinsert them from the canonical block. (docs.chainrpc.io). - Internal Transfers: So, if you're looking to track value movements in more detail than just checking logs, you can run
trace_blockortrace_transactionon an archive Erigon node. Alternatively, you can go with a provider that supports thosetrace_*methods. It's a pretty handy way to get the info you need! (docs.erigon.tech).
Minimal schema:
When you start exploring the world of data, getting a handle on the minimal schema is super important. This idea is really about setting up a simple framework for your data that focuses on just the key elements you actually need. Let’s take a closer look at this.
What is a Minimal Schema?
A minimal schema is really just the bare minimum of data you'll need to get the job done for a specific purpose. It's like stripping things down to the essentials! It cuts through all the extra stuff and gets right to the point, shining a light on what really matters. This helps keep your database nice and tidy, making it run smoothly.
Key Characteristics:
1. Simplicity: The schema is super simple and easy to get your head around. 2. Relevance: This section sticks to the essentials you really need for your project. No fluff, just the good stuff! 3. Flexibility: The cool thing is, you can totally build on it down the road if your needs shift! 4. Efficiency: When there's less clutter, it's way easier and faster to search through things and keep everything organized.
Example of a Minimal Schema
Imagine you're putting together a straightforward blog database. A basic schema could include just these elements:
CREATE TABLE blog_posts (
id SERIAL PRIMARY KEY,
title VARCHAR(255) NOT NULL,
content TEXT NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
In this example, we've got all the key bits you need: there's an ID for every post, a catchy title, the actual content, and a timestamp that shows when it was created.
Benefits of Using a Minimal Schema
- Speedy Development: You can whip up and launch your app in no time!
- Simpler Maintenance: With fewer parts to worry about, it’s way easier to handle updates or troubleshoot any problems that pop up.
- Better Analytics: When you’ve got a specific dataset to work with, it makes analyzing the data and pulling out insights a whole lot simpler.
When to Use It
If you’re kicking off a new project or if you’re dealing with tight resources, a minimal schema can be super helpful. It keeps things simple and straightforward, making it easier to manage everything as you get started. It really helps you focus on what matters most without getting tangled up in all the extra stuff that doesn’t really matter.
Conclusion
When you go for a minimal schema approach, you’re really streamlining your data architecture. It keeps things simple and focused, which definitely makes things smoother for both developers and analysts. It’s like taking a breath of fresh air in a complex world! Just keep in mind, you can always expand on your project later if it takes off. Starting off simple is usually the way to go!
transfers_evm(
chain_id int,
block_time timestamp,
block_number bigint,
tx_hash bytea,
log_index int,
contract bytea,
from_addr bytea,
to_addr bytea,
amount numeric(78,0),
safety_level text, -- unsafe|safe|finalized
primary key (chain_id, tx_hash, log_index)
)
2) OP Stack “bridge messages” with L1-aware safety
If you want to reduce the chances of false positives once the data hits L1, consider using the L2 RPC "safe" option. "Feel free to increase it once the L1 block is finalized!" Usually, it’s a bit sketchy for a few seconds, then it settles down and becomes safe for a couple of minutes, and after about 15 to 30 minutes, you’re all good! If you want to dive into the specifics, just click here to check it out!
Hey, if you're getting into cross-chain safety for multi-rollup apps, you should definitely check out how OP-Supervisor is tackling it. They’ve got this interesting model that breaks things down into unsafe, local-safe, cross-safe, and finalized categories. It's worth a look! This could be super useful when you're assembling multi-chain invariants! If you want to dig deeper, you can check out more details here.
3) Solana program events with commitment control
Just a quick heads-up: when you're checking in with commitment=confirmed for UX, make sure to sync it up with finalized.
Typically, you can expect to wait around 10 to 20 seconds for the finalization.
Just a heads up, remember to keep an eye on both the time slot and whether or not you’ve got that confirmation status sorted out.
For all the info you need, just head over to this link: docs.solanalabs.com. Happy exploring!
Storage choices for analytics-grade indexing
Start by using columnar files like Parquet for your backfills. It’s a good foundation! After that, you can add an open table format to keep things running smoothly for ongoing development and maintain that ACID compliance. Trust me, it’ll make your life a lot easier down the line! Hey, have you heard about Apache Iceberg? It's really cool for handling schema and partition changes. Plus, it offers hidden partitioning, which becomes super handy as your event schemas grow. Definitely worth checking out! (iceberg.apache.org). Delta Lake has a pretty awesome feature called Change Data Feed (CDF). It’s designed to make it easy for you to stream updates and promotions as they transition from an "unsafe" state to a "safe" one, and finally to "finalized" in your downstream tables. Pretty neat, right? (docs.delta.io).
- Partitioning If you're diving into OLAP, it's a good idea to think about partitioning your data by date--like using the block_time day--as well as by chain_id. This could really help with organizing and optimizing your queries! Watch out for those address-based partitions with a ton of different values. They can really complicate things! Instead, try grouping or organizing those popular queries by contract or topic. It'll make everything a lot easier to navigate!
- Warehouse data sources
- If you want to get started quickly, check out BigQuery's managed Ethereum dataset. It's perfect for rapid prototyping! Just a quick reminder: always stay on top of freshness--like checking out the latest block. And don’t underestimate the importance of real-time updates! It's a smart move to keep an eye on any lag issues with datasets from other chains, too. (docs.cloud.google.com).
Ops details engineering teams often miss
- Cursoring and exactly-once Hey there! Just a quick tip: when you're working with offsets, it's really important to tie them to either (block_number, tx_index, log_index) or (slot, index) for each chain. It’ll help keep everything organized! Storing cursors from various sources in a reliable key-value store is a smart move.
- Idempotency
When you're doing upserts, think about using composite keys. And hey, if you come across a
removed=true, just go ahead and treat that as a delete. Oh, and just a quick reminder: when you're re-emitting those derived aggregates, make sure to do it idempotently. It really helps keep everything neat and organized! - Provider limits and timeouts. So, just a heads up about those provider limits and timeouts. They can really affect how things run, so it's good to stay aware of them!
Just a heads-up, don’t forget to consider the limits for
eth_getLogs. You might want to split your ranges and use some topic filters to make things easier. Just a heads up, some providers might limit their responses to about 150MB, so keep that in mind! Hey, take a look at the details over on alchemy.com. It's got some really useful info! - Trace availability
So, when you're using the
trace_*functions, you’ll need to have the right clients or settings in place. They typically only play nice with archived data or the most recent N blocks, so keep that in mind! "Hey, it's a good idea to plan for some backups or maybe set up a few 'deep trace' jobs just in case." If you’re looking for more details, check out docs.erigon.tech. There’s a ton of useful info there! - Finality-aware retries So, Ethereum and the OP Stack come with some pretty useful JSON-RPC tags, like “safe” and “finalized.” "Using these tags is definitely a smarter move than just hard-coding rules like '12 blocks' for your retries." If you're looking for more info, you can check it out at docs.chainstack.com.
- Choosing the right size for your nodes when you self-host. So, if you're thinking about self-hosting, just a heads up that an Erigon archive will typically need about 1. By September 2025, we'll be looking at a whopping 77 TB! You’ll definitely want to shoot for at least 4 TB of NVMe storage. It’ll give you plenty of wiggle room for everything you need. When it comes to the Nethermind and Geth archives, you're looking at a pretty hefty size--somewhere between 10 to over 14 TB! That's a lot of data! Try to steer clear of using HDDs. You’ll want to aim for those sweet 10,000+ IOPS to ensure everything runs without a hitch. If you're curious about the hardware requirements, just hop over to docs.erigon.tech to check them out!
Build vs buy in 2026--clear decision guardrails
- Opt for Substreams + The Graph when: You're looking for efficient backfills that can handle a ton of data without any lag, seamless modular transformations in Rust, and easy-to-use GraphQL endpoints that connect with over 40 different chains. Check it out here.
- Choose Goldsky when: If you're looking for managed subgraphs, convenient push webhooks, and sinks with usage-based pricing, then you're in luck! Plus, you can say goodbye to any infrastructure headaches. It's all about making things easier for you! Take a good look at the free worker hours and storage they offer--you'll want to keep that in mind. And don’t worry, you can always scale up if you need more later on! More info here.
- Choose Covalent when: So, you’re looking to whip up a normalized multi-chain data model that covers everything from balances to logs and NFTs, and you want to do it fast. The last thing you want is to start from scratch with an ETL process. Check out all the details here! It's worth a look!
- Choose SubQuery when:
- If you're working with Cosmos or Polkadot, or if you're looking to create custom indexing across different networks using an SDK, you're in the right place! If you want to dive deeper into it, just click here for more info!
- Pick self-hosted Erigon or Nethermind when:
- If you’re looking for complete internal traces, custom archive queries, or if strict data sovereignty is your thing, then we’ve got you covered.
Emerging best practices we recommend to clients
- We do a two-step process for bringing in each dataset.
- Fast path: This is all about quickly managing any unsafe or processed commitments, as well as confirmed ones, to really enhance the user experience.
- Final Path: Let’s take those rows and bump them up to a finalized status! We'll use L1/L2 safety signals along with some CDF-style updates to ensure everything flows smoothly into our downstream tables.
(docs.delta.io). - Event-first modeling When it comes to primary keys, we go with (chain_id, tx_hash, log_index) for EVM. For Solana, we use (slot, signature, index). It's a pretty straightforward way to keep everything organized! Hey, just a quick reminder to make sure you include the safety_level enum!
- Safety across different domains for rollups.
- Take advantage of OP Stack's safe and finalized metrics. And hey, when you're handling cross-chain messages, definitely check the L1 inclusion first before you consider anything finalized on the destination chain. It’s a smart move! (specs.optimism.io).
- Substreams for heavy backfills: These are specialized channels or pathways designed to manage and support heavier materials during the backfilling process. If you're dealing with protocols that get bombarded with millions of events, Substreams really speeds things up. Instead of waiting weeks for backfills, it can crunch all that data down to just a matter of hours. Plus, it streams everything straight to both databases and subgraphs, making the whole process way smoother. (thegraph.com).
- Pagination and filters that are aware of the provider. Hey, just a quick reminder: let’s remember to stick to those block-range limits for each chain and don’t forget to use topic filters. Oh, and keep an eye on the response size too! When it comes to newer chains like Monad, you might find that the caps are set between 100 and 1000 blocks. (docs.monad.xyz).
- Alerting and SLOs
- Make sure to monitor head lag for the unsafe, safe, and finalized states individually. Hey! Just a heads up, you can set up alerts to go off when the safe lag hits N minutes or when the finalized lag surpasses 2 epochs on Ethereum. It's a handy way to stay on top of things! (ethereum.github.io).
- Warehouse hygiene Alright, here's the deal: Go ahead and use Iceberg for making changes to your schema and partitions without the hassle of rewriting everything. When it comes to swapping data around, Parquet is your best bet. And for anyone downstream who wants to keep track of changes in the right order, be sure to use Delta CDF. Trust me, it’ll make things a lot smoother! (iceberg.apache.org).
“Indexing blockchain data” playbook: a concrete blueprint
- Choosing Your Source Based on the Chain.
- EVM L1/L2: Take a peek at the RPC logs and keep an eye out for those “safe/finalized” tags.
Hey, just a heads up! When you’re going through the Erigon archive, make sure you include
trace_*for those internal flows. It’s important! If you want to dig deeper into the details, just check this link out: here. It’s got all the info you need! - Solana: When you're working with it, make sure to use WebSocket or HTTP, and aim for that commitment level to move from
confirmedtofinalized. Don’t forget to keep an eye on the slot and theconfirmationStatus. If you want to dive deeper, you can check out more details here.
2) Ingestion mechanics
Hey there! Just a quick tip: when you're using eth_getLogs, don't forget to paginate your requests. It's important to keep an eye on the block range and the limits on your payload, so you don’t run into any issues. Happy coding!
Make sure you remember to save cursors for every address or topic you’re working on!
If you're looking for more info, you can check it out right here.
To keep things tidy and prevent any duplicates, go ahead and use composite keys. Also, don’t forget to manage your logs by setting logs.removed=true.
If you’re looking to dig a little deeper, you can find all the details here. Happy exploring!
- Transform and serve
- Organize your entities, such as transfers, trades, and positions. Enhance them by incorporating token metadata and current prices, which you can do with separate dimension tables. If you’re searching for market data or NFT metadata, you might want to check out some managed platforms like Goldsky Mirror or Covalent add-ons. They can be super helpful! Take a look at them over here: goldsky.com. You might find something you like!
- Storage
- Save your data in Parquet format and register your Iceberg tables so they can evolve over time. You’ll be able to access them easily through tools like Trino, Spark, or BigQuery. To make it easier for downstream use, convert the data to Delta format using Change Data Feed (CDF) for effective change propagation. Take a look at the info over at iceberg.apache.org. You'll find some great details there!
- Safety Promotion Make sure to keep an eye on the safety levels for each table. Background jobs can really help you move from unsafe to safe, and eventually to finalized, by using those handy chain-native tags. Hey, just a quick reminder: when you're working on the OP Stack, hold off on marking anything as final until the L1 origin is all set and finalized. It’s super important to get that right! For all the details, hop over to this link: (docs.optimism.io).
6) Monitoring
Just make sure to keep an eye on head lag according to safety levels. Also, keep tabs on the reorganization counts, check how often ingestion errors pop up, and don’t forget to watch for any provider throttling issues.
Common pitfalls we fix most often
Hey, just wanted to give you a quick heads-up: thinking that getting "12 confirmations" on Ethereum means everything is totally locked in isn’t exactly how it works with Proof of Stake. To keep things precise, make sure you use the final RPC tags that are safe. (docs.chainstack.com).
- Just a heads up: don't consider those L2 "unsafe" blocks as set in stone.
Sequencers can really help tidy up those risky blocks, so it's a good idea to wait until they're posted and fully finalized on L1 before promoting them. (docs.optimism.io). Hey, just a heads-up! When you’re pulling eth_getLogs over really big block ranges, make sure to tread carefully. A lot of providers have some restrictions when it comes to block ranges or payload sizes. So, if you're not mindful and try to skip pagination, you could run into a timeout or end up with results that aren't complete. Just a little heads-up! (alchemy.com). Hey, don’t forget about trace availability! The trace_* functions really rely on specific clients to do their thing. They usually work best if you’re using them in archive mode or looking at data from a recent time frame. Just a little tip! (docs.erigon.tech). - One last thing to keep in mind is that depending on public warehouse datasets for a real-time user experience can be a bit of a challenge. They’re great for analysis, but it’s super important to double-check how current the data is. And hey, it doesn’t hurt to have a backup plan just in case! (docs.cloud.google.com).
Quick buyer’s guide for decision-makers
If you're aiming to get a production GraphQL API up and running without all the heavy lifting, you should definitely check out Goldsky. Another great option is launching a subgraph powered by Substreams on The Graph Network. It’s pretty straightforward and can save you a lot of time! (goldsky.com). Are you interested in multi-chain analytics and looking for some easy-to-use schemas for faster results? Then you should definitely check out Covalent! Also, don’t forget to explore the managed datasets you can access through BigQuery. They can really simplify your analysis process! (docs.arbitrum.io).
- Looking for internal call traces, MEV analysis, or some specific data sovereignty stuff? Don’t forget to consider the costs if you’re thinking about a self-hosted Erigon archive and trace pipeline! (docs.erigon.tech). If you're diving into indexing for Cosmos or Polkadot and you like to have that SDK-first approach, then SubQuery is definitely the way to go! (subquery.network).
TL;DR checklist you can apply this quarter
Alright, let’s kick things off by talking about per-chain safety SLAs. You know, that whole deal with classifying things as unsafe, safe, or finalized? It’s super important to nail down those definitions and make sure they’re super clear in your tables. Hey, if you want to dig into the details, you can find everything you need right here. Happy exploring!
Alright, so the next step is to set up paginated log ingestion. Just make sure to incorporate those provider-specific limits and topic filters while you're at it. Hey, make sure to check out all the details on that here! You won’t want to miss it!
- Another important thing to keep in mind is to make sure your upserts are idempotent and can handle reorganizations. Just don’t forget to set
removed=truewhen you need to! If you're curious to dive deeper into this topic, check it out here. It's got some great info!
Hey there! Have you thought about using Iceberg or Delta CDF? They can really help with schema changes and moving states around without all the annoying table rewrites. It makes things so much smoother! If you want to explore that further, just click here. There's a lot of interesting stuff to check out!
- If you're handling a lot of heavy backfills or trying to juggle multiple sinks, it might be smart to go for Substreams or a similar managed option. It can really help simplify things! If you want to dive deeper into this, just check it out here. It's got all the details you need!
- And one last thing--if you're using public datasets, make sure to include a freshness check for every job. It's super important to keep everything up to date! If you're looking for more details, you can check it out here. It's a great resource!
Hey there! If you’re in need of an honest look at your architecture or you want to put together a strong sprint plan to switch things up from those delicate RPC scrapers to a more reliable, streaming-first indexing system, 7Block Labs is here to help you out! They'll help you figure out the costs, pick the right tech stack for each blockchain, and set up production pipelines that can deal with real-world changes and grow when you need them to.
Like what you're reading? Let's build together.
Get a free 30-minute consultation with our engineering team.
Related Posts
ByAUJay
Smart Tokenomics: Building for Stability, Not Just Buzz
### Strategic Tokenomics That Will Survive 2026 Forget about jumping on the next hype train--it's all about building a token that’s rooted in solid, provable unit economics. In this post, we’ll dive into how you can leverage rollup margins, ZK costs, cross-chain security, and MiCA constraints to create a token system that’s not just stable but also brings in a positive return on investment.
ByAUJay
Why Going Remote-First is a Game Changer for Blockchain Development
**Summary:** Remote-first blockchain engineering goes beyond just hopping on Zoom calls across different time zones. It’s a game-changing operating model that speeds up lead times, strengthens chain operations, and cuts down overall delivery costs by bringing together global talent with real-world protocols.
ByAUJay
M&A in Crypto: Tips for Successfully Integrating a Blockchain Acquisition
**M&A in Crypto: A Playbook for Seamless Blockchain Integration** Looking to navigate a blockchain acquisition without running into deadline delays or losing value? This handy playbook dives deep into where the risks lurk--think keys, circuits, bridges, and AA migrations. Plus, it outlines effective strategies to tackle those challenges head-on, all while speeding up the licensing process.

