Building an AI operations agent for blockchain infrastructure
In partnership with LoremLabs, we worked with a leading cryptocurrency exchange to improve the reliability of a large and varied blockchain infrastructure estate.
The Site Reliability Engineering team was responsible for more than 90 blockchain protocols running on Kubernetes. The platform consisted of thousands of nodes that needed constant monitoring, security maintenance, backups, upgrades and operational care. These protocols were used daily by engineers to test and develop new features before those changes moved towards production.
At that scale, reliability was not just a matter of responding quickly when something broke. The harder problem was consistency. Many incidents followed familiar patterns, but each still required an engineer to gather context, check the right systems and remember protocol specific details.
We developed an AI operations agent based on Claude Code to make that operational knowledge reusable, observable and increasingly autonomous.
The challenge
The node team had deep expertise, but much of it existed as tribal knowledge. Engineers knew which dashboards mattered, which internal APIs held the right state, which upgrades were scheduled and which recovery steps were safe. That knowledge was valuable, but it was also expensive to rediscover during every incident.
Early experiments with AI assistance showed promise, but also exposed an obvious weakness. Without strong operational context, the model would spend too much time exploring the same paths, asking the same questions or occasionally heading in the wrong direction.
The goal was to build something closer to an experienced SRE. It needed to understand common operational workflows, reason through infrastructure health in a predictable order, remember what had happened before and give us enough visibility to trust its behaviour over time.
Encoding operational knowledge with Claude Skills
The first step was to develop a set of Claude Skills that extended Claude Code with environment specific operational instructions.
Each skill captured a repetitive task that engineers had previously handled manually. Instead of relying on the model to infer the correct process from scratch, the skill gave it a proven path to follow.
Examples included:
- Querying internal APIs to find scheduled protocol updates.
- Applying scheduled updates safely across managed infrastructure.
- Creating snapshots from EBS volumes.
- Restoring environments from EBS snapshots.
- Performing common health checks and reconciliation steps.
This reduced wasted tokens and reduced avoidable mistakes. More importantly, it turned undocumented operational habits into reusable procedures. The agent could start from the right context rather than burning time discovering what the team already knew.
Moving from assistance to autonomous operation
After extensive testing, we started moving beyond a human in the loop workflow towards a more autonomous agentic tool.
We did not want an unconstrained agent that freely explored every possible diagnosis. That sounds clever until it spends half its budget enthusiastically investigating a red herring. Very futuristic. Very expensive.
Instead, we built a directed acyclic graph, or DAG, that guided the agent through three levels of decision making.
1. Infrastructure reconciliation
The agent first checked low level reconciliation concerns. It looked for obvious mismatches between expected and actual infrastructure state, including missing resources, unhealthy workloads or configuration drift.
This gave the agent a grounded starting point before it attempted any higher level reasoning.
2. Scheduling and planned activity
The second layer considered scheduled operational activity. The agent checked whether upgrades, maintenance tasks or other planned changes were already in progress.
This mattered because not every unhealthy signal represents a new incident. Sometimes the correct response is to recognise that work has already started and avoid interfering with it.
3. Container and protocol health
The final layer focused on container level health and protocol behaviour. At this point, the agent had already ruled out the most basic infrastructure and scheduling explanations, so it could spend its reasoning budget on more meaningful diagnosis.
This structure helped the agent handle day to day issues methodically. It avoided repeatedly analysing the same basic failure modes and saved deeper reasoning for situations that deserved it.
Persistent memory across executions
The next problem was continuity. A stateless agent can be useful once. An operational agent needs to remember what happened last time.
We developed a memory persistence layer backed by MongoDB. Each execution recorded the route the agent took through the DAG, what it observed, what action it took, what the outcome was and whether the previous context was likely to be useful again.
That memory made the agent more practical in real operations.
If it had started an upgrade thirty minutes earlier, it could recognise that context and avoid repeating the same process. If the same protocol had failed two weeks earlier, it could review what fixed the issue last time. If a previous fix appeared to work but the issue returned, that history could inform the next diagnosis.
MongoDB gave us fast retrieval for execution history and node lookups from the DAG. The design also allowed for future retrieval augmented generation, where richer output notes could be searched and brought back into the agent’s reasoning context.
Observability and evaluation
Autonomy is only useful when you can understand what the system is doing.
We combined high level prompts, persistent memories and tracing through Langfuse to give full observability into the agent’s behaviour over time. Every execution could be inspected, including the path through the graph, the memories retrieved, the decisions made and the actions proposed or taken.
We also used an evals based approach to improve the agent. We ran concurrent versions with different memory modes, graph shapes and prompting strategies, then compared how those changes affected diagnosis.
This gave us a controlled way to test whether the agent was actually improving. We could evaluate behaviour across real operational scenarios rather than relying on a few convincing demos. The demos are always convincing. That is the trap.
The outcome
The result was a structured AI operations system that captured operational knowledge, guided agent reasoning, persisted useful memory and exposed behaviour through tracing and evaluation.
It helped turn blockchain protocol reliability from a spare capacity activity into a measurable operational capability.
Average testnet protocol healthiness increased from 65% to 96%, meaning that nearly all of the testnets were available for developers to work with, leading to higher productivity and better return on compute investment.
That improvement came from a combination of faster diagnosis, safer repetition of known workflows, better use of historical context and a clearer view into how the agent behaved over time.
For infrastructure teams, the lesson was straightforward. The value was not in asking a model to be generically clever. The value came from giving it structure, memory, observability and access to the operational knowledge that experienced engineers already used every day.
