bet · 2025 – Present

LLM + RAG Automation for Backend Operations

Built internal AI-assisted tools (Python + LLM APIs, LangChain, RAG) for log analysis, anomaly detection, validation, and incident summarization — cutting manual operational toil across backend systems.

Role: Designed and built the automation tools and the retrieval/prompt-orchestration pipelines
Team: Backend engineering
Scope: Operational workflows across backend services at Bidgely
Level: Software Engineer

Operational toil

fully manual log triage & incident write-upsAI-assisted, human-reviewed

Method: Qualitative outcome: the tools reduced manual effort for log analysis, anomaly detection, validation, and incident summarization.
Window: during tenure

Output grounding

ungrounded prompting (hallucination-prone)RAG over real logs/runbooks (source-cited)

Method: Architectural change from raw prompting to retrieval-augmented generation grounded in internal operational context.
Window: by design

Context

A lot of backend on-call work is repetitive cognition: reading through logs, noticing the anomaly, writing up what happened. It's exactly the kind of toil an LLM is good at — if you can stop it from making things up. At Bidgely I built internal Python tooling on top of LLM APIs to take that toil off engineers' plates.

Why RAG, not raw prompting

A model prompted with no context will confidently invent service names, error semantics, and runbook steps. In an operational setting, confidently-wrong is worse than no tool at all.

So the tools are built on retrieval-augmented generation: embeddings and vector search pull the genuinely relevant logs and runbook context at query time, and the model reasons over that real evidence instead of its priors. Answers stay current as the systems change, and they trace back to their sources. The retrieval layer — chunking, embedding choice, relevance ranking — is what actually determines quality; the prompt is the easy bit.

Assistive, not autonomous

There's a spectrum from "the AI suggests" to "the AI acts." I deliberately kept these tools on the assistive end: they summarize, flag anomalies, and draft incident write-ups, and a human makes the call. Autonomous remediation would have raised the blast radius of a wrong answer enormously, and trust has to be earned first. That scoping is exactly why the tools got adopted quickly — they removed toil without asking anyone to hand a production decision to a model.

What broke

The honest part: an early version over-weighted recent logs in retrieval and would miss relevant older context, so summaries came out confident but incomplete. That's the failure mode that quietly kills trust in an internal tool. Tuning the chunking and relevance ranking fixed it — and it's why I now treat retrieval quality as the thing to evaluate directly, not just the final answer.

Outcomes

Engineers get grounded, source-cited help with the repetitive parts of operations — log analysis, anomaly detection, incident summaries — with a human still owning every decision. The pattern it established: internal AI tooling should be grounded, traceable, and assistive by default.

Decision Records

DR-001 · bet

Retrieval-augmented generation over our own operational context is what makes LLM output trustworthy enough to act on — raw prompting hallucinates too much for ops

Context

Backend on-call work involves a lot of repetitive cognition: reading logs, spotting anomalies, summarizing what happened in an incident. An LLM suits that well — but a model prompted with no context invents plausible-sounding nonsense, which is worse than no tool at all in an operational setting.

Options considered

✗
Prompt a general LLM directly with the raw question
No grounding in our systems. It hallucinates service names, error semantics, and runbook steps — confidently wrong, which is dangerous for ops decisions.
✗
Fine-tune a model on our data
Heavy to build and maintain, and stale the moment the systems change. Overkill for what is fundamentally a retrieval problem.
✓
RAG: retrieve relevant logs/runbooks/context, then reason over that
Embeddings + vector search pull the actually-relevant context at query time; the model reasons over real evidence instead of its priors. Stays current as the data changes, and answers are traceable to sources.

Tradeoffs

RAG quality is bounded by retrieval quality — bad chunking or embeddings mean bad answers — and it adds a retrieval and vector-store component to operate. Worth it because grounding is non-negotiable for operational trust.

Outcome

Engineers get grounded log analysis, anomaly flags, and incident summaries that trace back to their source context, reducing manual operational effort.

DR-002 · bet

Scoping the tools as assistive (summarize, flag, draft) rather than autonomous (act, remediate) is what made them safe to adopt quickly

Context

There's a spectrum from 'the AI suggests' to 'the AI acts.' In production operations, autonomous action raises the blast radius of a wrong answer enormously.

Options considered

✗
Autonomous remediation — let the tool take action
A hallucinated or wrong action in production ops has a large blast radius. Trust has to be earned before autonomy is on the table.
✓
Assistive — summarize, flag, draft, and let a human decide
Captures most of the time savings at a fraction of the risk. The engineer stays accountable for the decision; the tool removes the toil of getting there.

Tradeoffs

Leaves some efficiency on the table versus full automation. A deliberate choice — the right amount of automation for the current level of trust.

Outcome

Tools were adopted into real workflows because they reduced toil without asking anyone to hand a production decision to a model.

Where I Got It Wrong

Failure 1

An early version weighted retrieval toward recent logs and would miss relevant older context, producing summaries that were confident but incomplete until the retrieval and chunking strategy was tuned

Cost: Some early outputs needed more sanity-checking than they should have, which costs trust in an internal tool

Lesson: For RAG, the retrieval strategy is the product — chunking, embedding choice, and relevance ranking matter more than the prompt. Evaluate retrieval quality directly, not just the final answer

Long-Term Impact

Established a pattern for safe, grounded internal AI tooling — assistive, source-cited, human-in-the-loop.