bet · 2025 – Present

LLM + RAG Automation for Backend Operations

Built internal AI-assisted tools (Python + LLM APIs, LangChain, RAG) for log analysis, anomaly detection, validation, and incident summarization — cutting manual operational toil across backend systems.

Role
Designed and built the automation tools and the retrieval/prompt-orchestration pipelines
Team
Backend engineering
Scope
Operational workflows across backend services at Bidgely
Level
Software Engineer

Operational toil

fully manual log triage & incident write-upsAI-assisted, human-reviewed

Method: Qualitative outcome: the tools reduced manual effort for log analysis, anomaly detection, validation, and incident summarization.
Window: during tenure

Output grounding

ungrounded prompting (hallucination-prone)RAG over real logs/runbooks (source-cited)

Method: Architectural change from raw prompting to retrieval-augmented generation grounded in internal operational context.
Window: by design

Context

A lot of backend on-call work is repetitive cognition: reading through logs, noticing the anomaly, writing up what happened. It's exactly the kind of toil an LLM is good at — if you can stop it from making things up. At Bidgely I built internal Python tooling on top of LLM APIs to take that toil off engineers' plates.

Why RAG, not raw prompting

A model prompted with no context will confidently invent service names, error semantics, and runbook steps. In an operational setting, confidently-wrong is worse than no tool at all.

So the tools are built on retrieval-augmented generation: embeddings and vector search pull the genuinely relevant logs and runbook context at query time, and the model reasons over that real evidence instead of its priors. Answers stay current as the systems change, and they trace back to their sources. The retrieval layer — chunking, embedding choice, relevance ranking — is what actually determines quality; the prompt is the easy bit.

Assistive, not autonomous

There's a spectrum from "the AI suggests" to "the AI acts." I deliberately kept these tools on the assistive end: they summarize, flag anomalies, and draft incident write-ups, and a human makes the call. Autonomous remediation would have raised the blast radius of a wrong answer enormously, and trust has to be earned first. That scoping is exactly why the tools got adopted quickly — they removed toil without asking anyone to hand a production decision to a model.

What broke

The honest part: an early version over-weighted recent logs in retrieval and would miss relevant older context, so summaries came out confident but incomplete. That's the failure mode that quietly kills trust in an internal tool. Tuning the chunking and relevance ranking fixed it — and it's why I now treat retrieval quality as the thing to evaluate directly, not just the final answer.

Outcomes

Engineers get grounded, source-cited help with the repetitive parts of operations — log analysis, anomaly detection, incident summaries — with a human still owning every decision. The pattern it established: internal AI tooling should be grounded, traceable, and assistive by default.

Decision Records

Where I Got It Wrong

Failure 1

An early version weighted retrieval toward recent logs and would miss relevant older context, producing summaries that were confident but incomplete until the retrieval and chunking strategy was tuned

Cost: Some early outputs needed more sanity-checking than they should have, which costs trust in an internal tool

Lesson: For RAG, the retrieval strategy is the product — chunking, embedding choice, and relevance ranking matter more than the prompt. Evaluate retrieval quality directly, not just the final answer

Long-Term Impact

Established a pattern for safe, grounded internal AI tooling — assistive, source-cited, human-in-the-loop.