bet · 2025 – Present
LLM + RAG Automation for Backend Operations
Built internal AI-assisted tools (Python + LLM APIs, LangChain, RAG) for log analysis, anomaly detection, validation, and incident summarization — cutting manual operational toil across backend systems.
- Role
- Designed and built the automation tools and the retrieval/prompt-orchestration pipelines
- Team
- Backend engineering
- Scope
- Operational workflows across backend services at Bidgely
- Level
- Software Engineer
Operational toil
Method: Qualitative outcome: the tools reduced manual effort for log analysis, anomaly detection, validation, and incident summarization.
Window: during tenure
Output grounding
Method: Architectural change from raw prompting to retrieval-augmented generation grounded in internal operational context.
Window: by design
Context
A lot of backend on-call work is repetitive cognition: reading through logs, noticing the anomaly, writing up what happened. It's exactly the kind of toil an LLM is good at — if you can stop it from making things up. At Bidgely I built internal Python tooling on top of LLM APIs to take that toil off engineers' plates.
Why RAG, not raw prompting
A model prompted with no context will confidently invent service names, error semantics, and runbook steps. In an operational setting, confidently-wrong is worse than no tool at all.
So the tools are built on retrieval-augmented generation: embeddings and vector search pull the genuinely relevant logs and runbook context at query time, and the model reasons over that real evidence instead of its priors. Answers stay current as the systems change, and they trace back to their sources. The retrieval layer — chunking, embedding choice, relevance ranking — is what actually determines quality; the prompt is the easy bit.
Assistive, not autonomous
There's a spectrum from "the AI suggests" to "the AI acts." I deliberately kept these tools on the assistive end: they summarize, flag anomalies, and draft incident write-ups, and a human makes the call. Autonomous remediation would have raised the blast radius of a wrong answer enormously, and trust has to be earned first. That scoping is exactly why the tools got adopted quickly — they removed toil without asking anyone to hand a production decision to a model.
What broke
The honest part: an early version over-weighted recent logs in retrieval and would miss relevant older context, so summaries came out confident but incomplete. That's the failure mode that quietly kills trust in an internal tool. Tuning the chunking and relevance ranking fixed it — and it's why I now treat retrieval quality as the thing to evaluate directly, not just the final answer.
Outcomes
Engineers get grounded, source-cited help with the repetitive parts of operations — log analysis, anomaly detection, incident summaries — with a human still owning every decision. The pattern it established: internal AI tooling should be grounded, traceable, and assistive by default.
Decision Records
Where I Got It Wrong
Failure 1
An early version weighted retrieval toward recent logs and would miss relevant older context, producing summaries that were confident but incomplete until the retrieval and chunking strategy was tuned
Cost: Some early outputs needed more sanity-checking than they should have, which costs trust in an internal tool
Lesson: For RAG, the retrieval strategy is the product — chunking, embedding choice, and relevance ranking matter more than the prompt. Evaluate retrieval quality directly, not just the final answer
Long-Term Impact
Established a pattern for safe, grounded internal AI tooling — assistive, source-cited, human-in-the-loop.