Aarsh’s Blog

How RAG Works

2026-04-06T00:00:00+00:00

Large language models are impressive, but they have a fundamental constraint: they only know what was in their training data, frozen at a point in time. Ask GPT-4 about something that happened last week, or about your company’s internal docs, and it either hallucinates or admits ignorance.

Retrieval-Augmented Generation — RAG — is the standard solution.

The core idea

RAG is two steps:

Retrieve relevant documents for the user’s query
Generate a response using those documents as context

Instead of relying on what the model memorized during training, you hand it fresh information at inference time. The model becomes a reasoning engine over your data, not a knowledge store.

Step 1: Indexing

Before you can retrieve anything, you need to index your documents. This happens offline.

Each document (or chunk of a document) gets converted into a vector — a list of numbers that encodes its semantic meaning. This is done by an embedding model, a neural net trained to map text into a high-dimensional space where similar meanings land close together.

"The cat sat on the mat."  →  [0.12, -0.84, 0.33, ...]
"A feline rested on a rug." →  [0.11, -0.81, 0.35, ...]  # close!
"Quarterly revenue is up."  →  [0.92,  0.14, -0.67, ...]  # far

You store these vectors in a vector database — Pinecone, Weaviate, pgvector, Chroma, and others. The database is optimized for one operation: finding the N vectors closest to a query vector.

Step 2: Retrieval

When a user asks a question, you embed their query using the same embedding model, then run a nearest-neighbor search against your vector database.

query: "what is our refund policy?"
  → embed query → [0.43, -0.21, ...]
  → search vector DB → top 3 matching chunks
  → return: [chunk_42, chunk_7, chunk_91]

The chunks that come back are semantically similar to the question — they might not share a single keyword, but they’re conceptually related.

Step 3: Generation

Now you build a prompt that includes both the user’s question and the retrieved chunks, and send it to the LLM:

System: You are a helpful assistant. Answer based only on the context below.

Context:
[chunk_42] Our refund policy allows returns within 30 days...
[chunk_7]  Refunds are processed within 5-7 business days...
[chunk_91] Digital products are non-refundable unless...

User: What is your refund policy?

The model reads the context and generates a grounded answer. If the context doesn’t contain the answer, a well-prompted model should say so rather than invent one.

Why it works

The key insight is that LLMs are exceptional at reading comprehension and synthesis. Give them a passage and a question, and they’ll extract and reason over the answer reliably. RAG exploits this — instead of asking the model to remember, you ask it to read.

This sidesteps the hallucination problem for factual questions: the model is constrained to what you gave it.

The moving parts

A production RAG system has a few components worth knowing:

Chunking strategy: You split documents into chunks before embedding. Too large and the chunk adds noise; too small and you lose context. Typical chunks are 200–500 tokens, sometimes with overlap so context doesn’t get cut at boundaries.

Embedding model: Separate from the LLM. OpenAI’s text-embedding-3-small, Cohere’s embed-v3, or open models like bge-m3 are common choices. The embedding model determines retrieval quality.

Retrieval depth (top-k): How many chunks to retrieve. More context = more coverage but longer prompts and more noise. 3–5 is a common default.

Reranking: A lightweight model that reorders retrieved chunks by relevance before they go into the prompt. Catches cases where vector similarity diverges from actual usefulness.

What RAG doesn’t solve

RAG is not a silver bullet.

If the right chunk isn’t in your index, retrieval fails silently — the model answers from a bad or empty context. Garbage in, garbage out.

Multi-hop questions are hard: “Who manages the team that owns the billing service?” requires connecting two facts that might live in different documents. Standard RAG retrieves one neighborhood; it doesn’t traverse graphs.

And RAG adds latency. Every query now involves an embedding call, a vector search, and a larger prompt.

RAG is fundamentally simple: look something up, then answer with it in hand. The complexity lives in the details — chunking, embedding quality, retrieval tuning, prompt design. But the architecture is just a smart use of the LLM’s best skill: reading.

The Art of Building in Public

2026-04-01T00:00:00+00:00

There’s a temptation when building anything to wait until it’s done before showing it. To polish every edge, handle every edge case, and only then let people in.

That instinct is wrong.

The myth of the finished thing

Nothing is ever really finished. Software gets deprecated. Designs go stale. Essays accrue footnotes. The gap between “done enough to ship” and “actually done” is infinite—and filling it is often the thing that stops work from ever reaching anyone.

When you build in public, you opt out of that trap.

What building in public actually means

It doesn’t mean tweeting every keystroke or posting daily updates that nobody asked for. It means:

Sharing work at the 70% mark — when it’s useful but still shapeable.
Writing about problems before you’ve solved them — which forces clearer thinking.
Being honest about what you don’t know — which attracts collaborators who do.

The thing you’re afraid to share is usually exactly the thing someone else needs to see.

The feedback loop you can’t manufacture

The only way to know if an idea is good is to expose it to reality. Imagined feedback is worse than useless — it’s noise that feels like signal.

Real feedback, even uncomfortable feedback, grounds you. It shortens the distance between what you think you’re building and what you’re actually building.

idea → build → share → feedback → better idea

The only dangerous step is the one where “share” gets replaced by “polish forever.”

Start with one post

You don’t need an audience to start. You need a habit. Write the post you wished existed when you were stuck. Publish it somewhere—anywhere. See what happens.

Most of the time: nothing. Occasionally: someone finds it years later exactly when they needed it, and it saves them three hours.

That’s enough.

Understanding Go Interfaces from First Principles

2026-03-22T00:00:00+00:00

Most explanations of Go interfaces start with the syntax. This one won’t.

Instead, let’s start with the problem interfaces solve, and work backwards to why Go’s approach is surprisingly elegant.

The problem: coupling

When module A depends directly on module B, you’ve created coupling. A can’t exist without B. Testing A requires B. Changing B risks breaking A.

The classic solution is indirection: instead of depending on a concrete thing, depend on a description of what you need.

That description is an interface.

What makes Go’s interfaces different

In most languages (Java, C#, TypeScript), interfaces are declared. You say “this type implements that interface” explicitly.

Go’s interfaces are inferred. A type satisfies an interface if it has the right methods — no declaration required.

type Stringer interface {
    String() string
}

type Point struct {
    X, Y float64
}

// Point implicitly satisfies Stringer
func (p Point) String() string {
    return fmt.Sprintf("(%v, %v)", p.X, p.Y)
}

Point never mentions Stringer. It just has a String() method, and that’s enough.

Why this matters

This design decision has a profound consequence: interfaces are defined by the consumer, not the producer.

The package that needs a logger defines Logger. The package that provides logging just has methods. If they happen to match, they fit together — even if they were written years apart by different people.

This is the opposite of most OOP languages, where the producer decides the contract.

// In your package — you define this
type Logger interface {
    Log(msg string, level int)
}

// In some other package — they don't know you exist
type ZapLogger struct { ... }
func (z ZapLogger) Log(msg string, level int) { ... }

// They fit anyway
var l Logger = ZapLogger{}

The empty interface

interface{} (or any in modern Go) is the interface with no methods. Every type satisfies it, so it can hold any value.

It’s Go’s escape hatch. Use it sparingly — it trades away type safety for flexibility.

func PrintAnything(v any) {
    fmt.Println(v)
}

A practical heuristic

Keep interfaces small. One or two methods is ideal. io.Reader has one. io.Writer has one. These small interfaces compose beautifully — io.ReadWriter is just both combined.

The moment an interface has seven methods, it’s describing a class, not a capability.

The next time you’re reaching for a concrete type in a function signature, ask: what’s the minimum capability I actually need here? Name that capability. Make it an interface. The code will be better for it.

Notes on Local-First Software

2026-03-10T00:00:00+00:00

The cloud made software better in a lot of ways. Sync across devices. Collaboration. No data loss when your laptop dies.

But it also made software worse in a way we’ve mostly accepted: your data lives somewhere else, and access to it depends on someone else’s uptime, business model, and goodwill.

Local-first software is an attempt to get the benefits of the cloud without giving up ownership.

The seven ideals

The original local-first paper from Ink & Switch laid out seven properties:

Fast — no round-trip latency for reads
Multi-device — works across your machines
Offline — works without internet
Collaboration — still supports real-time editing
Longevity — data survives the vendor
Privacy — you control your data
User control — you can export, modify, extend

Most cloud software hits 2 and 4. Most desktop software hits 1, 3, and 6. Local-first tries to hit all seven.

The hard part: sync

Offline + multi-device + collaboration creates a distributed systems problem. When two devices edit the same document without talking to each other, how do you merge the results?

The promising answer is CRDTs — Conflict-free Replicated Data Types. They’re data structures designed to merge in a way that’s always consistent, without requiring coordination.

The tradeoff: CRDTs work well for text and counters, but get complicated for structured data with complex relationships.

Where it’s going

Tools like Obsidian, Logseq, and Linear (in parts) lean local-first. The Automerge library makes CRDTs accessible for app developers. SQLite-based sync tools like libSQL are exploring the same space for databases.

It’s not solved. But the gap between “cloud app” and “local app” is narrowing in the right direction.

I don’t think local-first will replace everything. Some things genuinely need a central server. But for the apps where you’d be upset if the company shut down — your notes, your writing, your finances — it’s worth asking whether the data should live on your machine first.