Building a RAG Pipeline in Laravel with pgvector

Retrieval-Augmented Generation (RAG) is the pattern behind almost every serious LLM application in production today. Instead of asking a model to answer from memory (which leads to hallucinations), you retrieve relevant context from your own data and inject it into the prompt. The model answers from that context, not from training data.

This post walks through the exact implementation I used for the AI chat on this portfolio.

The Stack

PostgreSQL + pgvector — stores embeddings and handles ANN search via an HNSW index
Voyage AI — generates 1024-dimension text embeddings
Claude Sonnet — the generation step
Laravel — ties it all together

Step 1: Schema

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE embeddings (
  id bigserial PRIMARY KEY,
  embeddable_type text NOT NULL,
  embeddable_id bigint NOT NULL,
  chunk_index int NOT NULL DEFAULT 0,
  chunk_text text NOT NULL,
  embedding vector(1024),
  created_at timestamptz DEFAULT now()
);

CREATE INDEX embeddings_hnsw_idx
  ON embeddings USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

The HNSW index gives sub-millisecond ANN search even with hundreds of thousands of vectors.

Step 2: Ingestion

For each piece of content (post, project description), I split the text into ~400-token chunks with a 50-token overlap, then embed each chunk:

$chunks = $this->splitter->split($text, chunkSize: 400, overlap: 50);

foreach ($chunks as $i => $chunk) {
    $vector = $this->voyageClient->embed($chunk);

    Embedding::updateOrCreate(
        ['embeddable_type' => $type, 'embeddable_id' => $id, 'chunk_index' => $i],
        ['chunk_text' => $chunk, 'embedding' => json_encode($vector)]
    );
}

Step 3: Retrieval

When a question arrives, embed it and do a cosine similarity search:

$queryVector = $this->voyageClient->embed($question);

$chunks = DB::select("
    SELECT chunk_text,
           1 - (embedding <=> :v::vector) AS similarity
    FROM embeddings
    ORDER BY embedding <=> :v::vector
    LIMIT 8
", ['v' => json_encode($queryVector)]);

Step 4: Generation

Build a system prompt with the retrieved chunks, then call Claude:

$context = collect($chunks)->pluck('chunk_text')->implode("\n\n---\n\n");

$response = $this->claude->messages()->create([
    'model' => 'claude-sonnet-4-6',
    'max_tokens' => 1024,
    'system' => "You are an assistant answering questions about Al Amin Ahamed's work.\n\nContext:\n{$context}",
    'messages' => [['role' => 'user', 'content' => $question]],
]);

Performance Notes

HNSW ef_search=40 gives 99%+ recall at ~2ms p99
Cache the query embedding for identical questions
Stream the response via SSE — users see tokens as they arrive

The full source is on GitHub if you want to dig deeper.

Share X / Twitter LinkedIn

Al Amin Ahamed

Senior software engineer & AI practitioner. Building things in Laravel, PHP, and TypeScript.

About me →

← Older

Laravel Queues at Scale: Lessons from 10 Million Jobs

One email a month. No noise.

What I shipped, what I read, occasional deep dive. Unsubscribe anytime.