Building a 12-Source Job Discovery Pipeline with RAG and pgvector

Job boards are fragmented. LinkedIn, Indeed, Greenhouse, Lever, RemoteOK, Hacker News — each has a different API surface, rate-limit profile, and data schema. JobPulse started as a personal itch: I wanted a single interface that would ingest all of them, score listings against my resume, and generate cover letters grounded in my actual experience rather than hallucinated bullet points.

This post covers the pipeline architecture: the three-tier source model, the resume embedding strategy, the composite scoring function, and the retrieval failures that forced me to redesign the chunking layer halfway through.

The Three-Tier Source Model

Twelve sources is a lot of surface area to maintain, so the first design decision was a tiered adapter protocol that isolates rate-limit concerns from retrieval logic.

Tier 1 — Search engines. SerpAPI (Google Jobs), JSearch (LinkedIn + Indeed aggregation), Bing Jobs. These give the broadest coverage but are the most expensive per call and the least structured. Results arrive as free-text snippets that need normalisation before they can be scored.

Tier 2 — Free job boards. RemoteOK, We Work Remotely, Remotive, and the Hacker News "Who is hiring?" monthly thread. These have public RSS feeds or simple JSON endpoints, no auth required, and structured enough data to skip the normalisation step. The HN thread is the outlier — it's a comment thread that needs its own parser to extract role, company, location, and salary signal from unstructured text.

Tier 3 — ATS public APIs. Greenhouse, Lever, Ashby, Workable, SmartRecruiters all expose public job board endpoints that return clean JSON. No API key required for read access. This tier is the most reliable and the most consistent in schema.

Each tier implements the same async adapter interface:

class JobSourceAdapter(Protocol):
    async def fetch(self, query: JobQuery) -> AsyncIterator[RawListing]: ...
    async def health(self) -> AdapterHealth: ...

The fetch method yields RawListing objects as they arrive — the pipeline never waits for a full batch before beginning normalisation. health() is called on a schedule by a Celery beat task; adapters that return AdapterHealth.DEGRADED are deprioritised in the source selection step without taking the whole pipeline down.

Normalisation and Deduplication

Raw listings from different sources describe the same jobs. A listing scraped from LinkedIn via JSearch and the same listing from the company's Greenhouse endpoint will collide. The deduplication strategy has two layers.

Exact match on a composite key of (company_slug, role_title_normalised, location_hash). This catches identical postings with minor formatting differences.

Fuzzy match via MinHash LSH on the job description text. Any two descriptions with a Jaccard similarity above 0.82 are treated as duplicates; the one from the higher-trust tier wins. The threshold of 0.82 came from manual review of 200 collision candidates — below that, legitimate distinct roles from the same company started getting dropped.

After deduplication, listings are written to Postgres via SQLAlchemy async with a RETURNING id clause that feeds the downstream embedding queue.

Resume Embedding Pipeline

Cover letter generation is only as good as the retrieval layer beneath it. The failure mode to avoid is the model generating plausible-sounding bullets that don't map to anything you've actually done. The fix is grounding: every generation call must retrieve specific resume chunks before it writes.

The resume ingestion flow:

Parse. python-docx for .docx, pdfminer.six for PDF. Both paths produce a list of (section_title, text_block) tuples.
Chunk. 512-token chunks with a 64-token overlap, respecting section boundaries. A chunk never crosses a section boundary — "Work Experience" and "Technical Skills" content never ends up in the same chunk. This matters because the retrieval query at generation time is a job description, and you want the similarity score to reflect role-relevant experience, not diluted by skills-section noise.
Embed. text-embedding-3-large at 3072 dimensions, reduced to 1536 via the API's dimensions parameter. The reduction halves storage and speeds up cosine search with negligible accuracy loss on this domain.
Store. Vectors written to pgvector with an ivfflat index, lists=100, probes=10 at query time. For a single user's resume (typically 40–80 chunks) a flat index would be faster, but ivfflat keeps the query path consistent if the service ever moves to multi-user.

Composite Scoring

Each job listing gets a score before it reaches the UI. The score is a weighted sum of four signals:

score = (
    w_semantic  * cosine_similarity(job_embedding, resume_centroid) +
    w_keyword   * bm25_overlap(job_text, resume_keywords) +
    w_salary    * salary_band_match(job_salary, target_range) +
    w_geo       * geo_match(job_location, preferred_locations)
)

The weights are configurable per search profile. The default values (0.45, 0.30, 0.15, 0.10) came from manually ranking 50 listings and running a grid search against my own preferences. The resume_centroid is the mean of all resume chunk embeddings — a single vector that represents "this person's experience" as a whole, used as the query vector for semantic matching.

BM25 overlap runs against a keyword list extracted from the resume at ingestion time: job titles, technology names, tool names, and proper nouns. This catches cases where semantic similarity is high (the job and resume talk about "building scalable APIs") but the specific technology signal is missing ("FastAPI" in the job, nowhere in the resume).

Cover Letter Generation

The generation prompt receives three inputs: the job description, the top five resume chunks retrieved by cosine similarity against the job embedding, and a system instruction that prohibits the model from referencing experience not present in the retrieved chunks.

async def generate_cover_letter(
    job: NormalisedListing,
    resume_chunks: list[ResumeChunk],
    model: str = "claude-sonnet-4-5",
) -> str:
    context = "\n\n".join(chunk.text for chunk in resume_chunks)
    response = await anthropic_client.messages.create(
        model=model,
        max_tokens=1000,
        system=(
            "You are writing a cover letter. "
            "Use ONLY the experience described in the provided resume excerpts. "
            "Do not invent, extrapolate, or embellish. "
            "If the job requires something not in the excerpts, acknowledge the gap honestly."
        ),
        messages=[
            {
                "role": "user",
                "content": f"Job:\n{job.description}\n\nResume excerpts:\n{context}",
            }
        ],
    )
    return response.content[0].text

The gap acknowledgement instruction turned out to be important. Early versions without it produced confident claims about experience that wasn't in the resume. With it, the model flags mismatches — which is more useful to a job seeker than a polished lie.

Three Retrieval Failures That Shaped the Architecture

Failure 1: Section boundary bleed. Initial chunks ignored section structure. A chunk that straddled "Professional Summary" and "Work Experience" retrieved well for generic queries but poorly for specific role queries. The fix — section-boundary-aware chunking — improved retrieval precision on role-specific queries by around 30% in informal testing.

Failure 2: Centroid drift on long resumes. The resume centroid is a mean of all chunk embeddings. For a resume with a long skills section, the centroid drifts toward the skills-section embedding space, making semantic matching favour jobs that match the skills list rather than the experience narrative. The fix was to exclude skills-section chunks from the centroid calculation. The centroid now represents experience only; skills are captured by the BM25 signal.

Failure 3: ATS tier latency spikes. Some ATS endpoints (Workable in particular) respond in 4–8 seconds under load. Awaiting them synchronously blocked the normalisation queue. The fix was a two-phase fetch: fire all adapter requests concurrently with asyncio.gather, collect whatever responds within a 3-second deadline, and schedule a retry job for the remainder via Celery. The UI shows partial results immediately and updates via WebSocket as retries complete.

Stack Summary

Python 3.12, FastAPI, SQLAlchemy async (asyncpg driver), pgvector, Redis (result cache + Celery broker), Celery with beat scheduler, OpenAI embeddings API, Anthropic Messages API, React 18 + Vite + Tailwind CSS for the dashboard, Docker Compose for local development, GitHub Actions for CI.

The full source is at github.com/mralaminahamed/jobpulse.

Share 𝕏 in

Al Amin Ahamed

Senior software engineer & AI practitioner. 5+ years shipping Laravel platforms, WordPress plugins, WooCommerce extensions, and AI-driven products.

About me →