Designing a ReAct-Style Codebase Research Agent with Tool Use

Most RAG systems are pipelines: query arrives, retrieval runs, context is stuffed into a prompt, model responds. That pattern works well for document Q&A where the answer lives in a single contiguous passage. It breaks down for codebase research, where answering "why does this function behave this way?" might require tracing a call chain across five files, checking git blame to understand when behaviour changed, and cross-referencing a symbol definition from a dependency.

codebase-research-agent is built around a different model: the retrieval layer is exposed as a set of callable tools, and the model decides — at runtime — which tools to call, in what order, and when it has enough grounded context to answer. This post covers the agent loop design, the hybrid retrieval substrate, the tool schema, and the failure modes that come with giving a model control over retrieval.

Why a Pipeline Isn't Enough for Codebases

A retrieval pipeline makes one decision: what to retrieve. It does that once, before generation begins, and the model works with whatever comes back. For code, that single retrieval step fails in predictable ways.

Chains. A function calls another function in a different file, which calls a third. Retrieving the top-5 semantically similar chunks to the query rarely returns all three. The model needs to follow the chain — retrieve, read, identify the next hop, retrieve again.

Definitions vs usages. Understanding a bug often requires seeing both where a symbol is defined and where it is called. Semantic search returns the most similar passages, not necessarily both. A dedicated symbol lookup tool that returns all usages of a name is more reliable than hoping similarity search surfaces them.

Temporal context. "When was this introduced and why?" is unanswerable from embeddings alone. Git blame returns that directly. Wiring it as a tool makes it callable on demand rather than pre-computed for every chunk.

The agent loop handles all three by letting the model compose tool calls iteratively until the answer is grounded.

The ReAct Loop

ReAct (Reason + Act) is a prompting pattern where the model alternates between a reasoning step ("I need to find where process_order is defined") and an action step (calling symbol_lookup("process_order")), then observes the result before reasoning again. The loop continues until the model emits a final answer rather than another tool call.

The implementation uses Anthropic's tool use API. The agent receives an initial query and a tool manifest; it returns either a tool call or a text response. If it returns a tool call, the orchestrator executes it, appends the result to the conversation, and calls the model again. If it returns text, the loop ends.

async def run_agent(query: str, repo_path: Path) -> str:
    messages = [{"role": "user", "content": query}]
    tools = build_tool_manifest(repo_path)

    while True:
        response = await anthropic_client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=4096,
            tools=tools,
            messages=messages,
        )

        if response.stop_reason == "end_turn":
            return extract_text(response)

        # Model requested one or more tool calls
        tool_results = await execute_tool_calls(response.content, repo_path)

        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": tool_results})

The loop has a hard cap of 12 iterations. Beyond that, the agent is either stuck in a retrieval cycle or the query is unanswerable from the codebase. Both cases return a partial answer with an explicit note about the iteration limit.

The Tool Manifest

Five tools are registered. Each is designed to be fast enough that the model can call several of them per turn without noticeable latency.

semantic_search(query, top_k) — cosine similarity over pgvector against the indexed codebase chunks. Returns file path, line range, and chunk text. This is the broad-net tool; the model uses it first when it doesn't know where to look.

symbol_lookup(name) — scans the AST index (built with tree-sitter at index time) for all definitions and usages of a symbol name. Returns file, line, and context window for each hit. More precise than semantic search for "where is X defined / called" queries.

ast_navigate(file_path, node_type) — returns all AST nodes of a given type from a file. Useful for "list all classes in this file" or "show me all function signatures in this module" without reading the full file into context.

grep(pattern, path_glob) — literal and regex search over file contents. The escape hatch when semantic search and symbol lookup both miss — searching for a specific error string, a config key, or a magic constant.

git_blame(file_path, line_start, line_end) — returns commit hash, author, date, and message for each line in the range. The model uses this after locating a suspicious code block to understand when it was introduced and why.

The tool schemas are defined as TypedDict classes and serialised to JSON Schema at startup. No string literals in the manifest — if a tool's parameter type changes, the schema updates automatically.

The Hybrid Retrieval Substrate

The semantic search tool sits on top of a hybrid retrieval layer that runs both dense and sparse retrieval before returning results.

Dense retrieval. tree-sitter parses the repository at index time, splitting files into chunks at function and class boundaries rather than fixed token counts. A function is always a complete chunk; it is never split mid-body. Chunks are embedded with text-embedding-3-large and stored in pgvector.

Sparse retrieval. BM25 index over the same chunks, built with rank_bm25. This catches exact identifier matches that semantic similarity misses — searching for OrderProcessor returns that class directly, even if the query phrasing uses "order handler" or "purchase workflow."

Reciprocal Rank Fusion. Results from both retrievers are merged using RRF with k=60. The fused ranking consistently outperforms either retriever alone on the evaluation set I assembled from real debugging sessions — 40 queries with manually verified answer locations.

def reciprocal_rank_fusion(
    dense_results: list[ScoredChunk],
    sparse_results: list[ScoredChunk],
    k: int = 60,
) -> list[ScoredChunk]:
    scores: dict[str, float] = {}
    for rank, chunk in enumerate(dense_results):
        scores[chunk.id] = scores.get(chunk.id, 0) + 1 / (k + rank + 1)
    for rank, chunk in enumerate(sparse_results):
        scores[chunk.id] = scores.get(chunk.id, 0) + 1 / (k + rank + 1)
    return sorted(all_chunks_by_id(scores), key=lambda c: scores[c.id], reverse=True)

Exposed as a Claude Code MCP Server

The agent is also packaged as a Model Context Protocol server, which means Claude Code can invoke it as a tool during coding sessions. The MCP server wraps the agent loop: Claude Code sends a natural-language query about the codebase, the server runs the full ReAct loop against the indexed repo, and returns the grounded answer.

@mcp_server.tool()
async def research_codebase(query: str) -> str:
    """
    Research a question about the current codebase.
    Use for architecture questions, tracing call chains,
    understanding why code behaves a certain way, or finding
    where a symbol is defined or used.
    """
    return await run_agent(query, repo_path=current_repo_path())

The practical effect is that "why does checkout_complete fire twice on mobile?" becomes answerable inside Claude Code without manually grepping the repository or reading five files to find the relevant hook registration.

Failure Modes

Context window exhaustion. Long call chains produce large tool result sets. After several iterations the conversation history grows large enough to approach the model's context limit. The current mitigation is a summarisation step that compresses older tool results after turn 6, keeping only the file paths and key findings rather than full chunk text.

Symbol ambiguity. symbol_lookup("render") in a large React codebase returns hundreds of hits. The model gets confused by too many results and either picks the wrong one or requests another tool call to narrow down. The fix is a scope parameter on symbol_lookup that restricts the search to a specific directory or file.

Git blame on generated files. Calling git_blame on a minified bundle or auto-generated migration returns useless commit messages ("build output", "auto-generated"). The tool now filters paths matching a configurable generated_paths glob list and returns an explicit "generated file — blame not meaningful" message instead of the raw blame output.

Stack

Python 3.12, Anthropic Messages API (tool use), OpenAI embeddings API (text-embedding-3-large), tree-sitter (AST parsing and chunking), pgvector (dense retrieval), rank_bm25 (sparse retrieval), FastAPI (MCP server transport), asyncio throughout.

Source: github.com/mralaminahamed/codebase-research-agent

Share 𝕏 in

Al Amin Ahamed

Senior software engineer & AI practitioner. 5+ years shipping Laravel platforms, WordPress plugins, WooCommerce extensions, and AI-driven products.

About me →