Sitemap

Generative AI Series

Hands on — Agentic RAG (2/3) — Agentic ReRanking RAG

This blog is a second part of the 3 part series on Agentic RAG. In this blog we will build a agentic RAG system, that implements ReRanking RAG, to pick the right answer.

8 min readJun 14, 2025

--

Courtesy: ChatGPT

This blog is a second part of the 3 part series on Agentic RAG. In this blog we will buid a agentic RAG system, that implements ReRanking RAG, to pick the right answer, Before we proceed with the code, please take time, to read other blogs I have published on traditional RAG

and don’t forget to read Part 1 of the blog Hands on — Agentic RAG (1/2), In that blog, I had shared my thoughts about Agentic RAG and how it is different from Traditional RAG. Now lets hit the road, and get our hands dirty.

In this blog, we’ll dive into building a agentic ai application in python that manages multiple agents to answer questions about PDF documents. We will be implementing ReRanking RAG approach, to demonstrate the power of Agentic AI. The Orchedtrator agent will dynamically decide the pipeline, to rerank and pick the right response for the query from th user. To keep it simple, lets ingest 1 PDF document, so that we can analyze the answers

Application Architecture

Traditional Sequential RAG

In the first part of this blog series, I walked through in detail how Agentic RAG is different from traditional RAG. The picture below shows the high-level flow of a typical traditional RAG.

The user’s query enters the Retriever, which fetches the most relevant chunk(s) from the index/embeddings. That chunk goes to the LLM (QA), which generates a final answer.

Agentic RAG with Orchestration and Reranking

In this blog, we will build an Agentic RAG system to demonstrate how agents work together. We will be implementing the specifically ReRanking RAG approach. The following picture shows the typical flow of a ReRanking RAG.

Here, the Orchestrator allows parallelism. From the user’s query, it launches several Retrieval agents in parallel, each finding relevant (possibly diverse) context. Each context goes to a separate QA step, producing multiple candidate answers.

At the end, an “LLM Reranker” reviews and consolidates all candidate answers, selects the best one, and outputs a final answer. This approach is robust to retrieval noise and aggregates information.

Lets double click on ReRanking System

Each ‘QA’ node is a candidate answer based on a different retrieval context. All these answers feed into the Reranker (usually an LLM), which reads them and outputs both the best answer and its reason for selection. ReRanking agent works like a “meta-evaluator” reranking lets the pipeline pick the answer that is best supported, most relevant, or least likely to hallucinate. You can read more about ReRanking RAG approach

Lets look at the full pipeline that we will be implementing in this blog.

This diagram captures the entire agentic pipeline, from ingesting a PDF to delivering an answer to the user.

  • The PDF is processed by the PDFLoaderAgent, chunked, and embedded into vectors (EmbeddingAgent).
  • The FAISS index stores vectors for fast search.
  • For each query: Multiple RetrievalAgents fetch distinct context sets (thanks to embedding perturbations).
  • QAAgents answer for each context set in parallel.
  • The RankingAgent evaluates all candidate answers with context and picks the best, returning it to the user.

To implement this pipeline we will be coding the following agents.

  • PDFLoaderAgent: Loads and splits large PDF documents into small, overlapping textual chunks suitable for semantic retrieval.
  • EmbeddingAgent: Turns those chunks into vector embeddings and builds a similarity index for rapid context retrieval.
  • RetrievalAgent: Given a query, finds multiple diverse “top-k” sets of relevant chunks from the embedding index, increasing answer coverage and robustness.
  • QAAgent: Generates answers to the initial question, using each candidate retrieved context, and does this in parallel for efficiency.
  • RankingAgent: Receives all candidate answers and their contexts. Prompts the LLM to select the best candidate and explain why. This step performs reranking (“meta-evaluation”).
  • RAGOrchestrator: The central coordinator. Handles the ingestion of new documents and orchestrates the retrieval, answer generation, and reranking pipelines in a dynamic, agentic fashion.

Detailed Code Walkthrough

Lets now walk through the complete code, you can also find the code in my github, I have provided the link at the end of the blog.

Token Counting Utility

ENC = tiktoken.get_encoding("cl100k_base")
def num_tokens(text: str) -> int:
return len(ENC.encode(text))

Initializes the correct tokenizer for OpenAI models (tiktoken). num_tokens lets you count how many tokens a string will take up—crucial for deciding chunk sizes, since LLMs and embedding APIs have token limits.

PDFLoaderAgent

class PDFLoaderAgent:
def __init__(self, chunk_size=500, chunk_overlap=50):
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap

def load_and_split(self, path: str) -> List[str]:
reader = PdfReader(path)
full_text = "\n".join(page.extract_text() or "" for page in reader.pages)
tokens = ENC.encode(full_text)
chunks = []
start = 0
while start < len(tokens):
end = min(start + self.chunk_size, len(tokens))
chunk = ENC.decode(tokens[start:end])
chunks.append(chunk)
start += self.chunk_size - self.chunk_overlap
return chunks

Loads all a PDF’s text, turns it into tokens, splits it into overlapping chunks (for context continuity). Returns a list of all text chunks for downstream embedding.

EmbeddingAgent

class EmbeddingAgent:
def __init__(self, dim=1536):
self.dim = dim
self.index = faiss.IndexFlatL2(dim)
def embed(self, texts: List[str]) -> List[List[float]]:
response = openai.embeddings.create(model=EMBED_MODEL, input=texts)
return [item.embedding for item in response.data]
def add_to_index(self, texts: List[str]):
embs = self.embed(texts)
vecs = np.array(embs, dtype="float32")
self.index.add(vecs)
  • Uses OpenAI to embed all chunks in semantic (vector) space.
  • Maintains an in-memory Faiss index for fast, approximate nearest-neighbor search.
  • This enables rapid and accurate retrieval of context for future questions.

RetrievalAgent

class RetrievalAgent:
def __init__(self, index):
self.index = index
def retrieve_candidates(self, query, texts, n_candidates=3, k=5):
base_emb = EmbeddingAgent().embed([query])[0]
candidates = []
for i in range(n_candidates):
perturbed = np.array(base_emb) + np.random.normal(0, 0.01, len(base_emb))
D, I = self.index.search(np.array([perturbed], dtype="float32"), k)
retrieved = [texts[j] for j in I[0] if j < len(texts)]
candidates.append(retrieved)
return candidates
  • For each candidate (n_candidates, e.g., 3), slightly perturbs the query embedding
  • This produces several diverse context sets (so LLM sees variety when reranking)
  • Each retrieval is a top-k (e.g. 5) chunk list. The function returns a list of those context lists

QAAgent

class QAAgent:
def __init__(self, model=CHAT_MODEL):
self.model = model
def answer(self, question, context):
context_str = '---\n'.join(context)
prompt = ("You are an expert assistant. Use the following context to answer the question.\n\n"
f"Context:\n{context_str}\n\n"
f"Question: {question}\nAnswer:")
resp = openai.chat.completions.create(
model=self.model,
messages=[{"role": "system", "content": prompt}],
temperature=0.2,
max_tokens=500,
)
return resp.choices[0].message.content.strip()
def answer_parallel(self, question, candidate_contexts):
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor() as executor:
return list(executor.map(lambda ctx: self.answer(question, ctx), candidate_contexts))
  • Given a context, generates an answer using the LLM and a system prompt.
  • Uses threading via ThreadPoolExecutor to answer all candidate contexts in parallel for speed.

RankingAgent

class RankingAgent:
def __init__(self, model=CHAT_MODEL):
self.model = model
def rank(self, question, candidate_answers, candidate_contexts):
print("[RankingAgent] All candidate contexts and answers:")
for idx, (ctx, ans) in enumerate(zip(candidate_contexts, candidate_answers), 1):
print(f"\nCandidate #{idx} Context:")
for chunk in ctx:
print(chunk)
print(f"Candidate #{idx} Answer: {ans}\n---")
ranking_prompt = f"..." # detailed LLM prompt
summary = "..." # aggregate summaries
full_prompt = ranking_prompt + summary
resp = openai.chat.completions.create(
model=self.model,
messages=[{"role": "system", "content": full_prompt}],
temperature=0.2,
max_tokens=350
)
response_text = resp.choices[0].message.content.strip()
print("\n[RankingAgent] LLM Decision and Reason:\n" + response_text)
import re
m = re.search(r"Candidate #(\d+)\s*\nReason:([^\n]*)\n+Best Answer:\n(.+)", response_text, re.DOTALL)
if m:
cand_idx = int(m.group(1)) - 1
reason = m.group(2).strip()
answer = m.group(3).strip()
print(f"Selected candidate #{cand_idx+1}. Reason: {reason}")
else:
cand_idx = 0
answer = candidate_answers[0]
print("Could not parse ranking output, returning first candidate.")
return answer, cand_idx
  • Prints all candidates and their contexts for review/debugging
  • Composes a system prompt that asks the LLM to pick the best answer and explain its choice.
  • Parses the response to find both the chosen answer and the LLM’s stated rationale.
  • Falls back to first candidate if the reasoning cannot be parsed.

RAGOrchestrator

class RAGOrchestrator:
def __init__(self, n_candidates=3, k=5):
self.loader = PDFLoaderAgent()
self.embedder = EmbeddingAgent()
self.text_chunks = []
self.retriever = None
self.qa = QAAgent()
self.ranker = RankingAgent()
self.n_candidates = n_candidates
self.k = k
def ingest(self, pdf_path):
self.text_chunks = self.loader.load_and_split(pdf_path)
self.embedder.add_to_index(self.text_chunks)
self.retriever = RetrievalAgent(self.embedder.index)
def query(self, question):
candidates = self.retriever.retrieve_candidates(
question, self.text_chunks, n_candidates=self.n_candidates, k=self.k)
candidate_answers = self.qa.answer_parallel(question, candidates)
final_answer, chosen_idx = self.ranker.rank(question, candidate_answers, candidates)
return final_answer
  • Central orchestrator for the whole agentic RAG process.
  • Handles ingestion (loading/chunking PDFs and building index) as well as full agentic query flow.
  • When query() is called, runs parallel retrievals, parallel QA, then reranking—then returns the final chosen answer.

Output

Here is a screenshot of the results.

Here is a video of screenshot demo of the application running. You can download the complete code from my GitHub to play around.

In the video, you can see how the agents work together, to fetch various candidate responses, and pick the best response.

Summary

There you go, I guess, now its clear how agentic RAG is really different. In this tutorial, we explored the agentic RAG pipeline with reranking, highlighting how it extends classic RAG by introducing orchestration and LLM-based meta-evaluation to improve answer quality and robustness.

Key takeaways:

  • Agentic Design: The process is broken into modular agents — PDFLoader, Embedding, Retrieval, QA, and Ranking — coordinated by the RAGOrchestrator, making it easy to swap, tune, or extend individual components.
  • Parallelism & Diversity: Multiple perturbed retrievals generate diverse context sets, and parallel QA agents produce several candidate answers, reducing single-point failures in retrieval or generation.
  • Reranking for Quality: A final LLM-based RankingAgent reviews all candidates and selects the best answer with a clear rationale, increasing accuracy and transparency.
  • Customization & Extensibility: You can adjust hyperparameters (e.g., chunk size, n_candidates, embedding models), introduce new loaders or ranking strategies, and build user interfaces to inspect intermediate steps.
  • Practical Benefits: By combining retrieval diversity with a meta-evaluator, agentic RAG produces more reliable, grounded, and transparent answers — ideal for applications requiring high accuracy and auditability.

Hope this was useful, I am already working on the Part 3 of this blog, where I am trying to integrate this with MCP servers, to make it even more richer. In the meantime, you can download the code, and play around here. :-) see u soon

--

--

A B Vijay Kumar
A B Vijay Kumar

Written by A B Vijay Kumar

IBM Fellow, Master Inventor, Agentic AI, GenAI, Hybrid Cloud, Mobile, RPi Full-Stack Programmer

Responses (1)