Generative AI Series

Retrieval Augmented Generation(RAG) — Chatbot for documents with LlamaIndex

Implement the RAG technique using Langchain, and LlamaIndex for conversational chatbot on a document.

A B Vijay Kumar


This blog is an ongoing series on GenerativeAI and is a continuation of the previous blog, which talks about the RAG pattern and how RAG is used to augment prompts and enhance the content and context of an LLM, with specific data


I covered the RAG pattern in my previous blog “Prompt Engineering: Retrieval Augmented Generation(RAG)”. Though it's not a prompt engineering technique, it is used to enhance the prompts with use-case-specific data/context, using vector embeddings, and passing this as a context to the LLM, so that the LLM can generate use-case-specific responses. Please read the blog, for better understanding.

RAG implementations have made advancements, since my previous blog, and I thought it would be a good idea to talk about LlamaIndex, which is the new and most preferred framework for using custom data sources, on LLMs to query the custom data.


LlamaIndex is a framework that allows us to ingest custom content/data and allows us to query that content. This is a very powerful requirement, in most of the enterprise use cases, where the LLM is not necessarily trained/fine-tuned with the use case-specific content/data/documents.

For example, I work with various clients, where they have their custom documents (such as SOPs — Standard Operational Procedures, which they apply during any production incidents/tickets/events. In such cases, it is so easy to ingest the terabytes of Word documents, and PDF documents, and allow the engineer to have a bot, that can be used to query the documents, and even automate that with LLM agents, to retrieve appropriate content, based on the incident and context, as part of ChatOps. I will soon author a few blogs, that talk about how we are using LLMs in App/platform dev/ops).

LlamaIndex is a framework, that provides libraries to create advanced RAG applications.


The following picture shows a high level approach of how LlamaIndex works.

Let's try to understand this diagram better.

  • LlamaIndex provides a framework to ingest various types of content, such as documents, databases, and APIs, which makes it a powerful framework to build LLM applications, that have multiple types of content, and you want to have an integrated response to your queries.
  • LlamaIndex has 2 major phases. Loading/Indexing and querying.
  • In the loading & indexing phase, the documents that are ingested are broken down into chunks of content. These chunks are converted to embedding, using embedding models. This creates a vector representation of the content, with similar content mapped closer in a Multi-dimensional space. This vector is stored in a vector DB (we can also provide our custom vector databases such as Pinecone etc), LlamaIndex also stores the index, for faster semantic search.
  • When a query is issued, the query is converted to embedding vectors, and a semantic search is performed on the vector database, to retrieve all similar content, which can serve as the context to the query. This is then passed to the large language model, for a response.

Please read my blog on RAG, for more details.


Let's implement a Q&A chatbot using LlamaIndex to query a document

Setup Environment

I normally create a separate virtual environment for each project, so that I make sure we have the correct configurations and environment to run the application and avoid any version conflicts.

The following commands

python3 -m venv ./llamaindex-venv #setup a environment
source ./llamaindex-venv/bin/activate #activate the environment
pip install -r requirements.txt #install all the required python libraries

the following shows my requirements.txt, which has all the libraries that we will be using for implementing this application.


We need to create a folder, where we can put all our documents. I am calling in “document”. You can create your folders, but make sure whatever is the name of the folder/path, you update the code accordingly.

copy the files into the folder (you can copy any pdf files, that you want to use, but I am using the Cannon EOS-R manual for this example, which I downloaded from Cannon User manual site)

Application code

In the following code, we are importing all the libraries we need.

  • streamlit: This is the streamlit library, which we will be using to build our streamlit application
  • os: We need this library to access the operating system, to check if the folders/files exist
  • dotenv: We will be storing our OpenAI API key in .env file and we will be using this library to load the environment variables
  • VectorStoreIndex: VectorStoreIndex is the most important object, that we will be using to access
  • SimpleDirectoryReader: SimpleDirectoryReader automatically picks the right document reader based on the document type. We will be using load_data() method that extracts the content and converts it into a list of Document objects
  • load_index_from_storage: This is used to load the index from the storage file, which is saved
  • StorageContext: This object is the container for nodes, indices, and vectors, which is used across the framework.
  • ServiceContext: This object is used as a utility class to store all the contextual information such as llm, prompt helper, etc.

Please refer to the official documentation for more details

Let's go through the following code

Line 15: we are loading the .env file, which has the OPENAI_API_KEY environment variable with the value of my OpenAI Key.

Line 16: We are setting the location of the folder, where we want to store our VectorDB. In this case, we are implementing a simple vectordb that is stored on a local disk. The more scalable solution is to use SaaS solutions like PineCone etc.

Line 17: We are setting the location of the documents “./documents” where we will be storing all our documents, that we will be ingesting.

Line 18–30: We are initializing the vector store. In case, where we have already ingested the documents into the index, there is no need to perform indexing and storing. initialize() method checks if the index already exists, then it loads from the existing database, otherwise, it creates a new one and the index.

Index: One of the key features of LlamaIndex, is the way it organizes the ingested content into indexes, It uses these indexes to answer queries. The content that is ingested is broken into chunks, and these are some times referred to as nodes. These nodes are indexed as lists, trees, keywords, and vectors. All the content in the nodes (chunks) are stored as vector embeddings. The chunking is very important, as all of the LLMs has a token limitation, during inference. To avoid this, LlaamaIndex will query the LLM with these various chunks, in a particular sequence, and refines the response, as its practically not possible to pass all the document content in one inference request.

This method will create the following files

  • default__vectors_store.json: This file has all the vector embeddings stored, as a dictionary of embeddings
  • docstore.json: This file stores the document metadata and the document chunks
  • index_store.json: This file has all the index metadata
  • graph_store.json: This file is used for storing the graphs — we will discuss this in later blogs, where we will ingest graphdb
  • image__vector_store.json: This file is used to store the embeddings of image content, we will be discussing this in later blogs, on how to ingest images.

Line 34-56: We are creating a streamlit application, with a chat kind of interface. We will be storing the message in the st.session_state (Lines 36, 43), and this will be printed in the main window as a chat (Lines 45–46). We are capturing the prompt that is given in the st.chat_input() (Line 42–44) and calling the chat_engine() (Line 52) that we created in Line 40. LlamaIndex provides a convenient function to create a chat engine with the index that is created. This takes care of all the complexity of doing the RAG, and calling the appropriate LLM to get the response.

Running the Application

Since it's a streamlit application, we should run it with the following command

streamlit run

this will launch the browser with the chat application. The following screenshots shows the output.

On the console, you should be able to see the predictions and the similarity score for the top 2 options. this is printed on the console by the pprint_response(response, show_source=True)(Line 54).


There you go…As you can see this is super easy to implement complex RAG patterns using LlamaIndex. Please leave your feedback, and comments. I always learn from hearing from you all…

you can find the complete code in my GitHub here

We are just scratching the surface. I will be blogging about more features in future blogs. Until then, stay safe and have fun… ;-)




A B Vijay Kumar

IBM Fellow, Master Inventor, Mobile, RPi & Cloud Architect & Full-Stack Programmer