How to Build a Simple Retrieval-Augmented Generation (RAG) System with LangChain |QualityPoint Technologies (QPT)

Retrieval-Augmented Generation (RAG) is a powerful technique that combines retrieval-based search with a generative AI model. It allows an AI assistant to fetch relevant information from a knowledge base before generating a response, making it more accurate and informative. In this post, we'll walk through building a simple RAG pipeline using LangChain and Hugging Face models.

What is RAG?

RAG improves upon standard AI models by retrieving relevant documents from a database before generating a response. This helps the model provide more contextually accurate answers, especially when dealing with specialized knowledge.

Key Components of a RAG System

Document Storage: A collection of text documents containing useful information.
Text Splitting: Divides large documents into smaller, manageable chunks.
Embedding Model: Converts text chunks into numerical vectors for efficient retrieval.
Vector Database: Stores embeddings and allows similarity searches.
Retriever: Fetches the most relevant document chunk for a query.
LLM (Large Language Model): Generates answers using the retrieved context.

AI Course | Bundle Offer (including RAG ebook) | RAG Kindle Book | Master RAG

Step-by-Step Implementation

Step 1: Install Required Libraries

Before we begin, ensure you have the required Python packages installed:

pip install langchain langchain_huggingface faiss-cpu

Step 2: Load and Prepare Documents

We start by defining a set of documents containing useful information:

documents = [
" blah blah blah blah blah blah. blah blah blahblah blah blah. Vitamin C helps boost immunity.blah blah blah blah blah blah. blah blah blahblah blah blah.blah blah blah blah blah blah. blah blah blahblah blah blah.",
"blah blah blah blah blah blah. blah blah blahblah blah blah.blah blah blah blah blah blah. blah blah blahblah blah blah.blah blah blah blah blah blah. blah blah blahblah blah blah. Exercise improves mental and physical health. blah blah blah blah blah blah. blah blah blahblah blah blah.blah blah blah blah blah blah. blah blah blahblah blah blah.blah blah blah blah blah blah. blah blah blahblah blah blah.blah blah blah blah blah blah. blah blah blahblah blah blah.",
"blah blah blah blah blah blah. blah blah blahblah blah blah.blah blah blah blah blah blah. blah blah blahblah blah blah.blah blah blah blah blah blah. blah blah blahblah blah blah.blah blah blah blah blah blah. blah blah blahblah blah blah.blah blah blah blah blah blah. blah blah blahblah blah blah. Drinking enough water keeps you hydrated and improves focus. blah blah blah blah blah blah. blah blah blahblah blah blah.blah blah blah blah blah blah. blah blah blahblah blah blah.blah blah blah blah blah blah. blah blah blahblah blah blah.blah blah blah blah blah blah. blah blah blahblah blah blah.blah blah blah blah blah blah. blah blah blahblah blah blah.blah blah blah blah blah blah. blah blah blahblah blah blah.blah blah blah blah blah blah. blah blah blahblah blah blah.blah blah blah blah blah blah. blah blah blahblah blah blah.blah blah blah blah blah blah. blah blah blahblah blah blah."
]

Since documents are usually long, we need to split them into smaller chunks.

Step 3: Convert Documents into Chunks

from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=60, chunk_overlap=10, separator=".")
chunks = text_splitter.create_documents(documents)

This ensures that each chunk is of a manageable size while maintaining some overlap for context.

Step 4: Create Embeddings and Store in Vector Database

We now convert these text chunks into embeddings and store them in FAISS (a fast similarity search database).

from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vector_db = FAISS.from_documents(chunks, embeddings)

The Hugging Face embedding model converts each chunk into a vector representation, making it searchable.

Step 5: Set Up the Chat Model and Retriever

from langchain_huggingface import HuggingFaceEndpoint

llm = HuggingFaceEndpoint(repo_id="HuggingFaceH4/zephyr-7b-alpha")
retriever = vector_db.as_retriever(search_kwargs={"k": 1})  # Retrieve only the most relevant chunk

The retriever will find the most relevant document chunk for each query.

Step 6: Define the RAG Chain

We define a prompt template to instruct the LLM on how to use the retrieved context.

from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import PromptTemplate

prompt = PromptTemplate.from_template(
    "You are a helpful AI assistant. Based on the following retrieved context, answer the question concisely.\n\n"
    "Context:\n{context}\n\n"
    "Question: {input}\n"
    "Answer:"
)

stuff_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, stuff_chain)

Here, we use create_stuff_documents_chain to format the retrieved document before passing it to the LLM.

Step 7: Query and Get a Response

query = "How does exercise affect health?"
response = rag_chain.invoke({"input": query})
print("Final Answer:", response["answer"])

Now, when you ask a question, the retriever fetches the most relevant document chunk, and the LLM generates an informed answer based on that context.

Debugging: What is Being Sent to the LLM?

To inspect what is being sent to the LLM, we can print the final prompt:

def debug_rag_chain(input_query):
    retrieved_docs = retriever.get_relevant_documents(input_query)
    retrieved_text = "\n".join([doc.page_content for doc in retrieved_docs])
    formatted_prompt = prompt.format(context=retrieved_text, input=input_query)
    print("\n===== DEBUG: FINAL PROMPT SENT TO LLM =====\n")
    print(formatted_prompt)
    print("\n==========================================\n")

# Test Debugging
debug_rag_chain("How does exercise affect health?")

This helps us understand how the retrieved documents influence the final answer.

Conclusion

By following these steps, we've successfully built a simple RAG (Retrieval-Augmented Generation) system using LangChain and Hugging Face. This approach allows us to retrieve relevant information before generating an answer, leading to more accurate and informed responses.

I put this code on Github here.

Key Takeaways:

✅ RAG improves AI-generated responses by retrieving relevant documents.
✅ We used FAISS as our vector store for efficient document retrieval.
✅ We limited retrieval to the most relevant document chunk for better accuracy.
✅ Debugging helps understand what the LLM is processing.

AI Course | Bundle Offer (including RAG ebook) | RAG Kindle Book | Master RAG

Contact me (rajamanickam.a@gmail.com) for AI development or one-on-one coaching to learn AI (RAG, Computer Vision, etc) for affordable hourly charges.

QualityPoint Technologies (QPT)

Saturday, March 1, 2025