Monday, November 3, 2025

Building a Web-Crawling RAG Chatbot Using LangChain, Supabase, and Gemini


Have you ever wanted to create an AI chatbot that answers questions using your own website or documentation?

In this guide, we’ll build exactly that — a Retrieval-Augmented Generation (RAG) chatbot that crawls a website, stores the extracted text in a Supabase vector database, and uses Google’s Gemini LLM to answer user questions based on that knowledge.

We’ll use LangChain for orchestration, Sentence Transformers for embeddings, and Supabase as our persistent vector store. The result is a flexible, production-ready setup for any custom knowledge chatbot.


🧠 What You’ll Learn

By the end of this post, you’ll understand:

  • What RAG (Retrieval-Augmented Generation) means and why it’s important.

  • How to crawl a website and extract meaningful text.

  • How to create and store embeddings in a Supabase vector database.

  • How to use LangChain and Gemini to build a question-answering chatbot.


⚙️ What is RAG?

Retrieval-Augmented Generation (RAG) is an AI technique that combines two powerful components:

  1. Retrieval – Fetch relevant information from an external knowledge base (like your documents or website).

  2. Generation – Use an LLM to generate an answer using that retrieved context.

This solves one major limitation of language models: they don’t know your private or recent data.
RAG bridges that gap by giving the model real-world context at query time.


🧩 Tech Stack Overview

Let’s look at the core tools we’re using:

🦜 LangChain

LangChain is a powerful framework that simplifies the process of building LLM-powered applications.
It handles:

  • Document loading and text splitting

  • Embedding and vector store management

  • Chain creation for RAG workflows

You can think of LangChain as the “glue” that connects all components together.


🧱 Supabase

Supabase is an open-source alternative to Firebase — with a PostgreSQL backend and a built-in vector database extension (pgvector).
We’ll use Supabase to store and retrieve text embeddings, turning it into our chatbot’s memory.


🧩 Sentence Transformers (Embeddings)

Before an LLM can understand and retrieve text efficiently, we need to convert text into embeddings — numerical representations of meaning.
We’ll use the Hugging Face model sentence-transformers/all-MiniLM-L6-v2, a lightweight and accurate embedding model that produces 384-dimensional vectors.

These embeddings are stored in Supabase and later retrieved using similarity search.


🤖 Gemini LLM

Gemini is Google’s latest large language model, optimized for reasoning and factual accuracy.
We’ll use the Gemini API via LangChain’s Google Generative AI integration to answer questions based on the retrieved knowledge.


🗂️ Project Structure

We’ll split the code into two scripts for clarity and scalability:

project/

├── crawl_and_store.py   # Crawl a website and store embeddings in  Supabase

└── chatbot_rag.py       # Query the Supabase vector store using Gemini



🕷️ Part 1: Crawling and Storing Data

The first script crawls a documentation website (like https://docs.n8n.io/), extracts readable content, splits it into chunks, creates embeddings, and stores them in Supabase.

Here’s the key flow:

  1. Crawl and parse pages using BeautifulSoup

  2. Split long text into smaller chunks (using LangChain’s text splitter)

  3. Convert chunks into embeddings using Sentence Transformers

  4. Upload to Supabase as a persistent vector database

Code: crawl_and_store.py

# Crawl -> Split -> Embed -> Store

docs = crawl_website(BASE_URL, max_pages=100)
chunks = text_splitter.split_documents(docs)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
SupabaseVectorStore.from_documents(chunks, embeddings, client=supabase, table_name="documents")

After running this, your website’s content is now searchable via vector similarity — perfect for question answering!


💬 Part 2: Building the Chatbot

Once data is in Supabase, the chatbot script retrieves the most relevant chunks based on user questions and uses Gemini to generate accurate answers.

Key steps:

  1. Connect to Supabase and load the stored vector data

  2. Retrieve relevant chunks for each user query

  3. Pass the retrieved context + query to Gemini LLM

  4. Generate and display the final answer

Code: chatbot_rag.py

vector_db = SupabaseVectorStore(client=supabase, embedding=embeddings, table_name="documents")
retriever = vector_db.as_retriever()

prompt = PromptTemplate.from_template(
    "You are a helpful AI assistant. Use the following context to answer:\n\n"
    "Context:\n{context}\n\nQuestion: {input}\nAnswer:"
)
rag_chain = create_retrieval_chain(retriever, create_stuff_documents_chain(llm, prompt))


When you run it:

python chatbot_rag.py


You’ll get:

🤖 Chatbot is ready! Type 'exit' to quit.

Enter your question: What is n8n used for?

Answer: n8n is an open-source workflow automation tool that lets you connect apps and services using visual workflows.



🧰 Setting Up the Environment

Before running the code, install dependencies:

pip install requests beautifulsoup4 supabase langchain langchain-community langchain-huggingface langchain-google-genai


Then set your environment variables:

export SUPABASE_URL="https://YOUR_PROJECT.supabase.co"

export SUPABASE_SERVICE_KEY="YOUR_SERVICE_ROLE_KEY"

export GEMINI_API_KEY="YOUR_GEMINI_API_KEY"


Create a table in Supabase:

CREATE TABLE documents (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
content text,
metadata jsonb,
embedding vector(384)
);

create or replace function match_documents(
query_embedding vector(384),
match_count int default 5
)
returns table (
id uuid,
content text,
metadata jsonb,
embedding vector(384),
similarity float
)
language plpgsql
as $$
begin
return query
select
documents.id,
documents.content,
documents.metadata,
documents.embedding,
1 - (documents.embedding <=> query_embedding) as similarity
from documents
order by documents.embedding <=> query_embedding
limit match_count;
end;
$$;


🧩 How It All Fits Together

Here’s a quick overview of the workflow:

       ┌────────────────────────────┐

       │   Crawl website content    │

       └────────────┬───────────────┘

                    │

                    ▼

        ┌────────────────────────────┐

        │   Split text into chunks   │

        └────────────┬───────────────┘

                     │

                     ▼

        ┌────────────────────────────┐

        │  Convert text → embeddings │

        └────────────┬───────────────┘

                     │

                     ▼

        ┌────────────────────────────┐

        │   Store embeddings in DB   │

        └────────────┬───────────────┘

                     │

                     ▼

        ┌────────────────────────────┐

        │  Query with Gemini via RAG │

        └────────────────────────────┘



🌟 Why This Setup Is Powerful

 ✅ Crawl any website — works for docs, blogs, or internal knowledge bases
Store and query securely — Supabase handles authentication and persistence
Re-use embeddings — no need to reprocess every time
Plug in any LLM — Gemini today, GPT or Claude tomorrow
Scalable architecture — separate crawling and chatbot stages


🧠 Possible Improvements

  • Add automatic table creation (if it doesn’t exist).

  • Store and update embeddings incrementally.

  • Add a web UI using Streamlit or Gradio.

  • Implement multi-website knowledge sources.


🏁 Final Thoughts

This project shows how easily you can combine LangChain, Supabase, and Gemini to build a real-world AI assistant for your own content.

With just a few hundred lines of Python, you’ve created:

  • A custom knowledge crawler

  • A vector search system

  • An intelligent chatbot that understands your data

Whether you’re building internal support bots, documentation assistants, or smart data interfaces — this architecture is a solid starting point.

💻 Source Code:
You can find the complete code on GitHub:
👉 https://github.com/rajamanickam/rag-langchain

Check the Lanchain Tutorial book here.

For one-on-one coaching of RAG and Lanchain, read details here.

No comments:

Search This Blog