1. Why do LLMs “hallucinate” without RAG?
LLMs generate answers based on patterns learned during training, not from live or verified sources. When knowledge is missing or ambiguous, the model guesses. RAG grounds the model by injecting real documents at inference time, reducing hallucinations.
2. Is RAG a replacement for fine-tuning?
No. RAG and fine-tuning solve different problems.
-
Fine-tuning changes how the model behaves
-
RAG changes what the model knows at runtime
In practice, the best systems use both together.
3. What exactly is retrieved in a RAG system?
Not full documents. RAG retrieves small text chunks (usually 200–1,000 tokens) that are semantically closest to the user’s query, based on vector similarity.
4. Why can’t we just search with keywords instead of embeddings?
Keyword search matches words.
Embedding search matches meaning.
For example, “heart attack” can retrieve documents mentioning “myocardial infarction” even if the exact words don’t match.
5. How do embeddings “understand” meaning?
Embeddings convert text into high-dimensional vectors where:
-
Similar meanings → closer vectors
-
Different meanings → distant vectors
This geometry allows semantic retrieval instead of literal matching.
6. What happens if the retrieved context is wrong?
The LLM will confidently generate a wrong answer.
This is why retrieval quality is more important than model size in RAG systems.
7. How many documents should RAG retrieve per query?
Typical values:
-
3–5 chunks for precise QA
-
5–10 chunks for complex reasoning
Too many chunks increase noise and token cost.
8. Why does chunk size matter so much?
-
Small chunks → better precision, less context
-
Large chunks → more context, lower precision
There is no universal size—chunking must match document structure and use case.
9. Can RAG work with structured data like tables or CSVs?
Yes, but with preprocessing:
-
Convert rows to readable text
-
Add metadata (columns, source, timestamps)
-
Use hybrid retrieval (vector + filters)
10. What is “metadata filtering” in RAG?
Metadata allows you to restrict retrieval by:
-
Date
-
Document type
-
User role
-
Language
This dramatically improves relevance and security.
11. How does RAG help with data privacy?
Your private data:
-
Is not used to train the LLM
-
Stays inside your vector database
-
Is retrieved only when needed
This makes RAG ideal for enterprise and internal knowledge systems.
12. Can RAG provide citations or sources?
Yes. If you store document IDs or URLs as metadata, the system can return answer + source references, increasing trust and auditability.
13. Why do some RAG systems still hallucinate?
Common reasons:
-
Poor chunking
-
Irrelevant retrieval
-
Missing documents
-
Prompt not enforcing “answer only from context”
RAG reduces hallucinations—it doesn’t eliminate them automatically.
14. What is “context window” and why does it limit RAG?
LLMs can only process a limited number of tokens per request. Retrieved content must fit inside this window, forcing trade-offs between depth and breadth.
15. How is RAG different from search + LLM?
Search + LLM:
-
Search returns links
-
LLM answers separately
RAG:
-
Retrieval and generation are tightly integrated
-
The model reasons directly over retrieved text
16. Does RAG require a vector database?
Practically, yes. While embeddings can be stored elsewhere, vector databases are optimized for:
-
Fast similarity search
-
Filtering
-
Scalability
17. How often should embeddings be updated?
Whenever:
-
Documents change
-
New knowledge is added
-
Meaningful corrections occur
Stale embeddings lead to outdated answers.
18. Can RAG handle multilingual documents?
Yes, if:
-
You use multilingual embedding models
-
Language metadata is stored and filtered
Otherwise, retrieval quality drops significantly.
19. What is hybrid RAG?
Hybrid RAG combines:
-
Vector search (semantic)
-
Keyword or BM25 search (exact match)
This improves performance for technical terms, IDs, and numbers.
20. Is RAG expensive to run?
Costs come from:
-
Embedding generation (one-time or periodic)
-
Vector storage
-
LLM inference
However, RAG is far cheaper than retraining models and scales efficiently.
21. When should you NOT use RAG?
Avoid RAG if:
-
The knowledge is small and static
-
The model already knows everything needed
-
Latency must be ultra-low with zero retrieval
22. How do you evaluate a RAG system?
Key metrics:
-
Retrieval relevance
-
Answer faithfulness
-
Coverage of knowledge
-
Latency and cost
Human evaluation is still critical.
23. What is “RAG grounding”?
Grounding ensures the model:
-
Uses only retrieved context
-
Avoids injecting prior knowledge
This is enforced through careful prompting and system design.
24. Can RAG systems reason across multiple documents?
Yes, but only if:
-
Retrieved chunks cover all required facts
-
The prompt encourages synthesis, not summarization
25. What’s the biggest mistake people make with RAG?
Focusing on LLM choice instead of:
-
Data quality
-
Chunking strategy
-
Retrieval accuracy
In RAG, data architecture beats model size.
If you want to learn RAG from personal coaching, read the details here.
No comments:
Post a Comment