Building RAG Systems with SPADE-LLM¶
This guide shows you how to build Retrieval-Augmented Generation (RAG) systems that enhance your LLM agents with external knowledge. You'll learn when to use RAG, how to design effective retrieval pipelines, and how to integrate them into multi-agent workflows.
When Should You Use RAG?¶
Before diving into implementation, consider whether RAG is the right solution for your use case:
✅ Use RAG When:¶
- Your agents need current or frequently updated information that wasn't in the LLM's training data
- You're building domain-specific assistants (e.g., company documentation, legal docs, technical manuals)
- You need verifiable, source-based answers rather than model-generated responses
- Your knowledge base is too large to fit in the context window
- You want to reduce hallucinations by grounding responses in actual documents
❌ Skip RAG When:¶
- Your task requires only general knowledge already in the LLM
- You need creative generation without factual constraints
- Your use case is simple Q&A that doesn't require external context
- Real-time performance is critical and retrieval latency is unacceptable
Understanding the RAG Pipeline¶
Building a RAG system involves three phases that work together:
Phase 1: Indexing (Build Your Knowledge Base)¶
This happens once (or periodically) to prepare your documents:
- Load documents from files, databases, or APIs
- Chunk text into semantic pieces
- Embed chunks into vectors using an embedding model
- Store vectors in a database for fast similarity search
Think of this as creating a searchable library catalog for your documents.
Phase 2: Retrieval (Find Relevant Information)¶
This happens at query time:
- User asks a question
- Convert question to a vector (using the same embedding model)
- Search vector database for similar document chunks
- Return the most relevant chunks (typically top 3-5)
This is like asking a librarian to find books on a specific topic.
Phase 3: Generation (Create the Answer)¶
The LLM uses retrieved context:
- Retrieved chunks are added to the LLM's prompt as context
- LLM generates an answer based on this context
- Response includes information from actual documents, not just model memory
This combines the librarian's retrieved books with the LLM's reading comprehension.
Building Your First RAG System¶
Let's build a practical RAG system step by step. We'll create a documentation assistant that answers questions about your codebase.
Step 1: Load Your Documents¶
Choose the loader that matches your document source:
from spade_llm.rag import DirectoryLoader
# Load all markdown files from documentation
loader = DirectoryLoader(
path="./docs",
glob_pattern="**/*.md"
)
documents = await loader.load()
print(f"Loaded {len(documents)} documents")
Tip: Use metadata to organize documents by category, version, or author. This helps with filtering during retrieval.
Step 2: Chunk Your Documents¶
Chunking is critical, too small and you lose context, too large and retrieval becomes imprecise.
from spade_llm.rag import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=150
)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")
Why overlap? Overlap prevents important information from being split across chunk boundaries.
Step 3: Choose Your Embedding Model¶
Embeddings convert text into vectors that capture semantic meaning. Two main options:
Option A: OpenAI Embeddings (Cloud)¶
from spade_llm.providers import LLMProvider
embedding_provider = LLMProvider(
model="text-embedding-3-small",
api_key="your-api-key",
)
When to use: You need highest quality, have budget, don't mind cloud dependency.
Option B: Ollama Embeddings (Local)¶
# First: ollama pull nomic-embed-text
embedding_provider = LLMProvider(
model="ollama/nomic-embed-text",
)
When to use: Privacy matters, offline operation needed, high-volume use case.
Critical: Always use the same embedding model for indexing and querying. Mixing models breaks semantic similarity.
Step 4: Create and Populate Vector Store¶
from spade_llm.rag import Chroma
# Initialize vector store with persistent storage
vector_store = Chroma(
collection_name="docs_kb",
embedding_fn=embedding_provider.get_embeddings,
persist_directory="./vector_db"
)
await vector_store.initialize()
# Index all chunks (this may take a few minutes)
await vector_store.add_documents(chunks)
print(f"Indexed {await vector_store.get_document_count()} chunks")
Step 5: Set Up Retrieval¶
Create a retriever to query your indexed documents:
from spade_llm.rag import VectorStoreRetriever
retriever = VectorStoreRetriever(
vector_store=vector_store,
search_type="similarity", # or "mmr" for diverse results
k=5 # Return top 5 chunks
)
results = await retriever.retrieve("How do I configure agents?")
for doc in results:
print(f"📄 {doc.metadata['source']}: {doc.content[:100]}...")
Integrating RAG with your agents¶
Pattern 1: Direct Agent Integration¶
For simple use cases, integrate retrieval directly into an agent workflow (this works with traditional SPADE agents):
from spade_llm import RetrievalAgent
from spade_llm.rag import VectorStoreRetriever
# Create a retrieval agent that handles document queries
retrieval_agent = RetrievalAgent(
jid="retrieval@localhost",
password="password",
retriever=VectorStoreRetriever(vector_store=vector_store)
)
await retrieval_agent.start()
When to use: Single knowledge base, simple architecture, getting started.
Pattern 2: Multi-Agent RAG (Recommended)¶
In this case, LLM agents can query retrieval agents via tools:
from spade_llm import LLMAgent, RetrievalAgent
from spade_llm.tools import RetrievalTool
from spade_llm.providers import LLMProvider
# 1. Start retrieval agent (manages the knowledge base)
retrieval_agent = RetrievalAgent(
jid="retrieval@localhost",
password="retrieval_pass",
retriever=retriever
)
await retrieval_agent.start()
# 2. Create LLM agent with retrieval tool
llm_provider = LLMProvider(model="gpt-5-nano")
retrieval_tool = RetrievalTool(
name="docs_search",
description="Search technical documentation for code examples and explanations",
retrieval_agent_jid="retrieval@localhost",
k=5
)
llm_agent = LLMAgent(
jid="assistant@localhost",
password="assistant_pass",
provider=llm_provider,
tools=[retrieval_tool]
)
await llm_agent.start()
# 3. Query the agent - it will automatically use retrieval when needed
response = await llm_agent.query("How do I create custom behaviours?")
How it works:
- User asks the LLM agent a question
- LLM decides whether to search the knowledge base
- If needed, it calls the
docs_searchtool - Tool sends XMPP message to
RetrievalAgent - Retrieved documents are returned and used as context
- LLM generates grounded response with sources
When to use: Production systems, multiple knowledge bases, need agent autonomy.
Pattern 3: Distributed Knowledge Bases¶
Deploy multiple specialized retrieval agents for different domains:
# Technical documentation retrieval agent
tech_retrieval = RetrievalAgent(
jid="tech_docs@localhost",
password="pass",
retriever=tech_retriever
)
# HR policies retrieval agent
hr_retrieval = RetrievalAgent(
jid="hr_docs@localhost",
password="pass",
retriever=hr_retriever
)
# LLM agent with access to both
llm_agent = LLMAgent(
jid="assistant@localhost",
password="pass",
provider=llm_provider,
tools=[
RetrievalTool(
name="tech_docs",
description="Search technical and API documentation",
retrieval_agent_jid="tech_docs@localhost"
),
RetrievalTool(
name="hr_policies",
description="Search HR policies and employee handbook",
retrieval_agent_jid="hr_docs@localhost"
)
]
)
When to use: Access control needs, domain separation.
Next Steps¶
- RAG API Reference - Detailed API documentation
- Examples - Complete working examples
- Tools System - Integrate RAG with LLM agents
- Providers - Embedding model configuration