Building a Production RAG Pipeline with LangChain4j + Spring Boot
A complete guide to building retrieval-augmented generation (RAG) systems in Java using LangChain4j. Learn chunking strategies, embedding pipelines, vector store integration, and how to ship RAG to production.
Retrieval-augmented generation (RAG) is the most practical way to add domain-specific knowledge to LLMs without fine-tuning. But building production RAG systems is harder than the hype suggests.
I've shipped a RAG assistant in production — ingesting 100+ pages of engineering docs, handling 500+ queries/day, and reducing support escalations by ~40%. Here's what actually works, and what doesn't.
Why RAG, and Why Now?
RAG solves a real problem: LLMs are trained on static data and hallucinate when they don't know the answer. Instead of retraining, you retrieve relevant documents, pass them to the LLM, and let it answer from that context.
The pipeline is simple in theory:
- Chunk documents into semantic units
- Embed chunks into vectors
- Store vectors in a vector database
- Retrieve the k most similar chunks at query time
- Generate answer with context
The devil is in the implementation details. Let me walk through a production-ready setup.
Architecture Overview
Part 1: Document Ingestion & Chunking
Why Chunking Matters
Shoving an entire PDF into the embedding model won't work:
- Embedding models have context windows (usually 512–4096 tokens)
- Chunks that are too large dilute semantic meaning (mixing unrelated concepts)
- Chunks that are too small miss context (a sentence alone is often ambiguous)
Chunking Strategy
I recommend recursive character splitting with overlap:
// Spring Boot config for LangChain4j document splitter
@Configuration
public class DocumentProcessingConfig {
@Bean
public DocumentSplitter documentSplitter() {
return new RecursiveCharacterTextSplitter(
chunkSize = 1000, // tokens per chunk
chunkOverlap = 200, // overlap to preserve context
separators = Arrays.asList("\n\n", "\n", " ", "")
);
}
}
Why these settings?
- 1000 tokens: ~4 paragraphs. Large enough to be semantically complete, small enough to fit embedding models.
- 200 token overlap: Prevents losing context at chunk boundaries. A query might match the tail of one chunk + the head of the next.
- Recursive splitting: Tries to split on paragraphs first, then sentences, then words. Keeps natural boundaries intact.
Handling Different Document Types
PDFs, Markdown, HTML — they're all different:
@Service
public class DocumentIngestService {
public List<Document> ingestAndChunk(MultipartFile file)
throws IOException {
byte[] content = file.getBytes();
String filename = file.getOriginalFilename();
// Route by file type
List<Document> documents;
if (filename.endsWith(".pdf")) {
documents = processPdf(content);
} else if (filename.endsWith(".md")) {
documents = processMarkdown(new String(content));
} else {
throw new IllegalArgumentException("Unsupported format");
}
// Chunk each document
return documentSplitter.split(documents);
}
private List<Document> processPdf(byte[] content) {
// Use PDFBox or Apache PDFBox
// Return List<Document> with extracted text + metadata (page number, etc.)
}
private List<Document> processMarkdown(String content) {
// Split by # headers to preserve document structure
// Each section becomes metadata for better retrieval
}
}
Part 2: Embedding & Vector Storage
Choosing an Embedding Model
LangChain4j supports multiple providers. Here's what I recommend:
| Model | Cost | Speed | Quality | Use Case |
|---|---|---|---|---|
| OpenAI text-embedding-3-small | $0.02/1M tokens | Fast | Very good | Prod if budget allows |
| text-embedding-3-large | $0.13/1M tokens | Fast | Excellent | High-accuracy needs |
| Ollama (local, free) | $0 | Slower | Good | Dev/on-prem |
For production, I use OpenAI's embedding model. At 500 queries/day with 5 retrieved chunks, that's ~2.5M tokens/month = ~$50.
@Configuration
public class EmbeddingConfig {
@Bean
public EmbeddingModel embeddingModel(
@Value("${openai.api.key}") String apiKey) {
return new OpenAiEmbeddingModel(
OpenAiEmbeddingModelName.TEXT_EMBEDDING_3_SMALL,
apiKey
);
}
}
Vector Store Integration
LangChain4j has connectors for Pinecone, Milvus, Weaviate, and others. I use Pinecone for its managed infrastructure and 0-ops overhead.
@Configuration
public class VectorStoreConfig {
@Bean
public PineconeVectorStore vectorStore(
PineconeClient pineconeClient,
EmbeddingModel embeddingModel) {
return new PineconeVectorStore(
pineconeClient,
indexName = "documentation",
namespace = "prod",
embeddingModel = embeddingModel
);
}
@Bean
public PineconeClient pineconeClient(
@Value("${pinecone.api.key}") String apiKey) {
return PineconeClient.builder()
.apiKey(apiKey)
.environment("gcp-starter") // or your region
.build();
}
}
Storing Documents
Once you have a vector store, ingestion is straightforward:
@Service
public class DocumentIndexingService {
private final VectorStore vectorStore;
private final DocumentSplitter splitter;
public void indexDocuments(List<Document> docs) {
// Split into chunks
List<Document> chunks = splitter.splitDocuments(docs);
// Add to vector store
// LangChain4j handles embedding in batch
vectorStore.add(chunks);
// Index now has (document text, embedding vector, metadata)
}
}
Part 3: Retrieval & Prompt Engineering
Semantic Search
Retrieval is the heart of RAG. You embed the user's query and find k nearest neighbors in the vector store.
@Service
public class RagService {
private final VectorStore vectorStore;
private final ChatModel chatModel;
public String queryAndGenerate(String userQuery) {
// Step 1: Retrieve relevant chunks
SearchRequest request = SearchRequest.builder()
.query(userQuery)
.maxResults(5) // Get top 5 chunks
.minScore(0.75) // Filter by relevance
.build();
List<Document> retrievedDocs = vectorStore.search(request);
// Step 2: Build context
String context = retrievedDocs.stream()
.map(Document::text)
.collect(joining("\n\n---\n\n"));
// Step 3: Generate answer with context
String prompt = buildPrompt(userQuery, context);
return chatModel.generate(prompt);
}
private String buildPrompt(String query, String context) {
return """
You are a helpful technical assistant.
Based on the following documentation, answer the user's question.
If the documentation doesn't contain enough information, say so.
Documentation:
${context}
User question: ${query}
Answer:
""";
}
}
Improving Retrieval Quality
If your RAG is returning irrelevant chunks, try:
- Re-rank — Use a second model (e.g., cross-encoder) to re-rank the top k results. Expensive but accurate.
- Query expansion — Transform the user's query into 3–5 variations and search for all of them.
- Metadata filtering — If your docs have metadata (source, date, category), filter before search.
// Query expansion
public List<Document> retrieveWithExpansion(String query) {
List<String> expandedQueries = generateQueryVariations(query);
Set<Document> results = new HashSet<>();
for (String q : expandedQueries) {
results.addAll(vectorStore.search(q, maxResults = 5));
}
return results.stream()
.limit(5) // Deduplicate and limit
.collect(toList());
}
private List<String> generateQueryVariations(String query) {
// Use LLM to generate variations
String prompt = "Generate 3 alternative phrasings of this question: " + query;
String variations = chatModel.generate(prompt);
return parseVariations(variations);
}
Part 4: Agentic Tool-Calling (Optional, Advanced)
Simple RAG retrieves static docs. But what if the user needs live data?
With agentic tool-calling, the RAG can call your APIs to fetch current information:
@Service
public class AgenticRagService {
private final ChatModel chatModel;
private final VectorStore vectorStore;
private final OrderApiClient orderApi; // live data source
public String generateWithTools(String userQuery) {
// Define tools the AI can call
List<ToolSpecification> tools = Arrays.asList(
searchDocumentation(),
queryLiveOrders(),
checkInventory()
);
// Let the AI decide which tools to use
ChatResponse response = chatModel.generate(
userQuery,
tools
);
// If AI called tools, execute them and re-prompt
if (response.toolCalls != null) {
for (ToolCall call : response.toolCalls) {
Object result = executeTool(call);
// Feed result back to LLM
}
}
return response.text();
}
private ToolSpecification queryLiveOrders() {
return ToolSpecification.builder()
.name("queryLiveOrders")
.description("Fetch current orders and their status")
.inputSchema(/* JSON schema */)
.build();
}
}
This is how I reduced support escalations by 40% — the RAG could answer "What's the status of my order?" by calling the live API, not just the docs.
Part 5: Production Considerations
Caching
Embedding the same query twice wastes API costs. Cache embeddings:
@Service
@CacheConfig(cacheNames = "queryEmbeddings")
public class CachedRagService {
@Cacheable
public List<Document> retrieveDocuments(String query) {
return vectorStore.search(query, maxResults = 5);
}
}
Monitoring & Observability
Track:
- Retrieval quality: Did the retrieved chunks actually answer the question?
- Embedding costs: Especially if using paid models
- Latency: Is the round-trip P99 < 500ms?
@Service
public class ObservableRagService {
private final MeterRegistry meterRegistry;
public String queryAndGenerate(String query) {
Timer.Sample sample = Timer.start(meterRegistry);
try {
List<Document> docs = retrieve(query);
meterRegistry.counter(
"rag.retrieval.results",
"count", String.valueOf(docs.size())
).increment();
return generate(query, docs);
} finally {
sample.stop(Timer.builder("rag.latency")
.publishPercentiles(0.5, 0.95, 0.99)
.register(meterRegistry));
}
}
}
Cost Optimization
- Batch embedding: Embed multiple docs at once (cheaper than one-by-one)
- Use smaller models: OpenAI's
text-embedding-3-smallis $10/month for most workloads - Cache aggressively: Reduce redundant API calls
Common Pitfalls
- Chunk size too large — Mixing unrelated docs. You embed context, not meaning.
- No overlap — Queries match boundaries. Use 15–20% overlap.
- Stale docs — RAG is only as good as your source docs. Update them regularly.
- Ignoring retrieval quality — Blindly returning top-k chunks. Measure actual accuracy (does the retrieved doc answer the question?).
- No monitoring — You can't improve what you don't measure.
Wrapping Up
RAG is powerful but requires discipline. Start simple:
- Ingest docs, chunk them
- Embed with OpenAI (or local Ollama)
- Store in Pinecone (or PostgreSQL with pgvector)
- Retrieve top-5, pass to LLM
- Measure retrieval quality
Once this works, add tools, caching, monitoring. Don't over-engineer upfront.
The PCB assistant I built processes 500+ queries/day on a $50/month embedding API + $50/month Pinecone. It's not cheap-cheap, but it's production-viable.
Next steps: Try LangChain4j's quickstart, swap in your docs, measure retrieval quality. You'll quickly find the gaps.
Ravi Kant Shukla
Senior Java + AI engineer. 9+ years in system design, Kafka, microservices, and LLM/RAG pipelines.
Enjoyed this post?
Get more system design and AWS insights delivered weekly. No spam.