Building a Production RAG Pipeline with LangChain4j + Spring Boot

Retrieval-augmented generation (RAG) is the most practical way to add domain-specific knowledge to LLMs without fine-tuning. But building production RAG systems is harder than the hype suggests.

I've shipped a RAG assistant in production — ingesting 100+ pages of engineering docs, handling 500+ queries/day, and reducing support escalations by ~40%. Here's what actually works, and what doesn't.

Why RAG, and Why Now?

RAG solves a real problem: LLMs are trained on static data and hallucinate when they don't know the answer. Instead of retraining, you retrieve relevant documents, pass them to the LLM, and let it answer from that context.

The pipeline is simple in theory:

Chunk documents into semantic units
Embed chunks into vectors
Store vectors in a vector database
Retrieve the k most similar chunks at query time
Generate answer with context

The devil is in the implementation details. Let me walk through a production-ready setup.

Architecture Overview

Part 1: Document Ingestion & Chunking

Why Chunking Matters

Shoving an entire PDF into the embedding model won't work:

Embedding models have context windows (usually 512–4096 tokens)
Chunks that are too large dilute semantic meaning (mixing unrelated concepts)
Chunks that are too small miss context (a sentence alone is often ambiguous)

Chunking Strategy

I recommend recursive character splitting with overlap:

Java

// Spring Boot config for LangChain4j document splitter
@Configuration
public class DocumentProcessingConfig {
  
  @Bean
  public DocumentSplitter documentSplitter() {
    return new RecursiveCharacterTextSplitter(
      chunkSize = 1000,           // tokens per chunk
      chunkOverlap = 200,         // overlap to preserve context
      separators = Arrays.asList("\n\n", "\n", " ", "")
    );
  }
}

Why these settings?

1000 tokens: ~4 paragraphs. Large enough to be semantically complete, small enough to fit embedding models.
200 token overlap: Prevents losing context at chunk boundaries. A query might match the tail of one chunk + the head of the next.
Recursive splitting: Tries to split on paragraphs first, then sentences, then words. Keeps natural boundaries intact.

Handling Different Document Types

PDFs, Markdown, HTML — they're all different:

Java

@Service
public class DocumentIngestService {
  
  public List<Document> ingestAndChunk(MultipartFile file) 
      throws IOException {
    
    byte[] content = file.getBytes();
    String filename = file.getOriginalFilename();
    
    // Route by file type
    List<Document> documents;
    if (filename.endsWith(".pdf")) {
      documents = processPdf(content);
    } else if (filename.endsWith(".md")) {
      documents = processMarkdown(new String(content));
    } else {
      throw new IllegalArgumentException("Unsupported format");
    }
    
    // Chunk each document
    return documentSplitter.split(documents);
  }
  
  private List<Document> processPdf(byte[] content) {
    // Use PDFBox or Apache PDFBox
    // Return List<Document> with extracted text + metadata (page number, etc.)
  }
  
  private List<Document> processMarkdown(String content) {
    // Split by # headers to preserve document structure
    // Each section becomes metadata for better retrieval
  }
}

Part 2: Embedding & Vector Storage

Choosing an Embedding Model

LangChain4j supports multiple providers. Here's what I recommend:

Model	Cost	Speed	Quality	Use Case
OpenAI text-embedding-3-small	$0.02/1M tokens	Fast	Very good	Prod if budget allows
text-embedding-3-large	$0.13/1M tokens	Fast	Excellent	High-accuracy needs
Ollama (local, free)	$0	Slower	Good	Dev/on-prem

For production, I use OpenAI's embedding model. At 500 queries/day with 5 retrieved chunks, that's ~2.5M tokens/month = ~$50.

Java

@Configuration
public class EmbeddingConfig {
  
  @Bean
  public EmbeddingModel embeddingModel(
      @Value("${openai.api.key}") String apiKey) {
    return new OpenAiEmbeddingModel(
      OpenAiEmbeddingModelName.TEXT_EMBEDDING_3_SMALL,
      apiKey
    );
  }
}

Vector Store Integration

LangChain4j has connectors for Pinecone, Milvus, Weaviate, and others. I use Pinecone for its managed infrastructure and 0-ops overhead.

Java

@Configuration
public class VectorStoreConfig {
  
  @Bean
  public PineconeVectorStore vectorStore(
      PineconeClient pineconeClient,
      EmbeddingModel embeddingModel) {
    return new PineconeVectorStore(
      pineconeClient,
      indexName = "documentation",
      namespace = "prod",
      embeddingModel = embeddingModel
    );
  }
  
  @Bean
  public PineconeClient pineconeClient(
      @Value("${pinecone.api.key}") String apiKey) {
    return PineconeClient.builder()
      .apiKey(apiKey)
      .environment("gcp-starter")  // or your region
      .build();
  }
}

Storing Documents

Once you have a vector store, ingestion is straightforward:

Java

@Service
public class DocumentIndexingService {
  
  private final VectorStore vectorStore;
  private final DocumentSplitter splitter;
  
  public void indexDocuments(List<Document> docs) {
    // Split into chunks
    List<Document> chunks = splitter.splitDocuments(docs);
    
    // Add to vector store
    // LangChain4j handles embedding in batch
    vectorStore.add(chunks);
    
    // Index now has (document text, embedding vector, metadata)
  }
}

Part 3: Retrieval & Prompt Engineering

Semantic Search

Retrieval is the heart of RAG. You embed the user's query and find k nearest neighbors in the vector store.

Java

@Service
public class RagService {
  
  private final VectorStore vectorStore;
  private final ChatModel chatModel;
  
  public String queryAndGenerate(String userQuery) {
    // Step 1: Retrieve relevant chunks
    SearchRequest request = SearchRequest.builder()
      .query(userQuery)
      .maxResults(5)  // Get top 5 chunks
      .minScore(0.75) // Filter by relevance
      .build();
    
    List<Document> retrievedDocs = vectorStore.search(request);
    
    // Step 2: Build context
    String context = retrievedDocs.stream()
      .map(Document::text)
      .collect(joining("\n\n---\n\n"));
    
    // Step 3: Generate answer with context
    String prompt = buildPrompt(userQuery, context);
    return chatModel.generate(prompt);
  }
  
  private String buildPrompt(String query, String context) {
    return """
      You are a helpful technical assistant.
      
      Based on the following documentation, answer the user's question.
      If the documentation doesn't contain enough information, say so.
      
      Documentation:
      ${context}
      
      User question: ${query}
      
      Answer:
      """;
  }
}

Improving Retrieval Quality

If your RAG is returning irrelevant chunks, try:

Re-rank — Use a second model (e.g., cross-encoder) to re-rank the top k results. Expensive but accurate.
Query expansion — Transform the user's query into 3–5 variations and search for all of them.
Metadata filtering — If your docs have metadata (source, date, category), filter before search.

Java

// Query expansion
public List<Document> retrieveWithExpansion(String query) {
  List<String> expandedQueries = generateQueryVariations(query);
  
  Set<Document> results = new HashSet<>();
  for (String q : expandedQueries) {
    results.addAll(vectorStore.search(q, maxResults = 5));
  }
  
  return results.stream()
    .limit(5)  // Deduplicate and limit
    .collect(toList());
}

private List<String> generateQueryVariations(String query) {
  // Use LLM to generate variations
  String prompt = "Generate 3 alternative phrasings of this question: " + query;
  String variations = chatModel.generate(prompt);
  return parseVariations(variations);
}

Part 4: Agentic Tool-Calling (Optional, Advanced)

Simple RAG retrieves static docs. But what if the user needs live data?

With agentic tool-calling, the RAG can call your APIs to fetch current information:

Java

@Service
public class AgenticRagService {
  
  private final ChatModel chatModel;
  private final VectorStore vectorStore;
  private final OrderApiClient orderApi;  // live data source
  
  public String generateWithTools(String userQuery) {
    // Define tools the AI can call
    List<ToolSpecification> tools = Arrays.asList(
      searchDocumentation(),
      queryLiveOrders(),
      checkInventory()
    );
    
    // Let the AI decide which tools to use
    ChatResponse response = chatModel.generate(
      userQuery,
      tools
    );
    
    // If AI called tools, execute them and re-prompt
    if (response.toolCalls != null) {
      for (ToolCall call : response.toolCalls) {
        Object result = executeTool(call);
        // Feed result back to LLM
      }
    }
    
    return response.text();
  }
  
  private ToolSpecification queryLiveOrders() {
    return ToolSpecification.builder()
      .name("queryLiveOrders")
      .description("Fetch current orders and their status")
      .inputSchema(/* JSON schema */)
      .build();
  }
}

This is how I reduced support escalations by 40% — the RAG could answer "What's the status of my order?" by calling the live API, not just the docs.

Part 5: Production Considerations

Caching

Embedding the same query twice wastes API costs. Cache embeddings:

Java

@Service
@CacheConfig(cacheNames = "queryEmbeddings")
public class CachedRagService {
  
  @Cacheable
  public List<Document> retrieveDocuments(String query) {
    return vectorStore.search(query, maxResults = 5);
  }
}

Monitoring & Observability

Track:

Retrieval quality: Did the retrieved chunks actually answer the question?
Embedding costs: Especially if using paid models
Latency: Is the round-trip P99 < 500ms?

Java

@Service
public class ObservableRagService {
  
  private final MeterRegistry meterRegistry;
  
  public String queryAndGenerate(String query) {
    Timer.Sample sample = Timer.start(meterRegistry);
    
    try {
      List<Document> docs = retrieve(query);
      
      meterRegistry.counter(
        "rag.retrieval.results",
        "count", String.valueOf(docs.size())
      ).increment();
      
      return generate(query, docs);
      
    } finally {
      sample.stop(Timer.builder("rag.latency")
        .publishPercentiles(0.5, 0.95, 0.99)
        .register(meterRegistry));
    }
  }
}

Cost Optimization

Batch embedding: Embed multiple docs at once (cheaper than one-by-one)
Use smaller models: OpenAI's text-embedding-3-small is $10/month for most workloads
Cache aggressively: Reduce redundant API calls

Common Pitfalls

Chunk size too large — Mixing unrelated docs. You embed context, not meaning.
No overlap — Queries match boundaries. Use 15–20% overlap.
Stale docs — RAG is only as good as your source docs. Update them regularly.
Ignoring retrieval quality — Blindly returning top-k chunks. Measure actual accuracy (does the retrieved doc answer the question?).
No monitoring — You can't improve what you don't measure.

Wrapping Up

RAG is powerful but requires discipline. Start simple:

Ingest docs, chunk them
Embed with OpenAI (or local Ollama)
Store in Pinecone (or PostgreSQL with pgvector)
Retrieve top-5, pass to LLM
Measure retrieval quality

Once this works, add tools, caching, monitoring. Don't over-engineer upfront.

The PCB assistant I built processes 500+ queries/day on a $50/month embedding API + $50/month Pinecone. It's not cheap-cheap, but it's production-viable.

Next steps: Try LangChain4j's quickstart, swap in your docs, measure retrieval quality. You'll quickly find the gaps.

Questions? Hit me up on LinkedIn or Twitter.

Building a Production RAG Pipeline with LangChain4j + Spring Boot

Why RAG, and Why Now?

Architecture Overview

Part 1: Document Ingestion & Chunking

Why Chunking Matters

Chunking Strategy

Handling Different Document Types

Part 2: Embedding & Vector Storage

Choosing an Embedding Model

Vector Store Integration

Storing Documents

Part 3: Retrieval & Prompt Engineering

Semantic Search

Improving Retrieval Quality

Part 4: Agentic Tool-Calling (Optional, Advanced)

Part 5: Production Considerations

Caching

Monitoring & Observability

Cost Optimization

Common Pitfalls

Wrapping Up

Ravi Kant Shukla

Enjoyed this post?

Comments (0)

Leave a comment

Related Posts

Prompt Engineering at Scale: Templates, Chains, and Optimization

Retrieval-Augmented Generation at Scale: Vector Databases & Semantic Search

Agentic AI Systems: Tool-Calling, Planning, and Execution