RS
Ravi Shukla
HomeBlogToolsAbout
Resume
RS
Ravi Shukla

Senior Java + AI engineer. Kafka, RAG, distributed systems.

Content

  • Blog
  • System Design
  • AI & ML
  • DevOps

Explore

  • About Ravi
  • Open Stats
  • Thank You

© 2026 Ravi Kant Shukla. All rights reserved.

Deployed on Vercel · Mumbai region

Back to Writing
ai-ml

Building a Production RAG Pipeline with LangChain4j + Spring Boot

A complete guide to building retrieval-augmented generation (RAG) systems in Java using LangChain4j. Learn chunking strategies, embedding pipelines, vector store integration, and how to ship RAG to production.

May 5, 202618 min read
LangChain4jRAGSpring BootVector StoresAI/LLM

Retrieval-augmented generation (RAG) is the most practical way to add domain-specific knowledge to LLMs without fine-tuning. But building production RAG systems is harder than the hype suggests.

I've shipped a RAG assistant in production — ingesting 100+ pages of engineering docs, handling 500+ queries/day, and reducing support escalations by ~40%. Here's what actually works, and what doesn't.


Why RAG, and Why Now?

RAG solves a real problem: LLMs are trained on static data and hallucinate when they don't know the answer. Instead of retraining, you retrieve relevant documents, pass them to the LLM, and let it answer from that context.

The pipeline is simple in theory:

  1. Chunk documents into semantic units
  2. Embed chunks into vectors
  3. Store vectors in a vector database
  4. Retrieve the k most similar chunks at query time
  5. Generate answer with context

The devil is in the implementation details. Let me walk through a production-ready setup.


Architecture Overview


Part 1: Document Ingestion & Chunking

Why Chunking Matters

Shoving an entire PDF into the embedding model won't work:

  • Embedding models have context windows (usually 512–4096 tokens)
  • Chunks that are too large dilute semantic meaning (mixing unrelated concepts)
  • Chunks that are too small miss context (a sentence alone is often ambiguous)

Chunking Strategy

I recommend recursive character splitting with overlap:

Java
// Spring Boot config for LangChain4j document splitter
@Configuration
public class DocumentProcessingConfig {
  
  @Bean
  public DocumentSplitter documentSplitter() {
    return new RecursiveCharacterTextSplitter(
      chunkSize = 1000,           // tokens per chunk
      chunkOverlap = 200,         // overlap to preserve context
      separators = Arrays.asList("\n\n", "\n", " ", "")
    );
  }
}

Why these settings?

  • 1000 tokens: ~4 paragraphs. Large enough to be semantically complete, small enough to fit embedding models.
  • 200 token overlap: Prevents losing context at chunk boundaries. A query might match the tail of one chunk + the head of the next.
  • Recursive splitting: Tries to split on paragraphs first, then sentences, then words. Keeps natural boundaries intact.

Handling Different Document Types

PDFs, Markdown, HTML — they're all different:

Java
@Service
public class DocumentIngestService {
  
  public List<Document> ingestAndChunk(MultipartFile file) 
      throws IOException {
    
    byte[] content = file.getBytes();
    String filename = file.getOriginalFilename();
    
    // Route by file type
    List<Document> documents;
    if (filename.endsWith(".pdf")) {
      documents = processPdf(content);
    } else if (filename.endsWith(".md")) {
      documents = processMarkdown(new String(content));
    } else {
      throw new IllegalArgumentException("Unsupported format");
    }
    
    // Chunk each document
    return documentSplitter.split(documents);
  }
  
  private List<Document> processPdf(byte[] content) {
    // Use PDFBox or Apache PDFBox
    // Return List<Document> with extracted text + metadata (page number, etc.)
  }
  
  private List<Document> processMarkdown(String content) {
    // Split by # headers to preserve document structure
    // Each section becomes metadata for better retrieval
  }
}

Part 2: Embedding & Vector Storage

Choosing an Embedding Model

LangChain4j supports multiple providers. Here's what I recommend:

ModelCostSpeedQualityUse Case
OpenAI text-embedding-3-small$0.02/1M tokensFastVery goodProd if budget allows
text-embedding-3-large$0.13/1M tokensFastExcellentHigh-accuracy needs
Ollama (local, free)$0SlowerGoodDev/on-prem

For production, I use OpenAI's embedding model. At 500 queries/day with 5 retrieved chunks, that's ~2.5M tokens/month = ~$50.

Java
@Configuration
public class EmbeddingConfig {
  
  @Bean
  public EmbeddingModel embeddingModel(
      @Value("${openai.api.key}") String apiKey) {
    return new OpenAiEmbeddingModel(
      OpenAiEmbeddingModelName.TEXT_EMBEDDING_3_SMALL,
      apiKey
    );
  }
}

Vector Store Integration

LangChain4j has connectors for Pinecone, Milvus, Weaviate, and others. I use Pinecone for its managed infrastructure and 0-ops overhead.

Java
@Configuration
public class VectorStoreConfig {
  
  @Bean
  public PineconeVectorStore vectorStore(
      PineconeClient pineconeClient,
      EmbeddingModel embeddingModel) {
    return new PineconeVectorStore(
      pineconeClient,
      indexName = "documentation",
      namespace = "prod",
      embeddingModel = embeddingModel
    );
  }
  
  @Bean
  public PineconeClient pineconeClient(
      @Value("${pinecone.api.key}") String apiKey) {
    return PineconeClient.builder()
      .apiKey(apiKey)
      .environment("gcp-starter")  // or your region
      .build();
  }
}

Storing Documents

Once you have a vector store, ingestion is straightforward:

Java
@Service
public class DocumentIndexingService {
  
  private final VectorStore vectorStore;
  private final DocumentSplitter splitter;
  
  public void indexDocuments(List<Document> docs) {
    // Split into chunks
    List<Document> chunks = splitter.splitDocuments(docs);
    
    // Add to vector store
    // LangChain4j handles embedding in batch
    vectorStore.add(chunks);
    
    // Index now has (document text, embedding vector, metadata)
  }
}

Part 3: Retrieval & Prompt Engineering

Semantic Search

Retrieval is the heart of RAG. You embed the user's query and find k nearest neighbors in the vector store.

Java
@Service
public class RagService {
  
  private final VectorStore vectorStore;
  private final ChatModel chatModel;
  
  public String queryAndGenerate(String userQuery) {
    // Step 1: Retrieve relevant chunks
    SearchRequest request = SearchRequest.builder()
      .query(userQuery)
      .maxResults(5)  // Get top 5 chunks
      .minScore(0.75) // Filter by relevance
      .build();
    
    List<Document> retrievedDocs = vectorStore.search(request);
    
    // Step 2: Build context
    String context = retrievedDocs.stream()
      .map(Document::text)
      .collect(joining("\n\n---\n\n"));
    
    // Step 3: Generate answer with context
    String prompt = buildPrompt(userQuery, context);
    return chatModel.generate(prompt);
  }
  
  private String buildPrompt(String query, String context) {
    return """
      You are a helpful technical assistant.
      
      Based on the following documentation, answer the user's question.
      If the documentation doesn't contain enough information, say so.
      
      Documentation:
      ${context}
      
      User question: ${query}
      
      Answer:
      """;
  }
}

Improving Retrieval Quality

If your RAG is returning irrelevant chunks, try:

  1. Re-rank — Use a second model (e.g., cross-encoder) to re-rank the top k results. Expensive but accurate.
  2. Query expansion — Transform the user's query into 3–5 variations and search for all of them.
  3. Metadata filtering — If your docs have metadata (source, date, category), filter before search.
Java
// Query expansion
public List<Document> retrieveWithExpansion(String query) {
  List<String> expandedQueries = generateQueryVariations(query);
  
  Set<Document> results = new HashSet<>();
  for (String q : expandedQueries) {
    results.addAll(vectorStore.search(q, maxResults = 5));
  }
  
  return results.stream()
    .limit(5)  // Deduplicate and limit
    .collect(toList());
}

private List<String> generateQueryVariations(String query) {
  // Use LLM to generate variations
  String prompt = "Generate 3 alternative phrasings of this question: " + query;
  String variations = chatModel.generate(prompt);
  return parseVariations(variations);
}

Part 4: Agentic Tool-Calling (Optional, Advanced)

Simple RAG retrieves static docs. But what if the user needs live data?

With agentic tool-calling, the RAG can call your APIs to fetch current information:

Java
@Service
public class AgenticRagService {
  
  private final ChatModel chatModel;
  private final VectorStore vectorStore;
  private final OrderApiClient orderApi;  // live data source
  
  public String generateWithTools(String userQuery) {
    // Define tools the AI can call
    List<ToolSpecification> tools = Arrays.asList(
      searchDocumentation(),
      queryLiveOrders(),
      checkInventory()
    );
    
    // Let the AI decide which tools to use
    ChatResponse response = chatModel.generate(
      userQuery,
      tools
    );
    
    // If AI called tools, execute them and re-prompt
    if (response.toolCalls != null) {
      for (ToolCall call : response.toolCalls) {
        Object result = executeTool(call);
        // Feed result back to LLM
      }
    }
    
    return response.text();
  }
  
  private ToolSpecification queryLiveOrders() {
    return ToolSpecification.builder()
      .name("queryLiveOrders")
      .description("Fetch current orders and their status")
      .inputSchema(/* JSON schema */)
      .build();
  }
}

This is how I reduced support escalations by 40% — the RAG could answer "What's the status of my order?" by calling the live API, not just the docs.


Part 5: Production Considerations

Caching

Embedding the same query twice wastes API costs. Cache embeddings:

Java
@Service
@CacheConfig(cacheNames = "queryEmbeddings")
public class CachedRagService {
  
  @Cacheable
  public List<Document> retrieveDocuments(String query) {
    return vectorStore.search(query, maxResults = 5);
  }
}

Monitoring & Observability

Track:

  • Retrieval quality: Did the retrieved chunks actually answer the question?
  • Embedding costs: Especially if using paid models
  • Latency: Is the round-trip P99 < 500ms?
Java
@Service
public class ObservableRagService {
  
  private final MeterRegistry meterRegistry;
  
  public String queryAndGenerate(String query) {
    Timer.Sample sample = Timer.start(meterRegistry);
    
    try {
      List<Document> docs = retrieve(query);
      
      meterRegistry.counter(
        "rag.retrieval.results",
        "count", String.valueOf(docs.size())
      ).increment();
      
      return generate(query, docs);
      
    } finally {
      sample.stop(Timer.builder("rag.latency")
        .publishPercentiles(0.5, 0.95, 0.99)
        .register(meterRegistry));
    }
  }
}

Cost Optimization

  • Batch embedding: Embed multiple docs at once (cheaper than one-by-one)
  • Use smaller models: OpenAI's text-embedding-3-small is $10/month for most workloads
  • Cache aggressively: Reduce redundant API calls

Common Pitfalls

  1. Chunk size too large — Mixing unrelated docs. You embed context, not meaning.
  2. No overlap — Queries match boundaries. Use 15–20% overlap.
  3. Stale docs — RAG is only as good as your source docs. Update them regularly.
  4. Ignoring retrieval quality — Blindly returning top-k chunks. Measure actual accuracy (does the retrieved doc answer the question?).
  5. No monitoring — You can't improve what you don't measure.

Wrapping Up

RAG is powerful but requires discipline. Start simple:

  1. Ingest docs, chunk them
  2. Embed with OpenAI (or local Ollama)
  3. Store in Pinecone (or PostgreSQL with pgvector)
  4. Retrieve top-5, pass to LLM
  5. Measure retrieval quality

Once this works, add tools, caching, monitoring. Don't over-engineer upfront.

The PCB assistant I built processes 500+ queries/day on a $50/month embedding API + $50/month Pinecone. It's not cheap-cheap, but it's production-viable.

Next steps: Try LangChain4j's quickstart, swap in your docs, measure retrieval quality. You'll quickly find the gaps.

Questions? Hit me up on LinkedIn or Twitter.

R

Ravi Kant Shukla

Senior Java + AI engineer. 9+ years in system design, Kafka, microservices, and LLM/RAG pipelines.

About Ravi →More Posts →

Enjoyed this post?

Get more system design and AWS insights delivered weekly. No spam.

Comments (0)

Loading comments...

Leave a comment

Your email will not be displayed publicly.

Related Posts

ai-ml

Serving ML Models in Production with FastAPI: Async Inference, Streaming, and Deployment

FastAPI has become the go-to Python framework for serving ML models in production. Here's how to build async inference endpoints, stream LLM responses, and deploy them reliably on AWS.

FastAPIMachine LearningPython
May 25, 202620 min read
ai-ml

Deploy Your ML Model on AWS Lambda: The Complete Production Guide

Step-by-step guide to packaging a scikit-learn or PyTorch model as a Lambda function — covering cold starts, container images, model versioning, and A/B testing on AWS.

AWS LambdaML DeploymentDocker
Feb 20, 202414 min read