RS
Ravi Shukla
HomeBlogToolsAbout
Resume
RS
Ravi Shukla

Senior Java + AI engineer. Kafka, RAG, distributed systems.

Content

  • Blog
  • System Design
  • AI & ML
  • DevOps

Explore

  • About Ravi
  • Open Stats
  • Thank You

© 2026 Ravi Kant Shukla. All rights reserved.

Deployed on Vercel · Mumbai region

Back to Writing
ai-ml

Deploy Your ML Model on AWS Lambda: The Complete Production Guide

Step-by-step guide to packaging a scikit-learn or PyTorch model as a Lambda function — covering cold starts, container images, model versioning, and A/B testing on AWS.

February 20, 202414 min read
AWS LambdaML DeploymentDockerPythonSageMaker

Deploying ML models to production is where most data scientists hit a wall. Jupyter notebooks don't scale. Flask apps on EC2 are expensive to keep running 24/7. In this guide, we'll deploy a real ML model to AWS Lambda — serverless, pay-per-call, and scales to zero when idle.

By the end, you'll have:

  • A packaged ML model serving predictions via Lambda
  • Cold start mitigation strategies
  • Blue/green deployments for zero-downtime model updates
  • A/B testing between model versions

Why Lambda for ML Inference?

Lambda works best for:

  • Batch inference (process requests as they come, not streams)
  • Low-to-medium traffic (< 1000 requests/second)
  • Cost-sensitive workloads (you pay only when running)

Lambda is not ideal for:

  • Real-time video/audio processing (timeout limits)
  • Very large models (>10GB) — use SageMaker Endpoint instead
  • Sub-10ms latency requirements (cold starts kill this)

For our example, we'll deploy a customer churn prediction model (scikit-learn + tabular data).


Step 1: Train and Save the Model

Python
# train.py
import joblib
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Load data
df = pd.read_csv('customer_churn.csv')
X = df.drop('churn', axis=1)
y = df['churn']

# Train pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', GradientBoostingClassifier(
        n_estimators=200,
        max_depth=4,
        learning_rate=0.1,
        random_state=42
    ))
])

pipeline.fit(X, y)

# Save
joblib.dump(pipeline, 'model.joblib')
print(f"Model size: {os.path.getsize('model.joblib') / 1024:.1f} KB")

Our model is ~2MB — well within Lambda's 250MB unzipped limit.


Step 2: Build the Lambda Handler

Python
# lambda_function.py
import json
import os
import boto3
import joblib
import numpy as np
from io import BytesIO

# ─── Model loading (module-level for warm reuse) ──────────────────────────────

_model = None

def get_model():
    """Load model once and cache in Lambda container memory."""
    global _model
    if _model is not None:
        return _model

    # Load from S3 (allows updates without redeployment)
    s3 = boto3.client('s3')
    model_bucket = os.environ['MODEL_BUCKET']
    model_key = os.environ['MODEL_KEY']  # e.g. "models/churn/v3/model.joblib"

    print(f"Loading model from s3://{model_bucket}/{model_key}")
    buffer = BytesIO()
    s3.download_fileobj(model_bucket, model_key, buffer)
    buffer.seek(0)

    _model = joblib.load(buffer)
    print("Model loaded and cached in memory")
    return _model


# ─── Feature engineering ──────────────────────────────────────────────────────

FEATURE_COLUMNS = [
    'tenure_months', 'monthly_charges', 'total_charges',
    'num_products', 'has_tech_support', 'contract_type_encoded',
    'payment_method_encoded',
]

def extract_features(payload: dict) -> np.ndarray:
    """Extract and validate features from the API payload."""
    missing = [f for f in FEATURE_COLUMNS if f not in payload]
    if missing:
        raise ValueError(f"Missing required features: {missing}")

    return np.array([[payload[f] for f in FEATURE_COLUMNS]])


# ─── Main handler ─────────────────────────────────────────────────────────────

def lambda_handler(event, context):
    try:
        # Parse body (API Gateway sends JSON as string)
        if isinstance(event.get('body'), str):
            body = json.loads(event['body'])
        else:
            body = event.get('body', event)

        # Extract features
        features = extract_features(body)

        # Predict
        model = get_model()
        probability = model.predict_proba(features)[0][1]  # P(churn=1)
        prediction = int(probability >= 0.5)

        return {
            'statusCode': 200,
            'headers': {'Content-Type': 'application/json'},
            'body': json.dumps({
                'prediction': prediction,
                'probability': round(float(probability), 4),
                'model_version': os.environ.get('MODEL_VERSION', 'unknown'),
            })
        }

    except ValueError as e:
        return {
            'statusCode': 400,
            'body': json.dumps({'error': str(e)})
        }
    except Exception as e:
        print(f"Unexpected error: {e}")
        return {
            'statusCode': 500,
            'body': json.dumps({'error': 'Internal server error'})
        }

Key pattern: loading the model at module level (outside the handler) means the model is loaded only once per cold start, not on every invocation. Warm Lambda containers reuse the global _model.


Step 3: Dockerise for Large Dependencies

scikit-learn + numpy exceeds the 50MB Lambda deployment package limit. Use a container image instead:

Dockerfile
# Dockerfile
FROM public.ecr.aws/lambda/python:3.11

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy handler
COPY lambda_function.py .

CMD ["lambda_function.lambda_handler"]
# requirements.txt
scikit-learn==1.4.2
numpy==1.26.4
pandas==2.2.2
joblib==1.4.2
boto3==1.34.0

Build and push to ECR:

Shell
# Build
docker build -t churn-model .

# Authenticate with ECR
aws ecr get-login-password --region ap-south-1 | \
  docker login --username AWS \
  --password-stdin 123456789.dkr.ecr.ap-south-1.amazonaws.com

# Tag and push
docker tag churn-model:latest \
  123456789.dkr.ecr.ap-south-1.amazonaws.com/churn-model:v3
docker push 123456789.dkr.ecr.ap-south-1.amazonaws.com/churn-model:v3

Step 4: Lambda Configuration via Terraform

Terraform
# lambda.tf

resource "aws_lambda_function" "churn_model" {
  function_name = "churn-prediction"
  package_type  = "Image"
  image_uri     = "${aws_ecr_repository.model.repository_url}:v3"
  role          = aws_iam_role.lambda_exec.arn

  # Memory: more memory = more CPU = faster inference
  memory_size = 1024  # 1GB — tune based on model size
  timeout     = 30    # seconds

  environment {
    variables = {
      MODEL_BUCKET  = aws_s3_bucket.models.bucket
      MODEL_KEY     = "models/churn/v3/model.joblib"
      MODEL_VERSION = "v3"
    }
  }

  # Publish a version for blue/green deployments
  publish = true
}

# Lambda Alias for stable API endpoint
resource "aws_lambda_alias" "production" {
  name             = "production"
  function_name    = aws_lambda_function.churn_model.function_name
  function_version = aws_lambda_function.churn_model.version
}

Step 5: Cold Start Mitigation

Lambda cold starts for ML models can be 3–8 seconds due to model deserialization. Strategies:

1. Provisioned Concurrency

Terraform
resource "aws_lambda_provisioned_concurrency_config" "warm" {
  function_name                  = aws_lambda_function.churn_model.function_name
  qualifier                      = aws_lambda_alias.production.name
  provisioned_concurrent_executions = 5  # Keep 5 containers warm
}

Cost: ~₹2,500/month for 5 warm containers, 24/7. Justified for high-traffic APIs.

2. Scheduled Ping (Budget Option)

Python
# ping.py — runs every 5 minutes via EventBridge
import boto3
import json

def ping_lambda(event, context):
    lambda_client = boto3.client('lambda')
    lambda_client.invoke(
        FunctionName='churn-prediction:production',
        InvocationType='RequestResponse',
        Payload=json.dumps({'_warmup': True})
    )

In the main handler, add:

Python
if body.get('_warmup'):
    return {'statusCode': 200, 'body': json.dumps({'status': 'warm'})}

Step 6: A/B Testing Between Model Versions

Lambda aliases support weighted traffic routing — perfect for gradual model rollouts:

Terraform
# Route 10% to new v4 model, 90% to stable v3
resource "aws_lambda_alias" "production" {
  name             = "production"
  function_name    = aws_lambda_function.churn_model.function_name
  function_version = "3"  # stable v3

  routing_config {
    additional_version_weights = {
      "4" = 0.10  # 10% to v4
    }
  }
}

Track which version handled each request via the model_version field in the response payload and correlate with downstream business outcomes (did the customer actually churn?).


Step 7: Monitoring with CloudWatch

Python
# In lambda_handler, emit custom metrics
import boto3
import time

cloudwatch = boto3.client('cloudwatch')

def record_metric(name: str, value: float, unit: str = 'None'):
    cloudwatch.put_metric_data(
        Namespace='ChurnModel',
        MetricData=[{
            'MetricName': name,
            'Value': value,
            'Unit': unit,
            'Dimensions': [
                {'Name': 'ModelVersion', 'Value': os.environ.get('MODEL_VERSION', 'unknown')}
            ]
        }]
    )

# Usage inside handler:
start = time.time()
probability = model.predict_proba(features)[0][1]
inference_ms = (time.time() - start) * 1000

record_metric('InferenceLatencyMs', inference_ms, 'Milliseconds')
record_metric('PredictionProbability', float(probability), 'None')

Set alarms on:

  • InferenceLatencyMs P99 > 500ms → something's wrong
  • Lambda Errors > 5 in 5 minutes → alert on-call
  • Lambda Throttles > 0 → scale up reserved concurrency

Cost Comparison (1M predictions/month)

ApproachMonthly Cost (₹)Cold StartNotes
Lambda (on-demand)₹900Yes (3–8s)Cheapest
Lambda + Provisioned₹4,500NoBest UX
ECS Fargate (always-on)₹6,300NoMore control
SageMaker Endpoint (ml.t3.medium)₹5,400NoManaged MLOps

For most Indian startups doing <1M predictions/month, Lambda on-demand is the right choice. Add provisioned concurrency only when p99 latency starts hurting conversion.


Summary

Training   → Save model.joblib to S3
Packaging  → Docker container with scikit-learn deps
Deploy     → Lambda + ECR container image
Cold start → Module-level model loading + optional provisioned concurrency
A/B test   → Lambda alias weighted routing
Monitor    → CloudWatch custom metrics + alarms

Want me to review your model deployment setup or help architect an ML inference pipeline at scale? Book an Architecture Review.

R

Ravi Kant Shukla

Senior Java + AI engineer. 9+ years in system design, Kafka, microservices, and LLM/RAG pipelines.

About Ravi →More Posts →

Enjoyed this post?

Get more system design and AWS insights delivered weekly. No spam.

Comments (0)

Loading comments...

Leave a comment

Your email will not be displayed publicly.

Related Posts

ai-ml

Serving ML Models in Production with FastAPI: Async Inference, Streaming, and Deployment

FastAPI has become the go-to Python framework for serving ML models in production. Here's how to build async inference endpoints, stream LLM responses, and deploy them reliably on AWS.

FastAPIMachine LearningPython
May 25, 202620 min read
ai-ml

Building a Production RAG Pipeline with LangChain4j + Spring Boot

A complete guide to building retrieval-augmented generation (RAG) systems in Java using LangChain4j. Learn chunking strategies, embedding pipelines, vector store integration, and how to ship RAG to production.

LangChain4jRAGSpring Boot
May 5, 202618 min read