Deploy Your ML Model on AWS Lambda: The Complete Production Guide

Deploying ML models to production is where most data scientists hit a wall. Jupyter notebooks don't scale. Flask apps on EC2 are expensive to keep running 24/7. In this guide, we'll deploy a real ML model to AWS Lambda — serverless, pay-per-call, and scales to zero when idle.

By the end, you'll have:

A packaged ML model serving predictions via Lambda
Cold start mitigation strategies
Blue/green deployments for zero-downtime model updates
A/B testing between model versions

Why Lambda for ML Inference?

Lambda works best for:

Batch inference (process requests as they come, not streams)
Low-to-medium traffic (< 1000 requests/second)
Cost-sensitive workloads (you pay only when running)

Lambda is not ideal for:

Real-time video/audio processing (timeout limits)
Very large models (>10GB) — use SageMaker Endpoint instead
Sub-10ms latency requirements (cold starts kill this)

For our example, we'll deploy a customer churn prediction model (scikit-learn + tabular data).

Step 1: Train and Save the Model

Python

# train.py
import joblib
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Load data
df = pd.read_csv('customer_churn.csv')
X = df.drop('churn', axis=1)
y = df['churn']

# Train pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', GradientBoostingClassifier(
        n_estimators=200,
        max_depth=4,
        learning_rate=0.1,
        random_state=42
    ))
])

pipeline.fit(X, y)

# Save
joblib.dump(pipeline, 'model.joblib')
print(f"Model size: {os.path.getsize('model.joblib') / 1024:.1f} KB")

Our model is ~2MB — well within Lambda's 250MB unzipped limit.

Step 2: Build the Lambda Handler

Python

# lambda_function.py
import json
import os
import boto3
import joblib
import numpy as np
from io import BytesIO

# ─── Model loading (module-level for warm reuse) ──────────────────────────────

_model = None

def get_model():
    """Load model once and cache in Lambda container memory."""
    global _model
    if _model is not None:
        return _model

    # Load from S3 (allows updates without redeployment)
    s3 = boto3.client('s3')
    model_bucket = os.environ['MODEL_BUCKET']
    model_key = os.environ['MODEL_KEY']  # e.g. "models/churn/v3/model.joblib"

    print(f"Loading model from s3://{model_bucket}/{model_key}")
    buffer = BytesIO()
    s3.download_fileobj(model_bucket, model_key, buffer)
    buffer.seek(0)

    _model = joblib.load(buffer)
    print("Model loaded and cached in memory")
    return _model


# ─── Feature engineering ──────────────────────────────────────────────────────

FEATURE_COLUMNS = [
    'tenure_months', 'monthly_charges', 'total_charges',
    'num_products', 'has_tech_support', 'contract_type_encoded',
    'payment_method_encoded',
]

def extract_features(payload: dict) -> np.ndarray:
    """Extract and validate features from the API payload."""
    missing = [f for f in FEATURE_COLUMNS if f not in payload]
    if missing:
        raise ValueError(f"Missing required features: {missing}")

    return np.array([[payload[f] for f in FEATURE_COLUMNS]])


# ─── Main handler ─────────────────────────────────────────────────────────────

def lambda_handler(event, context):
    try:
        # Parse body (API Gateway sends JSON as string)
        if isinstance(event.get('body'), str):
            body = json.loads(event['body'])
        else:
            body = event.get('body', event)

        # Extract features
        features = extract_features(body)

        # Predict
        model = get_model()
        probability = model.predict_proba(features)[0][1]  # P(churn=1)
        prediction = int(probability >= 0.5)

        return {
            'statusCode': 200,
            'headers': {'Content-Type': 'application/json'},
            'body': json.dumps({
                'prediction': prediction,
                'probability': round(float(probability), 4),
                'model_version': os.environ.get('MODEL_VERSION', 'unknown'),
            })
        }

    except ValueError as e:
        return {
            'statusCode': 400,
            'body': json.dumps({'error': str(e)})
        }
    except Exception as e:
        print(f"Unexpected error: {e}")
        return {
            'statusCode': 500,
            'body': json.dumps({'error': 'Internal server error'})
        }

Key pattern: loading the model at module level (outside the handler) means the model is loaded only once per cold start, not on every invocation. Warm Lambda containers reuse the global _model.

Step 3: Dockerise for Large Dependencies

scikit-learn + numpy exceeds the 50MB Lambda deployment package limit. Use a container image instead:

Dockerfile

# Dockerfile
FROM public.ecr.aws/lambda/python:3.11

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy handler
COPY lambda_function.py .

CMD ["lambda_function.lambda_handler"]

# requirements.txt
scikit-learn==1.4.2
numpy==1.26.4
pandas==2.2.2
joblib==1.4.2
boto3==1.34.0

Build and push to ECR:

Shell

# Build
docker build -t churn-model .

# Authenticate with ECR
aws ecr get-login-password --region ap-south-1 | \
  docker login --username AWS \
  --password-stdin 123456789.dkr.ecr.ap-south-1.amazonaws.com

# Tag and push
docker tag churn-model:latest \
  123456789.dkr.ecr.ap-south-1.amazonaws.com/churn-model:v3
docker push 123456789.dkr.ecr.ap-south-1.amazonaws.com/churn-model:v3

Step 4: Lambda Configuration via Terraform

Terraform

# lambda.tf

resource "aws_lambda_function" "churn_model" {
  function_name = "churn-prediction"
  package_type  = "Image"
  image_uri     = "${aws_ecr_repository.model.repository_url}:v3"
  role          = aws_iam_role.lambda_exec.arn

  # Memory: more memory = more CPU = faster inference
  memory_size = 1024  # 1GB — tune based on model size
  timeout     = 30    # seconds

  environment {
    variables = {
      MODEL_BUCKET  = aws_s3_bucket.models.bucket
      MODEL_KEY     = "models/churn/v3/model.joblib"
      MODEL_VERSION = "v3"
    }
  }

  # Publish a version for blue/green deployments
  publish = true
}

# Lambda Alias for stable API endpoint
resource "aws_lambda_alias" "production" {
  name             = "production"
  function_name    = aws_lambda_function.churn_model.function_name
  function_version = aws_lambda_function.churn_model.version
}

Step 5: Cold Start Mitigation

Lambda cold starts for ML models can be 3–8 seconds due to model deserialization. Strategies:

1. Provisioned Concurrency

Terraform

resource "aws_lambda_provisioned_concurrency_config" "warm" {
  function_name                  = aws_lambda_function.churn_model.function_name
  qualifier                      = aws_lambda_alias.production.name
  provisioned_concurrent_executions = 5  # Keep 5 containers warm
}

Cost: ~₹2,500/month for 5 warm containers, 24/7. Justified for high-traffic APIs.

2. Scheduled Ping (Budget Option)

Python

# ping.py — runs every 5 minutes via EventBridge
import boto3
import json

def ping_lambda(event, context):
    lambda_client = boto3.client('lambda')
    lambda_client.invoke(
        FunctionName='churn-prediction:production',
        InvocationType='RequestResponse',
        Payload=json.dumps({'_warmup': True})
    )

In the main handler, add:

Python

if body.get('_warmup'):
    return {'statusCode': 200, 'body': json.dumps({'status': 'warm'})}

Step 6: A/B Testing Between Model Versions

Lambda aliases support weighted traffic routing — perfect for gradual model rollouts:

Terraform

# Route 10% to new v4 model, 90% to stable v3
resource "aws_lambda_alias" "production" {
  name             = "production"
  function_name    = aws_lambda_function.churn_model.function_name
  function_version = "3"  # stable v3

  routing_config {
    additional_version_weights = {
      "4" = 0.10  # 10% to v4
    }
  }
}

Track which version handled each request via the model_version field in the response payload and correlate with downstream business outcomes (did the customer actually churn?).

Step 7: Monitoring with CloudWatch

Python

# In lambda_handler, emit custom metrics
import boto3
import time

cloudwatch = boto3.client('cloudwatch')

def record_metric(name: str, value: float, unit: str = 'None'):
    cloudwatch.put_metric_data(
        Namespace='ChurnModel',
        MetricData=[{
            'MetricName': name,
            'Value': value,
            'Unit': unit,
            'Dimensions': [
                {'Name': 'ModelVersion', 'Value': os.environ.get('MODEL_VERSION', 'unknown')}
            ]
        }]
    )

# Usage inside handler:
start = time.time()
probability = model.predict_proba(features)[0][1]
inference_ms = (time.time() - start) * 1000

record_metric('InferenceLatencyMs', inference_ms, 'Milliseconds')
record_metric('PredictionProbability', float(probability), 'None')

Set alarms on:

InferenceLatencyMs P99 > 500ms → something's wrong
Lambda Errors > 5 in 5 minutes → alert on-call
Lambda Throttles > 0 → scale up reserved concurrency

Cost Comparison (1M predictions/month)

Approach	Monthly Cost (₹)	Cold Start	Notes
Lambda (on-demand)	₹900	Yes (3–8s)	Cheapest
Lambda + Provisioned	₹4,500	No	Best UX
ECS Fargate (always-on)	₹6,300	No	More control
SageMaker Endpoint (ml.t3.medium)	₹5,400	No	Managed MLOps

For most Indian startups doing <1M predictions/month, Lambda on-demand is the right choice. Add provisioned concurrency only when p99 latency starts hurting conversion.

Summary

Training   → Save model.joblib to S3
Packaging  → Docker container with scikit-learn deps
Deploy     → Lambda + ECR container image
Cold start → Module-level model loading + optional provisioned concurrency
A/B test   → Lambda alias weighted routing
Monitor    → CloudWatch custom metrics + alarms

Want me to review your model deployment setup or help architect an ML inference pipeline at scale? Book an Architecture Review.

Deploy Your ML Model on AWS Lambda: The Complete Production Guide

Why Lambda for ML Inference?

Step 1: Train and Save the Model

Step 2: Build the Lambda Handler

Step 3: Dockerise for Large Dependencies

Step 4: Lambda Configuration via Terraform

Step 5: Cold Start Mitigation

1. Provisioned Concurrency

2. Scheduled Ping (Budget Option)

Step 6: A/B Testing Between Model Versions

Step 7: Monitoring with CloudWatch

Cost Comparison (1M predictions/month)

Summary

Ravi Kant Shukla

Enjoyed this post?

Comments (0)

Leave a comment

Related Posts

Prompt Engineering at Scale: Templates, Chains, and Optimization

Retrieval-Augmented Generation at Scale: Vector Databases & Semantic Search

Agentic AI Systems: Tool-Calling, Planning, and Execution