Deploy Your ML Model on AWS Lambda: The Complete Production Guide
Step-by-step guide to packaging a scikit-learn or PyTorch model as a Lambda function — covering cold starts, container images, model versioning, and A/B testing on AWS.
Deploying ML models to production is where most data scientists hit a wall. Jupyter notebooks don't scale. Flask apps on EC2 are expensive to keep running 24/7. In this guide, we'll deploy a real ML model to AWS Lambda — serverless, pay-per-call, and scales to zero when idle.
By the end, you'll have:
- A packaged ML model serving predictions via Lambda
- Cold start mitigation strategies
- Blue/green deployments for zero-downtime model updates
- A/B testing between model versions
Why Lambda for ML Inference?
Lambda works best for:
- Batch inference (process requests as they come, not streams)
- Low-to-medium traffic (< 1000 requests/second)
- Cost-sensitive workloads (you pay only when running)
Lambda is not ideal for:
- Real-time video/audio processing (timeout limits)
- Very large models (>10GB) — use SageMaker Endpoint instead
- Sub-10ms latency requirements (cold starts kill this)
For our example, we'll deploy a customer churn prediction model (scikit-learn + tabular data).
Step 1: Train and Save the Model
# train.py
import joblib
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# Load data
df = pd.read_csv('customer_churn.csv')
X = df.drop('churn', axis=1)
y = df['churn']
# Train pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', GradientBoostingClassifier(
n_estimators=200,
max_depth=4,
learning_rate=0.1,
random_state=42
))
])
pipeline.fit(X, y)
# Save
joblib.dump(pipeline, 'model.joblib')
print(f"Model size: {os.path.getsize('model.joblib') / 1024:.1f} KB")
Our model is ~2MB — well within Lambda's 250MB unzipped limit.
Step 2: Build the Lambda Handler
# lambda_function.py
import json
import os
import boto3
import joblib
import numpy as np
from io import BytesIO
# ─── Model loading (module-level for warm reuse) ──────────────────────────────
_model = None
def get_model():
"""Load model once and cache in Lambda container memory."""
global _model
if _model is not None:
return _model
# Load from S3 (allows updates without redeployment)
s3 = boto3.client('s3')
model_bucket = os.environ['MODEL_BUCKET']
model_key = os.environ['MODEL_KEY'] # e.g. "models/churn/v3/model.joblib"
print(f"Loading model from s3://{model_bucket}/{model_key}")
buffer = BytesIO()
s3.download_fileobj(model_bucket, model_key, buffer)
buffer.seek(0)
_model = joblib.load(buffer)
print("Model loaded and cached in memory")
return _model
# ─── Feature engineering ──────────────────────────────────────────────────────
FEATURE_COLUMNS = [
'tenure_months', 'monthly_charges', 'total_charges',
'num_products', 'has_tech_support', 'contract_type_encoded',
'payment_method_encoded',
]
def extract_features(payload: dict) -> np.ndarray:
"""Extract and validate features from the API payload."""
missing = [f for f in FEATURE_COLUMNS if f not in payload]
if missing:
raise ValueError(f"Missing required features: {missing}")
return np.array([[payload[f] for f in FEATURE_COLUMNS]])
# ─── Main handler ─────────────────────────────────────────────────────────────
def lambda_handler(event, context):
try:
# Parse body (API Gateway sends JSON as string)
if isinstance(event.get('body'), str):
body = json.loads(event['body'])
else:
body = event.get('body', event)
# Extract features
features = extract_features(body)
# Predict
model = get_model()
probability = model.predict_proba(features)[0][1] # P(churn=1)
prediction = int(probability >= 0.5)
return {
'statusCode': 200,
'headers': {'Content-Type': 'application/json'},
'body': json.dumps({
'prediction': prediction,
'probability': round(float(probability), 4),
'model_version': os.environ.get('MODEL_VERSION', 'unknown'),
})
}
except ValueError as e:
return {
'statusCode': 400,
'body': json.dumps({'error': str(e)})
}
except Exception as e:
print(f"Unexpected error: {e}")
return {
'statusCode': 500,
'body': json.dumps({'error': 'Internal server error'})
}
Key pattern: loading the model at module level (outside the handler) means the model is loaded only once per cold start, not on every invocation. Warm Lambda containers reuse the global _model.
Step 3: Dockerise for Large Dependencies
scikit-learn + numpy exceeds the 50MB Lambda deployment package limit. Use a container image instead:
# Dockerfile
FROM public.ecr.aws/lambda/python:3.11
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy handler
COPY lambda_function.py .
CMD ["lambda_function.lambda_handler"]
# requirements.txt
scikit-learn==1.4.2
numpy==1.26.4
pandas==2.2.2
joblib==1.4.2
boto3==1.34.0
Build and push to ECR:
# Build
docker build -t churn-model .
# Authenticate with ECR
aws ecr get-login-password --region ap-south-1 | \
docker login --username AWS \
--password-stdin 123456789.dkr.ecr.ap-south-1.amazonaws.com
# Tag and push
docker tag churn-model:latest \
123456789.dkr.ecr.ap-south-1.amazonaws.com/churn-model:v3
docker push 123456789.dkr.ecr.ap-south-1.amazonaws.com/churn-model:v3
Step 4: Lambda Configuration via Terraform
# lambda.tf
resource "aws_lambda_function" "churn_model" {
function_name = "churn-prediction"
package_type = "Image"
image_uri = "${aws_ecr_repository.model.repository_url}:v3"
role = aws_iam_role.lambda_exec.arn
# Memory: more memory = more CPU = faster inference
memory_size = 1024 # 1GB — tune based on model size
timeout = 30 # seconds
environment {
variables = {
MODEL_BUCKET = aws_s3_bucket.models.bucket
MODEL_KEY = "models/churn/v3/model.joblib"
MODEL_VERSION = "v3"
}
}
# Publish a version for blue/green deployments
publish = true
}
# Lambda Alias for stable API endpoint
resource "aws_lambda_alias" "production" {
name = "production"
function_name = aws_lambda_function.churn_model.function_name
function_version = aws_lambda_function.churn_model.version
}
Step 5: Cold Start Mitigation
Lambda cold starts for ML models can be 3–8 seconds due to model deserialization. Strategies:
1. Provisioned Concurrency
resource "aws_lambda_provisioned_concurrency_config" "warm" {
function_name = aws_lambda_function.churn_model.function_name
qualifier = aws_lambda_alias.production.name
provisioned_concurrent_executions = 5 # Keep 5 containers warm
}
Cost: ~₹2,500/month for 5 warm containers, 24/7. Justified for high-traffic APIs.
2. Scheduled Ping (Budget Option)
# ping.py — runs every 5 minutes via EventBridge
import boto3
import json
def ping_lambda(event, context):
lambda_client = boto3.client('lambda')
lambda_client.invoke(
FunctionName='churn-prediction:production',
InvocationType='RequestResponse',
Payload=json.dumps({'_warmup': True})
)
In the main handler, add:
if body.get('_warmup'):
return {'statusCode': 200, 'body': json.dumps({'status': 'warm'})}
Step 6: A/B Testing Between Model Versions
Lambda aliases support weighted traffic routing — perfect for gradual model rollouts:
# Route 10% to new v4 model, 90% to stable v3
resource "aws_lambda_alias" "production" {
name = "production"
function_name = aws_lambda_function.churn_model.function_name
function_version = "3" # stable v3
routing_config {
additional_version_weights = {
"4" = 0.10 # 10% to v4
}
}
}
Track which version handled each request via the model_version field in the response payload and correlate with downstream business outcomes (did the customer actually churn?).
Step 7: Monitoring with CloudWatch
# In lambda_handler, emit custom metrics
import boto3
import time
cloudwatch = boto3.client('cloudwatch')
def record_metric(name: str, value: float, unit: str = 'None'):
cloudwatch.put_metric_data(
Namespace='ChurnModel',
MetricData=[{
'MetricName': name,
'Value': value,
'Unit': unit,
'Dimensions': [
{'Name': 'ModelVersion', 'Value': os.environ.get('MODEL_VERSION', 'unknown')}
]
}]
)
# Usage inside handler:
start = time.time()
probability = model.predict_proba(features)[0][1]
inference_ms = (time.time() - start) * 1000
record_metric('InferenceLatencyMs', inference_ms, 'Milliseconds')
record_metric('PredictionProbability', float(probability), 'None')
Set alarms on:
InferenceLatencyMsP99 > 500ms → something's wrong- Lambda
Errors> 5 in 5 minutes → alert on-call - Lambda
Throttles> 0 → scale up reserved concurrency
Cost Comparison (1M predictions/month)
| Approach | Monthly Cost (₹) | Cold Start | Notes |
|---|---|---|---|
| Lambda (on-demand) | ₹900 | Yes (3–8s) | Cheapest |
| Lambda + Provisioned | ₹4,500 | No | Best UX |
| ECS Fargate (always-on) | ₹6,300 | No | More control |
| SageMaker Endpoint (ml.t3.medium) | ₹5,400 | No | Managed MLOps |
For most Indian startups doing <1M predictions/month, Lambda on-demand is the right choice. Add provisioned concurrency only when p99 latency starts hurting conversion.
Summary
Training → Save model.joblib to S3
Packaging → Docker container with scikit-learn deps
Deploy → Lambda + ECR container image
Cold start → Module-level model loading + optional provisioned concurrency
A/B test → Lambda alias weighted routing
Monitor → CloudWatch custom metrics + alarms
Want me to review your model deployment setup or help architect an ML inference pipeline at scale? Book an Architecture Review.
Ravi Kant Shukla
Senior Java + AI engineer. 9+ years in system design, Kafka, microservices, and LLM/RAG pipelines.
Enjoyed this post?
Get more system design and AWS insights delivered weekly. No spam.