Deploy Your ML Model on AWS Lambda: The Complete Production Guide
Step-by-step guide to packaging a scikit-learn or PyTorch model as a Lambda function — covering cold starts, container images, model versioning, and A/B testing on AWS.
Deploying ML models to production is where most data scientists hit a wall. Jupyter notebooks don't scale. Flask apps on EC2 are expensive to keep running 24/7. In this guide, we'll deploy a real ML model to AWS Lambda — serverless, pay-per-call, and scales to zero when idle.
By the end, you'll have:
- A packaged ML model serving predictions via Lambda
- Cold start mitigation strategies
- Blue/green deployments for zero-downtime model updates
- A/B testing between model versions
Why Lambda for ML Inference?
Lambda works best for:
- Batch inference (process requests as they come, not streams)
- Low-to-medium traffic (< 1000 requests/second)
- Cost-sensitive workloads (you pay only when running)
Lambda is not ideal for:
- Real-time video/audio processing (timeout limits)
- Very large models (>10GB) — use SageMaker Endpoint instead
- Sub-10ms latency requirements (cold starts kill this)
For our example, we'll deploy a customer churn prediction model (scikit-learn + tabular data).
Step 1: Train and Save the Model
# train.py
import joblib
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# Load data
df = pd.read_csv('customer_churn.csv')
X = df.drop('churn', axis=1)
y = df['churn']
# Train pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', GradientBoostingClassifier(
n_estimators=200,
max_depth=4,
learning_rate=0.1,
random_state=42
))
])
pipeline.fit(X, y)
# Save
joblib.dump(pipeline, 'model.joblib')
print(f"Model size: {os.path.getsize('model.joblib') / 1024:.1f} KB")
Our model is ~2MB — well within Lambda's 250MB unzipped limit.
Step 2: Build the Lambda Handler
# lambda_function.py
import json
import os
import boto3
import joblib
import numpy as np
from io import BytesIO
# ─── Model loading (module-level for warm reuse) ──────────────────────────────
_model = None
def get_model():
"""Load model once and cache in Lambda container memory."""
global _model
if _model is not None:
return _model
# Load from S3 (allows updates without redeployment)
s3 = boto3.client('s3')
model_bucket = os.environ['MODEL_BUCKET']
model_key = os.environ['MODEL_KEY'] # e.g. "models/churn/v3/model.joblib"
print(f"Loading model from s3://{model_bucket}/{model_key}")
buffer = BytesIO()
s3.download_fileobj(model_bucket, model_key, buffer)
buffer.seek(0)
_model = joblib.load(buffer)
print("Model loaded and cached in memory")
return _model
# ─── Feature engineering ──────────────────────────────────────────────────────
FEATURE_COLUMNS = [
'tenure_months', 'monthly_charges', 'total_charges',
'num_products', 'has_tech_support', 'contract_type_encoded',
'payment_method_encoded',
]
def extract_features(payload: dict) -> np.ndarray:
"""Extract and validate features from the API payload."""
missing = [f for f in FEATURE_COLUMNS if f not in payload]
if missing:
raise ValueError(f"Missing required features: {missing}")
return np.array([[payload[f] for f in FEATURE_COLUMNS]])
# ─── Main handler ─────────────────────────────────────────────────────────────
def lambda_handler(event, context):
try:
# Parse body (API Gateway sends JSON as string)
if isinstance(event.get('body'), str):
body = json.loads(event['body'])
else:
body = event.get('body', event)
# Extract features
features = extract_features(body)
# Predict
model = get_model()
probability = model.predict_proba(features)[0][1] # P(churn=1)
prediction = int(probability >= 0.5)
return {
'statusCode': 200,
'headers': {'Content-Type': 'application/json'},
'body': json.dumps({
'prediction': prediction,
'probability': round(float(probability), 4),
'model_version': os.environ.get('MODEL_VERSION', 'unknown'),
})
}
except ValueError as e:
return {
'statusCode': 400,
'body': json.dumps({'error': str(e)})
}
except Exception as e:
print(f"Unexpected error: {e}")
return {
'statusCode': 500,
'body': json.dumps({'error': 'Internal server error'})
}
Key pattern: loading the model at module level (outside the handler) means the model is loaded only once per cold start, not on every invocation. Warm Lambda containers reuse the global _model.
Step 3: Dockerise for Large Dependencies
scikit-learn + numpy exceeds the 50MB Lambda deployment package limit. Use a container image instead:
# Dockerfile
FROM public.ecr.aws/lambda/python:3.11
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy handler
COPY lambda_function.py .
CMD ["lambda_function.lambda_handler"]
# requirements.txt
scikit-learn==1.4.2
numpy==1.26.4
pandas==2.2.2
joblib==1.4.2
boto3==1.34.0
Build and push to ECR:
# Build
docker build -t churn-model .
# Authenticate with ECR
aws ecr get-login-password --region ap-south-1 | \
docker login --username AWS \
--password-stdin 123456789.dkr.ecr.ap-south-1.amazonaws.com
# Tag and push
docker tag churn-model:latest \
123456789.dkr.ecr.ap-south-1.amazonaws.com/churn-model:v3
docker push 123456789.dkr.ecr.ap-south-1.amazonaws.com/churn-model:v3
Step 4: Lambda Configuration via Terraform
# lambda.tf
resource "aws_lambda_function" "churn_model" {
function_name = "churn-prediction"
package_type = "Image"
image_uri = "${aws_ecr_repository.model.repository_url}:v3"
role = aws_iam_role.lambda_exec.arn
# Memory: more memory = more CPU = faster inference
memory_size = 1024 # 1GB — tune based on model size
timeout = 30 # seconds
environment {
variables = {
MODEL_BUCKET = aws_s3_bucket.models.bucket
MODEL_KEY = "models/churn/v3/model.joblib"
MODEL_VERSION = "v3"
}
}
# Publish a version for blue/green deployments
publish = true
}
# Lambda Alias for stable API endpoint
resource "aws_lambda_alias" "production" {
name = "production"
function_name = aws_lambda_function.churn_model.function_name
function_version = aws_lambda_function.churn_model.version
}
Step 5: Cold Start Mitigation
Lambda cold starts for ML models can be 3–8 seconds due to model deserialization. Strategies:
1. Provisioned Concurrency
resource "aws_lambda_provisioned_concurrency_config" "warm" {
function_name = aws_lambda_function.churn_model.function_name
qualifier = aws_lambda_alias.production.name
provisioned_concurrent_executions = 5 # Keep 5 containers warm
}
Cost: ~₹2,500/month for 5 warm containers, 24/7. Justified for high-traffic APIs.
2. Scheduled Ping (Budget Option)
# ping.py — runs every 5 minutes via EventBridge
import boto3
import json
def ping_lambda(event, context):
lambda_client = boto3.client('lambda')
lambda_client.invoke(
FunctionName='churn-prediction:production',
InvocationType='RequestResponse',
Payload=json.dumps({'_warmup': True})
)
In the main handler, add:
if body.get('_warmup'):
return {'statusCode': 200, 'body': json.dumps({'status': 'warm'})}
Step 6: A/B Testing Between Model Versions
Lambda aliases support weighted traffic routing — perfect for gradual model rollouts:
# Route 10% to new v4 model, 90% to stable v3
resource "aws_lambda_alias" "production" {
name = "production"
function_name = aws_lambda_function.churn_model.function_name
function_version = "3" # stable v3
routing_config {
additional_version_weights = {
"4" = 0.10 # 10% to v4
}
}
}
Track which version handled each request via the model_version field in the response payload and correlate with downstream business outcomes (did the customer actually churn?).
Step 7: Monitoring with CloudWatch
# In lambda_handler, emit custom metrics
import boto3
import time
cloudwatch = boto3.client('cloudwatch')
def record_metric(name: str, value: float, unit: str = 'None'):
cloudwatch.put_metric_data(
Namespace='ChurnModel',
MetricData=[{
'MetricName': name,
'Value': value,
'Unit': unit,
'Dimensions': [
{'Name': 'ModelVersion', 'Value': os.environ.get('MODEL_VERSION', 'unknown')}
]
}]
)
# Usage inside handler:
start = time.time()
probability = model.predict_proba(features)[0][1]
inference_ms = (time.time() - start) * 1000
record_metric('InferenceLatencyMs', inference_ms, 'Milliseconds')
record_metric('PredictionProbability', float(probability), 'None')
Set alarms on:
InferenceLatencyMsP99 > 500ms → something's wrong- Lambda
Errors> 5 in 5 minutes → alert on-call - Lambda
Throttles> 0 → scale up reserved concurrency
Cost Comparison (1M predictions/month)
| Approach | Monthly Cost (₹) | Cold Start | Notes |
|---|---|---|---|
| Lambda (on-demand) | ₹900 | Yes (3–8s) | Cheapest |
| Lambda + Provisioned | ₹4,500 | No | Best UX |
| ECS Fargate (always-on) | ₹6,300 | No | More control |
| SageMaker Endpoint (ml.t3.medium) | ₹5,400 | No | Managed MLOps |
For most Indian startups doing <1M predictions/month, Lambda on-demand is the right choice. Add provisioned concurrency only when p99 latency starts hurting conversion.
Summary
Training → Save model.joblib to S3
Packaging → Docker container with scikit-learn deps
Deploy → Lambda + ECR container image
Cold start → Module-level model loading + optional provisioned concurrency
A/B test → Lambda alias weighted routing
Monitor → CloudWatch custom metrics + alarms
Want me to review your model deployment setup or help architect an ML inference pipeline at scale? Book an Architecture Review.
Ravi Kant Shukla
Backend architect helping developers and startups build production-grade systems on AWS. 8+ years of experience in system design, microservices, and AI/ML deployment.
Enjoyed this post?
Get more system design and AWS insights delivered weekly. No spam.