Production Deployment Guide
Complete guide for deploying ModelKnife ML pipelines to production environments
Production Deployment Flow
Development
Local testing
Feature development
Staging
Integration testing
Performance validation
Production
Live deployment
Business operations
Production Environment Setup
Prepare AWS infrastructure for production workloads
AWS Account Strategy
For production deployments, follow AWS best practices for account separation:
💡 Recommended Approach: Use separate AWS accounts for development, staging, and production to ensure complete isolation and security.
# Account Structure
Development Account: 111111111111
Staging Account: 222222222222
Production Account: 333333333333
# AWS Profile Configuration
aws configure --profile dev-account
aws configure --profile staging-account
aws configure --profile prod-account
Production-Specific Setup
# Initialize production environment
AWS_PROFILE=prod-account mk setup init
# Verify production setup
AWS_PROFILE=prod-account mk setup status
# Check team access
AWS_PROFILE=prod-account mk team status
Pre-Production Checklist
- Separate AWS account for production
- IAM roles configured with least privilege
- VPC and security groups properly configured
- CloudTrail enabled for audit logging
- Cost monitoring and budget alerts set up
- Backup and disaster recovery plan in place
Production Configuration Management
Set up environment-specific configurations
Profile-Based Configuration
Use profile-specific YAML files for environment configuration:
# File structure
mlknife-compose.yaml # Base configuration
mlknife-compose_staging.yaml # Staging overrides
mlknife-compose_prod.yaml # Production overrides
Base Configuration (mlknife-compose.yaml)
name: ml-recommendation-system
author: ml-team
version: v2.1.0
parameters:
environment: dev
instance_size: small
data_retention_days: 30
executors:
python_processor:
class: sagemaker.sklearn.processing.SKLearnProcessor
instance_type: ml.c5.xlarge
instance_count: 1
services:
feature_store:
type: dynamodb_table
configuration:
table_name: "features-${parameters.environment}"
billing_mode: "PAY_PER_REQUEST"
modules:
data_processing:
executor: ${executors.python_processor}
entry_point: process_data.py
Production Overrides (mlknife-compose_prod.yaml)
parameters:
environment: prod
instance_size: large
data_retention_days: 365
executors:
python_processor:
instance_type: ml.c5.4xlarge # Larger instances for production
instance_count: 3 # Multi-instance processing
services:
feature_store:
configuration:
billing_mode: "PROVISIONED" # Predictable billing
read_capacity: 100
write_capacity: 50
# Enable point-in-time recovery
point_in_time_recovery: true
# Production backup settings
backup_policy:
backup_enabled: true
backup_retention_days: 30
# Enable encryption at rest
encryption:
enabled: true
kms_key_id: "alias/ml-prod-key"
✅ Best Practice: Keep sensitive production parameters (like API keys) in AWS Parameter Store or Secrets Manager, not in configuration files.
Security Hardening
Implement production-grade security measures
IAM Role Hardening
# Update IAM roles with stricter policies
mk conf update-roles --environment prod --restrict-access
# Verify security configuration
mk conf show --security-report
Network Security
Configure VPC and security group settings for production:
services:
secure_lambda:
type: lambda_function
configuration:
function_name: "secure-api-prod"
# VPC configuration for network isolation
vpc_config:
subnet_ids:
- "subnet-12345678"
- "subnet-87654321"
security_group_ids:
- "sg-restrictive-access"
# Environment variables from Parameter Store
environment:
DB_HOST: "${aws:ssm:parameter:/prod/db/host:1}"
API_KEY: "${aws:secretsmanager:prod/api-key:SecretString:key}"
# Dead letter queue for error handling
dead_letter_config:
target_arn: "${aws:sqs:arn:dead-letter-queue}"
Security Checklist
- All resources deployed in private subnets
- Security groups follow least privilege principle
- Encryption at rest enabled for all data stores
- Encryption in transit for all communications
- No hardcoded secrets in configuration files
- CloudTrail logging enabled with log file validation
- AWS Config rules for compliance monitoring
Production Deployment Process
Execute controlled deployment to production
Pre-Deployment Validation
# 1. Validate configuration
mk s validate -p prod
mk p validate -p prod
# 2. Run dry-run deployment
mk s deploy -p prod --dry-run
mk p deploy -p prod --dry-run
# 3. Check staging environment first
AWS_PROFILE=staging-account mk p status
Deployment Execution
# 1. Deploy infrastructure services first
AWS_PROFILE=prod-account mk s deploy -p prod
# 2. Verify service deployment
AWS_PROFILE=prod-account mk s status
# 3. Deploy pipeline modules
AWS_PROFILE=prod-account mk p deploy -p prod
# 4. Initial validation run
AWS_PROFILE=prod-account mk p run --modules validation_module
⚠️ Deployment Order: Always deploy services before pipelines, as pipelines depend on service outputs like table names and ARNs.
Rollback Preparation
# Create deployment snapshot before changes
mk p version --create-snapshot "pre-v2.1.0-deployment"
# Document current stable version
mk p version --current > deployment-log.txt
# Test rollback procedure in staging first
AWS_PROFILE=staging-account mk p rollback --version v2.0.5
Production Monitoring & Alerting
Set up comprehensive monitoring and alerting
Pipeline Health Monitoring
# Check production deployment status
mk p status
# View detailed service information
mk p schedule set --cron "0 */6 * * *" --healthcheck
mk p schedule set --alert-on-failure --sns-topic "ml-prod-alerts"
Cost Monitoring
# Set up cost alerts
aws budgets create-budget --account-id 333333333333 \
--budget file://budget-config.json \
--notifications-with-subscribers file://budget-notifications.json
Monitoring Checklist
- CloudWatch dashboards for key metrics
- Automated alerts for pipeline failures
- Cost monitoring and budget alerts
- Performance metrics tracking
- Data quality monitoring
- SLA/SLO monitoring
- Log aggregation and analysis
Operational Procedures
Establish production operations and maintenance
Regular Maintenance
# Daily health checks
mk p status --detailed > daily-status-$(date +%Y%m%d).log
# Weekly performance review
mk p runs --last-week --performance-report
# Monthly cost analysis
mk show costs --month $(date +%Y-%m) --breakdown-by-service
Incident Response
# Pipeline failure response
mk p status --module FAILED_MODULE --debug
mk p logs --module FAILED_MODULE --last 1h
mk p restart --module FAILED_MODULE --safe-mode
# Emergency rollback
mk p rollback --version LAST_STABLE_VERSION --force
Backup and Recovery
# Create configuration backup
mk conf export --include-secrets > backup-$(date +%Y%m%d).json
# Model backup
aws s3 sync s3://prod-models-bucket s3://backup-models-bucket --delete
# Data backup validation
mk p run --modules backup_validation --parameters '{"backup_date": "2024-01-15"}'
💡 Documentation: Maintain runbooks for common operational procedures and keep incident response playbooks up to date.
Production Deployment Best Practices Summary
Security First
- Use separate AWS accounts per environment
- Implement least privilege IAM policies
- Enable encryption at rest and in transit
- Store secrets in AWS Parameter Store/Secrets Manager
Configuration Management
- Use profile-based configuration files
- Version control all configurations
- Validate before deployment
- Test configuration changes in staging first
Monitoring & Operations
- Set up comprehensive monitoring
- Implement automated alerting
- Establish incident response procedures
- Regular backup and disaster recovery testing
Deployment Process
- Always deploy services before pipelines
- Use dry-run validation before deployment
- Maintain rollback procedures
- Document all changes and deployments
Following these practices ensures reliable, secure, and maintainable production ML deployments.