Production Deployment Guide

Complete guide for deploying ModelKnife ML pipelines to production environments

Production Deployment Flow

Development

Local testing
Feature development

→

Staging

Integration testing
Performance validation

→

Production

Live deployment
Business operations

Production Environment Setup

Prepare AWS infrastructure for production workloads

AWS Account Strategy

For production deployments, follow AWS best practices for account separation:

💡 Recommended Approach: Use separate AWS accounts for development, staging, and production to ensure complete isolation and security.

# Account Structure
Development Account:   111111111111
Staging Account:      222222222222  
Production Account:   333333333333

# AWS Profile Configuration
aws configure --profile dev-account
aws configure --profile staging-account  
aws configure --profile prod-account

Production-Specific Setup

# Initialize production environment
AWS_PROFILE=prod-account mk setup init

# Verify production setup
AWS_PROFILE=prod-account mk setup status

# Check team access
AWS_PROFILE=prod-account mk team status

Pre-Production Checklist

Separate AWS account for production
IAM roles configured with least privilege
VPC and security groups properly configured
CloudTrail enabled for audit logging
Cost monitoring and budget alerts set up
Backup and disaster recovery plan in place

Production Configuration Management

Set up environment-specific configurations

Profile-Based Configuration

Use profile-specific YAML files for environment configuration:

# File structure
mlknife-compose.yaml           # Base configuration
mlknife-compose_staging.yaml   # Staging overrides
mlknife-compose_prod.yaml      # Production overrides

Base Configuration (mlknife-compose.yaml)

mlknife-compose.yaml

name: ml-recommendation-system
author: ml-team
version: v2.1.0

parameters:
  environment: dev
  instance_size: small
  data_retention_days: 30
  
executors:
  python_processor:
    class: sagemaker.sklearn.processing.SKLearnProcessor
    instance_type: ml.c5.xlarge
    instance_count: 1
    
services:
  feature_store:
    type: dynamodb_table
    configuration:
      table_name: "features-${parameters.environment}"
      billing_mode: "PAY_PER_REQUEST"
      
modules:
  data_processing:
    executor: ${executors.python_processor}
    entry_point: process_data.py

Production Overrides (mlknife-compose_prod.yaml)

mlknife-compose_prod.yaml

parameters:
  environment: prod
  instance_size: large
  data_retention_days: 365
  
executors:
  python_processor:
    instance_type: ml.c5.4xlarge    # Larger instances for production
    instance_count: 3               # Multi-instance processing
    
services:
  feature_store:
    configuration:
      billing_mode: "PROVISIONED"   # Predictable billing
      read_capacity: 100
      write_capacity: 50
      
      # Enable point-in-time recovery
      point_in_time_recovery: true
      
      # Production backup settings
      backup_policy:
        backup_enabled: true
        backup_retention_days: 30
        
      # Enable encryption at rest
      encryption:
        enabled: true
        kms_key_id: "alias/ml-prod-key"

✅ Best Practice: Keep sensitive production parameters (like API keys) in AWS Parameter Store or Secrets Manager, not in configuration files.

Security Hardening

Implement production-grade security measures

IAM Role Hardening

# Update IAM roles with stricter policies
mk conf update-roles --environment prod --restrict-access

# Verify security configuration
mk conf show --security-report

Network Security

Configure VPC and security group settings for production:

Production Security Configuration

services:
  secure_lambda:
    type: lambda_function
    configuration:
      function_name: "secure-api-prod"
      
      # VPC configuration for network isolation
      vpc_config:
        subnet_ids:
          - "subnet-12345678"
          - "subnet-87654321"
        security_group_ids:
          - "sg-restrictive-access"
          
      # Environment variables from Parameter Store
      environment:
        DB_HOST: "${aws:ssm:parameter:/prod/db/host:1}"
        API_KEY: "${aws:secretsmanager:prod/api-key:SecretString:key}"
        
      # Dead letter queue for error handling
      dead_letter_config:
        target_arn: "${aws:sqs:arn:dead-letter-queue}"

Security Checklist

All resources deployed in private subnets
Security groups follow least privilege principle
Encryption at rest enabled for all data stores
Encryption in transit for all communications
No hardcoded secrets in configuration files
CloudTrail logging enabled with log file validation
AWS Config rules for compliance monitoring

Production Deployment Process

Execute controlled deployment to production

Pre-Deployment Validation

# 1. Validate configuration
mk s validate -p prod
mk p validate -p prod

# 2. Run dry-run deployment
mk s deploy -p prod --dry-run
mk p deploy -p prod --dry-run

# 3. Check staging environment first
AWS_PROFILE=staging-account mk p status

Deployment Execution

# 1. Deploy infrastructure services first
AWS_PROFILE=prod-account mk s deploy -p prod

# 2. Verify service deployment
AWS_PROFILE=prod-account mk s status

# 3. Deploy pipeline modules
AWS_PROFILE=prod-account mk p deploy -p prod

# 4. Initial validation run
AWS_PROFILE=prod-account mk p run --modules validation_module

⚠️ Deployment Order: Always deploy services before pipelines, as pipelines depend on service outputs like table names and ARNs.

Rollback Preparation

# Create deployment snapshot before changes
mk p version --create-snapshot "pre-v2.1.0-deployment"

# Document current stable version
mk p version --current > deployment-log.txt

# Test rollback procedure in staging first
AWS_PROFILE=staging-account mk p rollback --version v2.0.5

Production Monitoring & Alerting

Set up comprehensive monitoring and alerting

Pipeline Health Monitoring

# Check production deployment status
mk p status

# View detailed service information
mk p schedule set --cron "0 */6 * * *" --healthcheck
mk p schedule set --alert-on-failure --sns-topic "ml-prod-alerts"

Cost Monitoring

# Set up cost alerts
aws budgets create-budget --account-id 333333333333 \
  --budget file://budget-config.json \
  --notifications-with-subscribers file://budget-notifications.json

Monitoring Checklist

CloudWatch dashboards for key metrics
Automated alerts for pipeline failures
Cost monitoring and budget alerts
Performance metrics tracking
Data quality monitoring
SLA/SLO monitoring
Log aggregation and analysis

Operational Procedures

Establish production operations and maintenance

Regular Maintenance

# Daily health checks
mk p status --detailed > daily-status-$(date +%Y%m%d).log

# Weekly performance review
mk p runs --last-week --performance-report

# Monthly cost analysis
mk show costs --month $(date +%Y-%m) --breakdown-by-service

Incident Response

# Pipeline failure response
mk p status --module FAILED_MODULE --debug
mk p logs --module FAILED_MODULE --last 1h
mk p restart --module FAILED_MODULE --safe-mode

# Emergency rollback
mk p rollback --version LAST_STABLE_VERSION --force

Backup and Recovery

# Create configuration backup
mk conf export --include-secrets > backup-$(date +%Y%m%d).json

# Model backup
aws s3 sync s3://prod-models-bucket s3://backup-models-bucket --delete

# Data backup validation
mk p run --modules backup_validation --parameters '{"backup_date": "2024-01-15"}'

💡 Documentation: Maintain runbooks for common operational procedures and keep incident response playbooks up to date.

Production Deployment Best Practices Summary

Security First

Use separate AWS accounts per environment
Implement least privilege IAM policies
Enable encryption at rest and in transit
Store secrets in AWS Parameter Store/Secrets Manager

Configuration Management

Use profile-based configuration files
Version control all configurations
Validate before deployment
Test configuration changes in staging first

Monitoring & Operations

Set up comprehensive monitoring
Implement automated alerting
Establish incident response procedures
Regular backup and disaster recovery testing

Deployment Process

Always deploy services before pipelines
Use dry-run validation before deployment
Maintain rollback procedures
Document all changes and deployments

Following these practices ensures reliable, secure, and maintainable production ML deployments.

Troubleshooting Guide Production Examples