Production Deployment Guide

Complete guide for deploying ModelKnife ML pipelines to production environments

Production Deployment Flow

Development

Local testing
Feature development

Staging

Integration testing
Performance validation

Production

Live deployment
Business operations

1

Production Environment Setup

Prepare AWS infrastructure for production workloads

AWS Account Strategy

For production deployments, follow AWS best practices for account separation:

💡 Recommended Approach: Use separate AWS accounts for development, staging, and production to ensure complete isolation and security.

# Account Structure
Development Account:   111111111111
Staging Account:      222222222222  
Production Account:   333333333333

# AWS Profile Configuration
aws configure --profile dev-account
aws configure --profile staging-account  
aws configure --profile prod-account

Production-Specific Setup

# Initialize production environment
AWS_PROFILE=prod-account mk setup init

# Verify production setup
AWS_PROFILE=prod-account mk setup status

# Check team access
AWS_PROFILE=prod-account mk team status

Pre-Production Checklist

  • Separate AWS account for production
  • IAM roles configured with least privilege
  • VPC and security groups properly configured
  • CloudTrail enabled for audit logging
  • Cost monitoring and budget alerts set up
  • Backup and disaster recovery plan in place
2

Production Configuration Management

Set up environment-specific configurations

Profile-Based Configuration

Use profile-specific YAML files for environment configuration:

# File structure
mlknife-compose.yaml           # Base configuration
mlknife-compose_staging.yaml   # Staging overrides
mlknife-compose_prod.yaml      # Production overrides

Base Configuration (mlknife-compose.yaml)

mlknife-compose.yaml
name: ml-recommendation-system
author: ml-team
version: v2.1.0

parameters:
  environment: dev
  instance_size: small
  data_retention_days: 30
  
executors:
  python_processor:
    class: sagemaker.sklearn.processing.SKLearnProcessor
    instance_type: ml.c5.xlarge
    instance_count: 1
    
services:
  feature_store:
    type: dynamodb_table
    configuration:
      table_name: "features-${parameters.environment}"
      billing_mode: "PAY_PER_REQUEST"
      
modules:
  data_processing:
    executor: ${executors.python_processor}
    entry_point: process_data.py

Production Overrides (mlknife-compose_prod.yaml)

mlknife-compose_prod.yaml
parameters:
  environment: prod
  instance_size: large
  data_retention_days: 365
  
executors:
  python_processor:
    instance_type: ml.c5.4xlarge    # Larger instances for production
    instance_count: 3               # Multi-instance processing
    
services:
  feature_store:
    configuration:
      billing_mode: "PROVISIONED"   # Predictable billing
      read_capacity: 100
      write_capacity: 50
      
      # Enable point-in-time recovery
      point_in_time_recovery: true
      
      # Production backup settings
      backup_policy:
        backup_enabled: true
        backup_retention_days: 30
        
      # Enable encryption at rest
      encryption:
        enabled: true
        kms_key_id: "alias/ml-prod-key"

✅ Best Practice: Keep sensitive production parameters (like API keys) in AWS Parameter Store or Secrets Manager, not in configuration files.

3

Security Hardening

Implement production-grade security measures

IAM Role Hardening

# Update IAM roles with stricter policies
mk conf update-roles --environment prod --restrict-access

# Verify security configuration
mk conf show --security-report

Network Security

Configure VPC and security group settings for production:

Production Security Configuration
services:
  secure_lambda:
    type: lambda_function
    configuration:
      function_name: "secure-api-prod"
      
      # VPC configuration for network isolation
      vpc_config:
        subnet_ids:
          - "subnet-12345678"
          - "subnet-87654321"
        security_group_ids:
          - "sg-restrictive-access"
          
      # Environment variables from Parameter Store
      environment:
        DB_HOST: "${aws:ssm:parameter:/prod/db/host:1}"
        API_KEY: "${aws:secretsmanager:prod/api-key:SecretString:key}"
        
      # Dead letter queue for error handling
      dead_letter_config:
        target_arn: "${aws:sqs:arn:dead-letter-queue}"

Security Checklist

  • All resources deployed in private subnets
  • Security groups follow least privilege principle
  • Encryption at rest enabled for all data stores
  • Encryption in transit for all communications
  • No hardcoded secrets in configuration files
  • CloudTrail logging enabled with log file validation
  • AWS Config rules for compliance monitoring
4

Production Deployment Process

Execute controlled deployment to production

Pre-Deployment Validation

# 1. Validate configuration
mk s validate -p prod
mk p validate -p prod

# 2. Run dry-run deployment
mk s deploy -p prod --dry-run
mk p deploy -p prod --dry-run

# 3. Check staging environment first
AWS_PROFILE=staging-account mk p status

Deployment Execution

# 1. Deploy infrastructure services first
AWS_PROFILE=prod-account mk s deploy -p prod

# 2. Verify service deployment
AWS_PROFILE=prod-account mk s status

# 3. Deploy pipeline modules
AWS_PROFILE=prod-account mk p deploy -p prod

# 4. Initial validation run
AWS_PROFILE=prod-account mk p run --modules validation_module

⚠️ Deployment Order: Always deploy services before pipelines, as pipelines depend on service outputs like table names and ARNs.

Rollback Preparation

# Create deployment snapshot before changes
mk p version --create-snapshot "pre-v2.1.0-deployment"

# Document current stable version
mk p version --current > deployment-log.txt

# Test rollback procedure in staging first
AWS_PROFILE=staging-account mk p rollback --version v2.0.5
5

Production Monitoring & Alerting

Set up comprehensive monitoring and alerting

Pipeline Health Monitoring

# Check production deployment status
mk p status

# View detailed service information
mk p schedule set --cron "0 */6 * * *" --healthcheck
mk p schedule set --alert-on-failure --sns-topic "ml-prod-alerts"

Cost Monitoring

# Set up cost alerts
aws budgets create-budget --account-id 333333333333 \
  --budget file://budget-config.json \
  --notifications-with-subscribers file://budget-notifications.json

Monitoring Checklist

  • CloudWatch dashboards for key metrics
  • Automated alerts for pipeline failures
  • Cost monitoring and budget alerts
  • Performance metrics tracking
  • Data quality monitoring
  • SLA/SLO monitoring
  • Log aggregation and analysis
6

Operational Procedures

Establish production operations and maintenance

Regular Maintenance

# Daily health checks
mk p status --detailed > daily-status-$(date +%Y%m%d).log

# Weekly performance review
mk p runs --last-week --performance-report

# Monthly cost analysis
mk show costs --month $(date +%Y-%m) --breakdown-by-service

Incident Response

# Pipeline failure response
mk p status --module FAILED_MODULE --debug
mk p logs --module FAILED_MODULE --last 1h
mk p restart --module FAILED_MODULE --safe-mode

# Emergency rollback
mk p rollback --version LAST_STABLE_VERSION --force

Backup and Recovery

# Create configuration backup
mk conf export --include-secrets > backup-$(date +%Y%m%d).json

# Model backup
aws s3 sync s3://prod-models-bucket s3://backup-models-bucket --delete

# Data backup validation
mk p run --modules backup_validation --parameters '{"backup_date": "2024-01-15"}'

💡 Documentation: Maintain runbooks for common operational procedures and keep incident response playbooks up to date.

Production Deployment Best Practices Summary

Security First

  • Use separate AWS accounts per environment
  • Implement least privilege IAM policies
  • Enable encryption at rest and in transit
  • Store secrets in AWS Parameter Store/Secrets Manager

Configuration Management

  • Use profile-based configuration files
  • Version control all configurations
  • Validate before deployment
  • Test configuration changes in staging first

Monitoring & Operations

  • Set up comprehensive monitoring
  • Implement automated alerting
  • Establish incident response procedures
  • Regular backup and disaster recovery testing

Deployment Process

  • Always deploy services before pipelines
  • Use dry-run validation before deployment
  • Maintain rollback procedures
  • Document all changes and deployments

Following these practices ensures reliable, secure, and maintainable production ML deployments.

Troubleshooting Guide Production Examples