Multi-Profile Setup - ModelKnife

Profile Strategy

ModelKnife profiles serve two key purposes: multi-environment deployment (dev, staging, prod) and multi-pipeline configuration (different ML workflows sharing common infrastructure). This allows you to reduce configuration duplication and maintain flexible, scalable ML systems.

Two Profile Use Cases

Multi-Environment: Same pipeline across dev/staging/prod environments
Multi-Pipeline: Different ML workflows sharing base configuration (e.g., user-interest, media-annotation, post-enrichment)

Multi-Environment

Same ML pipeline deployed across different environments with environment-specific configurations.

Dev, staging, production environments
Different resource sizes
Environment-specific parameters
Isolated infrastructure

Multi-Pipeline

Different ML workflows sharing common base configuration and infrastructure services.

Shared executors and services
Pipeline-specific modules
Reduced configuration duplication
Modular ML workflows

Hybrid Approach

Combine both strategies: multiple pipelines each deployed across multiple environments.

Pipeline-specific profiles
Environment-specific overrides
Maximum flexibility
Enterprise-scale ML systems

Profile Configuration Patterns

1

Multi-Environment Pattern

Same pipeline across different environments

Create base mlknife-compose.yaml and environment-specific overrides:

mlknife-compose.yaml - Base Environment Configuration

# mlknife-compose.yaml - Base configuration
name: hermes-search-engine
author: david.liu
description: Search engine with user and post search capabilities

parameters:
  env: dev  # Will be overridden by profiles
  base_path: s3://naoo-ai/
  version: v1
  data_path: ${parameters.base_path}${parameters.env}/hermes-search/${parameters.version}

executors:
  glue_etl:
    type: glue_etl
    job_name: "ml-data-preprocessing"
    runtime: python3.9
    role: AWSGlueServiceRole
    number_of_workers: 2  # Will be overridden for prod
    worker_type: G.1X

services:
  search_backend_service:
    type: search_service
    configuration:
      service_name: "hermes-search-${parameters.env}"
      instance_type: "t3.medium"  # Will be overridden
      
  search_api_gateway:
    type: api_gateway
    configuration:
      api_name: "hermes-search-api-${parameters.env}"
      stage_name: ${parameters.env}

modules:
  build_search_index:
    executor: ${executors.glue_etl}
    entry_point: ./src/build_search_index.py
    job_parameters:
      output_path: ${parameters.data_path}/search_index/

2

Environment-Specific Overrides

Create production profile with optimized settings

Create mlknife-compose_prod.yaml with production overrides:

mlknife-compose_prod.yaml - Production Overrides

# mlknife-compose_prod.yaml - Production overrides
name: hermes-search-engine-prod
author: david.liu

parameters:
  env: prod  # Override environment

# Disable development-only services
disabled_services:
  - search_api_gateway  # Use production API Gateway instead

# Production-specific services (if any)
services:
  search_backend_service:
    configuration:
      instance_type: "m5.large"  # Larger for production
      instance_count: 3  # High availability
      auto_scaling:
        min_capacity: 2
        max_capacity: 10

executors:
  glue_etl:
    number_of_workers: 10  # Scale up for production
    worker_type: G.2X  # More powerful workers

modules: {}  # Inherit all modules from base

Multi-Environment Benefits

Same pipeline logic across environments
Environment-specific resource sizing
Selective service enablement/disablement
Consistent deployment patterns

3

Multi-Pipeline Pattern

Different ML workflows sharing base configuration

Create pipeline-specific profiles for different ML workflows:

mlknife-compose_user-interest.yaml - User Interest Pipeline

# mlknife-compose_user-interest.yaml - User interest pipeline
name: user-interest
description: "Pipeline for user interest analysis and personalization"

parameters:
  base_data_path: s3://naoo-ai/prod/feed_pixel
  lookback_window: 3d
  model_version: "1.0"
  rec_result_base_path: ${parameters.base_data_path}/recommendation_results/${parameters.model_version}

# Inherit base executors and add pipeline-specific defaults
module_defaults:
  executor: ${executors.glue_etl}
  repository: ../modules/python_module

# Pipeline-specific modules
modules:
  build_user_post_interaction_behaviors:
    number_of_workers: 4
    worker_type: G.2X
    entry_point: ./src/glue_jobs/user_interest/build_user_post_interaction_behaviors.py
    job_parameters:
      lookback_window: ${parameters.lookback_window}
      raw_user_event_path: s3://gaia-naoo-data/prod/cleaned/user_event_raw/
      output_path: ${parameters.base_data_path}/user_post_interaction_behaviors/
    depends_on: []

  generate_user_interests_daily:
    entry_point: ./src/glue_jobs/user_interest/generate_user_interests_daily.py
    job_parameters:
      max_interests: 10
      long_term_lookback_window: 6m
      output_path: ${parameters.base_data_path}/user_interest_daily/
    depends_on:
      - build_user_post_interaction_behaviors

4

Additional Pipeline Profiles

More specialized ML workflows

Create additional pipeline profiles for different ML use cases:

mlknife-compose_media-annotation.yaml - Media Processing Pipeline

# mlknife-compose_media-annotation.yaml - Media annotation pipeline
name: media-annotation
description: "Pipeline for media content annotation and metadata extraction"

parameters:
  base_data_path: s3://naoo-ai/prod/feed_pixel
  batch_size: 100
  model_endpoint: bedrock-claude-3

# Use different executor for media processing
module_defaults:
  executor: ${executors.bedrock_batch_infer}
  repository: ../modules/python_module

# Media-specific modules
modules:
  extract_media_metadata:
    entry_point: ./src/glue_jobs/media_annotation/extract_media_metadata.py
    job_parameters:
      input_path: ${parameters.base_data_path}/raw_media/
      output_path: ${parameters.base_data_path}/media_metadata/
      batch_size: ${parameters.batch_size}
    depends_on: []

  generate_media_annotations:
    executor: ${executors.bedrock_batch_infer}
    entry_point: ./src/bedrock_jobs/generate_media_annotations.py
    job_parameters:
      model_endpoint: ${parameters.model_endpoint}
      input_path: ${parameters.base_data_path}/media_metadata/
      output_path: ${parameters.base_data_path}/media_annotations/
    depends_on:
      - extract_media_metadata

Multi-Pipeline Benefits

Share common executors and services across pipelines
Each pipeline has focused, specific modules
Reduce configuration duplication significantly
Enable modular, composable ML architectures
Independent deployment and scaling per use case

Profile Deployment Workflows

5

Multi-Environment Workflow

Deploy same pipeline across environments

Environment-Based Deployment

# Deploy base configuration (dev by default)
mk s deploy
mk p deploy

# Deploy to production with profile override
mk s deploy -p prod
mk p deploy -p prod

# Run in production
mk p run -p prod

# Compare configurations
mk show -p prod  # Show production config
mk show          # Show base config

6

Multi-Pipeline Workflow

Deploy different pipeline variants

Pipeline-Specific Deployment

# Deploy base services (shared across pipelines)
mk s deploy

# Deploy user interest pipeline
mk p deploy -p user-interest
mk p run -p user-interest

# Deploy media annotation pipeline
mk p deploy -p media-annotation 
mk p run -p media-annotation

# Deploy post enrichment pipeline
mk p deploy -p post-enrichment
mk p run -p post-enrichment

# Run pipelines on schedule
mk p schedule set -p user-interest --cron "0 2 * * *"
mk p schedule set -p media-annotation --cron "0 4 * * *"

7

Hybrid: Multiple Pipelines × Multiple Environments

Ultimate flexibility with both patterns combined

Complex Multi-Profile Deployment

# Deploy base services to dev
mk s deploy

# Test user-interest pipeline in dev
mk p deploy -p user-interest
mk p run -p user-interest

# When ready, deploy to production with both profiles
# This would need: mlknife-compose_user-interest_prod.yaml
mk s deploy -p prod
mk p deploy -p user-interest-prod
mk p run -p user-interest-prod

# Monitor across environments and pipelines
mk p status -p user-interest      # Dev environment
mk p status -p user-interest-prod # Prod environment

Profile Naming Convention

For hybrid scenarios, use descriptive profile names like:

mlknife-compose_user-interest.yaml - Pipeline variant
mlknife-compose_prod.yaml - Environment variant
mlknife-compose_user-interest-prod.yaml - Both combined

Profile Management Commands

Profile Selection and Status

Profile Selection Commands

# Multi-environment profiles
mk s status -p prod            # Production environment
mk p status -p prod            # Check production pipeline status

# Multi-pipeline profiles  
mk p status -p user-interest   # User interest pipeline
mk p status -p media-annotation # Media annotation pipeline
mk show -p post-enrichment     # Show post enrichment config

# Profile discovery
mk show                        # Shows base configuration
ls mlknife-compose*.yaml       # List available profiles

# Set default profile for session
export MLKNIFE_PROFILE=user-interest

Profile Comparisons and Visualization

Profile Comparison Commands

# Compare different pipelines
mk show -p user-interest --detailed
mk show -p media-annotation --detailed

# Visualize pipeline differences
mk p visualize -p user-interest
mk p visualize -p post-enrichment

# Export configurations for comparison
mk show -p prod --json > prod-config.json
mk show -p user-interest --json > user-interest-config.json

# Check which modules differ between profiles
mk show -p user-interest | grep modules
mk show | grep modules  # Base configuration

Pipeline Orchestration

Multi-Pipeline Orchestration

# Schedule multiple pipelines with dependencies
mk p schedule set -p user-interest --cron "0 1 * * *"
mk p schedule set -p post-enrichment --cron "0 3 * * *"  # After user-interest
mk p schedule set -p media-annotation --cron "0 5 * * *"

# Monitor pipeline runs across profiles
mk p runs -p user-interest --limit 5
mk p runs -p media-annotation --limit 5

# Run pipelines with shared data dependencies
# User interest generates data used by post enrichment
mk p run -p user-interest
# Wait for completion, then:
mk p run -p post-enrichment

# Check pipeline dependencies
mk p visualize -p post-enrichment  # Shows dependency graph

Profile Best Practices

Multi-Environment Best Practices

Environment Isolation: Use separate AWS accounts or strict IAM policies
Resource Naming: Include environment in resource names (e.g., hermes-search-prod)
Configuration Validation: Always run mk p deploy --dry-run before production
Graduated Deployment: Test in dev → staging → production progression
Environment Parity: Keep staging as close to production as possible

Multi-Pipeline Best Practices

Shared Infrastructure: Define common executors and services in base configuration
Pipeline Modularity: Each profile should represent a focused ML use case
Data Dependencies: Document and manage data flows between pipelines
Scheduling Coordination: Schedule dependent pipelines with appropriate delays
Resource Optimization: Use appropriate executor types for each pipeline's needs

Configuration Management

Version Control: Store all profile configurations in Git with descriptive names
Profile Documentation: Comment each profile's purpose and key differences
Parameter Consistency: Use consistent parameter naming across profiles
Secrets Management: Never hardcode credentials; use AWS Parameter Store
Regular Cleanup: Remove unused profile configurations to avoid confusion

Real-World Profile Examples

Common Profile Patterns

mlknife-compose.yaml - Base configuration with shared services
mlknife-compose_prod.yaml - Production environment overrides
mlknife-compose_user-interest.yaml - User behavior analysis pipeline
mlknife-compose_media-annotation.yaml - Media content processing pipeline
mlknife-compose_post-enrichment.yaml - Content enrichment pipeline
mlknife-compose_data-clean.yaml - Data cleaning and validation pipeline

Multi-Profile Configuration