Advanced Debugging Guide

Comprehensive debugging techniques for complex ModelKnife deployment issues

Pipeline Deployment Failures

Issues during pipeline deployment to AWS services

Symptoms

  • Deployment fails with "Resource not found" errors
  • Modules deploy but dependencies aren't resolved
  • Step Functions creation fails
  • Partial deployment success with some modules failing
[2024-01-15 10:30:45] ERROR Failed to deploy module 'data_processing' [2024-01-15 10:30:45] ERROR SageMaker role 'arn:aws:iam::123456789012:role/SageMakerRole' does not exist [2024-01-15 10:30:45] ERROR Deployment failed: InvalidParameterValueException

Debugging Steps

  1. Check deployment order:
    mk s status  # Verify services deployed first
    mk p status --detailed  # Check module status
  2. Validate IAM roles exist:
    aws iam get-role --role-name SageMakerRole
    aws iam get-role --role-name GlueServiceRole
  3. Check dependency chain:
    mk p visualize  # Generate dependency graph
    grep -r "depends_on" mlknife-compose.yaml
  4. Review AWS service limits:
    aws service-quotas get-service-quota --service-code sagemaker --quota-code L-1194F43D

Solutions

  • Deploy services first: Always run mk s deploy before mk p deploy
  • Fix IAM roles: Run mk conf setup to recreate missing roles
  • Resolve dependency cycles: Remove circular dependencies in module configurations
  • Request limit increases: Contact AWS Support for service limit increases

Module Execution Errors

Runtime failures in SageMaker, Glue, or other AWS services

Symptoms

  • SageMaker jobs fail with container exit codes
  • Glue jobs timeout or fail with memory errors
  • Step Functions execution stops with errors
  • Modules succeed but produce incorrect outputs
[2024-01-15 11:45:22] ERROR SageMaker job failed: AlgorithmError [2024-01-15 11:45:22] INFO Exit code: 1 [2024-01-15 11:45:22] ERROR Container error: ImportError: No module named 'custom_package' [2024-01-15 11:45:22] WARN Job exceeded memory limit: 8192 MB

Debugging Steps

  1. Check execution logs:
    mk p status --module data_processing
    aws logs filter-log-events --log-group-name /aws/sagemaker/ProcessingJobs
  2. Validate input data:
    aws s3 ls s3://your-bucket/input/
    mk p show --detailed
  3. Test locally:
    python src/process_data.py --input local_test_data/
    docker run --rm -v $(pwd):/opt/ml sagemaker-scikit-learn:0.23-1-cpu-py3
  4. Check resource usage:
    aws cloudwatch get-metric-statistics --namespace AWS/SageMaker --metric-name MemoryUtilization

Solutions

  • Fix dependencies: Add missing packages to requirements.txt or install scripts
  • Increase resources: Use larger instance types or increase memory allocation
  • Validate inputs: Add input validation modules to catch data issues early
  • Implement retries: Configure retry logic for transient failures

IAM Permission Issues

Access denied errors and permission-related failures

Symptoms

  • AccessDenied errors during deployment
  • Services can't access other AWS resources
  • Team members can't access ModelKnife commands
  • Cross-account access failures
[2024-01-15 12:15:33] ERROR AccessDenied: User: arn:aws:iam::123456789012:user/ml-engineer [2024-01-15 12:15:33] ERROR is not authorized to perform: sagemaker:CreateProcessingJob [2024-01-15 12:15:33] ERROR on resource: arn:aws:sagemaker:us-east-1:123456789012:processing-job/*

Debugging Steps

  1. Check user permissions:
    aws iam get-user --user-name ml-engineer
    aws iam list-attached-user-policies --user-name ml-engineer
    aws iam list-groups-for-user --user-name ml-engineer
  2. Verify role trust relationships:
    aws iam get-role --role-name SageMakerRole
    mk team status
  3. Test specific permissions:
    aws iam simulate-principal-policy --policy-source-arn arn:aws:iam::123456789012:user/ml-engineer --action-names sagemaker:CreateProcessingJob
  4. Review CloudTrail logs:
    aws logs filter-log-events --log-group-name CloudTrail/AccessDenied

Solutions

  • Add user to groups: mk team add-user --user USERNAME --group mlknife-developers
  • Update role policies: mk conf update-roles --fix-permissions
  • Fix trust relationships: Ensure roles can be assumed by correct services
  • Grant missing permissions: Add required actions to IAM policies

Performance Problems

Slow execution, timeouts, and resource inefficiency

Symptoms

  • Jobs take longer than expected to complete
  • Frequent timeout errors
  • High memory or CPU usage
  • Expensive AWS bills

Performance Analysis

  1. Analyze execution metrics:
    mk p runs --performance-report --last-month
    aws cloudwatch get-metric-statistics --namespace AWS/SageMaker --metric-name ProcessingJobRuntimeSeconds
  2. Profile resource usage:
    mk show costs --breakdown-by-module --last-week
    aws cloudwatch get-metric-statistics --metric-name CPUUtilization
  3. Check data size and distribution:
    aws s3api head-object --bucket your-bucket --key data/file.parquet
    mk p run --modules data_profiling

Optimization Strategies

  • Right-size instances: Use appropriate instance types for workload
  • Optimize data formats: Use Parquet, optimize partitioning
  • Implement caching: Cache intermediate results in S3
  • Parallel processing: Increase instance count for distributed workloads

Configuration Validation Errors

YAML syntax errors and configuration validation failures

Symptoms

  • YAML parsing errors
  • Invalid reference errors (${...})
  • Missing required fields
  • Configuration validation failures
[2024-01-15 13:20:15] ERROR YAML parsing failed: mlknife-compose.yaml:45 [2024-01-15 13:20:15] ERROR Invalid reference: ${services.nonexistent_service.outputs.table_name} [2024-01-15 13:20:15] ERROR Required field missing: executors.python_processor.instance_type

Configuration Debugging

  1. Validate YAML syntax:
    mk s validate
    mk p validate
    yamllint mlknife-compose.yaml
  2. Check reference resolution:
    mk fmt show --resolve-references
    grep -r '\${' mlknife-compose.yaml
  3. Verify service outputs:
    mk s status --outputs
    mk conf show --services

Configuration Fixes

  • Fix YAML syntax: Use proper indentation and quoting
  • Correct references: Ensure referenced services exist
  • Add missing fields: Include all required configuration parameters
  • Use validation tools: Run mk s validate before deployment

Emergency Recovery Procedures

Pipeline Rollback

# Emergency rollback to last stable version
mk p rollback --version $(mk p version --last-stable) --force

# Verify rollback success
mk p status --detailed

Stop All Executions

# Stop running executions
mk p stop --all --force

# Check for any remaining running jobs
mk p runs --status running

Contact Support

  • Gather error logs and configuration
  • Document steps leading to the issue
  • Create GitHub issue with details
  • For critical issues: ping team immediately

Still Need Help?

If these troubleshooting steps don't resolve your issue, we're here to help.

Quick Start Guide Working Examples Get Support