Advanced Debugging Guide
Comprehensive debugging techniques for complex ModelKnife deployment issues
Common Issues
Pipeline Deployment Failures
Issues during pipeline deployment to AWS services
Symptoms
- Deployment fails with "Resource not found" errors
- Modules deploy but dependencies aren't resolved
- Step Functions creation fails
- Partial deployment success with some modules failing
ERROR Failed to deploy module 'data_processing'
ERROR SageMaker role 'arn:aws:iam::123456789012:role/SageMakerRole' does not exist
ERROR Deployment failed: InvalidParameterValueException
Debugging Steps
- Check deployment order:
mk s status # Verify services deployed first mk p status --detailed # Check module status
- Validate IAM roles exist:
aws iam get-role --role-name SageMakerRole aws iam get-role --role-name GlueServiceRole
- Check dependency chain:
mk p visualize # Generate dependency graph grep -r "depends_on" mlknife-compose.yaml
- Review AWS service limits:
aws service-quotas get-service-quota --service-code sagemaker --quota-code L-1194F43D
Solutions
- Deploy services first: Always run
mk s deploy
beforemk p deploy
- Fix IAM roles: Run
mk conf setup
to recreate missing roles - Resolve dependency cycles: Remove circular dependencies in module configurations
- Request limit increases: Contact AWS Support for service limit increases
Module Execution Errors
Runtime failures in SageMaker, Glue, or other AWS services
Symptoms
- SageMaker jobs fail with container exit codes
- Glue jobs timeout or fail with memory errors
- Step Functions execution stops with errors
- Modules succeed but produce incorrect outputs
ERROR SageMaker job failed: AlgorithmError
INFO Exit code: 1
ERROR Container error: ImportError: No module named 'custom_package'
WARN Job exceeded memory limit: 8192 MB
Debugging Steps
- Check execution logs:
mk p status --module data_processing aws logs filter-log-events --log-group-name /aws/sagemaker/ProcessingJobs
- Validate input data:
aws s3 ls s3://your-bucket/input/ mk p show --detailed
- Test locally:
python src/process_data.py --input local_test_data/ docker run --rm -v $(pwd):/opt/ml sagemaker-scikit-learn:0.23-1-cpu-py3
- Check resource usage:
aws cloudwatch get-metric-statistics --namespace AWS/SageMaker --metric-name MemoryUtilization
Solutions
- Fix dependencies: Add missing packages to requirements.txt or install scripts
- Increase resources: Use larger instance types or increase memory allocation
- Validate inputs: Add input validation modules to catch data issues early
- Implement retries: Configure retry logic for transient failures
IAM Permission Issues
Access denied errors and permission-related failures
Symptoms
- AccessDenied errors during deployment
- Services can't access other AWS resources
- Team members can't access ModelKnife commands
- Cross-account access failures
ERROR AccessDenied: User: arn:aws:iam::123456789012:user/ml-engineer
ERROR is not authorized to perform: sagemaker:CreateProcessingJob
ERROR on resource: arn:aws:sagemaker:us-east-1:123456789012:processing-job/*
Debugging Steps
- Check user permissions:
aws iam get-user --user-name ml-engineer aws iam list-attached-user-policies --user-name ml-engineer aws iam list-groups-for-user --user-name ml-engineer
- Verify role trust relationships:
aws iam get-role --role-name SageMakerRole mk team status
- Test specific permissions:
aws iam simulate-principal-policy --policy-source-arn arn:aws:iam::123456789012:user/ml-engineer --action-names sagemaker:CreateProcessingJob
- Review CloudTrail logs:
aws logs filter-log-events --log-group-name CloudTrail/AccessDenied
Solutions
- Add user to groups:
mk team add-user --user USERNAME --group mlknife-developers
- Update role policies:
mk conf update-roles --fix-permissions
- Fix trust relationships: Ensure roles can be assumed by correct services
- Grant missing permissions: Add required actions to IAM policies
Performance Problems
Slow execution, timeouts, and resource inefficiency
Symptoms
- Jobs take longer than expected to complete
- Frequent timeout errors
- High memory or CPU usage
- Expensive AWS bills
Performance Analysis
- Analyze execution metrics:
mk p runs --performance-report --last-month aws cloudwatch get-metric-statistics --namespace AWS/SageMaker --metric-name ProcessingJobRuntimeSeconds
- Profile resource usage:
mk show costs --breakdown-by-module --last-week aws cloudwatch get-metric-statistics --metric-name CPUUtilization
- Check data size and distribution:
aws s3api head-object --bucket your-bucket --key data/file.parquet mk p run --modules data_profiling
Optimization Strategies
- Right-size instances: Use appropriate instance types for workload
- Optimize data formats: Use Parquet, optimize partitioning
- Implement caching: Cache intermediate results in S3
- Parallel processing: Increase instance count for distributed workloads
Configuration Validation Errors
YAML syntax errors and configuration validation failures
Symptoms
- YAML parsing errors
- Invalid reference errors (${...})
- Missing required fields
- Configuration validation failures
ERROR YAML parsing failed: mlknife-compose.yaml:45
ERROR Invalid reference: ${services.nonexistent_service.outputs.table_name}
ERROR Required field missing: executors.python_processor.instance_type
Configuration Debugging
- Validate YAML syntax:
mk s validate mk p validate yamllint mlknife-compose.yaml
- Check reference resolution:
mk fmt show --resolve-references grep -r '\${' mlknife-compose.yaml
- Verify service outputs:
mk s status --outputs mk conf show --services
Configuration Fixes
- Fix YAML syntax: Use proper indentation and quoting
- Correct references: Ensure referenced services exist
- Add missing fields: Include all required configuration parameters
- Use validation tools: Run
mk s validate
before deployment
Emergency Recovery Procedures
Pipeline Rollback
# Emergency rollback to last stable version
mk p rollback --version $(mk p version --last-stable) --force
# Verify rollback success
mk p status --detailed
Stop All Executions
# Stop running executions
mk p stop --all --force
# Check for any remaining running jobs
mk p runs --status running
Contact Support
- Gather error logs and configuration
- Document steps leading to the issue
- Create GitHub issue with details
- For critical issues: ping team immediately
Still Need Help?
If these troubleshooting steps don't resolve your issue, we're here to help.