YAML Configuration Reference
Complete reference for ModelKnife YAML configuration schema
Table of Contents
Root Configuration Schema
Top-level fields in mlknife-compose.yaml
# ModelKnife Configuration File
name: "my-ml-pipeline"
author: "ml-team"
version: "v1.0"
description: "Complete ML workflow for recommendation system"
parameters:
environment: dev
data_bucket: "ml-pipeline-data"
executors:
# Compute environment templates
services:
# Infrastructure services
modules:
# ML processing modules
Field | Type | Required | Description |
---|---|---|---|
name |
string | Required | Unique identifier for the pipeline. Used for AWS resource naming. |
author |
string | Required | Author or team name for metadata and tagging. |
version |
string | Optional | Pipeline version for deployment tracking. |
description |
string | Optional | Human-readable description of the pipeline purpose. |
role |
string | Optional | Global IAM role ARN for pipeline execution. Can be overridden in individual executors. |
parameters |
object | Optional | Configuration parameters and variables. |
executors |
object | Optional | Compute environment templates for modules. |
services |
object | Optional | Infrastructure services (DynamoDB, Lambda, etc.). |
modules |
object | Optional | ML processing modules and workflows. |
Services Configuration
Infrastructure services and AWS resources
services:
feature_store:
type: sagemaker_feature_store
configuration:
feature_group_name: "features-${parameters.environment}"
record_identifier_name: "feature_id"
event_time_feature_name: "timestamp"
depends_on: []
tags:
service_type: "feature_store"
environment: "${parameters.environment}"
model_api:
type: lambda_function
repository: "../services/"
configuration:
function_name: "ml-api-${parameters.environment}"
runtime: "python3.9"
handler: "api.lambda_handler"
code_path: "./lambda"
depends_on: []
tags:
service_type: "lambda_function"
Service Types
Service Type | AWS Service | Description | Key Configuration |
---|---|---|---|
dynamodb_table |
DynamoDB | NoSQL database tables | table_name, partition_key, sort_key |
lambda_function |
Lambda | Serverless functions | function_name, runtime, handler |
api_gateway |
API Gateway | REST API endpoints | api_name, resources, methods |
s3_bucket |
S3 | Object storage | bucket_name, versioning |
sagemaker_endpoint |
SageMaker | Model inference endpoints | endpoint_name, model_name |
sagemaker_feature_store |
SageMaker Feature Store | ML feature storage and retrieval | feature_group_name, record_identifier_name |
search_service |
OpenSearch | Search and analytics | service_name, indices, fields |
💡 Service Outputs: Services automatically
generate outputs (like table names, ARNs) that can be referenced
in modules using
${services.SERVICE_NAME.outputs.FIELD}
Executors Configuration
Compute environment templates for ML processing
executors:
python_processor:
type: sagemaker_processor
class: sagemaker.sklearn.processing.SKLearnProcessor
instance_type: "ml.c5.2xlarge"
instance_count: 1
glue_etl:
type: glue_etl
job_name: "ml-data-processing"
runtime: "python3.9"
glue_version: "5.0"
worker_type: "G.1X"
number_of_workers: 2
Modules Configuration
ML processing steps and workflows
modules:
data_cleaning:
repository: "../modules"
executor: "${executors.python_processor}"
entry_point: "./jobs/clean_data.py"
description: "Clean and validate raw data"
depends_on: []
model_training:
repository: "../modules"
executor: "${executors.python_processor}"
entry_point: "./jobs/train_model.py"
description: "Train ML model"
depends_on: ["data_cleaning"]
Parameters and Variables
Configuration parameters and variable resolution methods
ModelKnife supports multiple ways to define and resolve configuration variables, making your configurations flexible and environment-aware.
parameters:
environment: development
version: v1.0.0
aws_region: us-west-2
# Nested parameters
database:
host: localhost
port: 5432
name: myapp_dev
Variable Resolution Methods
Parameter References
Reference other parameters using ${parameters.key}
syntax
parameters:
environment: development
db_name: myapp_${parameters.environment}
services:
my-service:
configuration:
function_name: "api-${parameters.environment}"
database_url: "postgresql://localhost/${parameters.db_name}"
Environment Variables
Access environment variables using ${env.VARIABLE}
syntax
# .env file
ENVIRONMENT=production
AWS_REGION=us-east-1
VERSION=v2.0.0
# mlknife-compose.yaml
name: my-pipeline-${env.VERSION}
parameters:
environment: ${env.ENVIRONMENT}
region: ${env.AWS_REGION}
services:
my-lambda:
configuration:
function_name: "app-${env.VERSION}-${env.ENVIRONMENT}"
memory_size: ${env.LAMBDA_MEMORY}
Service Output References
Reference outputs from deployed services using ${services.service.outputs.key}
services:
database:
type: dynamodb_table
configuration:
table_name: "users-${parameters.environment}"
api:
type: lambda_function
depends_on: [database]
configuration:
function_name: "api-${parameters.environment}"
environment:
TABLE_NAME: "${services.database.outputs.table_name}"
TABLE_ARN: "${services.database.outputs.table_arn}"
Pipeline Context
Access pipeline-level information using ${pipeline.property}
name: recommendation-system
services:
storage:
configuration:
bucket_name: "${pipeline.name}-data-${parameters.environment}"
.env File Support
ModelKnife automatically loads .env
files before processing your configuration.
Place your .env
file in the same directory as your mlknife-compose.yaml
for automatic discovery.
# Environment Configuration
ENVIRONMENT=development
AWS_REGION=us-west-2
VERSION=v1.0.0
# Application Settings
LAMBDA_MEMORY_SIZE=512
API_GATEWAY_STAGE=dev
# Database Configuration
DB_HOST=localhost
DB_PORT=5432
DB_NAME=myapp_dev
# Feature flags
ENABLE_DEBUG=true
ENABLE_CACHING=false
Environment Variable Loading
When using ${env.VARIABLE}
references, ModelKnife follows this loading priority:
- System Environment Variables - Variables already set in your system environment
- .env File Variables - Loaded from .env file if not already set in system
ModelKnife never overwrites existing environment variables. .env files only set variables that don't already exist in your system environment.
Best Practices
- Use .env files for environment-specific values like API keys, database URLs, and feature flags
- Use parameters for reusable values that don't change between environments
- Use service outputs for dynamic values like generated resource names and ARNs
- Keep sensitive data secure - never commit .env files with secrets to version control
- Provide .env.example files as templates for other developers
# .env file
ENVIRONMENT=production
VERSION=v2.1.0
AWS_REGION=us-east-1
LAMBDA_MEMORY=1024
# mlknife-compose.yaml
name: ecommerce-api-${env.VERSION}
author: platform-team
parameters:
environment: ${env.ENVIRONMENT}
region: ${env.AWS_REGION}
api_stage: ${env.ENVIRONMENT}
services:
user_database:
type: dynamodb_table
configuration:
table_name: "users-${parameters.environment}"
billing_mode: "PAY_PER_REQUEST"
product_search:
type: search_service
configuration:
service_name: "search-${env.VERSION}-${parameters.environment}"
environment: ${parameters.environment}
api_gateway:
type: api_gateway_v2
depends_on: [user_database, product_search]
configuration:
api_name: "ecommerce-api-${env.VERSION}"
stage_name: ${parameters.api_stage}
user_service:
type: lambda_function
depends_on: [user_database, api_gateway]
configuration:
function_name: "user-service-${env.VERSION}"
memory_size: ${env.LAMBDA_MEMORY}
environment:
TABLE_NAME: "${services.user_database.outputs.table_name}"
SEARCH_ENDPOINT: "${services.product_search.outputs.search_endpoint}"
STAGE: ${parameters.api_stage}
Ready to Configure Your Pipeline?
Use this reference to build powerful ML workflows with ModelKnife's flexible configuration system.