YAML Configuration Reference

Root Configuration Schema

Top-level fields in mlknife-compose.yaml

mlknife-compose.yaml

# ModelKnife Configuration File
name: "my-ml-pipeline"
author: "ml-team"
version: "v1.0"
description: "Complete ML workflow for recommendation system"

parameters:
  environment: dev
  data_bucket: "ml-pipeline-data"

executors:
  # Compute environment templates

services:
  # Infrastructure services

modules:
  # ML processing modules

Field	Type	Required	Description
`name`	string	Required	Unique identifier for the pipeline. Used for AWS resource naming.
`author`	string	Required	Author or team name for metadata and tagging.
`version`	string	Optional	Pipeline version for deployment tracking.
`description`	string	Optional	Human-readable description of the pipeline purpose.
`role`	string	Optional	Global IAM role ARN for pipeline execution. Can be overridden in individual executors.
`parameters`	object	Optional	Configuration parameters and variables.
`executors`	object	Optional	Compute environment templates for modules.
`services`	object	Optional	Infrastructure services (DynamoDB, Lambda, etc.).
`modules`	object	Optional	ML processing modules and workflows.

Services Configuration

Infrastructure services and AWS resources

services:
  feature_store:
    type: sagemaker_feature_store
    configuration:
      feature_group_name: "features-${parameters.environment}"
      record_identifier_name: "feature_id"
      event_time_feature_name: "timestamp"
    depends_on: []
    tags:
      service_type: "feature_store"
      environment: "${parameters.environment}"

  model_api:
    type: lambda_function
    repository: "../services/"
    configuration:
      function_name: "ml-api-${parameters.environment}"
      runtime: "python3.9"
      handler: "api.lambda_handler"
      code_path: "./lambda"
    depends_on: []
    tags:
      service_type: "lambda_function"

Service Types

Service Type	AWS Service	Description	Key Configuration
`dynamodb_table`	DynamoDB	NoSQL database tables	table_name, partition_key, sort_key
`lambda_function`	Lambda	Serverless functions	function_name, runtime, handler
`api_gateway`	API Gateway	REST API endpoints	api_name, resources, methods
`s3_bucket`	S3	Object storage	bucket_name, versioning
`sagemaker_endpoint`	SageMaker	Model inference endpoints	endpoint_name, model_name
`sagemaker_feature_store`	SageMaker Feature Store	ML feature storage and retrieval	feature_group_name, record_identifier_name
`search_service`	OpenSearch	Search and analytics	service_name, indices, fields

💡 Service Outputs: Services automatically generate outputs (like table names, ARNs) that can be referenced in modules using ${services.SERVICE_NAME.outputs.FIELD}

Executors Configuration

Compute environment templates for ML processing

executors:
  python_processor:
    type: sagemaker_processor
    class: sagemaker.sklearn.processing.SKLearnProcessor
    instance_type: "ml.c5.2xlarge"
    instance_count: 1

  glue_etl:
    type: glue_etl
    job_name: "ml-data-processing"
    runtime: "python3.9"
    glue_version: "5.0"
    worker_type: "G.1X"
    number_of_workers: 2

Modules Configuration

ML processing steps and workflows

modules:
  data_cleaning:
    repository: "../modules"
    executor: "${executors.python_processor}"
    entry_point: "./jobs/clean_data.py"
    description: "Clean and validate raw data"
    depends_on: []

  model_training:
    repository: "../modules"
    executor: "${executors.python_processor}"
    entry_point: "./jobs/train_model.py"
    description: "Train ML model"
    depends_on: ["data_cleaning"]

Parameters and Variables

Configuration parameters and variable resolution methods

ModelKnife supports multiple ways to define and resolve configuration variables, making your configurations flexible and environment-aware.

Basic Parameters

parameters:
  environment: development
  version: v1.0.0
  aws_region: us-west-2
  
  # Nested parameters
  database:
    host: localhost
    port: 5432
    name: myapp_dev

Variable Resolution Methods

Parameter References

Reference other parameters using ${parameters.key} syntax

parameters:
  environment: development
  db_name: myapp_${parameters.environment}

services:
  my-service:
    configuration:
      function_name: "api-${parameters.environment}"
      database_url: "postgresql://localhost/${parameters.db_name}"

Environment Variables

Access environment variables using ${env.VARIABLE} syntax

# .env file
ENVIRONMENT=production
AWS_REGION=us-east-1
VERSION=v2.0.0

# mlknife-compose.yaml
name: my-pipeline-${env.VERSION}

parameters:
  environment: ${env.ENVIRONMENT}
  region: ${env.AWS_REGION}

services:
  my-lambda:
    configuration:
      function_name: "app-${env.VERSION}-${env.ENVIRONMENT}"
      memory_size: ${env.LAMBDA_MEMORY}

Service Output References

Reference outputs from deployed services using ${services.service.outputs.key}

services:
  database:
    type: dynamodb_table
    configuration:
      table_name: "users-${parameters.environment}"
  
  api:
    type: lambda_function
    depends_on: [database]
    configuration:
      function_name: "api-${parameters.environment}"
      environment:
        TABLE_NAME: "${services.database.outputs.table_name}"
        TABLE_ARN: "${services.database.outputs.table_arn}"

Pipeline Context

Access pipeline-level information using ${pipeline.property}

name: recommendation-system

services:
  storage:
    configuration:
      bucket_name: "${pipeline.name}-data-${parameters.environment}"

.env File Support

Automatic .env Loading

ModelKnife automatically loads .env files before processing your configuration. Place your .env file in the same directory as your mlknife-compose.yaml for automatic discovery.

.env

# Environment Configuration
ENVIRONMENT=development
AWS_REGION=us-west-2
VERSION=v1.0.0

# Application Settings
LAMBDA_MEMORY_SIZE=512
API_GATEWAY_STAGE=dev

# Database Configuration
DB_HOST=localhost
DB_PORT=5432
DB_NAME=myapp_dev

# Feature flags
ENABLE_DEBUG=true
ENABLE_CACHING=false

Environment Variable Loading

When using ${env.VARIABLE} references, ModelKnife follows this loading priority:

System Environment Variables - Variables already set in your system environment
.env File Variables - Loaded from .env file if not already set in system

Safe Loading

ModelKnife never overwrites existing environment variables. .env files only set variables that don't already exist in your system environment.

Best Practices

Use .env files for environment-specific values like API keys, database URLs, and feature flags
Use parameters for reusable values that don't change between environments
Use service outputs for dynamic values like generated resource names and ARNs
Keep sensitive data secure - never commit .env files with secrets to version control
Provide .env.example files as templates for other developers

Complete Example

# .env file
ENVIRONMENT=production
VERSION=v2.1.0
AWS_REGION=us-east-1
LAMBDA_MEMORY=1024

# mlknife-compose.yaml
name: ecommerce-api-${env.VERSION}
author: platform-team

parameters:
  environment: ${env.ENVIRONMENT}
  region: ${env.AWS_REGION}
  api_stage: ${env.ENVIRONMENT}

services:
  user_database:
    type: dynamodb_table
    configuration:
      table_name: "users-${parameters.environment}"
      billing_mode: "PAY_PER_REQUEST"

  product_search:
    type: search_service
    configuration:
      service_name: "search-${env.VERSION}-${parameters.environment}"
      environment: ${parameters.environment}

  api_gateway:
    type: api_gateway_v2
    depends_on: [user_database, product_search]
    configuration:
      api_name: "ecommerce-api-${env.VERSION}"
      stage_name: ${parameters.api_stage}

  user_service:
    type: lambda_function
    depends_on: [user_database, api_gateway]
    configuration:
      function_name: "user-service-${env.VERSION}"
      memory_size: ${env.LAMBDA_MEMORY}
      environment:
        TABLE_NAME: "${services.user_database.outputs.table_name}"
        SEARCH_ENDPOINT: "${services.product_search.outputs.search_endpoint}"
        STAGE: ${parameters.api_stage}