YAML Configuration Reference

Complete reference for ModelKnife YAML configuration schema

Root Configuration Schema

Top-level fields in mlknife-compose.yaml

mlknife-compose.yaml
# ModelKnife Configuration File
name: "my-ml-pipeline"
author: "ml-team"
version: "v1.0"
description: "Complete ML workflow for recommendation system"

parameters:
  environment: dev
  data_bucket: "ml-pipeline-data"

executors:
  # Compute environment templates

services:
  # Infrastructure services

modules:
  # ML processing modules
Field Type Required Description
name string Required Unique identifier for the pipeline. Used for AWS resource naming.
author string Required Author or team name for metadata and tagging.
version string Optional Pipeline version for deployment tracking.
description string Optional Human-readable description of the pipeline purpose.
role string Optional Global IAM role ARN for pipeline execution. Can be overridden in individual executors.
parameters object Optional Configuration parameters and variables.
executors object Optional Compute environment templates for modules.
services object Optional Infrastructure services (DynamoDB, Lambda, etc.).
modules object Optional ML processing modules and workflows.

Services Configuration

Infrastructure services and AWS resources

services:
  feature_store:
    type: sagemaker_feature_store
    configuration:
      feature_group_name: "features-${parameters.environment}"
      record_identifier_name: "feature_id"
      event_time_feature_name: "timestamp"
    depends_on: []
    tags:
      service_type: "feature_store"
      environment: "${parameters.environment}"

  model_api:
    type: lambda_function
    repository: "../services/"
    configuration:
      function_name: "ml-api-${parameters.environment}"
      runtime: "python3.9"
      handler: "api.lambda_handler"
      code_path: "./lambda"
    depends_on: []
    tags:
      service_type: "lambda_function"

Service Types

Service Type AWS Service Description Key Configuration
dynamodb_table DynamoDB NoSQL database tables table_name, partition_key, sort_key
lambda_function Lambda Serverless functions function_name, runtime, handler
api_gateway API Gateway REST API endpoints api_name, resources, methods
s3_bucket S3 Object storage bucket_name, versioning
sagemaker_endpoint SageMaker Model inference endpoints endpoint_name, model_name
sagemaker_feature_store SageMaker Feature Store ML feature storage and retrieval feature_group_name, record_identifier_name
search_service OpenSearch Search and analytics service_name, indices, fields

💡 Service Outputs: Services automatically generate outputs (like table names, ARNs) that can be referenced in modules using ${services.SERVICE_NAME.outputs.FIELD}

Executors Configuration

Compute environment templates for ML processing

executors:
  python_processor:
    type: sagemaker_processor
    class: sagemaker.sklearn.processing.SKLearnProcessor
    instance_type: "ml.c5.2xlarge"
    instance_count: 1

  glue_etl:
    type: glue_etl
    job_name: "ml-data-processing"
    runtime: "python3.9"
    glue_version: "5.0"
    worker_type: "G.1X"
    number_of_workers: 2

Modules Configuration

ML processing steps and workflows

modules:
  data_cleaning:
    repository: "../modules"
    executor: "${executors.python_processor}"
    entry_point: "./jobs/clean_data.py"
    description: "Clean and validate raw data"
    depends_on: []

  model_training:
    repository: "../modules"
    executor: "${executors.python_processor}"
    entry_point: "./jobs/train_model.py"
    description: "Train ML model"
    depends_on: ["data_cleaning"]

Parameters and Variables

Configuration parameters and variable resolution methods

ModelKnife supports multiple ways to define and resolve configuration variables, making your configurations flexible and environment-aware.

Basic Parameters
parameters:
  environment: development
  version: v1.0.0
  aws_region: us-west-2
  
  # Nested parameters
  database:
    host: localhost
    port: 5432
    name: myapp_dev

Variable Resolution Methods

Parameter References

Reference other parameters using ${parameters.key} syntax

parameters:
  environment: development
  db_name: myapp_${parameters.environment}

services:
  my-service:
    configuration:
      function_name: "api-${parameters.environment}"
      database_url: "postgresql://localhost/${parameters.db_name}"

Environment Variables

Access environment variables using ${env.VARIABLE} syntax

# .env file
ENVIRONMENT=production
AWS_REGION=us-east-1
VERSION=v2.0.0

# mlknife-compose.yaml
name: my-pipeline-${env.VERSION}

parameters:
  environment: ${env.ENVIRONMENT}
  region: ${env.AWS_REGION}

services:
  my-lambda:
    configuration:
      function_name: "app-${env.VERSION}-${env.ENVIRONMENT}"
      memory_size: ${env.LAMBDA_MEMORY}

Service Output References

Reference outputs from deployed services using ${services.service.outputs.key}

services:
  database:
    type: dynamodb_table
    configuration:
      table_name: "users-${parameters.environment}"
  
  api:
    type: lambda_function
    depends_on: [database]
    configuration:
      function_name: "api-${parameters.environment}"
      environment:
        TABLE_NAME: "${services.database.outputs.table_name}"
        TABLE_ARN: "${services.database.outputs.table_arn}"

Pipeline Context

Access pipeline-level information using ${pipeline.property}

name: recommendation-system

services:
  storage:
    configuration:
      bucket_name: "${pipeline.name}-data-${parameters.environment}"

.env File Support

Automatic .env Loading

ModelKnife automatically loads .env files before processing your configuration. Place your .env file in the same directory as your mlknife-compose.yaml for automatic discovery.

.env
# Environment Configuration
ENVIRONMENT=development
AWS_REGION=us-west-2
VERSION=v1.0.0

# Application Settings
LAMBDA_MEMORY_SIZE=512
API_GATEWAY_STAGE=dev

# Database Configuration
DB_HOST=localhost
DB_PORT=5432
DB_NAME=myapp_dev

# Feature flags
ENABLE_DEBUG=true
ENABLE_CACHING=false

Environment Variable Loading

When using ${env.VARIABLE} references, ModelKnife follows this loading priority:

  1. System Environment Variables - Variables already set in your system environment
  2. .env File Variables - Loaded from .env file if not already set in system
Safe Loading

ModelKnife never overwrites existing environment variables. .env files only set variables that don't already exist in your system environment.

Best Practices

  • Use .env files for environment-specific values like API keys, database URLs, and feature flags
  • Use parameters for reusable values that don't change between environments
  • Use service outputs for dynamic values like generated resource names and ARNs
  • Keep sensitive data secure - never commit .env files with secrets to version control
  • Provide .env.example files as templates for other developers
Complete Example
# .env file
ENVIRONMENT=production
VERSION=v2.1.0
AWS_REGION=us-east-1
LAMBDA_MEMORY=1024

# mlknife-compose.yaml
name: ecommerce-api-${env.VERSION}
author: platform-team

parameters:
  environment: ${env.ENVIRONMENT}
  region: ${env.AWS_REGION}
  api_stage: ${env.ENVIRONMENT}

services:
  user_database:
    type: dynamodb_table
    configuration:
      table_name: "users-${parameters.environment}"
      billing_mode: "PAY_PER_REQUEST"

  product_search:
    type: search_service
    configuration:
      service_name: "search-${env.VERSION}-${parameters.environment}"
      environment: ${parameters.environment}

  api_gateway:
    type: api_gateway_v2
    depends_on: [user_database, product_search]
    configuration:
      api_name: "ecommerce-api-${env.VERSION}"
      stage_name: ${parameters.api_stage}

  user_service:
    type: lambda_function
    depends_on: [user_database, api_gateway]
    configuration:
      function_name: "user-service-${env.VERSION}"
      memory_size: ${env.LAMBDA_MEMORY}
      environment:
        TABLE_NAME: "${services.user_database.outputs.table_name}"
        SEARCH_ENDPOINT: "${services.product_search.outputs.search_endpoint}"
        STAGE: ${parameters.api_stage}

Ready to Configure Your Pipeline?

Use this reference to build powerful ML workflows with ModelKnife's flexible configuration system.

View Examples Quick Start