ModelKnife has four core concepts that separate batch workflows from real-time infrastructure. Understanding these will help you build production ML systems effectively.
Key Distinction
Pipelines = Batch processing workflows (produce models, data, predictions)
Services = Real-time serving infrastructure (serve models to users)
Pipelines
Batch Processing Workflows
What: Complete batch workflows that transform raw data into models, datasets, or predictions
When: Execute scheduled processing jobs for training models or processing large amounts of data
Examples: Data ETL → feature engineering → model training → batch prediction workflows
name: recommendation-pipeline
version: v1.0
description: End-to-end product recommendation system
modules:
data_cleaning: { ... }
feature_engineering: { ... }
model_training: { ... }
Modules
Individual Processing Steps
What: Individual work units that define specific processing steps within a pipeline
When: Each distinct task in your workflow (data cleaning, training, inference)
Examples: Data cleaning, feature engineering, model training, batch inference
modules:
# Data cleaning module
data_cleaning:
executor: ${executors.glue_etl}
entry_point: clean_data.py
job_parameters:
input_path: s3://bucket/raw/
output_path: s3://bucket/cleaned/
# Model training module
model_training:
executor: ${executors.python_processor}
entry_point: train_model.py
depends_on: [data_cleaning] # Runs after data cleaning
Executors
Compute Environment Templates
What: Reusable compute environment templates that define how modules run
When: Specify compute resources (instance types, frameworks) for your modules
Examples: SageMaker processors, Glue jobs, Spark clusters, Bedrock batch jobs
executors:
# For Python-based data processing
python_processor:
class: sagemaker.sklearn.processing.SKLearnProcessor
role: ${pipeline.role}
instance_type: ml.c5.2xlarge
framework_version: 0.23-1
# For big data processing
glue_etl:
type: glue_etl
runtime: python3.9
glue_version: "4.0"
worker_type: G.2X
number_of_workers: 5
Services
Real-Time Serving Infrastructure
What: Always-on infrastructure components that serve trained models to end users in real-time
When: Deploy once per environment, consume models and data produced by batch workflows
Examples: SageMaker endpoints, Lambda APIs, feature stores (DynamoDB), API Gateway, search services, data streaming (Kinesis), event processing (EventBridge), data catalog (Glue)
services:
feature_store:
type: dynamodb_table
configuration:
table_name: "ml-features-${parameters.environment}"
partition_key: "feature_id"
model_api:
type: lambda_function
configuration:
function_name: "inference-api-${parameters.environment}"
runtime: python3.9
handler: "api.lambda_handler"
How Concepts Work Together
ModelKnife separates batch workflows from real-time infrastructure:
Batch Processing Workflows
Overall Workflow
Processing Steps
Compute Infrastructure
Produces models, processed data, predictions, and other ML artifacts
Real-Time Serving Infrastructure
API Endpoints + Databases
Serves trained models to users: real-time predictions, feature lookups, API responses
name: ecommerce-recommendation
version: v1.0
# Services: Stable infrastructure (mk s deploy)
services:
feature_store:
type: dynamodb_table
configuration:
table_name: "features-${parameters.environment}"
# Executors: Compute environment templates
executors:
python_processor:
class: sagemaker.sklearn.processing.SKLearnProcessor
instance_type: ml.c5.xlarge
glue_etl:
type: glue_etl
runtime: python3.9
# Modules: ML processing steps (mk p deploy)
modules:
data_cleaning:
executor: ${executors.glue_etl} # Uses executor template
entry_point: clean_data.py
feature_engineering:
executor: ${executors.python_processor}
entry_point: build_features.py
depends_on: [data_cleaning] # Module dependency
job_parameters:
feature_table: "${services.feature_store.outputs.table_name}" # Uses service output
Deployment Model
Understanding when and how to deploy each concept:
Concept | Command | Frequency | Purpose |
---|---|---|---|
Services | mk s deploy |
Once per environment | Stable infrastructure foundation |
Pipelines | mk p deploy |
Multiple times per day | ML workflow iteration |
Modules | Part of pipeline | Deployed with pipeline | Individual processing steps |
Executors | Configuration only | Never deployed directly | Compute templates |
Mental Model: Bakery Operations
Think of ModelKnife like running a bakery that both bakes goods and serves customers:
- Pipelines = Baking recipes (mix dough → let rise → bake → cool → package) - complete processes that create finished baked goods
- Modules = Individual baking steps (mixing, kneading, rising, baking, decorating) - specific tasks within each recipe
- Executors = Kitchen equipment (stand mixer, oven temperature, baking time) - specify what tools and settings each step needs
- Services = Storefront operations (display cases, cash register, staff) - always-open shop that serves fresh baked goods to customers
Key insight: Baking happens in batches early in the morning (scheduled production), while the storefront serves customers all day long (on-demand service).
Common Patterns
1. Service-Pipeline Integration
Services provide stable interfaces that pipelines consume:
# Service provides feature storage
services:
feature_store: { type: dynamodb_table }
# Pipeline modules read from/write to services
modules:
model_training:
job_parameters:
feature_table: "${services.feature_store.outputs.table_name}"
2. Module Dependencies
Modules form a dependency graph that ModelKnife orchestrates:
modules:
data_cleaning: { ... }
feature_engineering:
depends_on: [data_cleaning]
model_training:
depends_on: [feature_engineering]
3. Executor Reuse
Multiple modules can share executor configurations:
executors:
ml_processor: { ... } # Defined once
modules:
feature_engineering:
executor: ${executors.ml_processor} # Reused
model_training:
executor: ${executors.ml_processor} # Reused
Ready to Apply These Concepts?
Now that you understand ModelKnife's core concepts, try them out with real examples.