Configuration#
YT Framework uses YAML files for configuration and environment files for secrets. Understanding the configuration system is essential for building effective pipelines.
Configuration Files#
Configuration is organized into multiple files:
Pipeline config (
configs/config.yaml): Pipeline-level settingsStage configs (
stages/<stage_name>/config.yaml): Stage-specific settingsSecrets (
configs/secrets.env): Credentials and sensitive data
Warning
YT Cluster Requirements
In production mode, ytjobs code executes on YT cluster nodes. Ensure your cluster’s Docker image includes required dependencies or use custom Docker images. See Cluster Requirements for details.
Pipeline Configuration#
The pipeline configuration file (configs/config.yaml) controls pipeline-level behavior:
stages:
enabled_stages:
- stage1
- stage2
- stage3
pipeline:
mode: "dev" # or "prod"
build_folder: "//tmp/my_pipeline/build" # Required for operations with code
Stages Section#
enabled_stages (required): List of stage names to execute, in order.
stages:
enabled_stages:
- create_input
- process_data
- validate_output
Only stages listed here will be executed. Stages are executed in the order specified.
Pipeline Section#
mode (optional, default: “dev”): Execution mode.
"dev": Local development mode (file system simulation)"prod": Production mode (YT cluster execution)
build_folder (required for code execution): YT path where code will be uploaded.
pipeline:
build_folder: "//tmp/my_pipeline/build"
Required if any enabled stages have src/ directory (for map or vanilla operations).
upload_modules (optional): List of Python module names to upload in addition to ytjobs.
pipeline:
upload_modules: [my_package, company_utils]
Each module must be importable. The ytjobs package is always uploaded implicitly.
upload_paths (optional): List of local directories to upload by path.
pipeline:
upload_paths:
- { source: "./lib/shared", target: "shared" }
- { source: "./experiments/utils" } # target defaults to "utils"
Paths are relative to the pipeline directory. Use target to set the directory name in the archive. See Code Upload for details.
Stage Configuration#
Each stage has its own configuration file at stages/<stage_name>/config.yaml:
# stages/my_stage/config.yaml
job:
# Job-specific settings
multiplier: 2
prefix: "processed_"
client:
# Client settings
input_table: //tmp/my_pipeline/input
output_table: //tmp/my_pipeline/output
# Operation configurations
operations:
map:
input_table: //tmp/my_pipeline/input
output_table: //tmp/my_pipeline/output
resources:
pool: default
memory_limit_gb: 4
cpu_limit: 2
job_count: 2
Configuration Structure#
Configuration is organized into sections:
job: Settings used by mapper.py or vanilla.py scriptsclient: Settings used by the stage itselfoperations: Operation-specific settings (map, vanilla, etc.)
Accessing Configuration#
In stage code:
class MyStage(BaseStage):
def run(self, debug: DebugContext) -> DebugContext:
# Access job config
multiplier = self.config.job.multiplier
# Access client config
input_table = self.config.client.input_table
# Access nested config
memory = self.config.client.operations.map.resources.memory_limit_gb
return debug
In mapper.py or vanilla.py:
from omegaconf import OmegaConf
from ytjobs.config import get_config_path
config = OmegaConf.load(get_config_path())
multiplier = config.job.multiplier
Configuration Examples#
Simple Pipeline#
# configs/config.yaml
stages:
enabled_stages:
- create_table
pipeline:
mode: "dev"
Pipeline with Code Execution#
# configs/config.yaml
stages:
enabled_stages:
- process_data
pipeline:
mode: "prod"
build_folder: "//tmp/my_pipeline/build"
Pipeline with Multiple Operations#
# configs/config.yaml
stages:
enabled_stages:
- process_and_validate
pipeline:
mode: "prod"
build_folder: "//tmp/my_pipeline/build"
# stages/process_and_validate/config.yaml
client:
operations:
process:
input_table: //tmp/my_pipeline/input
output_table: //tmp/my_pipeline/processed
resources:
memory_limit_gb: 8
cpu_limit: 4
validate:
resources:
memory_limit_gb: 4
cpu_limit: 2
Next Steps#
Understand Cluster Requirements for production mode dependencies
Learn about Secrets Management for credentials
Explore Advanced Configuration for multiple configs and merging
Check Dev vs Prod for mode-specific configuration
Review Operations for operation-specific configuration