YT Framework documentation#
This documentation explains how to build and run data processing pipelines on YTsaurus (YT) with the YT Framework Python package.
Table of contents#
Introduction#
YT Framework is a Python library for defining pipelines as ordered stages, running them against a YT cluster in production, or against the local filesystem in development.
You get:
Pipelines built from stages under
stages/, with YAML configuration per stage and for the pipeline.A dev mode that mimics table and job behavior locally (no cluster required for basic work).
Operations such as map, vanilla, YQL (via the YT client), S3 helpers, and related utilities.
Packaging and upload of job code when you run on the cluster.
When it helps#
Less wiring for stage discovery and config than rolling everything by hand.
One codebase: flip
pipeline.modebetween dev and prod instead of maintaining two runners.YQL and table helpers exposed on the same client you use for reads and writes.
Installation#
Prerequisites#
Python 3.11 or newer
For prod mode: network access to YT, valid credentials, and a cluster whose images match your job dependencies (see Cluster requirements)
YT cluster requirements#
Warning
Cluster Docker image
In prod mode, code from ytjobs runs inside jobs on the cluster. The default or custom Docker image for those jobs must include the Python packages your mappers, reducers, and vanilla scripts import.
Details: Cluster requirements.
Install from PyPI#
pip install yt-framework
Install from source#
pip install -e .
Verify installation#
python -c "import yt_framework; print(yt_framework.__version__)"
The PyPI distribution is named yt-framework. Import paths are yt_framework (driver) and ytjobs (job-side helpers).
Credentials for prod#
Warning
Secrets for production
Prod mode expects YT (and optionally S3) credentials in configs/secrets.env. Without them, the client cannot talk to the cluster.
Create configs/secrets.env in your pipeline repo:
# configs/secrets.env
YT_PROXY=your-yt-proxy-url
YT_TOKEN=your-yt-token
For S3-backed operations, add the keys your stage uses (names vary by operation; see Secrets):
S3_ENDPOINT=https://your-s3-endpoint.com
S3_DOWNLOAD_ACCESS_KEY=your-download-access-key
S3_DOWNLOAD_SECRET_KEY=your-download-secret-key
S3_UPLOAD_ACCESS_KEY=your-upload-access-key
S3_UPLOAD_SECRET_KEY=your-upload-secret-key
More detail: Secrets management.
Quick start#
Minimal pipeline: one stage that writes a small table.
Step 1: Layout#
mkdir my_first_pipeline
cd my_first_pipeline
mkdir -p stages/create_data configs
Step 2: Entry point#
pipeline.py at the repo root:
from yt_framework.core.pipeline import DefaultPipeline
if __name__ == "__main__":
DefaultPipeline.main()
Step 3: Pipeline config#
configs/config.yaml:
stages:
enabled_stages:
- create_data
pipeline:
mode: "dev" # use "prod" on the cluster
Step 4: Stage#
stages/create_data/stage.py:
from yt_framework.core.pipeline import DebugContext
from yt_framework.core.stage import BaseStage
class CreateDataStage(BaseStage):
def run(self, debug: DebugContext) -> DebugContext:
self.logger.info("Creating data table...")
rows = [
{"id": 1, "name": "Alice", "value": 100},
{"id": 2, "name": "Bob", "value": 200},
{"id": 3, "name": "Charlie", "value": 300},
]
self.deps.yt_client.write_table(
table_path=self.config.client.output_table,
rows=rows,
)
self.logger.info("Created table with %s rows", len(rows))
return debug
stages/create_data/config.yaml:
client:
output_table: //tmp/my_first_pipeline/data
Step 5: Run#
python pipeline.py
In dev mode, rows land under something like my_first_pipeline/.dev/data.jsonl. In prod mode, the same logical path is a YT table at //tmp/my_first_pipeline/data.
Where to go next#
Core concepts#
Pipelines and stages#
A pipeline runs stages in order. Each stage is a class with a run method.
DefaultPipeline: discoversBaseStagesubclasses understages/.BasePipeline: you register stages yourself.BaseStage: base class for stage implementations.
More: Pipelines and stages.
Dev vs prod#
Tip
Start in dev
Use dev mode first: no cluster credentials, fast feedback, files under .dev/.
Dev: tables as
.jsonlunder.dev/, local subprocesses for map/vanilla-style work, YQL backed by DuckDB where applicable.Prod: real YT operations, code upload to
build_folder, jobs on the cluster.
Dev vs prod has a full comparison.
Configuration#
configs/config.yaml: pipeline mode, enabled stages, shared options.stages/<name>/config.yaml: settings for that stage.configs/secrets.env: credentials (not committed).
Operations#
Map#
Row-wise transforms with uploaded mapper code. Map operations — example 04_map_operation.
Vanilla#
Jobs without mandatory input/output tables (setup, maintenance, one-off scripts). Vanilla — example 05_vanilla_operation.
YQL#
Table operations through YQL via the YT client (joins, filters, aggregates, etc.). YQL — example 03_yql_operations.
S3#
List, download, and related patterns against S3-compatible storage. S3 operations — example 06_s3_integration.
Advanced topics#
Code upload — how job bundles are built and sent to YT.
Docker — custom images for GPU or extra system deps — example 07_custom_docker.
Checkpoints — model artifacts for inference-style stages.
Multiple operations — more than one operation in a stage — example 09_multiple_operations.
Reference#
API reference —
yt_framework(autodoc from docstrings)YT jobs (
ytjobs) — mapper helpers, S3, logging, job config path, Cypress checkpoints
Examples#
Under examples/ on GitHub:
Example |
What it shows |
|---|---|
Minimal pipeline |
|
Several stages and context |
|
YQL |
|
Map |
|
Vanilla |
|
S3 |
|
Custom Docker image |
|
Multiple config files |
|
Multiple operations in one stage |
|
Custom upload layout |
|
Environment logging |
|
GPU-oriented sample |
Each example directory has its own README.