YT Framework documentation#

This documentation explains how to build and run data processing pipelines on YTsaurus (YT) with the YT Framework Python package.

Table of contents#

Introduction#

YT Framework is a Python library for defining pipelines as ordered stages, running them against a YT cluster in production, or against the local filesystem in development.

You get:

Pipelines built from stages under stages/, with YAML configuration per stage and for the pipeline.
A dev mode that mimics table and job behavior locally (no cluster required for basic work).
Operations such as map, vanilla, YQL (via the YT client), S3 helpers, and related utilities.
Packaging and upload of job code when you run on the cluster.

When it helps#

Less wiring for stage discovery and config than rolling everything by hand.
One codebase: flip pipeline.mode between dev and prod instead of maintaining two runners.
YQL and table helpers exposed on the same client you use for reads and writes.

Installation#

Prerequisites#

Python 3.11 or newer
For prod mode: network access to YT, valid credentials, and a cluster whose images match your job dependencies (see Cluster requirements)

YT cluster requirements#

Warning

Cluster Docker image

In prod mode, code from ytjobs runs inside jobs on the cluster. The default or custom Docker image for those jobs must include the Python packages your mappers, reducers, and vanilla scripts import.

Details: Cluster requirements.

Install from PyPI#

pip install yt-framework

Install from source#

pip install -e .

Verify installation#

python -c "import yt_framework; print(yt_framework.__version__)"

The PyPI distribution is named yt-framework. Import paths are yt_framework (driver) and ytjobs (job-side helpers).

Credentials for prod#

Warning

Secrets for production

Prod mode expects YT (and optionally S3) credentials in configs/secrets.env. Without them, the client cannot talk to the cluster.

Create configs/secrets.env in your pipeline repo:

# configs/secrets.env
YT_PROXY=your-yt-proxy-url
YT_TOKEN=your-yt-token

For S3-backed operations, add the keys your stage uses (names vary by operation; see Secrets):

S3_ENDPOINT=https://your-s3-endpoint.com
S3_DOWNLOAD_ACCESS_KEY=your-download-access-key
S3_DOWNLOAD_SECRET_KEY=your-download-secret-key
S3_UPLOAD_ACCESS_KEY=your-upload-access-key
S3_UPLOAD_SECRET_KEY=your-upload-secret-key

More detail: Secrets management.

Quick start#

Minimal pipeline: one stage that writes a small table.

Step 1: Layout#

mkdir my_first_pipeline
cd my_first_pipeline
mkdir -p stages/create_data configs

Step 2: Entry point#

pipeline.py at the repo root:

from yt_framework.core.pipeline import DefaultPipeline

if __name__ == "__main__":
    DefaultPipeline.main()

Step 3: Pipeline config#

configs/config.yaml:

stages:
  enabled_stages:
    - create_data

pipeline:
  mode: "dev"  # use "prod" on the cluster

Step 4: Stage#

stages/create_data/stage.py:

from yt_framework.core.pipeline import DebugContext
from yt_framework.core.stage import BaseStage

class CreateDataStage(BaseStage):
    def run(self, debug: DebugContext) -> DebugContext:
        self.logger.info("Creating data table...")

        rows = [
            {"id": 1, "name": "Alice", "value": 100},
            {"id": 2, "name": "Bob", "value": 200},
            {"id": 3, "name": "Charlie", "value": 300},
        ]

        self.deps.yt_client.write_table(
            table_path=self.config.client.output_table,
            rows=rows,
        )

        self.logger.info("Created table with %s rows", len(rows))
        return debug

stages/create_data/config.yaml:

client:
  output_table: //tmp/my_first_pipeline/data

Step 5: Run#

python pipeline.py

In dev mode, rows land under something like my_first_pipeline/.dev/data.jsonl. In prod mode, the same logical path is a YT table at //tmp/my_first_pipeline/data.

Where to go next#

Core concepts#

Pipelines and stages#

A pipeline runs stages in order. Each stage is a class with a run method.

DefaultPipeline: discovers BaseStage subclasses under stages/.
BasePipeline: you register stages yourself.
BaseStage: base class for stage implementations.

More: Pipelines and stages.

Dev vs prod#

Tip

Start in dev

Use dev mode first: no cluster credentials, fast feedback, files under .dev/.

Dev: tables as .jsonl under .dev/, local subprocesses for map/vanilla-style work, YQL backed by DuckDB where applicable.
Prod: real YT operations, code upload to build_folder, jobs on the cluster.

Dev vs prod has a full comparison.

Configuration#

configs/config.yaml: pipeline mode, enabled stages, shared options.
stages/<name>/config.yaml: settings for that stage.
configs/secrets.env: credentials (not committed).

Configuration.

Operations#

Map#

Row-wise transforms with uploaded mapper code. Map operations — example 04_map_operation.

Vanilla#

Jobs without mandatory input/output tables (setup, maintenance, one-off scripts). Vanilla — example 05_vanilla_operation.

YQL#

Table operations through YQL via the YT client (joins, filters, aggregates, etc.). YQL — example 03_yql_operations.

S3#

List, download, and related patterns against S3-compatible storage. S3 operations — example 06_s3_integration.

Advanced topics#

Code upload — how job bundles are built and sent to YT.
Docker — custom images for GPU or extra system deps — example 07_custom_docker.
Checkpoints — model artifacts for inference-style stages.
Multiple operations — more than one operation in a stage — example 09_multiple_operations.

Reference#

API reference — yt_framework (autodoc from docstrings)
YT jobs (ytjobs) — mapper helpers, S3, logging, job config path, Cypress checkpoints
Environment variables
Troubleshooting

Examples#

Under examples/ on GitHub:

Example	What it shows
01_hello_world	Minimal pipeline
02_multi_stage_pipeline	Several stages and context
03_yql_operations	YQL
04_map_operation	Map
05_vanilla_operation	Vanilla
06_s3_integration	S3
07_custom_docker	Custom Docker image
08_multiple_configs	Multiple config files
09_multiple_operations	Multiple operations in one stage
10_custom_upload	Custom upload layout
environment_log	Environment logging
video_gpu	GPU-oriented sample

Each example directory has its own README.