YT Framework Documentation#

Welcome to the YT Framework documentation! This guide will help you get started with building data processing pipelines on YTsaurus.

Table of Contents#

Introduction#

YT Framework is a Python framework designed to simplify the development and execution of data processing pipelines on YTsaurus (YT) clusters. It provides:

  • Simple Pipeline Architecture: Organize your workflows into stages

  • Seamless Development: Develop locally, deploy to production with minimal changes

  • Comprehensive Operations: Support for Map, Vanilla, YQL, and S3 operations

  • Automatic Code Management: Handles code upload, dependencies, and execution automatically

Why YT Framework?#

  • Fast Development: Automatic stage discovery means less boilerplate

  • Local Testing: Dev mode simulates YT operations locally using file system

  • Production Ready: Same code runs in dev and prod modes

  • Flexible: Supports everything from simple table operations to complex ML inference pipelines

Installation#

Prerequisites#

  • Python 3.11 or higher

  • Access to YTsaurus cluster (for production mode)

  • YT credentials (for production mode)

YT Cluster Requirements#

Warning

Cluster Docker Image Dependencies

Code from ytjobs executes on YT cluster nodes in production mode. The cluster’s Docker image (default or custom) must include required dependencies for your operations to run successfully.

See Cluster Requirements for detailed information about cluster dependencies and how to verify compatibility.

Install from PyPI#

For most users, install the package from PyPI:

pip install yt-framework

Install from Source#

For local development or contributing, install the package in editable mode:

pip install -e .

Verify Installation#

python -c "import yt_framework; print(yt_framework.__version__)"

Configuration Setup#

Warning

Secrets Required for Production Mode

Production mode requires YT credentials. Make sure to set up secrets.env before running in prod mode.

After installation, you’ll need to set up your YT credentials for production mode. Create a secrets.env file in your pipeline’s configs/ directory:

# configs/secrets.env
YT_PROXY=your-yt-proxy-url
YT_TOKEN=your-yt-token

For S3 integration, also add:

S3_ENDPOINT=https://your-s3-endpoint.com
S3_DOWNLOAD_ACCESS_KEY=your-download-access-key
S3_DOWNLOAD_SECRET_KEY=your-download-secret-key
S3_UPLOAD_ACCESS_KEY=your-upload-access-key
S3_UPLOAD_SECRET_KEY=your-upload-secret-key

See Secrets Management for more details.

Quick Start#

Let’s create a simple pipeline that creates a table with some data.

Step 1: Create Pipeline Structure#

mkdir my_first_pipeline
cd my_first_pipeline
mkdir -p stages/create_data configs

Step 2: Create Pipeline Entry Point#

Create pipeline.py in the root directory:

from yt_framework.core.pipeline import DefaultPipeline

if __name__ == "__main__":
    DefaultPipeline.main()

Step 3: Create Stage Configuration#

Create configs/config.yaml:

stages:
  enabled_stages:
    - create_data

pipeline:
  mode: "dev"  # Use "prod" for production

Step 4: Create Stage#

Create stages/create_data/stage.py:

from yt_framework.core.pipeline import DebugContext
from yt_framework.core.stage import BaseStage

class CreateDataStage(BaseStage):
    def run(self, debug: DebugContext) -> DebugContext:
        self.logger.info("Creating data table...")
        
        # Create some sample data
        rows = [
            {"id": 1, "name": "Alice", "value": 100},
            {"id": 2, "name": "Bob", "value": 200},
            {"id": 3, "name": "Charlie", "value": 300},
        ]
        
        # Write to YT table
        self.deps.yt_client.write_table(
            table_path=self.config.client.output_table,
            rows=rows,
        )
        
        self.logger.info(f"Created table with {len(rows)} rows")
        return debug

Create stages/create_data/config.yaml:

client:
  output_table: //tmp/my_first_pipeline/data

Step 5: Run the Pipeline#

python pipeline.py

In dev mode, the table will be created as my_first_pipeline/.dev/data.jsonl. In prod mode, it will be created on the YT cluster at //tmp/my_first_pipeline/data.

Next Steps#

Core Concepts#

Pipelines and Stages#

A pipeline is a collection of stages that execute in sequence. Each stage performs a specific task (e.g., create table, process data, upload results).

  • DefaultPipeline: Automatically discovers stages from stages/ directory

  • BasePipeline: Manual stage registration (for advanced use cases)

  • BaseStage: Base class for all stages

See Pipelines and Stages for details.

Dev vs Prod Modes#

Tip

Start with Dev Mode

Always develop and test your pipelines in dev mode first. It’s faster, doesn’t require YT credentials, and makes debugging easier.

  • Dev Mode: Simulates YT operations locally using file system. Tables are stored as .jsonl files in .dev/ directory. Perfect for development and testing.

  • Prod Mode: Executes operations on actual YT cluster. Requires YT credentials and cluster access.

See Dev vs Prod for complete comparison.

Configuration System#

Configuration is managed through YAML files:

  • Pipeline config (configs/config.yaml): Pipeline-level settings (mode, build_folder)

  • Stage configs (stages/<stage_name>/config.yaml): Stage-specific settings

  • Secrets (configs/secrets.env): Credentials and sensitive data

See Configuration Guide for details.

Operations#

YT Framework supports several types of operations:

Map Operations#

Process each row of a table independently. Perfect for row-by-row transformations.

Vanilla Operations#

Run standalone jobs without input/output tables. Perfect for setup, cleanup, or validation tasks.

YQL Operations#

Perform table operations using YQL (YTsaurus Query Language). Includes joins, filters, aggregations, and more.

S3 Operations#

Integrate with S3 for file listing, downloading, and processing.

Advanced Topics#

Code Upload#

Learn how the framework handles code packaging and deployment to YT cluster.

Docker Support#

Use custom Docker images for GPU workloads or special dependencies.

Checkpoint Management#

Handle ML model checkpoints for inference pipelines.

Multiple Operations#

Run multiple operations in a single stage.

Reference#

Examples#

The examples/ directory contains complete working examples:

Each example includes a README explaining what it demonstrates.