YT Framework Documentation#
Welcome to the YT Framework documentation! This guide will help you get started with building data processing pipelines on YTsaurus.
Table of Contents#
Introduction#
YT Framework is a Python framework designed to simplify the development and execution of data processing pipelines on YTsaurus (YT) clusters. It provides:
Simple Pipeline Architecture: Organize your workflows into stages
Seamless Development: Develop locally, deploy to production with minimal changes
Comprehensive Operations: Support for Map, Vanilla, YQL, and S3 operations
Automatic Code Management: Handles code upload, dependencies, and execution automatically
Why YT Framework?#
Fast Development: Automatic stage discovery means less boilerplate
Local Testing: Dev mode simulates YT operations locally using file system
Production Ready: Same code runs in dev and prod modes
Flexible: Supports everything from simple table operations to complex ML inference pipelines
Installation#
Prerequisites#
Python 3.11 or higher
Access to YTsaurus cluster (for production mode)
YT credentials (for production mode)
YT Cluster Requirements#
Warning
Cluster Docker Image Dependencies
Code from ytjobs executes on YT cluster nodes in production mode. The cluster’s Docker image (default or custom) must include required dependencies for your operations to run successfully.
See Cluster Requirements for detailed information about cluster dependencies and how to verify compatibility.
Install from PyPI#
For most users, install the package from PyPI:
pip install yt-framework
Install from Source#
For local development or contributing, install the package in editable mode:
pip install -e .
Verify Installation#
python -c "import yt_framework; print(yt_framework.__version__)"
Configuration Setup#
Warning
Secrets Required for Production Mode
Production mode requires YT credentials. Make sure to set up secrets.env before running in prod mode.
After installation, you’ll need to set up your YT credentials for production mode. Create a secrets.env file in your pipeline’s configs/ directory:
# configs/secrets.env
YT_PROXY=your-yt-proxy-url
YT_TOKEN=your-yt-token
For S3 integration, also add:
S3_ENDPOINT=https://your-s3-endpoint.com
S3_DOWNLOAD_ACCESS_KEY=your-download-access-key
S3_DOWNLOAD_SECRET_KEY=your-download-secret-key
S3_UPLOAD_ACCESS_KEY=your-upload-access-key
S3_UPLOAD_SECRET_KEY=your-upload-secret-key
See Secrets Management for more details.
Quick Start#
Let’s create a simple pipeline that creates a table with some data.
Step 1: Create Pipeline Structure#
mkdir my_first_pipeline
cd my_first_pipeline
mkdir -p stages/create_data configs
Step 2: Create Pipeline Entry Point#
Create pipeline.py in the root directory:
from yt_framework.core.pipeline import DefaultPipeline
if __name__ == "__main__":
DefaultPipeline.main()
Step 3: Create Stage Configuration#
Create configs/config.yaml:
stages:
enabled_stages:
- create_data
pipeline:
mode: "dev" # Use "prod" for production
Step 4: Create Stage#
Create stages/create_data/stage.py:
from yt_framework.core.pipeline import DebugContext
from yt_framework.core.stage import BaseStage
class CreateDataStage(BaseStage):
def run(self, debug: DebugContext) -> DebugContext:
self.logger.info("Creating data table...")
# Create some sample data
rows = [
{"id": 1, "name": "Alice", "value": 100},
{"id": 2, "name": "Bob", "value": 200},
{"id": 3, "name": "Charlie", "value": 300},
]
# Write to YT table
self.deps.yt_client.write_table(
table_path=self.config.client.output_table,
rows=rows,
)
self.logger.info(f"Created table with {len(rows)} rows")
return debug
Create stages/create_data/config.yaml:
client:
output_table: //tmp/my_first_pipeline/data
Step 5: Run the Pipeline#
python pipeline.py
In dev mode, the table will be created as my_first_pipeline/.dev/data.jsonl. In prod mode, it will be created on the YT cluster at //tmp/my_first_pipeline/data.
Next Steps#
Learn about Pipelines and Stages
Explore Configuration options
Understand Dev vs Prod modes
Check out Examples for more complex scenarios
Core Concepts#
Pipelines and Stages#
A pipeline is a collection of stages that execute in sequence. Each stage performs a specific task (e.g., create table, process data, upload results).
DefaultPipeline: Automatically discovers stages from
stages/directoryBasePipeline: Manual stage registration (for advanced use cases)
BaseStage: Base class for all stages
See Pipelines and Stages for details.
Dev vs Prod Modes#
Tip
Start with Dev Mode
Always develop and test your pipelines in dev mode first. It’s faster, doesn’t require YT credentials, and makes debugging easier.
Dev Mode: Simulates YT operations locally using file system. Tables are stored as
.jsonlfiles in.dev/directory. Perfect for development and testing.Prod Mode: Executes operations on actual YT cluster. Requires YT credentials and cluster access.
See Dev vs Prod for complete comparison.
Configuration System#
Configuration is managed through YAML files:
Pipeline config (
configs/config.yaml): Pipeline-level settings (mode, build_folder)Stage configs (
stages/<stage_name>/config.yaml): Stage-specific settingsSecrets (
configs/secrets.env): Credentials and sensitive data
See Configuration Guide for details.
Operations#
YT Framework supports several types of operations:
Map Operations#
Process each row of a table independently. Perfect for row-by-row transformations.
Example: 04_map_operation
Vanilla Operations#
Run standalone jobs without input/output tables. Perfect for setup, cleanup, or validation tasks.
YQL Operations#
Perform table operations using YQL (YTsaurus Query Language). Includes joins, filters, aggregations, and more.
Example: 03_yql_operations
S3 Operations#
Integrate with S3 for file listing, downloading, and processing.
Example: 06_s3_integration
Advanced Topics#
Code Upload#
Learn how the framework handles code packaging and deployment to YT cluster.
Docker Support#
Use custom Docker images for GPU workloads or special dependencies.
Example: 07_custom_docker
Checkpoint Management#
Handle ML model checkpoints for inference pipelines.
Multiple Operations#
Run multiple operations in a single stage.
Reference#
API Reference - Complete API documentation
Troubleshooting - Common issues and solutions
Examples#
The examples/ directory contains complete working examples:
01_hello_world - Basic pipeline
02_multi_stage_pipeline - Multiple stages
03_yql_operations - YQL operations
04_map_operation - Map operation
05_vanilla_operation - Vanilla operation
06_s3_integration - S3 integration
07_custom_docker - Custom Docker
08_multiple_configs - Multiple configs
09_multiple_operations - Multiple operations
environment_log - Environment logging
video_gpu - GPU processing
Each example includes a README explaining what it demonstrates.