# YT Framework Documentation

Welcome to the YT Framework documentation! This guide will help you get started with building data processing pipelines on YTsaurus.

## Table of Contents

```{toctree}
:maxdepth: 3
:titlesonly:

configuration/index
operations/index
advanced/index
reference/api
troubleshooting/index
```

## Introduction

YT Framework is a Python framework designed to simplify the development and execution of data processing pipelines on YTsaurus (YT) clusters. It provides:

- **Simple Pipeline Architecture**: Organize your workflows into stages
- **Seamless Development**: Develop locally, deploy to production with minimal changes
- **Comprehensive Operations**: Support for Map, Vanilla, YQL, and S3 operations
- **Automatic Code Management**: Handles code upload, dependencies, and execution automatically

### Why YT Framework?

- **Fast Development**: Automatic stage discovery means less boilerplate
- **Local Testing**: Dev mode simulates YT operations locally using file system
- **Production Ready**: Same code runs in dev and prod modes
- **Flexible**: Supports everything from simple table operations to complex ML inference pipelines

## Installation

### Prerequisites

- Python 3.11 or higher
- Access to YTsaurus cluster (for production mode)
- YT credentials (for production mode)

#### YT Cluster Requirements

```{warning}
**Cluster Docker Image Dependencies**

Code from `ytjobs` executes on YT cluster nodes in production mode. The cluster's Docker image (default or custom) must include required dependencies for your operations to run successfully.
```

See [Cluster Requirements](configuration/cluster-requirements.md) for detailed information about cluster dependencies and how to verify compatibility.

### Install from PyPI

For most users, install the package from PyPI:

```bash
pip install yt-framework
```

### Install from Source

For local development or contributing, install the package in editable mode:

```bash
pip install -e .
```

### Verify Installation

```bash
python -c "import yt_framework; print(yt_framework.__version__)"
```

### Configuration Setup

```{warning}
**Secrets Required for Production Mode**

Production mode requires YT credentials. Make sure to set up `secrets.env` before running in prod mode.
```

After installation, you'll need to set up your YT credentials for production mode. Create a `secrets.env` file in your pipeline's `configs/` directory:

```bash
# configs/secrets.env
YT_PROXY=your-yt-proxy-url
YT_TOKEN=your-yt-token
```

For S3 integration, also add:

```bash
S3_ENDPOINT=https://your-s3-endpoint.com
S3_DOWNLOAD_ACCESS_KEY=your-download-access-key
S3_DOWNLOAD_SECRET_KEY=your-download-secret-key
S3_UPLOAD_ACCESS_KEY=your-upload-access-key
S3_UPLOAD_SECRET_KEY=your-upload-secret-key
```

See [Secrets Management](configuration/secrets.md) for more details.

## Quick Start

Let's create a simple pipeline that creates a table with some data.

### Step 1: Create Pipeline Structure

```bash
mkdir my_first_pipeline
cd my_first_pipeline
mkdir -p stages/create_data configs
```

### Step 2: Create Pipeline Entry Point

Create `pipeline.py` in the root directory:

```python
from yt_framework.core.pipeline import DefaultPipeline

if __name__ == "__main__":
    DefaultPipeline.main()
```

### Step 3: Create Stage Configuration

Create `configs/config.yaml`:

```yaml
stages:
  enabled_stages:
    - create_data

pipeline:
  mode: "dev"  # Use "prod" for production
```

### Step 4: Create Stage

Create `stages/create_data/stage.py`:

```python
from yt_framework.core.pipeline import DebugContext
from yt_framework.core.stage import BaseStage

class CreateDataStage(BaseStage):
    def run(self, debug: DebugContext) -> DebugContext:
        self.logger.info("Creating data table...")
        
        # Create some sample data
        rows = [
            {"id": 1, "name": "Alice", "value": 100},
            {"id": 2, "name": "Bob", "value": 200},
            {"id": 3, "name": "Charlie", "value": 300},
        ]
        
        # Write to YT table
        self.deps.yt_client.write_table(
            table_path=self.config.client.output_table,
            rows=rows,
        )
        
        self.logger.info(f"Created table with {len(rows)} rows")
        return debug
```

Create `stages/create_data/config.yaml`:

```yaml
client:
  output_table: //tmp/my_first_pipeline/data
```

### Step 5: Run the Pipeline

```bash
python pipeline.py
```

In dev mode, the table will be created as `my_first_pipeline/.dev/data.jsonl`. In prod mode, it will be created on the YT cluster at `//tmp/my_first_pipeline/data`.

### Next Steps

- Learn about [Pipelines and Stages](pipelines-and-stages.md)
- Explore [Configuration](configuration/index.md) options
- Understand [Dev vs Prod modes](dev-vs-prod.md)
- Check out [Examples](https://github.com/GregoryKogan/yt-framework/tree/main/examples/) for more complex scenarios

## Core Concepts

### Pipelines and Stages

A **pipeline** is a collection of **stages** that execute in sequence. Each stage performs a specific task (e.g., create table, process data, upload results).

- **DefaultPipeline**: Automatically discovers stages from `stages/` directory
- **BasePipeline**: Manual stage registration (for advanced use cases)
- **BaseStage**: Base class for all stages

See [Pipelines and Stages](pipelines-and-stages.md) for details.

### Dev vs Prod Modes

```{tip}
**Start with Dev Mode**

Always develop and test your pipelines in dev mode first. It's faster, doesn't require YT credentials, and makes debugging easier.
```

- **Dev Mode**: Simulates YT operations locally using file system. Tables are stored as `.jsonl` files in `.dev/` directory. Perfect for development and testing.
- **Prod Mode**: Executes operations on actual YT cluster. Requires YT credentials and cluster access.

See [Dev vs Prod](dev-vs-prod.md) for complete comparison.

### Configuration System

Configuration is managed through YAML files:

- **Pipeline config** (`configs/config.yaml`): Pipeline-level settings (mode, build_folder)
- **Stage configs** (`stages/<stage_name>/config.yaml`): Stage-specific settings
- **Secrets** (`configs/secrets.env`): Credentials and sensitive data

See [Configuration Guide](configuration/index.md) for details.

## Operations

YT Framework supports several types of operations:

### Map Operations

Process each row of a table independently. Perfect for row-by-row transformations.

- [Map Operations Guide](operations/map.md)
- Example: [04_map_operation](https://github.com/GregoryKogan/yt-framework/tree/main/examples/04_map_operation/)

### Vanilla Operations

Run standalone jobs without input/output tables. Perfect for setup, cleanup, or validation tasks.

- [Vanilla Operations Guide](operations/vanilla.md)
- Example: [05_vanilla_operation](https://github.com/GregoryKogan/yt-framework/tree/main/examples/05_vanilla_operation/)

### YQL Operations

Perform table operations using YQL (YTsaurus Query Language). Includes joins, filters, aggregations, and more.

- [YQL Operations Guide](operations/yql.md)
- Example: [03_yql_operations](https://github.com/GregoryKogan/yt-framework/tree/main/examples/03_yql_operations/)

### S3 Operations

Integrate with S3 for file listing, downloading, and processing.

- [S3 Operations Guide](operations/s3.md)
- Example: [06_s3_integration](https://github.com/GregoryKogan/yt-framework/tree/main/examples/06_s3_integration/)

## Advanced Topics

### Code Upload

Learn how the framework handles code packaging and deployment to YT cluster.

- [Code Upload Guide](advanced/code-upload.md)

### Docker Support

Use custom Docker images for GPU workloads or special dependencies.

- [Docker Guide](advanced/docker.md)
- Example: [07_custom_docker](https://github.com/GregoryKogan/yt-framework/tree/main/examples/07_custom_docker/)

### Checkpoint Management

Handle ML model checkpoints for inference pipelines.

- [Checkpoints Guide](advanced/checkpoints.md)

### Multiple Operations

Run multiple operations in a single stage.

- [Multiple Operations Guide](advanced/multiple-operations.md)
- Example: [09_multiple_operations](https://github.com/GregoryKogan/yt-framework/tree/main/examples/09_multiple_operations/)

## Reference

- [API Reference](reference/api.md) - Complete API documentation
- [Troubleshooting](troubleshooting/index.md) - Common issues and solutions

## Examples

The `examples/` directory contains complete working examples:

- **[01_hello_world](https://github.com/GregoryKogan/yt-framework/tree/main/examples/01_hello_world/)** - Basic pipeline
- **[02_multi_stage_pipeline](https://github.com/GregoryKogan/yt-framework/tree/main/examples/02_multi_stage_pipeline/)** - Multiple stages
- **[03_yql_operations](https://github.com/GregoryKogan/yt-framework/tree/main/examples/03_yql_operations/)** - YQL operations
- **[04_map_operation](https://github.com/GregoryKogan/yt-framework/tree/main/examples/04_map_operation/)** - Map operation
- **[05_vanilla_operation](https://github.com/GregoryKogan/yt-framework/tree/main/examples/05_vanilla_operation/)** - Vanilla operation
- **[06_s3_integration](https://github.com/GregoryKogan/yt-framework/tree/main/examples/06_s3_integration/)** - S3 integration
- **[07_custom_docker](https://github.com/GregoryKogan/yt-framework/tree/main/examples/07_custom_docker/)** - Custom Docker
- **[08_multiple_configs](https://github.com/GregoryKogan/yt-framework/tree/main/examples/08_multiple_configs/)** - Multiple configs
- **[09_multiple_operations](https://github.com/GregoryKogan/yt-framework/tree/main/examples/09_multiple_operations/)** - Multiple operations
- **[environment_log](https://github.com/GregoryKogan/yt-framework/tree/main/examples/environment_log/)** - Environment logging
- **[video_gpu](https://github.com/GregoryKogan/yt-framework/tree/main/examples/video_gpu/)** - GPU processing

Each example includes a README explaining what it demonstrates.