Dev vs Prod Modes#

YT Framework supports two execution modes: dev (development) and prod (production). Understanding the differences and when to use each mode is crucial for effective pipeline development.

Overview#

Tip

Start with Dev Mode

Always develop and test your pipelines in dev mode first. It’s faster, doesn’t require YT credentials, and makes debugging easier.

  • Dev Mode: Simulates YT operations locally using the file system. Perfect for development, testing, and debugging.

  • Prod Mode: Executes operations on the actual YT cluster. Used for production workloads.

Both modes use the same code and configuration, making it easy to develop locally and deploy to production.

Warning

Credentials Required for Prod Mode

Production mode requires YT credentials in configs/secrets.env. Make sure to set up credentials before running in prod mode.

Dev Mode#

How It Works (dev)#

Dev mode simulates YT operations using the local file system:

  • Tables: Stored as .jsonl files in .dev/ directory

  • Operations: Executed locally using subprocess

  • Code Upload: No-op (code runs directly from local filesystem)

  • YQL Operations: Executed using DuckDB for local simulation

Configuration (dev)#

Set mode in pipeline config:

# configs/config.yaml
pipeline:
  mode: "dev"

Directory Structure (dev)#

When running in dev mode, the framework creates a .dev/ directory:

my_pipeline/
├── .dev/
│   ├── table1.jsonl      # Simulated YT tables
│   ├── table2.jsonl
│   └── operation.log     # Operation logs
├── configs/
├── stages/
└── pipeline.py

Table Operations (dev)#

Writing tables:

# In dev mode, writes to .dev/table_name.jsonl
self.deps.yt_client.write_table(
    table_path="//tmp/my_pipeline/data",
    rows=[{"id": 1, "name": "Alice"}]
)
# Creates: .dev/data.jsonl

Reading tables:

# In dev mode, reads from .dev/table_name.jsonl
rows = list(self.deps.yt_client.read_table("//tmp/my_pipeline/data"))
# Reads from: .dev/data.jsonl

Map Operations (dev)#

Map operations run locally using subprocess:

  1. Creates sandbox directory: .dev/sandbox_<input>-><output>/

  2. Copies input table to sandbox

  3. Executes mapper.py script

  4. Collects output to .dev/<output>.jsonl

Example:

# Dev mode execution
.dev/sandbox_input->output/
├── input.jsonl
├── code.tar.gz (extracted)
└── operation_wrapper_*.sh

Vanilla Operations (dev)#

Vanilla operations run locally using subprocess:

  1. Creates sandbox directory: .dev/<stage_name>_sandbox/

  2. Extracts code archive

  3. Executes vanilla.py script

  4. Logs output to .dev/<stage_name>.log

YQL Operations (dev)#

YQL operations are simulated using DuckDB:

  • Joins, filters, aggregations run locally

  • Results written to .dev/ directory

  • Full YQL syntax supported

When to Use Dev Mode#

  • Development: Writing and testing new stages

  • Debugging: Investigating issues locally

  • Testing: Validating pipeline logic

  • CI/CD: Running tests without YT cluster access

  • Learning: Understanding framework behavior

Advantages (dev)#

  • ✅ Fast iteration (no network latency)

  • ✅ No YT cluster access required

  • ✅ Easy debugging (files are local)

  • ✅ Free (no cluster resources used)

  • ✅ Works offline

Limitations (dev)#

  • ❌ Not suitable for large datasets (limited by local disk)

  • ❌ Some YT-specific features may differ

  • ❌ Performance characteristics differ from production

Prod Mode#

How It Works (prod)#

Prod mode executes operations on the actual YT cluster:

  • Tables: Stored on YT cluster at specified paths

  • Operations: Executed on YT cluster nodes

  • Code Upload: Code is packaged and uploaded to YT

  • YQL Operations: Executed using YT’s YQL engine

Warning

Cluster Dependencies Required

In prod mode, ytjobs code executes on YT cluster nodes. The cluster’s Docker image must include required dependencies or you must use custom Docker images. See Cluster Requirements for details.

Configuration (prod)#

Set mode in pipeline config:

# configs/config.yaml
pipeline:
  mode: "prod"
  build_folder: "//tmp/my_pipeline/build"

Required credentials (configs/secrets.env):

YT_PROXY=your-yt-proxy-url
YT_TOKEN=your-yt-token

Table Operations (prod)#

Writing tables:

# In prod mode, writes to YT cluster
self.deps.yt_client.write_table(
    table_path="//tmp/my_pipeline/data",
    rows=[{"id": 1, "name": "Alice"}]
)
# Creates: //tmp/my_pipeline/data on YT cluster

Reading tables:

# In prod mode, reads from YT cluster
rows = list(self.deps.yt_client.read_table("//tmp/my_pipeline/data"))
# Reads from: //tmp/my_pipeline/data on YT cluster

Map Operations (prod)#

Map operations run on YT cluster:

  1. Code is uploaded to build_folder

  2. YT creates jobs on cluster nodes

  3. Each job processes a portion of input table

  4. Results are written to output table on cluster

Vanilla Operations (prod)#

Vanilla operations run on YT cluster:

  1. Code is uploaded to build_folder

  2. YT creates job on cluster node

  3. Job executes vanilla.py script

  4. Logs available in YT web UI

YQL Operations (prod)#

YQL operations execute on YT cluster:

  • Uses YT’s distributed YQL engine

  • Handles large datasets efficiently

  • Full YT YQL syntax supported

When to Use Prod Mode#

  • Production: Running production workloads

  • Large Datasets: Processing data that doesn’t fit locally

  • Performance: Need cluster performance and parallelism

  • Integration: Integrating with other YT-based systems

Advantages (prod)#

  • ✅ Handles large datasets (distributed storage)

  • ✅ High performance (distributed processing)

  • ✅ Scalability (cluster resources)

  • ✅ Production-ready (real YT environment)

Limitations (prod)#

  • ❌ Requires YT cluster access

  • ❌ Slower iteration (network latency)

  • ❌ Costs cluster resources

  • ❌ Harder to debug (remote execution)

Quick Comparison#

Dev Mode:

pipeline:
  mode: "dev"

Prod Mode:

pipeline:
  mode: "prod"
  build_folder: "//tmp/my_pipeline/build"

Credentials

Dev Mode:

  • No credentials required

  • Works offline

Prod Mode:

  • Requires configs/secrets.env

  • Must have YT cluster access

Performance

Dev Mode:

  • Fast iteration

  • Limited by local resources

  • Sequential execution

Prod Mode:

  • Distributed processing

  • Scales with cluster size

  • Parallel execution

Debugging

Dev Mode:

  • Files in .dev/ directory

  • Immediate error feedback

  • Easy to inspect

Prod Mode:

  • YT web UI for logs

  • Remote debugging

  • Requires cluster access

Switching Between Modes#

Switching between modes is simple - just change the mode setting:

# Development
pipeline:
  mode: "dev"

# Production
pipeline:
  mode: "prod"

Note

Same Code, Different Execution

The same code and configuration work in both modes. The framework handles the differences automatically.

Important considerations:

  1. Table paths: Same paths work in both modes (dev mode maps them to .dev/)

  2. Credentials: Prod mode requires secrets.env with YT credentials

  3. Build folder: Prod mode requires build_folder for code execution

  4. Code changes: Dev mode uses local code, prod mode uploads code

Leaky Abstractions#

While the framework tries to abstract away differences, some leak through:

File Paths#

Dev mode:

  • Tables stored as .jsonl files

  • Path //tmp/my_pipeline/data becomes .dev/data.jsonl

Prod mode:

  • Tables stored on YT cluster

  • Path //tmp/my_pipeline/data is actual YT path

What to know:

  • Same code works in both modes

  • Path format is the same (//tmp/...)

  • Dev mode automatically maps paths to local files

Operation Execution#

Dev mode:

  • Map operations run sequentially (one job)

  • Limited parallelism

  • Uses local resources

Prod mode:

  • Map operations run in parallel (multiple jobs)

  • Full cluster parallelism

  • Uses cluster resources

What to know:

  • Performance characteristics differ

  • Dev mode may not catch all concurrency issues

  • Test in prod mode for production workloads

Code Execution#

Dev mode:

  • Code runs directly from local filesystem

  • No code upload needed

  • Changes are immediately available

Prod mode:

  • Code is packaged and uploaded

  • Must upload before execution

  • Changes require re-upload

What to know:

  • Dev mode is faster for iteration

  • Prod mode requires build_folder configuration

  • Code structure must be compatible with both modes

Error Handling#

Dev mode:

  • Errors show in terminal

  • Stack traces are immediate

  • Easy to debug

Prod mode:

  • Errors in YT web UI

  • Stack traces in operation logs

  • Requires YT access to debug

What to know:

  • Use dev mode for debugging

  • Check YT web UI for prod errors

  • Logs are crucial for prod debugging

Debugging Tips#

Dev Mode Debugging#

  1. Check .dev/ directory: See generated files and tables

  2. Check logs: Operation logs in .dev/ directory

  3. Inspect tables: Open .jsonl files directly

  4. Add print statements: Output appears immediately

Prod Mode Debugging#

  1. Check YT web UI: View operations and logs

  2. Use logging: self.logger output appears in YT logs

  3. Check operation status: Monitor in YT web UI

  4. Download results: Download tables for local inspection

Common Issues#

Issue: Tables not found in prod mode#

  • Check table paths exist on YT cluster

  • Verify YT credentials are correct

  • Check YT proxy URL is accessible

Issue: Code not updating in prod mode#

  • Code is uploaded once per pipeline run

  • Changes require re-running pipeline

  • Check build_folder is correct

Issue: Different behavior in dev vs prod#

  • Check for YT-specific features

  • Verify resource limits

  • Test with similar data sizes

Best Practices#

Tip

Development Workflow

  1. Develop and test in dev mode

  2. Validate in prod mode with small dataset

  3. Deploy to production with full dataset

  1. Develop in dev mode: Faster iteration and debugging

  2. Test in prod mode: Validate before production deployment

  3. Use same configs: Keep dev and prod configs similar

  4. Monitor resources: Check resource usage in prod mode

  5. Version control: Track config changes between modes

Warning

Test Before Production

Always test your pipeline in prod mode with a small dataset before running on production data. This helps catch mode-specific issues early.

Next Steps#