Dev vs Prod Modes#

YT Framework supports two execution modes: dev (development) and prod (production). Understanding the differences and when to use each mode is crucial for effective pipeline development.

Overview#

Tip

Start with Dev Mode

Always develop and test your pipelines in dev mode first. It’s faster, doesn’t require YT credentials, and makes debugging easier.

Dev Mode: Simulates YT operations locally using the file system. Perfect for development, testing, and debugging.
Prod Mode: Executes operations on the actual YT cluster. Used for production workloads.

Both modes use the same code and configuration, making it easy to develop locally and deploy to production.

Warning

Credentials Required for Prod Mode

Production mode requires YT credentials in configs/secrets.env. Make sure to set up credentials before running in prod mode.

Dev Mode#

How It Works (dev)#

Dev mode simulates YT operations using the local file system:

Tables: Stored as .jsonl files in .dev/ directory
Operations: Executed locally using subprocess
Code Upload: No-op (code runs directly from local filesystem)
YQL Operations: Executed using DuckDB for local simulation

Configuration (dev)#

Set mode in pipeline config:

# configs/config.yaml
pipeline:
  mode: "dev"

Directory Structure (dev)#

When running in dev mode, the framework creates a .dev/ directory:

my_pipeline/
├── .dev/
│   ├── table1.jsonl      # Simulated YT tables
│   ├── table2.jsonl
│   └── operation.log     # Operation logs
├── configs/
├── stages/
└── pipeline.py

Table Operations (dev)#

Writing tables:

# In dev mode, writes to .dev/table_name.jsonl
self.deps.yt_client.write_table(
    table_path="//tmp/my_pipeline/data",
    rows=[{"id": 1, "name": "Alice"}]
)
# Creates: .dev/data.jsonl

Reading tables:

# In dev mode, reads from .dev/table_name.jsonl
rows = list(self.deps.yt_client.read_table("//tmp/my_pipeline/data"))
# Reads from: .dev/data.jsonl

Map Operations (dev)#

Map operations run locally using subprocess:

Creates sandbox directory: .dev/sandbox_<input>-><output>/
Copies input table to sandbox
Executes mapper.py script
Collects output to .dev/<output>.jsonl

Example:

# Dev mode execution
.dev/sandbox_input->output/
├── input.jsonl
├── code.tar.gz (extracted)
└── operation_wrapper_*.sh

Vanilla Operations (dev)#

Vanilla operations run locally using subprocess:

Creates sandbox directory: .dev/<stage_name>_sandbox/
Extracts code archive
Executes vanilla.py script
Logs output to .dev/<stage_name>.log

YQL Operations (dev)#

YQL operations are simulated using DuckDB:

Joins, filters, aggregations run locally
Results written to .dev/ directory
Full YQL syntax supported

When to Use Dev Mode#

Development: Writing and testing new stages
Debugging: Investigating issues locally
Testing: Validating pipeline logic
CI/CD: Running tests without YT cluster access
Learning: Understanding framework behavior

Advantages (dev)#

✅ Fast iteration (no network latency)
✅ No YT cluster access required
✅ Easy debugging (files are local)
✅ Free (no cluster resources used)
✅ Works offline

Limitations (dev)#

❌ Not suitable for large datasets (limited by local disk)
❌ Some YT-specific features may differ
❌ Performance characteristics differ from production

Prod Mode#

How It Works (prod)#

Prod mode executes operations on the actual YT cluster:

Tables: Stored on YT cluster at specified paths
Operations: Executed on YT cluster nodes
Code Upload: Code is packaged and uploaded to YT
YQL Operations: Executed using YT’s YQL engine

Warning

Cluster Dependencies Required

In prod mode, ytjobs code executes on YT cluster nodes. The cluster’s Docker image must include required dependencies or you must use custom Docker images. See Cluster Requirements for details.

Configuration (prod)#

Set mode in pipeline config:

# configs/config.yaml
pipeline:
  mode: "prod"
  build_folder: "//tmp/my_pipeline/build"

Required credentials (configs/secrets.env):

YT_PROXY=your-yt-proxy-url
YT_TOKEN=your-yt-token

Table Operations (prod)#

Writing tables:

# In prod mode, writes to YT cluster
self.deps.yt_client.write_table(
    table_path="//tmp/my_pipeline/data",
    rows=[{"id": 1, "name": "Alice"}]
)
# Creates: //tmp/my_pipeline/data on YT cluster

Reading tables:

# In prod mode, reads from YT cluster
rows = list(self.deps.yt_client.read_table("//tmp/my_pipeline/data"))
# Reads from: //tmp/my_pipeline/data on YT cluster

Map Operations (prod)#

Map operations run on YT cluster:

Code is uploaded to build_folder
YT creates jobs on cluster nodes
Each job processes a portion of input table
Results are written to output table on cluster

Vanilla Operations (prod)#

Vanilla operations run on YT cluster:

Code is uploaded to build_folder
YT creates job on cluster node
Job executes vanilla.py script
Logs available in YT web UI

YQL Operations (prod)#

YQL operations execute on YT cluster:

Uses YT’s distributed YQL engine
Handles large datasets efficiently
Full YT YQL syntax supported

When to Use Prod Mode#

Production: Running production workloads
Large Datasets: Processing data that doesn’t fit locally
Performance: Need cluster performance and parallelism
Integration: Integrating with other YT-based systems

Advantages (prod)#

✅ Handles large datasets (distributed storage)
✅ High performance (distributed processing)
✅ Scalability (cluster resources)
✅ Production-ready (real YT environment)

Limitations (prod)#

❌ Requires YT cluster access
❌ Slower iteration (network latency)
❌ Costs cluster resources
❌ Harder to debug (remote execution)

Quick Comparison#

Configuration

Dev Mode:

pipeline:
  mode: "dev"

Prod Mode:

pipeline:
  mode: "prod"
  build_folder: "//tmp/my_pipeline/build"

Credentials

Dev Mode:

No credentials required
Works offline

Prod Mode:

Requires configs/secrets.env
Must have YT cluster access

Performance

Dev Mode:

Fast iteration
Limited by local resources
Sequential execution

Prod Mode:

Distributed processing
Scales with cluster size
Parallel execution

Debugging

Dev Mode:

Files in .dev/ directory
Immediate error feedback
Easy to inspect

Prod Mode:

YT web UI for logs
Remote debugging
Requires cluster access

Switching Between Modes#

Switching between modes is simple - just change the mode setting:

# Development
pipeline:
  mode: "dev"

# Production
pipeline:
  mode: "prod"

Note

Same Code, Different Execution

The same code and configuration work in both modes. The framework handles the differences automatically.

Important considerations:

Table paths: Same paths work in both modes (dev mode maps them to .dev/)
Credentials: Prod mode requires secrets.env with YT credentials
Build folder: Prod mode requires build_folder for code execution
Code changes: Dev mode uses local code, prod mode uploads code

Leaky Abstractions#

While the framework tries to abstract away differences, some leak through:

File Paths#

Dev mode:

Tables stored as .jsonl files
Path //tmp/my_pipeline/data becomes .dev/data.jsonl

Prod mode:

Tables stored on YT cluster
Path //tmp/my_pipeline/data is actual YT path

What to know:

Same code works in both modes
Path format is the same (//tmp/...)
Dev mode automatically maps paths to local files

Operation Execution#

Dev mode:

Map operations run sequentially (one job)
Limited parallelism
Uses local resources

Prod mode:

Map operations run in parallel (multiple jobs)
Full cluster parallelism
Uses cluster resources

What to know:

Performance characteristics differ
Dev mode may not catch all concurrency issues
Test in prod mode for production workloads

Code Execution#

Dev mode:

Code runs directly from local filesystem
No code upload needed
Changes are immediately available

Prod mode:

Code is packaged and uploaded
Must upload before execution
Changes require re-upload

What to know:

Dev mode is faster for iteration
Prod mode requires build_folder configuration
Code structure must be compatible with both modes

Error Handling#

Dev mode:

Errors show in terminal
Stack traces are immediate
Easy to debug

Prod mode:

Errors in YT web UI
Stack traces in operation logs
Requires YT access to debug

What to know:

Use dev mode for debugging
Check YT web UI for prod errors
Logs are crucial for prod debugging

Debugging Tips#

Dev Mode Debugging#

Check .dev/ directory: See generated files and tables
Check logs: Operation logs in .dev/ directory
Inspect tables: Open .jsonl files directly
Add print statements: Output appears immediately

Prod Mode Debugging#

Check YT web UI: View operations and logs
Use logging: self.logger output appears in YT logs
Check operation status: Monitor in YT web UI
Download results: Download tables for local inspection

Common Issues#

Issue: Tables not found in prod mode#

Check table paths exist on YT cluster
Verify YT credentials are correct
Check YT proxy URL is accessible

Issue: Code not updating in prod mode#

Code is uploaded once per pipeline run
Changes require re-running pipeline
Check build_folder is correct

Issue: Different behavior in dev vs prod#

Check for YT-specific features
Verify resource limits
Test with similar data sizes

Best Practices#

Tip

Development Workflow

Develop and test in dev mode
Validate in prod mode with small dataset
Deploy to production with full dataset

Develop in dev mode: Faster iteration and debugging
Test in prod mode: Validate before production deployment
Use same configs: Keep dev and prod configs similar
Monitor resources: Check resource usage in prod mode
Version control: Track config changes between modes

Warning

Test Before Production

Always test your pipeline in prod mode with a small dataset before running on production data. This helps catch mode-specific issues early.

Next Steps#

Understand Cluster Requirements for production mode dependencies
Learn about Configuration management
Explore Operations for different operation types
Check out Examples for mode-specific examples
Review Troubleshooting for mode-specific issues

Dev vs Prod Modes#

Overview#

Dev Mode#

How It Works (dev)#

Configuration (dev)#

Directory Structure (dev)#

Table Operations (dev)#

Map Operations (dev)#

Vanilla Operations (dev)#

YQL Operations (dev)#

When to Use Dev Mode#

Advantages (dev)#

Limitations (dev)#

Prod Mode#

How It Works (prod)#

Configuration (prod)#

Table Operations (prod)#

Map Operations (prod)#

Vanilla Operations (prod)#

YQL Operations (prod)#

When to Use Prod Mode#

Advantages (prod)#

Limitations (prod)#

Quick Comparison#

Switching Between Modes#

Leaky Abstractions#

File Paths#

Operation Execution#

Code Execution#

Error Handling#

Debugging Tips#

Dev Mode Debugging#

Prod Mode Debugging#

Common Issues#

Issue: Tables not found in prod mode#

Issue: Code not updating in prod mode#

Issue: Different behavior in dev vs prod#

Best Practices#

Next Steps#

This Page