# Dev vs Prod Modes YT Framework supports two execution modes: **dev** (development) and **prod** (production). Understanding the differences and when to use each mode is crucial for effective pipeline development. ## Overview ```{tip} **Start with Dev Mode** Always develop and test your pipelines in dev mode first. It's faster, doesn't require YT credentials, and makes debugging easier. ``` - **Dev Mode**: Simulates YT operations locally using the file system. Perfect for development, testing, and debugging. - **Prod Mode**: Executes operations on the actual YT cluster. Used for production workloads. Both modes use the same code and configuration, making it easy to develop locally and deploy to production. ```{warning} **Credentials Required for Prod Mode** Production mode requires YT credentials in `configs/secrets.env`. Make sure to set up credentials before running in prod mode. ``` ## Dev Mode ### How It Works (dev) Dev mode simulates YT operations using the local file system: - **Tables**: Stored as `.jsonl` files in `.dev/` directory - **Operations**: Executed locally using subprocess - **Code Upload**: No-op (code runs directly from local filesystem) - **YQL Operations**: Executed using DuckDB for local simulation ### Configuration (dev) Set mode in pipeline config: ```yaml # configs/config.yaml pipeline: mode: "dev" ``` ### Directory Structure (dev) When running in dev mode, the framework creates a `.dev/` directory: ```plaintext my_pipeline/ ├── .dev/ │ ├── table1.jsonl # Simulated YT tables │ ├── table2.jsonl │ └── operation.log # Operation logs ├── configs/ ├── stages/ └── pipeline.py ``` ### Table Operations (dev) **Writing tables:** ```python # In dev mode, writes to .dev/table_name.jsonl self.deps.yt_client.write_table( table_path="//tmp/my_pipeline/data", rows=[{"id": 1, "name": "Alice"}] ) # Creates: .dev/data.jsonl ``` **Reading tables:** ```python # In dev mode, reads from .dev/table_name.jsonl rows = list(self.deps.yt_client.read_table("//tmp/my_pipeline/data")) # Reads from: .dev/data.jsonl ``` ### Map Operations (dev) Map operations run locally using subprocess: 1. Creates sandbox directory: `.dev/sandbox_->/` 2. Copies input table to sandbox 3. Executes mapper.py script 4. Collects output to `.dev/.jsonl` **Example:** ```bash # Dev mode execution .dev/sandbox_input->output/ ├── input.jsonl ├── source.tar.gz (extracted) └── operation_wrapper_*.sh ``` ### Vanilla Operations (dev) Vanilla operations run locally using subprocess: 1. Creates sandbox directory: `.dev/_sandbox/` 2. Extracts source archive 3. Executes vanilla.py script 4. Logs output to `.dev/.log` ### YQL Operations (dev) YQL operations are simulated using DuckDB: - Joins, filters, aggregations run locally - Results written to `.dev/` directory - Full YQL syntax supported ### When to Use Dev Mode - **Development**: Writing and testing new stages - **Debugging**: Investigating issues locally - **Testing**: Validating pipeline logic - **CI/CD**: Running tests without YT cluster access - **Learning**: Understanding framework behavior ### Advantages (dev) - ✅ Fast iteration (no network latency) - ✅ No YT cluster access required - ✅ Easy debugging (files are local) - ✅ Free (no cluster resources used) - ✅ Works offline ### Limitations (dev) - ❌ Not suitable for large datasets (limited by local disk) - ❌ Some YT-specific features may differ - ❌ Performance characteristics differ from production ## Prod Mode ### How It Works (prod) Prod mode executes operations on the actual YT cluster: - **Tables**: Stored on YT cluster at specified paths - **Operations**: Executed on YT cluster nodes - **Code Upload**: Code is packaged and uploaded to YT - **YQL Operations**: Executed using YT's YQL engine ```{warning} **Cluster Dependencies Required** In prod mode, `ytjobs` code executes on YT cluster nodes. The cluster's Docker image must include required dependencies or you must use custom Docker images. See [Cluster Requirements](configuration/cluster-requirements.md) for details. ``` ### Configuration (prod) Set mode in pipeline config: ```yaml # configs/config.yaml pipeline: mode: "prod" build_folder: "//tmp/my_pipeline/build" ``` **Required credentials** (`configs/secrets.env`): ```bash YT_PROXY=your-yt-proxy-url YT_TOKEN=your-yt-token ``` ### Table Operations (prod) **Writing tables:** ```python # In prod mode, writes to YT cluster self.deps.yt_client.write_table( table_path="//tmp/my_pipeline/data", rows=[{"id": 1, "name": "Alice"}] ) # Creates: //tmp/my_pipeline/data on YT cluster ``` **Reading tables:** ```python # In prod mode, reads from YT cluster rows = list(self.deps.yt_client.read_table("//tmp/my_pipeline/data")) # Reads from: //tmp/my_pipeline/data on YT cluster ``` ### Map Operations (prod) Map operations run on YT cluster: 1. Code is uploaded to `build_folder` 2. YT creates jobs on cluster nodes 3. Each job processes a portion of input table 4. Results are written to output table on cluster ### Vanilla Operations (prod) Vanilla operations run on YT cluster: 1. Code is uploaded to `build_folder` 2. YT creates job on cluster node 3. Job executes vanilla.py script 4. Logs available in YT web UI ### YQL Operations (prod) YQL operations execute on YT cluster: - Uses YT's distributed YQL engine - Handles large datasets efficiently - Full YT YQL syntax supported ### When to Use Prod Mode - **Production**: Running production workloads - **Large Datasets**: Processing data that doesn't fit locally - **Performance**: Need cluster performance and parallelism - **Integration**: Integrating with other YT-based systems ### Advantages (prod) - ✅ Handles large datasets (distributed storage) - ✅ High performance (distributed processing) - ✅ Scalability (cluster resources) - ✅ Production-ready (real YT environment) ### Limitations (prod) - ❌ Requires YT cluster access - ❌ Slower iteration (network latency) - ❌ Costs cluster resources - ❌ Harder to debug (remote execution) ## Quick Comparison ```{tab-set} ```{tab-item} Configuration **Dev Mode:** ```yaml pipeline: mode: "dev" ``` **Prod Mode:** ```yaml pipeline: mode: "prod" build_folder: "//tmp/my_pipeline/build" ``` ```{tab-item} Credentials **Dev Mode:** - No credentials required - Works offline **Prod Mode:** - Requires `configs/secrets.env` - Must have YT cluster access ``` ```{tab-item} Performance **Dev Mode:** - Fast iteration - Limited by local resources - Sequential execution **Prod Mode:** - Distributed processing - Scales with cluster size - Parallel execution ``` ```{tab-item} Debugging **Dev Mode:** - Files in `.dev/` directory - Immediate error feedback - Easy to inspect **Prod Mode:** - YT web UI for logs - Remote debugging - Requires cluster access ``` ## Switching Between Modes Switching between modes is simple - just change the `mode` setting: ```yaml # Development pipeline: mode: "dev" # Production pipeline: mode: "prod" ``` ```{note} **Same Code, Different Execution** The same code and configuration work in both modes. The framework handles the differences automatically. ``` **Important considerations:** 1. **Table paths**: Same paths work in both modes (dev mode maps them to `.dev/`) 2. **Credentials**: Prod mode requires `secrets.env` with YT credentials 3. **Build folder**: Prod mode requires `build_folder` for code execution 4. **Code changes**: Dev mode uses local code, prod mode uploads code ## Leaky Abstractions While the framework tries to abstract away differences, some leak through: ### File Paths **Dev mode:** - Tables stored as `.jsonl` files - Path `//tmp/my_pipeline/data` becomes `.dev/data.jsonl` **Prod mode:** - Tables stored on YT cluster - Path `//tmp/my_pipeline/data` is actual YT path **What to know:** - Same code works in both modes - Path format is the same (`//tmp/...`) - Dev mode automatically maps paths to local files ### Operation Execution **Dev mode:** - Map operations run sequentially (one job) - Limited parallelism - Uses local resources **Prod mode:** - Map operations run in parallel (multiple jobs) - Full cluster parallelism - Uses cluster resources **What to know:** - Performance characteristics differ - Dev mode may not catch all concurrency issues - Test in prod mode for production workloads ### Code Execution **Dev mode:** - Code runs directly from local filesystem - No code upload needed - Changes are immediately available **Prod mode:** - Code is packaged and uploaded - Must upload before execution - Changes require re-upload **What to know:** - Dev mode is faster for iteration - Prod mode requires `build_folder` configuration - Code structure must be compatible with both modes ### Error Handling **Dev mode:** - Errors show in terminal - Stack traces are immediate - Easy to debug **Prod mode:** - Errors in YT web UI - Stack traces in operation logs - Requires YT access to debug **What to know:** - Use dev mode for debugging - Check YT web UI for prod errors - Logs are crucial for prod debugging ## Debugging Tips ### Dev Mode Debugging 1. **Check `.dev/` directory**: See generated files and tables 2. **Check logs**: Operation logs in `.dev/` directory 3. **Inspect tables**: Open `.jsonl` files directly 4. **Add print statements**: Output appears immediately ### Prod Mode Debugging 1. **Check YT web UI**: View operations and logs 2. **Use logging**: `self.logger` output appears in YT logs 3. **Check operation status**: Monitor in YT web UI 4. **Download results**: Download tables for local inspection ### Common Issues ### Issue: Tables not found in prod mode - Check table paths exist on YT cluster - Verify YT credentials are correct - Check YT proxy URL is accessible ### Issue: Code not updating in prod mode - Code is uploaded once per pipeline run - Changes require re-running pipeline - Check `build_folder` is correct ### Issue: Different behavior in dev vs prod - Check for YT-specific features - Verify resource limits - Test with similar data sizes ## Best Practices ```{tip} **Development Workflow** 1. Develop and test in dev mode 2. Validate in prod mode with small dataset 3. Deploy to production with full dataset ``` 1. **Develop in dev mode**: Faster iteration and debugging 2. **Test in prod mode**: Validate before production deployment 3. **Use same configs**: Keep dev and prod configs similar 4. **Monitor resources**: Check resource usage in prod mode 5. **Version control**: Track config changes between modes ```{warning} **Test Before Production** Always test your pipeline in prod mode with a small dataset before running on production data. This helps catch mode-specific issues early. ``` ## Next Steps - Understand [Cluster Requirements](configuration/cluster-requirements.md) for production mode dependencies - Learn about [Configuration](configuration/index.md) management - Explore [Operations](operations/) for different operation types - Check out [Examples](https://github.com/GregoryKogan/yt-framework/tree/main/examples/) for mode-specific examples - Review [Troubleshooting](troubleshooting/configuration.md) for mode-specific issues