# Dev vs prod modes The pipeline `mode` field is either `dev` or `prod`. Same Python and YAML shape; execution differs (local files vs YT cluster). ## Overview ```{tip} **Start in dev** No `secrets.env` requirement, fast feedback, artifacts under `.dev/`. ``` - **Dev**: tables and many operations are simulated on disk; good for development and CI without a cluster. - **Prod**: real YT operations and uploads; needs credentials and a compatible cluster image. ```{warning} **Prod needs credentials** Set `YT_PROXY` and `YT_TOKEN` in `configs/secrets.env` before switching to `prod`. ``` ## Dev mode ### Behavior - **Tables**: JSONL files under `.dev/`, keyed from logical YT-style paths. - **Map / vanilla style jobs**: local subprocess plus a sandbox directory under `.dev/`. - **Code upload**: skipped; Python runs from your working tree. - **YQL**: translated through DuckDB where the dev client supports it (not identical to cluster YQL in every edge case). ### Config ```yaml # configs/config.yaml pipeline: mode: "dev" ``` ### Layout after a run ```text my_pipeline/ ├── .dev/ │ ├── table1.jsonl │ ├── table2.jsonl │ └── operation.log ├── configs/ ├── stages/ └── pipeline.py ``` ### Tables Write: ```python # Writes .dev/data.jsonl for logical path //tmp/my_pipeline/data self.deps.yt_client.write_table( table_path="//tmp/my_pipeline/data", rows=[{"id": 1, "name": "Alice"}], ) ``` Append without truncating (same keyword as prod when the target already exists): ```python self.deps.yt_client.write_table( table_path="//tmp/my_pipeline/data", rows=[{"id": 2, "name": "Bob"}], append=True, ) ``` Read: ```python rows = list(self.deps.yt_client.read_table("//tmp/my_pipeline/data")) ``` ### Map (dev) Typical flow: 1. Sandbox: `.dev/sandbox_->/` (exact name may vary by config). 2. Copy or link input JSONL into the sandbox. 3. Run the mapper entrypoint. 4. Mapper stdout becomes `.dev/.jsonl`, or appends when the map op sets `append: true`. See [Map operations — Append output](operations/map.md) (`append: true` section). Command-mode mappers only (string commands). TypedJob map legs run on the cluster in prod. ### MapReduce (dev) Typical flow: 1. Sandbox: `.dev/sandbox_mr_->/` with `input.jsonl`, `intermediate.jsonl`, and `output.jsonl`. 2. Copy input JSONL into the sandbox and upload file dependencies (same as map). 3. Run the mapper command as a subprocess; stdout becomes `intermediate.jsonl`. 4. Sort intermediate rows by `sort_by` when set, otherwise by `reduce_by` (loads the JSONL into memory; for small local fixtures only). 5. Run the reducer command; stdout becomes the output table at `.dev/.jsonl`. 6. Stderr for each leg: `.dev/_mapper.log` and `_reducer.log`. String commands only; TypedJob MapReduce legs are prod-only (same rule as map). Dev runs one mapper and one reducer process (no shuffle partitions). For command-mode reducers that expect sorted keys, dev sorting matches what the cluster provides after shuffle. ### Reduce (dev) Typical flow: 1. Sandbox: `.dev/sandbox_reduce_->/`. 2. Copy input JSONL, upload dependencies, auto-sort rows by `sort_by` when set in config, otherwise `reduce_by` (in-memory; small fixtures only). 3. Run the reducer subprocess; stdout becomes `.dev/.jsonl`. 4. Stderr: `.dev/_reducer.log`. String commands only. Dev auto-sorts before the reducer so you do not need a separate `run_sort` stage locally. In prod, the input table must already be sorted by `reduce_by` (or run sort first). ### Vanilla (dev) 1. Sandbox under `.dev/_sandbox/` (name depends on stage). 2. Extract the uploaded archive layout locally. 3. Run `vanilla.py` (or configured entry). 4. Stdout/stderr captured to `.dev/.log` (see operations docs for exact file names). ### YQL (dev) Runs through the dev client’s DuckDB-backed path for supported statements. Treat results as representative, not a full YT SQL conformance suite. ### When dev mode is enough - Writing stages and unit-style checks. - Debugging mapper I/O with small fixtures. - CI that should not depend on YT network. ### Tradeoffs (dev) **Pros** - Fast edit-run cycles. - No cluster account required for basic flows. - Easy to inspect `.jsonl` and logs on disk. - Works offline for many pipelines. **Cons** - Dataset size bounded by your machine. - Parallelism and timing differ from prod. - Rare YT-only behavior may not appear until prod. ## Prod mode ### Behavior - **Tables**: real Cypress paths on YT. - **Operations**: cluster jobs with the resources you request. - **Upload**: framework packages code to `build_folder` before starting jobs. - **YQL**: cluster YQL engine. ```{warning} **Image must match imports** Job code imports `ytjobs` and your own modules. The Docker image for those jobs must ship matching Python deps. See [Cluster requirements](configuration/cluster-requirements.md). ``` ### Config ```yaml # configs/config.yaml pipeline: mode: "prod" build_folder: "//tmp/my_pipeline/build" ``` `configs/secrets.env`: ```bash YT_PROXY=your-yt-proxy-url YT_TOKEN=your-yt-token ``` ### Tables (prod) ```python self.deps.yt_client.write_table( table_path="//tmp/my_pipeline/data", rows=[{"id": 1, "name": "Alice"}], ) ``` ```python rows = list(self.deps.yt_client.read_table("//tmp/my_pipeline/data")) ``` ### Map (prod) 1. Upload bundle to `build_folder`. 2. YT schedules tasks over input chunks. 3. Reducers (if any) follow map semantics you configured. 4. Output lands in the configured output table. ### Vanilla (prod) Upload, single or few cluster tasks, logs in YT UI. ### YQL (prod) Distributed engine, cluster-sized inputs. ### When you need prod - Production schedules. - Data larger than fits comfortably on a laptop disk. - Real concurrency and YT-native features. ### Tradeoffs (prod) **Pros** - Scales with cluster storage and CPU. - Matches how batch jobs actually run in YT. **Cons** - Needs credentials and network. - Slower iteration than dev. - Debugging means YT logs and UI, not only local files. ## Quick comparison | Topic | Dev | Prod | |-------|-----|------| | Config snippet | `pipeline.mode: "dev"` | `pipeline.mode: "prod"` plus `build_folder` when uploading | | Credentials | No YT `secrets.env` for basic flows | `YT_PROXY` / `YT_TOKEN` required | | Throughput | One machine, subprocess-style map/vanilla | Cluster scheduling and distributed tables | | Debugging | `.dev/*.jsonl`, local stderr | YT operation UI, remote stderr | ## Switching modes Change one field: ```yaml pipeline: mode: "dev" # or "prod" ``` ```{note} **Same repo, different backend** The framework picks dev vs prod implementations from `mode`; your stage classes stay the same. ``` Checklist when going to prod: 1. Logical table paths stay the same string format; dev maps them to files. 2. `secrets.env` exists and points at the right cluster. 3. `build_folder` is set and writable for your service user. 4. Docker image includes everything imported inside uploaded job code. ## Where behavior diverges ### Paths - Dev: `//tmp/.../name` maps to `.dev/name.jsonl` (see client implementation for exact mapping rules). - Prod: the same string is a Cypress path. ### Parallelism - Dev map runs are closer to “one local subprocess story” than thousands of tiny tasks. - Prod uses YT scheduling; race conditions that never show up locally can appear under load. ### Code freshness - Dev reads your tree directly. - Prod needs a successful upload each run; if you change only local files, rerun the pipeline to refresh the bundle. ### Errors - Dev: tracebacks in your terminal. - Prod: fetch stderr and system logs from YT for the failing operation. ## Debugging ### Dev 1. List `.dev/` after the stage runs. 2. Open the JSONL you think should have changed. 3. Read `operation.log` and stage logs next to it. 4. Print debugging is fine; you see it immediately. ### Prod 1. Open the operation in the YT UI and read stderr. 2. Use `self.logger` consistently; it ends up in the same places operations already aggregate logs. 3. For stubborn issues, reproduce with a tiny input table, then widen. ### Common symptoms **Table missing in prod** - Path typo or table never created in that Cypress tree. - Credentials or proxy pointing at the wrong cluster. **Code changes ignored in prod** - You did not re-run the pipeline after editing sources, or upload failed silently earlier (check logs). **Different results dev vs prod** - DuckDB vs cluster SQL differences for YQL-heavy stages. - Resource limits killing tasks only on the cluster. ## Workflow suggestion ```{tip} **Smoke in prod early** After dev passes, run prod once on a small slice of real schema before full-scale backfill jobs. ``` 1. Implement in dev. 2. Promote to prod with a narrow date range or row limit. 3. Only then open the floodgates. ## Next steps - [Cluster requirements](configuration/cluster-requirements.md) - [Configuration](configuration/index.md) - [Operations](operations/index.md) - [Examples](https://github.com/GregoryKogan/yt-framework/tree/main/examples/) - [Troubleshooting: configuration](troubleshooting/configuration.md)