Troubleshooting#

Symptoms grouped by area. Start from the section that matches where the failure shows up (driver config vs job logs).

Sections#

  • Pipeline — discovery, ordering, enabled_stages

  • Operations — map, vanilla, YQL, S3, checkpoints, Docker

  • Configuration — dev vs prod, secrets, paths

  • Debugging — logging, reproducing, asking for help

How failures cluster#

  1. YAML / OmegaConf — missing keys, wrong types, bad table paths

  2. Filesystem — missing stage.py, wrong working directory

  3. Operations — mapper stderr, YQL pragma limits, S3 403

  4. Environment — Python version, missing pip package in cluster image

  5. Mode mismatch — code that only works locally, or vice versa

Fast checks#

Symptom

First look

Pipeline exits before stages

configs/config.yaml, python pipeline.py cwd

Stage raises immediately

Stage config.yaml, imports in stage.py

Cluster job fails

Task stderr in YT UI, Docker image packages

Works in dev, fails in prod

Credentials, build_folder, upload size

If you are stuck#

  1. Reproduce in dev with the smallest table or stdin fixture.

  2. Capture full stderr from the failing task (prod) or .dev logs (dev).

  3. Compare cluster image package list against imports in uploaded code.

See also#