# Tokenizer artifacts (tarball uploads) **Tokenizer artifacts** are separate from [checkpoint management](checkpoints.md) **model files**. They package a **directory or `.tar.gz`** (tokenizers, small processor bundles, etc.), upload it to a Cypress **file** path, and mount it into map / map-reduce / vanilla jobs. The framework wires **file dependencies** and **environment variables** so `StageBootstrapTypedJob` (or command-mode wrappers) can extract the archive in the sandbox. ## When to use - Reducers or mappers need a **tarball** of assets that are **not** a single PyTorch/HF checkpoint file. - You want idempotent upload (skip if the YT path already exists) and the same pool/resource model as other operations. ## Configuration (`tokenizer_artifact`) Under `client.operations..tokenizer_artifact` (map, map_reduce, reduce, vanilla—where enabled): | Key | Required | Description | |-----|----------|-------------| | `artifact_base` | Yes (to enable) | Cypress **directory** for uploaded tarballs (e.g. `//tmp/pipeline/tokenizers`). | | `artifact_name` | No* | Logical name; defaults from `job.tokenizer_name`, `job.model_name` basename, or `local_artifact_path` filename. | | `local_artifact_path` | No | Local **directory** (packed to temp `.tar.gz`) or existing **`.tar.gz`** to upload if missing in YT. | \*If `artifact_base` is set, a resolvable `artifact_name` is required (directly or via `job` / local path). Example (map): ```yaml client: operations: map: input_table: //tmp/pipeline/in output_table: //tmp/pipeline/out tokenizer_artifact: artifact_base: //tmp/pipeline/tokenizer_artifacts local_artifact_path: /path/to/my_tokenizer_bundle # dir or .tar.gz resources: pool: default ``` ## Sandbox behavior 1. The uploaded file is named `.tar.gz` under `artifact_base`. 2. Workers receive `TOKENIZER_ARTIFACT_FILE`, `TOKENIZER_ARTIFACT_DIR`, and optionally `TOKENIZER_ARTIFACT_NAME`. 3. `StageBootstrapTypedJob` extracts the tarball once per sandbox (see [Environment variables](../reference/environment-variables.md)). ## Relation to checkpoints - **Checkpoints** (`checkpoint:` block): usually a **single model file** mounted as `CHECKPOINT_FILE`. - **Tokenizer artifacts**: **tarball** workflow, separate Cypress path and env vars. You can use both in one stage if needed. ## API - `init_tokenizer_artifact_directory` and helpers: **Tokenizer artifact** in [API Reference](../reference/api.md) (`yt_framework.operations._internal.tokenizer_artifact`). - Exported on `yt_framework.operations` for advanced callers. ## See also - [Map-reduce TypedJob](../operations/map-reduce-typed-jobs.md) - [Cluster requirements](../configuration/cluster-requirements.md)