Docker Support#

YT Framework supports custom Docker images for operations that require special dependencies, GPU support, or custom environments.

Note

When Custom Docker is Required

Custom Docker images are essential if your YT cluster’s default Docker image doesn’t include the dependencies required by ytjobs (Python 3.11+, ytsaurus-client, boto3, omegaconf). See Cluster Requirements for details about cluster dependencies and when to use custom Docker images.

Overview#

Custom Docker images allow you to:

Install custom dependencies
Use GPU-enabled environments
Customize the execution environment
Ensure consistent environments across operations
Ensure required ytjobs dependencies are available (if default cluster image lacks them)

Key points:

Specify Docker image in operation config
Image must be compatible with YT cluster
GPU support requires GPU-enabled images
Docker authentication supported
Can solve cluster dependency issues - use custom images if default cluster image lacks required packages

When to Use Custom Docker#

Cluster Dependencies#

If your YT cluster’s default Docker image doesn’t include required ytjobs dependencies (Python 3.11+, ytsaurus-client, boto3), you must use custom Docker images. This is the most common reason for using custom Docker images.

See Cluster Requirements for complete details about required dependencies.

GPU Workloads#

For GPU processing, you need a GPU-enabled Docker image:

client:
  operations:
    map:
      resources:
        docker_image: nvidia/cuda:11.8.0-runtime-ubuntu22.04
        gpu_limit: 1
        memory_limit_gb: 16

Custom Dependencies#

For operations requiring specific libraries or tools:

client:
  operations:
    vanilla:
      resources:
        docker_image: my-registry/my-custom-image:latest
        memory_limit_gb: 4

Consistent Environments#

For reproducible environments across teams:

client:
  operations:
    map:
      resources:
        docker_image: my-registry/standard-python:3.11
        memory_limit_gb: 4

Creating Docker Images#

Basic Dockerfile#

Create a Dockerfile in your pipeline or stage directory:

# Build for linux/amd64 platform (required for YT cluster compatibility)
FROM python:3.11-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
RUN pip install --no-cache-dir \
    numpy>=1.20.0 \
    pandas>=1.3.0

WORKDIR /app

Platform Requirements#

Important: YT cluster requires linux/amd64 platform:

# Build for correct platform
docker buildx build --platform linux/amd64 --tag my-image:latest --load .

Or use buildx:

docker buildx build --platform linux/amd64 --tag my-image:latest --push .

GPU Dockerfile#

For GPU workloads:

# Use NVIDIA CUDA base image
FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04

# Install Python
RUN apt-get update && apt-get install -y \
    python3.11 \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*

# Install GPU-enabled libraries
RUN pip3 install --no-cache-dir \
    torch>=2.0.0 \
    torchvision>=0.15.0

WORKDIR /app

Note: GPU images are larger and take longer to pull.

Minimal Dockerfile#

For simple operations:

FROM python:3.11-slim

# Install only what you need
RUN pip install --no-cache-dir omegaconf

WORKDIR /app

Configuration#

Basic Configuration#

Specify Docker image in operation config:

# stages/my_stage/config.yaml
client:
  operations:
    map:
      resources:
        docker_image: my-registry/my-image:latest
        pool: default
        memory_limit_gb: 4
        cpu_limit: 2

Docker Image Location#

Docker images can be:

Public registry: python:3.11-slim, nvidia/cuda:11.8.0
Private registry: my-registry/my-image:latest
YT registry: //path/to/image (if using YT’s Docker registry)

GPU Configuration#

For GPU workloads:

client:
  operations:
    map:
      resources:
        docker_image: nvidia/cuda:11.8.0-runtime-ubuntu22.04
        gpu_limit: 1              # Request 1 GPU
        memory_limit_gb: 16       # More memory for GPU workloads
        cpu_limit: 4

GPU requirements:

GPU-enabled Docker image
gpu_limit set to 1 or higher
Sufficient memory (GPU workloads need more)

Docker Authentication#

For private registries, configure Docker authentication via environment variables in secrets.env:

Authentication Configuration#

Add Docker credentials to configs/secrets.env:

# configs/secrets.env
DOCKER_AUTH_USERNAME=myuser
DOCKER_AUTH_PASSWORD=mypassword

The framework automatically uses these credentials when a Docker image is specified in the operation config:

client:
  operations:
    map:
      resources:
        docker_image: my-registry/private-image:latest
        # Docker auth is automatically loaded from secrets.env

Note: Docker authentication is only used if all three are present: docker_image, DOCKER_AUTH_USERNAME, and DOCKER_AUTH_PASSWORD in secrets.env.

Complete Example#

Dockerfile#

# Build for linux/amd64 platform
FROM python:3.11-slim

# Install system tools
RUN apt-get update && apt-get install -y \
    cowsay \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
RUN pip install --no-cache-dir \
    omegaconf \
    botocore \
    boto3

# Make cowsay available
RUN ln -sf /usr/games/cowsay /usr/local/bin/cowsay

WORKDIR /app

Stage Configuration#

# stages/run_in_docker/config.yaml
client:
  operations:
    vanilla:
      resources:
        docker_image: my-registry/my-image:latest
        pool: default
        memory_limit_gb: 2
        cpu_limit: 1

Stage Code#

# stages/run_in_docker/stage.py
from yt_framework.core.pipeline import DebugContext
from yt_framework.core.stage import BaseStage
from yt_framework.operations.vanilla import run_vanilla

class RunInDockerStage(BaseStage):
    def run(self, debug: DebugContext) -> DebugContext:
        success = run_vanilla(
            context=self.context,
            operation_config=self.config.client.operations.vanilla,
        )
        
        if not success:
            raise RuntimeError("Vanilla operation failed")
        
        return debug

Vanilla Script#

# stages/run_in_docker/src/vanilla.py
#!/usr/bin/env python3
import subprocess
import logging
from ytjobs.logging.logger import get_logger

def main():
    logger = get_logger("docker-example", level=logging.INFO)
    
    # Use custom tool from Docker image
    result = subprocess.run(
        ["cowsay", "Hello from Docker!"],
        capture_output=True,
        text=True,
    )
    
    logger.info(result.stdout)

if __name__ == "__main__":
    main()

See Example: 07_custom_docker for complete example.

Best Practices#

Image Size#

Keep images small:

Use slim base images (python:3.11-slim)
Remove unnecessary packages
Use multi-stage builds if needed
Clean up apt cache

Example:

FROM python:3.11-slim

RUN apt-get update && apt-get install -y \
    build-essential \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

Dependency Management#

Install dependencies in image:

Pre-install common dependencies
Use requirements.txt for stage-specific deps
Pin versions for reproducibility

Example:

FROM python:3.11-slim

# Pre-install common dependencies
RUN pip install --no-cache-dir \
    numpy>=1.20.0 \
    pandas>=1.3.0

# Stage-specific deps installed at runtime via requirements.txt
WORKDIR /app

Version Tagging#

Tag images with versions:

# Build with version tag
docker buildx build --platform linux/amd64 \
    --tag my-registry/my-image:v1.2.3 \
    --push .

Use in config:

docker_image: my-registry/my-image:v1.2.3

Testing Images#

Test images locally:

# Build image
docker buildx build --platform linux/amd64 --tag my-image:test --load .

# Test image
docker run --rm my-image:test python3 -c "import numpy; print(numpy.__version__)"

Common Patterns#

Python with ML Libraries#

FROM python:3.11-slim

RUN pip install --no-cache-dir \
    numpy>=1.20.0 \
    pandas>=1.3.0 \
    scikit-learn>=1.0.0 \
    transformers>=4.20.0

GPU with PyTorch#

FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y \
    python3.11 \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*

RUN pip3 install --no-cache-dir \
    torch>=2.0.0 \
    torchvision>=0.15.0

Custom Tools#

FROM python:3.11-slim

RUN apt-get update && apt-get install -y \
    ffmpeg \
    imagemagick \
    && rm -rf /var/lib/apt/lists/*

# Install any custom Python dependencies your tools need
RUN pip install --no-cache-dir \
    your-custom-package>=1.0.0

Troubleshooting#

Issue: Image not found#

Check image name and tag
Verify image exists in registry
Check Docker authentication

Issue: Platform mismatch#

Build for linux/amd64 platform
Use docker buildx for cross-platform builds

Issue: GPU not available#

Verify GPU-enabled image
Check gpu_limit is set
Verify cluster has GPU nodes

Issue: Slow image pull#

Use smaller base images
Cache layers effectively
Use local registry if possible

Issue: Dependencies missing#

Check image includes required packages
Verify requirements.txt is correct
Review installation logs

Next Steps#

Understand Cluster Requirements for required dependencies
Learn about Checkpoints for model files
Explore Code Upload for code packaging
Check out Example: 07_custom_docker for complete example