DocsArtifacts

Managing Training Artifacts

Learn how to handle training outputs, model files, and other artifacts in Trainwave.

Quick Start

# List artifacts from a job
wave storage list j-xyz789
 
# Download artifacts
wave storage download j-xyz789 --output ./results

Artifact Storage

Storage Structure

Trainwave automatically manages artifacts in the following directory structure within your job’s container:

/workspace/
├── artifacts/              # Main artifacts directory
│   ├── models/            # Trained models
│   ├── checkpoints/       # Training checkpoints
│   ├── logs/             # Training logs
│   └── results/          # Evaluation results
├── data/                  # Input data
└── src/                  # Your source code

Saving Artifacts

Save your training outputs to the appropriate directories:

# PyTorch example
import torch
 
# Save model
torch.save(model.state_dict(), '/workspace/artifacts/models/model.pt')
 
# Save checkpoint
torch.save({
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss,
}, '/workspace/artifacts/checkpoints/checkpoint.pt')
# TensorFlow example
import tensorflow as tf
 
# Save model
model.save('/workspace/artifacts/models/model')
 
# Save checkpoint
checkpoint = tf.train.Checkpoint(model=model, optimizer=optimizer)
checkpoint.save('/workspace/artifacts/checkpoints/ckpt')

Artifact Management

CLI Commands

# List artifacts
wave storage list j-xyz789
 
# Download specific artifacts
wave storage download j-xyz789 \
  --include "*.pt" \
  --output ./models
 
# Download all artifacts
wave storage download j-xyz789 \
  --output ./results

Automatic Artifact Collection

Trainwave automatically collects:

  1. Training logs (/workspace/artifacts/logs/)
  2. Model files (/workspace/artifacts/models/)
  3. Metrics and results (/workspace/artifacts/results/)
  4. Environment information
  5. Resource usage statistics

Integration with ML Frameworks

PyTorch

import torch
from pathlib import Path
 
class ModelCheckpoint:
    def __init__(self, model, optimizer, save_dir):
        self.model = model
        self.optimizer = optimizer
        self.save_dir = Path('/workspace/artifacts/checkpoints') / save_dir
        self.save_dir.mkdir(parents=True, exist_ok=True)
 
    def save(self, epoch, loss):
        checkpoint_path = self.save_dir / f'checkpoint_epoch_{epoch}.pt'
        torch.save({
            'epoch': epoch,
            'model_state_dict': self.model.state_dict(),
            'optimizer_state_dict': self.optimizer.state_dict(),
            'loss': loss,
        }, checkpoint_path)
 
    def load(self, checkpoint_path):
        checkpoint = torch.load(checkpoint_path)
        self.model.load_state_dict(checkpoint['model_state_dict'])
        self.optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
        return checkpoint['epoch'], checkpoint['loss']

TensorFlow

import tensorflow as tf
import os
 
class TrainingCallback(tf.keras.callbacks.Callback):
    def __init__(self, checkpoint_dir):
        super().__init__()
        self.checkpoint_dir = os.path.join('/workspace/artifacts/checkpoints', checkpoint_dir)
        os.makedirs(self.checkpoint_dir, exist_ok=True)
 
    def on_epoch_end(self, epoch, logs=None):
        checkpoint_path = os.path.join(self.checkpoint_dir, f'checkpoint_epoch_{epoch}')
        self.model.save_weights(checkpoint_path)

Best Practices

Organization

  • Use consistent directory structure
  • Follow clear naming conventions
  • Separate different types of artifacts

Storage Efficiency

  • Compress large files where possible
  • Implement retention policies to clean up old artifacts
  • Use appropriate file formats (e.g., safetensors over raw pickle for models)

Versioning

  • Include version information in artifact filenames or metadata
  • Save the training configuration alongside model weights

Troubleshooting

Storage Space

# Check storage usage
wave storage list j-xyz789
 
# Ensure your hdd_size_mb in trainwave.toml is large enough for your artifacts

Missing Artifacts

# Verify artifact paths
wave storage list j-xyz789
 
# Check job logs for save errors
wave jobs logs j-xyz789

Support

support@trainwave.ai