Managing Training Artifacts
Learn how to handle training outputs, model files, and other artifacts in Trainwave.
Quick Start
# List artifacts from a job
wave storage list j-xyz789
# Download artifacts
wave storage download j-xyz789 --output ./resultsArtifact Storage
Storage Structure
Trainwave automatically manages artifacts in the following directory structure within your job’s container:
/workspace/
├── artifacts/ # Main artifacts directory
│ ├── models/ # Trained models
│ ├── checkpoints/ # Training checkpoints
│ ├── logs/ # Training logs
│ └── results/ # Evaluation results
├── data/ # Input data
└── src/ # Your source codeSaving Artifacts
Save your training outputs to the appropriate directories:
# PyTorch example
import torch
# Save model
torch.save(model.state_dict(), '/workspace/artifacts/models/model.pt')
# Save checkpoint
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}, '/workspace/artifacts/checkpoints/checkpoint.pt')# TensorFlow example
import tensorflow as tf
# Save model
model.save('/workspace/artifacts/models/model')
# Save checkpoint
checkpoint = tf.train.Checkpoint(model=model, optimizer=optimizer)
checkpoint.save('/workspace/artifacts/checkpoints/ckpt')Artifact Management
CLI Commands
# List artifacts
wave storage list j-xyz789
# Download specific artifacts
wave storage download j-xyz789 \
--include "*.pt" \
--output ./models
# Download all artifacts
wave storage download j-xyz789 \
--output ./resultsAutomatic Artifact Collection
Trainwave automatically collects:
- Training logs (
/workspace/artifacts/logs/) - Model files (
/workspace/artifacts/models/) - Metrics and results (
/workspace/artifacts/results/) - Environment information
- Resource usage statistics
Integration with ML Frameworks
PyTorch
import torch
from pathlib import Path
class ModelCheckpoint:
def __init__(self, model, optimizer, save_dir):
self.model = model
self.optimizer = optimizer
self.save_dir = Path('/workspace/artifacts/checkpoints') / save_dir
self.save_dir.mkdir(parents=True, exist_ok=True)
def save(self, epoch, loss):
checkpoint_path = self.save_dir / f'checkpoint_epoch_{epoch}.pt'
torch.save({
'epoch': epoch,
'model_state_dict': self.model.state_dict(),
'optimizer_state_dict': self.optimizer.state_dict(),
'loss': loss,
}, checkpoint_path)
def load(self, checkpoint_path):
checkpoint = torch.load(checkpoint_path)
self.model.load_state_dict(checkpoint['model_state_dict'])
self.optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
return checkpoint['epoch'], checkpoint['loss']TensorFlow
import tensorflow as tf
import os
class TrainingCallback(tf.keras.callbacks.Callback):
def __init__(self, checkpoint_dir):
super().__init__()
self.checkpoint_dir = os.path.join('/workspace/artifacts/checkpoints', checkpoint_dir)
os.makedirs(self.checkpoint_dir, exist_ok=True)
def on_epoch_end(self, epoch, logs=None):
checkpoint_path = os.path.join(self.checkpoint_dir, f'checkpoint_epoch_{epoch}')
self.model.save_weights(checkpoint_path)Best Practices
Organization
- Use consistent directory structure
- Follow clear naming conventions
- Separate different types of artifacts
Storage Efficiency
- Compress large files where possible
- Implement retention policies to clean up old artifacts
- Use appropriate file formats (e.g.,
safetensorsover raw pickle for models)
Versioning
- Include version information in artifact filenames or metadata
- Save the training configuration alongside model weights
Troubleshooting
Storage Space
# Check storage usage
wave storage list j-xyz789
# Ensure your hdd_size_mb in trainwave.toml is large enough for your artifactsMissing Artifacts
# Verify artifact paths
wave storage list j-xyz789
# Check job logs for save errors
wave jobs logs j-xyz789