You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Store model metadata (version, training params, etc.)
Enable inspection of model architecture without loading weights
Modular Loading
Support for lazy loading of specific weights
Allow partial model loading for distributed inference
Backward Compatibility
Define "noop"/"identity" operations for new features
Handle missing tensors in older checkpoints gracefully
Maintain compatibility across (most) updates
Consideration of alternatives
When selecting a storage format for ML model weights, especially for inference, we need to balance multiple critical factors: security, performance, compatibility, and future-proofing. Let's analyze our options, starting with general-purpose formats before examining ML-specific solutions.
General-Purpose Storage Formats
Arrow
Strengths: Industry standard for columnar data, excellent for tabular data processing
Limitations:
Not designed with neural network weights in mind
Lacks native support for ML-specific datatypes like BFloat16
Optimization targets are data analytics workloads, not ML inference patterns
For our use case: Would require custom extensions to handle our ML-specific needs
Cap'n Proto
Strengths: Zero-copy capabilities, high security standard
Limitations:
Missing native support for key ML datatypes (Float16/BFloat16)
Designed for RPC and general serialization rather than ML weight storage
Implementation complexity exceeds our specific requirements
For our use case: Would require workarounds for modern ML datatype support
Numerical Computing Formats
NumPy (npy/npz)
Strengths: Widespread in scientific computing, relatively simple format
Limitations:
NPZ format susceptible to zip bombs (security concern)
No zero-copy capabilities, impacting performance
Limited datatype support (no BFloat16)
Internal format wasn't designed for partial access patterns
For our use case: Performance and security issues make it inadequate for production
ML-Specific Formats
PyTorch Pickle (Our Current Solution)
Critical issues:
Security vulnerabilities that allow arbitrary code execution
Performance bottlenecks: 76x slower on CPU, 2x slower on GPU than alternatives
No support for metadata inspection without loading the full model
Difficult to implement backward compatibility without loading the model
For our use case: The security risks alone justify migration to a safer format
Lacks layout control needed for efficient tensor access patterns
No metadata scheme for model inspection without loading
Not optimized for large model loading workflows
For our use case: Would require significant extensions to meet our needs
Key Advantages of Safetensors for Anemoi
Safetensors uniquely addresses the specific requirements for Anemoi's inference checkpoint system:
Security and Performance Balance
Provides guaranteed safety without arbitrary code execution risks
Zero-copy architecture delivers dramatic speedups (76x on CPU, 2x on GPU)
Small codebase (~400 lines) minimizes security surface area
Weather-Specific Model Requirements
Support for large tensor dimensions typical in weather models
No file size limitations
Efficient architecture for distributed inference patterns
Practical Implementation Advantages
JSON header allows inspection without loading the entire model
Standardized metadata system for version tracking and model lineage
Layout control minimizes disk I/O during distributed loading
Backward Compatibility Enabler
Format structure makes it easy to implement "identity operations" for backward compatibility
Clear path for handling architecture changes between versions
Metadata can store transformation information for version migration
Future-Proofing
Support for modern datatypes (FP16, BFloat16, FP8)
Growing ecosystem adoption ensures continued maintenance
Cross-framework support preserves our flexibility for future architecture changes
Efficient lazy loading for distributed settings
To steal a summary from the safetensors docs:
Format
Safe
Zero-copy
Lazy loading
No file size limit
Layout control
Flexibility
Bfloat16/Fp8
pickle (PyTorch)
✗
✗
✗
🗸
✗
🗸
🗸
H5 (Tensorflow)
🗸
✗
🗸
🗸
~
~
✗
SavedModel (Tensorflow)
🗸
✗
✗
🗸
🗸
✗
🗸
MsgPack (flax)
🗸
🗸
✗
🗸
✗
✗
🗸
Protobuf (ONNX)
🗸
✗
✗
✗
✗
✗
🗸
Cap'n'Proto
🗸
🗸
~
🗸
🗸
~
✗
Arrow
?
?
?
?
?
?
✗
Numpy (npy,npz)
🗸
?
?
✗
🗸
✗
✗
pdparams (Paddle)
✗
✗
✗
🗸
✗
🗸
🗸
SafeTensors
🗸
🗸
🗸
🗸
🗸
✗
🗸
While several alternatives provide some of the benefits we need, safetensors is the only solution that comprehensively addresses all our requirements: security, performance, lazy loading, and framework agnosticism.
The implementation costs are low compared to alternatives, with high ROI in terms of security improvements, performance gains, and enabling future architecture flexibility.
The format's adoption by major ML projects (Hugging Face, MLX, StabilityAI) demonstrates its viability for production environments. Its simplicity and focus on ML-specific requirements make it far more suitable than general-purpose formats or older ML solutions with known limitations.
For Anemoi's inference checkpoints, safetensors offers the most direct path to secure, efficient, and future-proof model storage with minimal implementation overhead.
Practical Implementation Considerations
Migration Path
Established conversion patterns from pickle to safetensors
Conversion tools available (HF spaces, scripts) and adaptable
Metadata Handling
JSON-based header enables simple parsing and extension
Ability to store traceability information vital for MLOps
Weather-Specific Requirements
Format's simplicity allows for domain-specific metadata
Handles large tensors efficiently, which is important for weather models
Implementation Plan
Phase 1: Checkpoint Layout
Integrate safetensors into Anemoi training
Implement checkpoint saving utilities
Prepare model schema versioning
Phase 2: Metadata layout
Implement metadata layout in safetensors
Ensure compatibility with inference toolset
Verify full traceability incl. model versioning
Phase 3: Model Loading
Develop model loading capability for safetensors checkpoints
Create adapter layer for backward compatibility
Implement tensor mapping for architecture changes
Phase 4: Inference Integration and Interoperability
I believe this should be in line with design considerations and discussions with some member states during the on-boarding process.
I also believe, if implemented correctly this will solve multiple backwards compatibility issues, which would improve the overall health of the project and decrease tech debt. It could possibly even be a blueprint for certain aspects of training checkpoints, but this is out of scope for this specific work.
The text was updated successfully, but these errors were encountered:
Our current checkpoint system has several limitations:
Proposed Solution
I plan to implement a new inference checkpoint system using safetensors format to address these issues:
Features:
Secure & Fast Checkpoints
Traceability & Metadata
Modular Loading
Backward Compatibility
Consideration of alternatives
When selecting a storage format for ML model weights, especially for inference, we need to balance multiple critical factors: security, performance, compatibility, and future-proofing. Let's analyze our options, starting with general-purpose formats before examining ML-specific solutions.
General-Purpose Storage Formats
Arrow
Cap'n Proto
Numerical Computing Formats
NumPy (npy/npz)
ML-Specific Formats
PyTorch Pickle (Our Current Solution)
HDF5 (TensorFlow)
Protobuf (ONNX)
MsgPack (Flax)
Key Advantages of Safetensors for Anemoi
Safetensors uniquely addresses the specific requirements for Anemoi's inference checkpoint system:
Security and Performance Balance
Weather-Specific Model Requirements
Practical Implementation Advantages
Backward Compatibility Enabler
Future-Proofing
To steal a summary from the
safetensors
docs:While several alternatives provide some of the benefits we need, safetensors is the only solution that comprehensively addresses all our requirements: security, performance, lazy loading, and framework agnosticism.
The implementation costs are low compared to alternatives, with high ROI in terms of security improvements, performance gains, and enabling future architecture flexibility.
The format's adoption by major ML projects (Hugging Face, MLX, StabilityAI) demonstrates its viability for production environments. Its simplicity and focus on ML-specific requirements make it far more suitable than general-purpose formats or older ML solutions with known limitations.
For Anemoi's inference checkpoints, safetensors offers the most direct path to secure, efficient, and future-proof model storage with minimal implementation overhead.
Practical Implementation Considerations
Migration Path
Metadata Handling
Weather-Specific Requirements
Implementation Plan
Phase 1: Checkpoint Layout
Phase 2: Metadata layout
Phase 3: Model Loading
Phase 4: Inference Integration and Interoperability
Phase 5: Testing & Migration
Technical Details
As for the model loading, I would suggest implementing a similar loading capability to huggingface.
Questions
Resources
Additional Considerations
I believe this should be in line with design considerations and discussions with some member states during the on-boarding process.
I also believe, if implemented correctly this will solve multiple backwards compatibility issues, which would improve the overall health of the project and decrease tech debt. It could possibly even be a blueprint for certain aspects of training checkpoints, but this is out of scope for this specific work.
The text was updated successfully, but these errors were encountered: