SafeTensors Speed Up During Inference Inconsistency #596

StarManAI · 2025-04-01T19:06:15Z

System Info

transformers-cli env result :

transformers version: 4.48.0
Platform: Linux-6.8.0-52-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.27.1
Safetensors version: 0.5.2
Accelerate version: 1.2.1
Accelerate config: not found
PyTorch version (GPU?): 2.5.1+cu124 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: Nope

Information

The official example scripts
My own modified scripts

Reproduction

The behavior occurs when running this script with inference for CPU :

import os
import datetime
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

# Download model files
sf_filename = hf_hub_download("gpt2", filename="model.safetensors")
pt_filename = hf_hub_download("gpt2", filename="pytorch_model.bin")

# Load safetensors weights
start_st = datetime.datetime.now()
weights_st = load_file(sf_filename, device="cpu")
load_time_st = datetime.datetime.now() - start_st
print(f"Loaded safetensors {load_time_st}")

# Load pytorch weights
start_pt = datetime.datetime.now()
weights_pt = torch.load(pt_filename, map_location="cpu")
load_time_pt = datetime.datetime.now() - start_pt
print(f"Loaded pytorch {load_time_pt}")

print(f"on CPU, safetensors is faster than pytorch by: {load_time_pt/load_time_st:.1f}X")

# Initialize tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Benchmark inference with safetensors
model_st = GPT2LMHeadModel.from_pretrained("gpt2", state_dict=weights_st)
input_ids = tokenizer.encode("Hello, world!", return_tensors="pt")

start_infer_st = datetime.datetime.now()
output_st = model_st(input_ids)
infer_time_st = datetime.datetime.now() - start_infer_st
print(f"Inference time with safetensors: {infer_time_st}")

# Benchmark inference with pytorch
model_pt = GPT2LMHeadModel.from_pretrained("gpt2", state_dict=weights_pt)
start_infer_pt = datetime.datetime.now()
output_pt = model_pt(input_ids)
infer_time_pt = datetime.datetime.now() - start_infer_pt
print(f"Inference time with pytorch: {infer_time_pt}")

print(f"on CPU, safetensors inference is faster than pytorch by: {infer_time_pt/infer_time_st:.1f}X")

Expected behavior

I ran the script multiple times but the last 2 times this happened:

python3 main.py
Loaded safetensors 0:00:00.021458
Loaded pytorch 0:00:00.256295
on CPU, safetensors is faster than pytorch by: 11.9X
Inference time with safetensors: 0:00:00.053771
Inference time with pytorch: 0:00:00.095637
on CPU, safetensors inference is faster than pytorch by: 1.8X

python3 main.py
Loaded safetensors 0:00:00.020366
Loaded pytorch 0:00:00.259216
on CPU, safetensors is faster than pytorch by: 12.7X
Inference time with safetensors: 0:00:00.099727
Inference time with pytorch: 0:00:00.094022
on CPU, safetensors inference is faster than pytorch by: 0.9X

-The Speed Up During Inference seems inconsistent for some reason. I will also attach a screenshot.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SafeTensors Speed Up During Inference Inconsistency #596

SafeTensors Speed Up During Inference Inconsistency #596

StarManAI commented Apr 1, 2025 •

edited

Loading

SafeTensors Speed Up During Inference Inconsistency #596

SafeTensors Speed Up During Inference Inconsistency #596

Comments

StarManAI commented Apr 1, 2025 • edited Loading

System Info

Information

Reproduction

Expected behavior

StarManAI commented Apr 1, 2025 •

edited

Loading