Skip to content

SafeTensors Speed Up During Inference Inconsistency #596

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 tasks done
StarManAI opened this issue Apr 1, 2025 · 0 comments
Open
2 tasks done

SafeTensors Speed Up During Inference Inconsistency #596

StarManAI opened this issue Apr 1, 2025 · 0 comments

Comments

@StarManAI
Copy link

StarManAI commented Apr 1, 2025

System Info

transformers-cli env result :

  • transformers version: 4.48.0
  • Platform: Linux-6.8.0-52-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.27.1
  • Safetensors version: 0.5.2
  • Accelerate version: 1.2.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.5.1+cu124 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: Nope

Information

  • The official example scripts
  • My own modified scripts

Reproduction

The behavior occurs when running this script with inference for CPU :

import os
import datetime
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

# Download model files
sf_filename = hf_hub_download("gpt2", filename="model.safetensors")
pt_filename = hf_hub_download("gpt2", filename="pytorch_model.bin")

# Load safetensors weights
start_st = datetime.datetime.now()
weights_st = load_file(sf_filename, device="cpu")
load_time_st = datetime.datetime.now() - start_st
print(f"Loaded safetensors {load_time_st}")

# Load pytorch weights
start_pt = datetime.datetime.now()
weights_pt = torch.load(pt_filename, map_location="cpu")
load_time_pt = datetime.datetime.now() - start_pt
print(f"Loaded pytorch {load_time_pt}")

print(f"on CPU, safetensors is faster than pytorch by: {load_time_pt/load_time_st:.1f}X")

# Initialize tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Benchmark inference with safetensors
model_st = GPT2LMHeadModel.from_pretrained("gpt2", state_dict=weights_st)
input_ids = tokenizer.encode("Hello, world!", return_tensors="pt")

start_infer_st = datetime.datetime.now()
output_st = model_st(input_ids)
infer_time_st = datetime.datetime.now() - start_infer_st
print(f"Inference time with safetensors: {infer_time_st}")

# Benchmark inference with pytorch
model_pt = GPT2LMHeadModel.from_pretrained("gpt2", state_dict=weights_pt)
start_infer_pt = datetime.datetime.now()
output_pt = model_pt(input_ids)
infer_time_pt = datetime.datetime.now() - start_infer_pt
print(f"Inference time with pytorch: {infer_time_pt}")

print(f"on CPU, safetensors inference is faster than pytorch by: {infer_time_pt/infer_time_st:.1f}X")

Expected behavior

I ran the script multiple times but the last 2 times this happened:

python3 main.py
Loaded safetensors 0:00:00.021458
Loaded pytorch 0:00:00.256295
on CPU, safetensors is faster than pytorch by: 11.9X
Inference time with safetensors: 0:00:00.053771
Inference time with pytorch: 0:00:00.095637
on CPU, safetensors inference is faster than pytorch by: 1.8X

python3 main.py
Loaded safetensors 0:00:00.020366
Loaded pytorch 0:00:00.259216
on CPU, safetensors is faster than pytorch by: 12.7X
Inference time with safetensors: 0:00:00.099727
Inference time with pytorch: 0:00:00.094022
on CPU, safetensors inference is faster than pytorch by: 0.9X

-The Speed Up During Inference seems inconsistent for some reason. I will also attach a screenshot.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant