-
Notifications
You must be signed in to change notification settings - Fork 29.5k
[bnb] Minor modifications #18631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bnb] Minor modifications #18631
Changes from 3 commits
13c266f
ab9f9d8
fdabb5d
5994e83
ea119e3
0b17a89
632a5a7
8e70e5f
2e8f26b
c8d6281
8e7eb7e
a4d0d1b
27792f2
0016acf
65ec377
e950226
82a9c8e
8e703b0
e16a56a
2fef415
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -733,3 +733,57 @@ This feature involves 3 different libraries. To install them, please follow the | |
- [Torchdynamo installation](https://github.com/pytorch/torchdynamo#requirements-and-setup) | ||
- [Functorch installation](https://github.com/pytorch/functorch#install) | ||
- [Torch-TensorRT(FX) installation](https://github.com/pytorch/TensorRT/blob/master/docsrc/tutorials/getting_started_with_fx_path.rst#installation) | ||
|
||
## `bitsandbytes` integration for Int8 mixed-precision matrix decomposition | ||
|
||
From the paper `LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale`, we support HuggingFace integration for all models in the Hub with a few lines of code. | ||
The method reduce `nn.Linear` size by 2 `float16` or `bfloat16` weights for and by 4 for `float32` weights, with close to no impact to impact to the quality by operating on the outliers in half-precision. | ||
To benefit from this technique you'd want to use it only for models that are larger than 2 Billion parameters. This method has been validated to work well from 2 to 176-billion parameter models. Only PyTorch models are supported. | ||
|
||
 | ||
|
||
Int8 mixed-precision matrix decomposition works by separating a matrix multiplication into two streams: (1) a systematic feature outlier stream matrix multiplied in fp16 (0.01%), (2) a regular stream of int8 matrix multiplication (99.9%). With this method, int8 inference with no predictive degradation is possible for very large models (>=6B parameters). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for pointing out the inconsistency. Actually I wrote this sentence a while ago when there were some bugs for small models on |
||
For more details regarding the method, check out the [paper](https://huggingface.co/blog/hf-bitsandbytes-integration) or our [blogpost about the integration](https://huggingface.co/blog/hf-bitsandbytes-integration). | ||
younesbelkada marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
 | ||
|
||
Note, that you would require a GPU to run mixed-8bit models as the kernels have been compiled for GPUs only. Make sure that you have enough GPU memory to store the quarter (or half if your model weights are in half precision) of the model before using this feature. | ||
Below are some notes to help you use this module, or follow the demos on [Google colab](#colab-demos). | ||
|
||
### Requirements | ||
|
||
- Make sure you run that on a NVIDIA GPU that supports 8-bit tensor cores (Turing or Ampere GPUs - e.g. T4, RTX20s RTX30s, A40-A100). Note that previous generations of NVIDIA GPUs do not support 8-bit tensor cores. | ||
younesbelkada marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Install the correct version of `bitsandbytes` by running: | ||
younesbelkada marked this conversation as resolved.
Show resolved
Hide resolved
|
||
`pip install bitsandbytes>=0.31.5` | ||
- Install `accelerate>=0.12.0`: | ||
`pip install accelerate` | ||
younesbelkada marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Running mixed-int8 models | ||
|
||
After carefully installing the required libraries, the way to load your mixed 8-bit model is as follows: | ||
younesbelkada marked this conversation as resolved.
Show resolved
Hide resolved
|
||
```py | ||
model_name = "bigscience/bloom-2b5" | ||
model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True) | ||
``` | ||
The current implementation supports a multi-GPU setup when using `accelerate`. If you want to control the GPU memory you want to allocate for each GPU, you can use the `max_memory` argument as follows: | ||
younesbelkada marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
```py | ||
max_memory_mapping = {0: "1GB", 1: "2GB"} | ||
model_name = "bigscience/bloom-3b" | ||
model_8bit = AutoModelForCausalLM.from_pretrained( | ||
model_name, device_map="auto", load_in_8bit=True, max_memory=max_memory_mapping | ||
) | ||
``` | ||
|
||
In this example, the first GPU will use 1GB of memory and the second 2GB. | ||
|
||
### Colab demos | ||
|
||
With this method you can infer on models that were not possible to infer on a Google Colab before. | ||
Check out the demo for running T5-11b (42GB in fp16)! Using 8-bit quantization on Google Colab: | ||
younesbelkada marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
[](https://colab.research.google.com/drive/1YORPWx4okIHXnjW7MSAidXN29mPVNT7F?usp=sharing) | ||
|
||
Or this demo for BLOOM-3B: | ||
|
||
[](https://colab.research.google.com/drive/1qOjXfQIAULfKvZqwCen8-MoWKGdSatZ4?usp=sharing) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,37 +1,120 @@ | ||
# Testing mixed int8 quantization | ||
|
||
 | ||
|
||
Hi there, this is a recipe on how to effectively debug `bitsandbytes` integration on HuggingFace `transformers`. | ||
younesbelkada marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Library requirements | ||
|
||
+ `transformers>=4.22.0` | ||
+ `accelerate>=0.12.0` | ||
+ `0.31.8>=bitsandbytes>=0.31.5`. | ||
younesbelkada marked this conversation as resolved.
Show resolved
Hide resolved
|
||
## Hardware requirements | ||
|
||
I am using a setup of 2 GPUs that are NVIDIA-Tesla T4 15GB | ||
I am using a setup of 2 GPUs that are NVIDIA-Tesla T4 15GB - `younes-testing-multi-gpu` on GCP. To run successfully `bitsandbytes` you would need a 8-bit core tensor supported GPU. Note that Ampere architectures should be supported. Here is an exhaustive list of the supported GPU types at the time of this writing: | ||
younesbelkada marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
- RTX 20s & RTX 30s | ||
- A40-A100 | ||
- T4 + (e.g. Google Colab GPUs) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are you sure it's exhaustive? e.g. I don't see A6000 Perhaps best to list the architectures supporting those and a few examples? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I proposed a change in 632a5a7 ! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why even bother with the exhaustive list? this is not an ad for NVIDIA. and when you use this approach you set yourself for a future hell of needing to list all the new cards that will come out later. Moreover earlier you said Turing arch works too, so why have you dropped it here? I'd just list the supported arch names and future proof it with:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
||
## Virutal envs | ||
|
||
```conda create --name int8-testing python==3.8``` | ||
```git clone https://github.com/younesbelkada/transformers.git && git checkout integration-8bit``` | ||
```pip install -e ".[dev]"``` | ||
```pip install -i https://test.pypi.org/simple/ bitsandbytes``` | ||
```pip install git+https://github.com/huggingface/accelerate.git@e0212893ea6098cc0a7a3c7a6eb286a9104214c1``` | ||
|
||
```pip install bitsandbytes``` | ||
```pip install accelerate``` | ||
```pip install git+https://github.com/huggingface/transformers.git``` | ||
younesbelkada marked this conversation as resolved.
Show resolved
Hide resolved
younesbelkada marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Trobleshooting | ||
younesbelkada marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
A list of common errors: | ||
|
||
### Torch does not correctly do the operations on GPU | ||
|
||
First check that: | ||
|
||
```py | ||
import torch | ||
|
||
vec = torch.randn(1, 2, 3).to(0) | ||
``` | ||
|
||
Works without any error. If not install torch using `conda` like: | ||
younesbelkada marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
```conda create --name int8-testing python==3.8``` | ||
```pip install -i https://test.pypi.org/simple/ bitsandbytes``` | ||
```pip install bitsandbytes``` | ||
```conda install pytorch torchvision torchaudio -c pytorch``` | ||
```git clone https://github.com/younesbelkada/transformers.git && git checkout integration-8bit``` | ||
```pip install -e ".[dev]"``` | ||
```pip install git+https://github.com/huggingface/accelerate.git@b52b793ea8bac108ba61192eead3cf11ca02433d``` | ||
```pip install git+https://github.com/huggingface/transformers.git``` | ||
```pip install accelerate``` | ||
younesbelkada marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
And the snippet above should work | ||
younesbelkada marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### ` `bitsandbytes` operations are not supported under CPU!` | ||
younesbelkada marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
This happens when some Linear weights are set to the CPU when using `accelerate`. Please check carefully `model.hf_device_map` and make sure that there is no `Linear` module that is assigned to CPU. It is fine to have the last module (usually the Lm_head) set on CPU. | ||
|
||
### `To use the type as a Parameter, please correct the detach() semantics defined by __torch_dispatch__() implementation.` | ||
|
||
Use the latest version of `accelerate` with a command such as: `pip install --force accelerate` and the problem should be solved. | ||
younesbelkada marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Check driver settings: | ||
### `Parameter has no attribue .CB` | ||
|
||
Same comment as above | ||
younesbelkada marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### `RuntimeError: CUDA error: an illegal memory access was encountered ... consider passing CUDA_LAUNCH_BLOCKING=1` | ||
|
||
Run your script by pre-pending `CUDA_LAUNCH_BLOCKING=1` and you should observe an error as below: | ||
younesbelkada marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### `CUDA illegal memory error: an illegal memory access at line...`: | ||
younesbelkada marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Check the CUDA verisons with: | ||
``` | ||
nvcc --version | ||
``` | ||
|
||
And confirm it is the same version than the one detected by `bitsandbytes`. If not, run: | ||
younesbelkada marked this conversation as resolved.
Show resolved
Hide resolved
|
||
``` | ||
ls -l $CONDA_PREFIX/lib/libcudart.so | ||
``` | ||
or | ||
``` | ||
ls -l $LD_LIBRARY_PATH | ||
``` | ||
And check if `libcudart.so` has a correct symlink that is set. Sometimes `nvcc` detects the correct CUDA version but `bitsandbytes` doesn't. You have to make sure that the symlink that is set for the file `libcudart.so` is redirected to the correct CUDA file. | ||
younesbelkada marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Here is an example of a badly configured CUDA installation: | ||
|
||
`nvcc --version` gives: | ||
|
||
 | ||
|
||
Which means that the detected CUDA version is 11.3 but `bitsandbytes` outputs: | ||
younesbelkada marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
 | ||
|
||
Therefore check: | ||
|
||
``` | ||
ls -l /opt/conda/envs/py37/lib/libcudart.so | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you're hardcoding your custom paths - 99% of users will have it located elsewhere. In general please try to use text for traceback and not images. Images can't be searched. listing the traceback as a text will help users find this page when they google the error. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also it makes it very difficult to review since github doesn't show me the image when I review md. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree this is confusing! Proposed a fix in 8e703b0 |
||
``` | ||
|
||
### Recurrent bugs | ||
And you can see that: | ||
younesbelkada marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
 | ||
|
||
If you see that the file is linked to the wrong CUDA version (here 10.2), find the correct location for `libcudart.so` (`find --name ...`) and replace the environment variable `LD_LIBRARY_PATH` with the one containing the correct `libcudart.so` file. | ||
younesbelkada marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### If `bitsandbytes` installation breaks everything: | ||
younesbelkada marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is a bit of a vague statement - what do you mean it breaks everything? why not just list the error as you did with all the previous sections? overall this again a maintenance hell to list all the possible errors if the solution is to have the correct minimal version of a package which you have already stated is a requirement. Moreover you can code to check the minimal required version and assert if it's not there. then you can delete all the errors that are no longer relevant. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the feedback! As you said below, I think that setting the correct version of bitsandbytes should be sufficient There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removed in e16a56a |
||
|
||
It happened in a previous version that after installing `bitsandbytes` and running this script: | ||
|
||
```py | ||
import bitsandbytes as bnb | ||
``` | ||
|
||
You get an error: | ||
younesbelkada marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
``` | ||
major, minor, revision = ... | ||
Too many values to unpack... | ||
``` | ||
|
||
Sometimes you have to run a "dummy" inference pass when dealing with a multi-GPU setup. Checkout the ```test_multi_gpu_loading``` and the ```test_pipeline``` functions. | ||
Re-install `bitsandbytes==0.31.8` or `bitsandbytes==0.31.5` as everything worked fine on our Docker image with those versions. In the worst case remove the [line that installs bitsandbytes on the Dockerfile](https://github.com/huggingface/transformers/blob/d6eeb871706db0d64ab9ffd79f9545d95286b536/docker/transformers-all-latest-gpu/Dockerfile#L49) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same as before, you're setting the user up for an an inflexible env. Just say There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have the same comments as above! |
Uh oh!
There was an error while loading. Please reload this page.