Skip to content

[bnb] Minor modifications #18631

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Aug 16, 2022
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docker/transformers-all-latest-gpu/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ RUN python3 -m pip install -U "itsdangerous<2.1.0"
RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/accelerate@main#egg=accelerate

# Add bitsandbytes for mixed int8 testing
RUN python3 -m pip install -i https://test.pypi.org/simple/ bitsandbytes==0.31.5
RUN python3 -m pip install --no-cache-dir bitsandbytes

RUN python3 -m pip install --no-cache-dir decord

Expand Down
40 changes: 0 additions & 40 deletions docs/source/en/main_classes/model.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -133,46 +133,6 @@ model = AutoModel.from_config(config)

Due to Pytorch design, this functionality is only available for floating dtypes.

### `bitsandbytes` integration for Int8 mixed-precision matrix decomposition

From the paper `GPT3.int8() : 8-bit Matrix Multiplication for Transformers at Scale`, we suport HuggingFace 🤗 integration for all models in the Hub with few lines of code.
For models trained in half-precision (aka, either `float16` or `bfloat16`) or full precision. This method aims to reduce `nn.Linear` size by 2 (if trained in half precision) or by 4 if trained in full precision, without affecting too much quality by operating on the outliers in half-precision.
This technique is useful and works well for billion scale models (>1B parameters) therefore we advice you to use it only for models of that scale. This method has been tested for 2-billion to 176-billion scale models and supports only PyTorch models.

![HFxbitsandbytes.png](https://s3.amazonaws.com/moonup/production/uploads/1659861207959-62441d1d9fdefb55a0b7d12c.png)

Int8 mixed-precision matrix decomposition works by separating a matrix multiplication into two streams: (1) and systematic feature outlier stream matrix multiplied in fp16 (0.01%), (2) a regular stream of int8 matrix multiplication (99.9%). With this method, int8 inference with no predictive degradation is possible for very large models (>=176B parameters).
Values are usually normally distributed, that is, most values are in the range [-3.5, 3.5], but there are some exceptional systematic outliers that are very differently distributed for large models. These outliers are often in the interval [-60, -6] or [6, 60]. Int8 quantization works well for values of magnitude ~5, but beyond that, there is a significant performance penalty. A good default threshold is 6, but a lower threshold might be needed for more unstable models (small models, fine-tuning).

Note also that you would require a GPU to run mixed-8bit models as the kernels has been compiled for GPUs only. Make sure that you have enough GPU RAM to store the quarter (or half if your model is natively in half precision) of the model before using this feature.

Below are some notes to help you use this module, or follow this demo on Google colab: [![Open In Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1qOjXfQIAULfKvZqwCen8-MoWKGdSatZ4?usp=sharing)

#### Requirements

- Make sure you run that on a NVIDIA GPU that supports 8-bit tensor cores (Turing or Ampere GPUs - e.g. T4, RTX20s RTX30s, A40-A100). Note that previous generations of NVIDIA GPUs do not support 8-bit tensor cores.
- Install the correct version of `bitsandbytes` by running:
`pip install -i https://test.pypi.org/simple/ bitsandbytes`
- Install `accelerate`:
`pip install accelerate`

#### Running mixed-int8 models

After carefully installing the required libraries, the way to load your mixed 8-bit model is as follows:
```py
model_name = "bigscience/bloom-2b5"
model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
```
The implementation supports multi-GPU setup thanks to `accelerate` as backend. If you want to control the GPU memory you want to allocate for each GPU, you can use the `max_memory` argument as follows:
(If allocating `1GB` into GPU-0 and `2GB` into GPU-1, you can use `max_memory={0:"1GB", 1:"2GB"}`)
```py
max_memory_mapping = {0: "1GB", 1: "2GB"}
model_name = "bigscience/bloom-3b"
model_8bit = AutoModelForCausalLM.from_pretrained(
model_name, device_map="auto", load_in_8bit=True, max_memory=max_memory_mapping
)
```


## ModuleUtilsMixin

Expand Down
54 changes: 54 additions & 0 deletions docs/source/en/perf_train_gpu_one.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -733,3 +733,57 @@ This feature involves 3 different libraries. To install them, please follow the
- [Torchdynamo installation](https://github.com/pytorch/torchdynamo#requirements-and-setup)
- [Functorch installation](https://github.com/pytorch/functorch#install)
- [Torch-TensorRT(FX) installation](https://github.com/pytorch/TensorRT/blob/master/docsrc/tutorials/getting_started_with_fx_path.rst#installation)

## `bitsandbytes` integration for Int8 mixed-precision matrix decomposition

From the paper `LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale`, we support HuggingFace integration for all models in the Hub with a few lines of code.
The method reduce `nn.Linear` size by 2 `float16` or `bfloat16` weights for and by 4 for `float32` weights, with close to no impact to impact to the quality by operating on the outliers in half-precision.
To benefit from this technique you'd want to use it only for models that are larger than 2 Billion parameters. This method has been validated to work well from 2 to 176-billion parameter models. Only PyTorch models are supported.

![HFxbitsandbytes.png](https://s3.amazonaws.com/moonup/production/uploads/1659861207959-62441d1d9fdefb55a0b7d12c.png)

Int8 mixed-precision matrix decomposition works by separating a matrix multiplication into two streams: (1) a systematic feature outlier stream matrix multiplied in fp16 (0.01%), (2) a regular stream of int8 matrix multiplication (99.9%). With this method, int8 inference with no predictive degradation is possible for very large models (>=6B parameters).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 2 paras up and <6 here - which is it then? Let's be consistent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out the inconsistency. Actually I wrote this sentence a while ago when there were some bugs for small models on bitsanbytes. Now it should work for all models (see the figure 1 on the paper: https://arxiv.org/abs/2208.07339). I have just removed any reference for large models in 65ec377

For more details regarding the method, check out the [paper](https://huggingface.co/blog/hf-bitsandbytes-integration) or our [blogpost about the integration](https://huggingface.co/blog/hf-bitsandbytes-integration).

![MixedInt8.gif](https://s3.amazonaws.com/moonup/production/uploads/1660567469965-62441d1d9fdefb55a0b7d12c.gif)

Note, that you would require a GPU to run mixed-8bit models as the kernels have been compiled for GPUs only. Make sure that you have enough GPU memory to store the quarter (or half if your model weights are in half precision) of the model before using this feature.
Below are some notes to help you use this module, or follow the demos on [Google colab](#colab-demos).

### Requirements

- Make sure you run that on a NVIDIA GPU that supports 8-bit tensor cores (Turing or Ampere GPUs - e.g. T4, RTX20s RTX30s, A40-A100). Note that previous generations of NVIDIA GPUs do not support 8-bit tensor cores.
- Install the correct version of `bitsandbytes` by running:
`pip install bitsandbytes>=0.31.5`
- Install `accelerate>=0.12.0`:
`pip install accelerate`

### Running mixed-int8 models

After carefully installing the required libraries, the way to load your mixed 8-bit model is as follows:
```py
model_name = "bigscience/bloom-2b5"
model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
```
The current implementation supports a multi-GPU setup when using `accelerate`. If you want to control the GPU memory you want to allocate for each GPU, you can use the `max_memory` argument as follows:

```py
max_memory_mapping = {0: "1GB", 1: "2GB"}
model_name = "bigscience/bloom-3b"
model_8bit = AutoModelForCausalLM.from_pretrained(
model_name, device_map="auto", load_in_8bit=True, max_memory=max_memory_mapping
)
```

In this example, the first GPU will use 1GB of memory and the second 2GB.

### Colab demos

With this method you can infer on models that were not possible to infer on a Google Colab before.
Check out the demo for running T5-11b (42GB in fp16)! Using 8-bit quantization on Google Colab:

[![Open In Colab: T5-11b demo](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1YORPWx4okIHXnjW7MSAidXN29mPVNT7F?usp=sharing)

Or this demo for BLOOM-3B:

[![Open In Colab: BLOOM-3b demo](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1qOjXfQIAULfKvZqwCen8-MoWKGdSatZ4?usp=sharing)
111 changes: 97 additions & 14 deletions tests/mixed_int8/README.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,120 @@
# Testing mixed int8 quantization

![HFxbitsandbytes.png](https://s3.amazonaws.com/moonup/production/uploads/1660567705337-62441d1d9fdefb55a0b7d12c.png)

Hi there, this is a recipe on how to effectively debug `bitsandbytes` integration on HuggingFace `transformers`.

## Library requirements

+ `transformers>=4.22.0`
+ `accelerate>=0.12.0`
+ `0.31.8>=bitsandbytes>=0.31.5`.
## Hardware requirements

I am using a setup of 2 GPUs that are NVIDIA-Tesla T4 15GB
I am using a setup of 2 GPUs that are NVIDIA-Tesla T4 15GB - `younes-testing-multi-gpu` on GCP. To run successfully `bitsandbytes` you would need a 8-bit core tensor supported GPU. Note that Ampere architectures should be supported. Here is an exhaustive list of the supported GPU types at the time of this writing:

- RTX 20s & RTX 30s
- A40-A100
- T4 + (e.g. Google Colab GPUs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure it's exhaustive? e.g. I don't see A6000

Perhaps best to list the architectures supporting those and a few examples?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I proposed a change in 632a5a7 !

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why even bother with the exhaustive list? this is not an ad for NVIDIA. and when you use this approach you set yourself for a future hell of needing to list all the new cards that will come out later.

Moreover earlier you said Turing arch works too, so why have you dropped it here?

I'd just list the supported arch names and future proof it with:

Turing, Ampere or newer architectures - e.g. T4, RTX20s RTX30s, A40-A100

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great thanks ! I proposed a fix in 82a9c8e & e950226


## Virutal envs

```conda create --name int8-testing python==3.8```
```git clone https://github.com/younesbelkada/transformers.git && git checkout integration-8bit```
```pip install -e ".[dev]"```
```pip install -i https://test.pypi.org/simple/ bitsandbytes```
```pip install git+https://github.com/huggingface/accelerate.git@e0212893ea6098cc0a7a3c7a6eb286a9104214c1```

```pip install bitsandbytes```
```pip install accelerate```
```pip install git+https://github.com/huggingface/transformers.git```

## Trobleshooting

A list of common errors:

### Torch does not correctly do the operations on GPU

First check that:

```py
import torch

vec = torch.randn(1, 2, 3).to(0)
```

Works without any error. If not install torch using `conda` like:

```conda create --name int8-testing python==3.8```
```pip install -i https://test.pypi.org/simple/ bitsandbytes```
```pip install bitsandbytes```
```conda install pytorch torchvision torchaudio -c pytorch```
```git clone https://github.com/younesbelkada/transformers.git && git checkout integration-8bit```
```pip install -e ".[dev]"```
```pip install git+https://github.com/huggingface/accelerate.git@b52b793ea8bac108ba61192eead3cf11ca02433d```
```pip install git+https://github.com/huggingface/transformers.git```
```pip install accelerate```

And the snippet above should work

### ` `bitsandbytes` operations are not supported under CPU!`

This happens when some Linear weights are set to the CPU when using `accelerate`. Please check carefully `model.hf_device_map` and make sure that there is no `Linear` module that is assigned to CPU. It is fine to have the last module (usually the Lm_head) set on CPU.

### `To use the type as a Parameter, please correct the detach() semantics defined by __torch_dispatch__() implementation.`

Use the latest version of `accelerate` with a command such as: `pip install --force accelerate` and the problem should be solved.

### Check driver settings:
### `Parameter has no attribue .CB`

Same comment as above

### `RuntimeError: CUDA error: an illegal memory access was encountered ... consider passing CUDA_LAUNCH_BLOCKING=1`

Run your script by pre-pending `CUDA_LAUNCH_BLOCKING=1` and you should observe an error as below:

### `CUDA illegal memory error: an illegal memory access at line...`:

Check the CUDA verisons with:
```
nvcc --version
```

And confirm it is the same version than the one detected by `bitsandbytes`. If not, run:
```
ls -l $CONDA_PREFIX/lib/libcudart.so
```
or
```
ls -l $LD_LIBRARY_PATH
```
And check if `libcudart.so` has a correct symlink that is set. Sometimes `nvcc` detects the correct CUDA version but `bitsandbytes` doesn't. You have to make sure that the symlink that is set for the file `libcudart.so` is redirected to the correct CUDA file.

Here is an example of a badly configured CUDA installation:

`nvcc --version` gives:

![Screenshot 2022-08-15 at 15.12.23.png](https://s3.amazonaws.com/moonup/production/uploads/1660569220888-62441d1d9fdefb55a0b7d12c.png)

Which means that the detected CUDA version is 11.3 but `bitsandbytes` outputs:

![image.png](https://s3.amazonaws.com/moonup/production/uploads/1660569284243-62441d1d9fdefb55a0b7d12c.png)

Therefore check:

```
ls -l /opt/conda/envs/py37/lib/libcudart.so
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're hardcoding your custom paths - 99% of users will have it located elsewhere.

In general please try to use text for traceback and not images. Images can't be searched.

listing the traceback as a text will help users find this page when they google the error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also it makes it very difficult to review since github doesn't show me the image when I review md.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is confusing! Proposed a fix in 8e703b0

```

### Recurrent bugs
And you can see that:

![Screenshot 2022-08-15 at 15.12.33.png](https://s3.amazonaws.com/moonup/production/uploads/1660569176504-62441d1d9fdefb55a0b7d12c.png)

If you see that the file is linked to the wrong CUDA version (here 10.2), find the correct location for `libcudart.so` (`find --name ...`) and replace the environment variable `LD_LIBRARY_PATH` with the one containing the correct `libcudart.so` file.

### If `bitsandbytes` installation breaks everything:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit of a vague statement - what do you mean it breaks everything? why not just list the error as you did with all the previous sections?

overall this again a maintenance hell to list all the possible errors if the solution is to have the correct minimal version of a package which you have already stated is a requirement.

Moreover you can code to check the minimal required version and assert if it's not there.

then you can delete all the errors that are no longer relevant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! As you said below, I think that setting the correct version of bitsandbytes should be sufficient

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in e16a56a


It happened in a previous version that after installing `bitsandbytes` and running this script:

```py
import bitsandbytes as bnb
```

You get an error:

```
major, minor, revision = ...
Too many values to unpack...
```

Sometimes you have to run a "dummy" inference pass when dealing with a multi-GPU setup. Checkout the ```test_multi_gpu_loading``` and the ```test_pipeline``` functions.
Re-install `bitsandbytes==0.31.8` or `bitsandbytes==0.31.5` as everything worked fine on our Docker image with those versions. In the worst case remove the [line that installs bitsandbytes on the Dockerfile](https://github.com/huggingface/transformers/blob/d6eeb871706db0d64ab9ffd79f9545d95286b536/docker/transformers-all-latest-gpu/Dockerfile#L49)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as before, you're setting the user up for an an inflexible env. Just say itsandbytes>=0.31.8 and drop all the other old versions that are no longer relevant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same comments as above!