diff --git a/docker/transformers-all-latest-gpu/Dockerfile b/docker/transformers-all-latest-gpu/Dockerfile index b0a55ba8be94..18082d33901b 100644 --- a/docker/transformers-all-latest-gpu/Dockerfile +++ b/docker/transformers-all-latest-gpu/Dockerfile @@ -46,7 +46,7 @@ RUN python3 -m pip install -U "itsdangerous<2.1.0" RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/accelerate@main#egg=accelerate # Add bitsandbytes for mixed int8 testing -RUN python3 -m pip install -i https://test.pypi.org/simple/ bitsandbytes==0.31.5 +RUN python3 -m pip install --no-cache-dir bitsandbytes RUN python3 -m pip install --no-cache-dir decord diff --git a/docs/source/en/main_classes/model.mdx b/docs/source/en/main_classes/model.mdx index 10f81e55d745..fd19b3db52b7 100644 --- a/docs/source/en/main_classes/model.mdx +++ b/docs/source/en/main_classes/model.mdx @@ -133,46 +133,6 @@ model = AutoModel.from_config(config) Due to Pytorch design, this functionality is only available for floating dtypes. -### `bitsandbytes` integration for Int8 mixed-precision matrix decomposition - -From the paper `GPT3.int8() : 8-bit Matrix Multiplication for Transformers at Scale`, we suport HuggingFace 🤗 integration for all models in the Hub with few lines of code. -For models trained in half-precision (aka, either `float16` or `bfloat16`) or full precision. This method aims to reduce `nn.Linear` size by 2 (if trained in half precision) or by 4 if trained in full precision, without affecting too much quality by operating on the outliers in half-precision. -This technique is useful and works well for billion scale models (>1B parameters) therefore we advice you to use it only for models of that scale. This method has been tested for 2-billion to 176-billion scale models and supports only PyTorch models. - -![HFxbitsandbytes.png](https://s3.amazonaws.com/moonup/production/uploads/1659861207959-62441d1d9fdefb55a0b7d12c.png) - -Int8 mixed-precision matrix decomposition works by separating a matrix multiplication into two streams: (1) and systematic feature outlier stream matrix multiplied in fp16 (0.01%), (2) a regular stream of int8 matrix multiplication (99.9%). With this method, int8 inference with no predictive degradation is possible for very large models (>=176B parameters). -Values are usually normally distributed, that is, most values are in the range [-3.5, 3.5], but there are some exceptional systematic outliers that are very differently distributed for large models. These outliers are often in the interval [-60, -6] or [6, 60]. Int8 quantization works well for values of magnitude ~5, but beyond that, there is a significant performance penalty. A good default threshold is 6, but a lower threshold might be needed for more unstable models (small models, fine-tuning). - -Note also that you would require a GPU to run mixed-8bit models as the kernels has been compiled for GPUs only. Make sure that you have enough GPU RAM to store the quarter (or half if your model is natively in half precision) of the model before using this feature. - -Below are some notes to help you use this module, or follow this demo on Google colab: [![Open In Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1qOjXfQIAULfKvZqwCen8-MoWKGdSatZ4?usp=sharing) - -#### Requirements - -- Make sure you run that on a NVIDIA GPU that supports 8-bit tensor cores (Turing or Ampere GPUs - e.g. T4, RTX20s RTX30s, A40-A100). Note that previous generations of NVIDIA GPUs do not support 8-bit tensor cores. -- Install the correct version of `bitsandbytes` by running: -`pip install -i https://test.pypi.org/simple/ bitsandbytes` -- Install `accelerate`: -`pip install accelerate` - -#### Running mixed-int8 models - -After carefully installing the required libraries, the way to load your mixed 8-bit model is as follows: -```py -model_name = "bigscience/bloom-2b5" -model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True) -``` -The implementation supports multi-GPU setup thanks to `accelerate` as backend. If you want to control the GPU memory you want to allocate for each GPU, you can use the `max_memory` argument as follows: -(If allocating `1GB` into GPU-0 and `2GB` into GPU-1, you can use `max_memory={0:"1GB", 1:"2GB"}`) -```py -max_memory_mapping = {0: "1GB", 1: "2GB"} -model_name = "bigscience/bloom-3b" -model_8bit = AutoModelForCausalLM.from_pretrained( - model_name, device_map="auto", load_in_8bit=True, max_memory=max_memory_mapping -) -``` - ## ModuleUtilsMixin diff --git a/docs/source/en/perf_train_gpu_one.mdx b/docs/source/en/perf_train_gpu_one.mdx index 56cd6c6f10e3..32748186a42f 100644 --- a/docs/source/en/perf_train_gpu_one.mdx +++ b/docs/source/en/perf_train_gpu_one.mdx @@ -733,3 +733,56 @@ This feature involves 3 different libraries. To install them, please follow the - [Torchdynamo installation](https://github.com/pytorch/torchdynamo#requirements-and-setup) - [Functorch installation](https://github.com/pytorch/functorch#install) - [Torch-TensorRT(FX) installation](https://github.com/pytorch/TensorRT/blob/master/docsrc/tutorials/getting_started_with_fx_path.rst#installation) + +## `bitsandbytes` integration for Int8 mixed-precision matrix decomposition + +From the paper [`LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale`](https://arxiv.org/abs/2208.07339), we support HuggingFace integration for all models in the Hub with a few lines of code. +The method reduce `nn.Linear` size by 2 for `float16` and `bfloat16` weights and by 4 for `float32` weights, with close to no impact to the quality by operating on the outliers in half-precision. + +![HFxbitsandbytes.png](https://s3.amazonaws.com/moonup/production/uploads/1659861207959-62441d1d9fdefb55a0b7d12c.png) + +Int8 mixed-precision matrix decomposition works by separating a matrix multiplication into two streams: (1) a systematic feature outlier stream matrix multiplied in fp16 (0.01%), (2) a regular stream of int8 matrix multiplication (99.9%). With this method, int8 inference with no predictive degradation is possible for very large models. +For more details regarding the method, check out the [paper](https://arxiv.org/abs/2208.07339) or our [blogpost about the integration](https://huggingface.co/blog/hf-bitsandbytes-integration). + +![MixedInt8.gif](https://s3.amazonaws.com/moonup/production/uploads/1660567469965-62441d1d9fdefb55a0b7d12c.gif) + +Note, that you would require a GPU to run mixed-8bit models as the kernels have been compiled for GPUs only. Make sure that you have enough GPU memory to store the quarter (or half if your model weights are in half precision) of the model before using this feature. +Below are some notes to help you use this module, or follow the demos on [Google colab](#colab-demos). + +### Requirements + +- Make sure you run that on NVIDIA GPUs that support 8-bit tensor cores (Turing, Ampere or newer architectures - e.g. T4, RTX20s RTX30s, A40-A100). +- Install the correct version of `bitsandbytes` by running: +`pip install bitsandbytes>=0.31.5` +- Install `accelerate` +`pip install accelerate>=0.12.0` + +### Running mixed-int8 models + +After installing the required libraries, the way to load your mixed 8-bit model is as follows: +```py +model_name = "bigscience/bloom-2b5" +model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True) +``` +The current implementation supports a multi-GPU setup when using `accelerate`. If you want to control the GPU memory you want to allocate for each GPU use the `max_memory` argument as follows: + +```py +max_memory_mapping = {0: "1GB", 1: "2GB"} +model_name = "bigscience/bloom-3b" +model_8bit = AutoModelForCausalLM.from_pretrained( + model_name, device_map="auto", load_in_8bit=True, max_memory=max_memory_mapping +) +``` + +In this example, the first GPU will use 1GB of memory and the second 2GB. + +### Colab demos + +With this method you can infer on models that were not possible to infer on a Google Colab before. +Check out the demo for running T5-11b (42GB in fp32)! Using 8-bit quantization on Google Colab: + +[![Open In Colab: T5-11b demo](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1YORPWx4okIHXnjW7MSAidXN29mPVNT7F?usp=sharing) + +Or this demo for BLOOM-3B: + +[![Open In Colab: BLOOM-3b demo](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1qOjXfQIAULfKvZqwCen8-MoWKGdSatZ4?usp=sharing) \ No newline at end of file diff --git a/tests/mixed_int8/README.md b/tests/mixed_int8/README.md index c0173bed7a6b..7a0f86dbb256 100644 --- a/tests/mixed_int8/README.md +++ b/tests/mixed_int8/README.md @@ -1,37 +1,120 @@ # Testing mixed int8 quantization +![HFxbitsandbytes.png](https://s3.amazonaws.com/moonup/production/uploads/1660567705337-62441d1d9fdefb55a0b7d12c.png) + +The following is the recipe on how to effectively debug `bitsandbytes` integration on Hugging Face `transformers`. + +## Library requirements + ++ `transformers>=4.22.0` ++ `accelerate>=0.12.0` ++ `bitsandbytes>=0.31.5`. ## Hardware requirements -I am using a setup of 2 GPUs that are NVIDIA-Tesla T4 15GB +The following instructions are tested with 2 NVIDIA-Tesla T4 GPUs. To run successfully `bitsandbytes` you would need a 8-bit core tensor supported GPU. Note that Turing, Ampere or newer architectures - e.g. T4, RTX20s RTX30s, A40-A100, A6000 should be supported. ## Virutal envs -```conda create --name int8-testing python==3.8``` -```git clone https://github.com/younesbelkada/transformers.git && git checkout integration-8bit``` -```pip install -e ".[dev]"``` -```pip install -i https://test.pypi.org/simple/ bitsandbytes``` -```pip install git+https://github.com/huggingface/accelerate.git@e0212893ea6098cc0a7a3c7a6eb286a9104214c1``` +```bash +conda create --name int8-testing python==3.8 +pip install bitsandbytes>=0.31.5 +pip install accelerate>=0.12.0 +pip install transformers>=4.23.0 +``` +if `transformers>=4.23.0` is not released yet, then use: +``` +pip install git+https://github.com/huggingface/transformers.git +``` + +## Troubleshooting +A list of common errors: -## Trobleshooting +### Torch does not correctly do the operations on GPU -```conda create --name int8-testing python==3.8``` -```pip install -i https://test.pypi.org/simple/ bitsandbytes``` -```conda install pytorch torchvision torchaudio -c pytorch``` -```git clone https://github.com/younesbelkada/transformers.git && git checkout integration-8bit``` -```pip install -e ".[dev]"``` -```pip install git+https://github.com/huggingface/accelerate.git@b52b793ea8bac108ba61192eead3cf11ca02433d``` +First check that: -### Check driver settings: +```py +import torch +vec = torch.randn(1, 2, 3).to(0) ``` -nvcc --version + +Works without any error. If not, install torch using `conda` like: + +```bash +conda create --name int8-testing python==3.8 +conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge +pip install bitsandbytes>=0.31.5 +pip install accelerate>=0.12.0 +pip install transformers>=4.23.0 ``` +For the latest pytorch instructions please see [this](https://pytorch.org/get-started/locally/) + +and the snippet above should work. + +### ` bitsandbytes operations are not supported under CPU!` + +This happens when some Linear weights are set to the CPU when using `accelerate`. Please check carefully `model.hf_device_map` and make sure that there is no `Linear` module that is assigned to CPU. It is fine to have the last module (usually the Lm_head) set on CPU. + +### `To use the type as a Parameter, please correct the detach() semantics defined by __torch_dispatch__() implementation.` + +Use the latest version of `accelerate` with a command such as: `pip install -U accelerate` and the problem should be solved. + +### `Parameter has no attribue .CB` + +Same solution as above. + +### `RuntimeError: CUDA error: an illegal memory access was encountered ... consider passing CUDA_LAUNCH_BLOCKING=1` + +Run your script by pre-pending `CUDA_LAUNCH_BLOCKING=1` and you should observe an error as described in the next section. + +### `CUDA illegal memory error: an illegal memory access at line...`: +Check the CUDA verisons with: +``` +nvcc --version +``` +and confirm it is the same version as the one detected by `bitsandbytes`. If not, run: ``` ls -l $CONDA_PREFIX/lib/libcudart.so ``` +or +``` +ls -l $LD_LIBRARY_PATH +``` +Check if `libcudart.so` has a correct symlink that is set. Sometimes `nvcc` detects the correct CUDA version but `bitsandbytes` doesn't. You have to make sure that the symlink that is set for the file `libcudart.so` is redirected to the correct CUDA file. + +Here is an example of a badly configured CUDA installation: + +`nvcc --version` gives: + +![Screenshot 2022-08-15 at 15.12.23.png](https://s3.amazonaws.com/moonup/production/uploads/1660569220888-62441d1d9fdefb55a0b7d12c.png) + +which means that the detected CUDA version is 11.3 but `bitsandbytes` outputs: + +![image.png](https://s3.amazonaws.com/moonup/production/uploads/1660569284243-62441d1d9fdefb55a0b7d12c.png) + +First check: + +```bash +echo $LD_LIBRARY_PATH +``` + +If this contains multiple paths separated by `:`. Then you have to make sure that the correct CUDA version is set. By doing: + +```bash +ls -l $path/libcudart.so +``` + +On each path (`$path`) separated by `:`. +If not, simply run +```bash +ls -l $LD_LIBRARY_PATH/libcudart.so +``` + +and you can see -### Recurrent bugs +![Screenshot 2022-08-15 at 15.12.33.png](https://s3.amazonaws.com/moonup/production/uploads/1660569176504-62441d1d9fdefb55a0b7d12c.png) -Sometimes you have to run a "dummy" inference pass when dealing with a multi-GPU setup. Checkout the ```test_multi_gpu_loading``` and the ```test_pipeline``` functions. \ No newline at end of file +If you see that the file is linked to the wrong CUDA version (here 10.2), find the correct location for `libcudart.so` (`find --name libcudart.so`) and replace the environment variable `LD_LIBRARY_PATH` with the one containing the correct `libcudart.so` file. \ No newline at end of file