Skip to content

Instantly share code, notes, and snippets.

@catid
Last active April 22, 2024 04:53
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save catid/533dd0c7d4f3ee8d34a6a905155b72ae to your computer and use it in GitHub Desktop.
Save catid/533dd0c7d4f3ee8d34a6a905155b72ae to your computer and use it in GitHub Desktop.
How to quantize 70B model so it will fit on 2x4090 GPUs
How to quantize 70B model so it will fit on 2x4090 GPUs:
I tried EXL2, AutoAWQ, and SqueezeLLM and they all failed for different reasons (issues opened).
HQQ worked:
I rented a 4x GPU 1TB RAM ($19/hr) instance on runpod with 1024GB container and 1024GB workspace disk space.
I think you only need 2x GPU with 80GB VRAM and 512GB+ system RAM so probably overpaid.
Note you need to fill in the form to get access to the 70B Meta weights.
You can copy/paste this on the console and it will just set up everything automatically:
```bash
apt update
apt install git-lfs vim -y
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
~/miniconda3/bin/conda init bash
source ~/.bashrc
conda create -n hqq python=3.10 -y && conda activate hqq
git lfs install
git clone https://github.com/mobiusml/hqq.git
cd hqq
pip install torch
pip install .
pip install huggingface_hub[hf_transfer]
export HF_HUB_ENABLE_HF_TRANSFER=1
huggingface-cli login
```
Create `quantize.py` file by copy/pasting this into console:
```
echo "
import torch
model_id = 'meta-llama/Meta-Llama-3-70B-Instruct'
save_dir = 'cat-llama-3-70b-hqq'
compute_dtype = torch.bfloat16
from hqq.core.quantize import *
quant_config = BaseQuantizeConfig(nbits=4, group_size=64, offload_meta=True)
zero_scale_group_size = 128
quant_config['scale_quant_params']['group_size'] = zero_scale_group_size
quant_config['zero_quant_params']['group_size'] = zero_scale_group_size
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
model = HQQModelForCausalLM.from_pretrained(model_id)
from hqq.models.hf.base import AutoHQQHFModel
AutoHQQHFModel.quantize_model(model, quant_config=quant_config,
compute_dtype=compute_dtype)
AutoHQQHFModel.save_quantized(model, save_dir)
model = AutoHQQHFModel.from_quantized(save_dir)
model.eval()
" > quantize.py
```
Run script:
```
python quantize.py
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment