I used the following command to run the LLaMA2-13B model:
CUDA_VISIBLE_DEVICES=0 python llama.py /mnt/llama2-13b wikitext2 --wbits 4 --load sq-llama-13b-w4-s0.45.pt --include_sparse --eval
The --load option loads the model after packing. I followed the README instructions step by step, but when I tested this code, the speed was extremely slow. I thought the model might have had an error during quantization, so I downloaded the model from Hugging Face as shown in the README(sq-llama-13b-w4-s45). However, it still remains slow.
I don’t know why this issue persists. What can I do to resolve it?