When I use SqueezeLLM to quantize the LLaMA2-13B model and test it, the speed is extremely slow.

I used the following command to run the LLaMA2-13B model:
`CUDA_VISIBLE_DEVICES=0 python llama.py  /mnt/llama2-13b  wikitext2 --wbits 4 --load sq-llama-13b-w4-s0.45.pt --include_sparse --eval
`
The --load option loads the model after packing. I followed the README instructions step by step, but when I tested this code, the speed was extremely slow. I thought the model might have had an error during quantization, so I downloaded the model from Hugging Face as shown in the README([sq-llama-13b-w4-s45](https://huggingface.co/squeeze-ai-lab/sq-llama-13b-w4-s45/blob/main/sq-llama-13b-w4-s45.pt)). However, it still remains slow.

I don’t know why this issue persists. What can I do to resolve it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

When I use SqueezeLLM to quantize the LLaMA2-13B model and test it, the speed is extremely slow. #71

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

When I use SqueezeLLM to quantize the LLaMA2-13B model and test it, the speed is extremely slow. #71

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions