Skip to content

When I use SqueezeLLM to quantize the LLaMA2-13B model and test it, the speed is extremely slow. #71

@zhangfzR

Description

@zhangfzR

I used the following command to run the LLaMA2-13B model:
CUDA_VISIBLE_DEVICES=0 python llama.py /mnt/llama2-13b wikitext2 --wbits 4 --load sq-llama-13b-w4-s0.45.pt --include_sparse --eval
The --load option loads the model after packing. I followed the README instructions step by step, but when I tested this code, the speed was extremely slow. I thought the model might have had an error during quantization, so I downloaded the model from Hugging Face as shown in the README(sq-llama-13b-w4-s45). However, it still remains slow.

I don’t know why this issue persists. What can I do to resolve it?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions