Skip to content

Further speeding up the quantization process #67

@SyphonArch

Description

@SyphonArch

I previously contributed a pull request that reduced the runtime of the main clustering algorithm from over two hours to just six minutes for the Llama 2 7B model (#60). In the 'Further Suggestions' section of that PR, I mentioned potential optimizations by exploiting the 1D nature of the task.

I'm excited to share that I've developed a Python package, flash1dkmeans, which implements a faster 1D K-means algorithm. This package is now part of the Any-Precision LLM project, a variable bit-rate quantization scheme using SqueezeLLM as the seed model. With this new implementation, we've managed to further reduce the execution time for SqueezeLLM to 38 seconds on an i9-13900K machine, achieving a further tenfold speed increase.

If interested in integrating this speed enhancement, you can refer to the code in Any-Precision LLM, as an example where we use the package to create the seed model. For maximum performance gains, consider accelerating the caller function with @numba.njit(parallel=True). However, even using the standard multiprocessing pool should yield significant improvements.

This package can serve as an almost drop-in replacement for sklearn's K-means if you're looking to speed up SqueezeLLM further. Of course, sticking with sklearn for better transparency is perfectly fine too. I wanted to share these findings, as your work helped create ours 👍 .

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions