Skip to content

Remove sentence transformers#152

Merged
jmsevin merged 8 commits into
mainfrom
remove-sentence-transformers
Jun 24, 2026
Merged

Remove sentence transformers#152
jmsevin merged 8 commits into
mainfrom
remove-sentence-transformers

Conversation

@jmsevin

@jmsevin jmsevin commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

This pull request migrates the embedding and keyword extraction code from using sentence-transformers to HuggingFace Transformers, updating both the core implementation and related tests. The main changes include refactoring embedding model loading and inference, updating keyword extraction to use HuggingFace pipelines, and adapting tests to mock the new interfaces.

Migration to HuggingFace Transformers:

  • The embedding_model_helpers module now uses HuggingFace's AutoModel and AutoTokenizer for loading embedding models instead of SentenceTransformer. The new _compute_embeddings helper performs embedding extraction using HuggingFace models, and create_content_slices has been updated to use this workflow. (welearn_datastack/modules/embedding_model_helpers.py) [1] [2] [3] [4] [5]
  • The load_embedding_model function now returns both the model and tokenizer, checks for local directory existence, and stores models in a new format. (welearn_datastack/modules/embedding_model_helpers.py)

Keyword extraction pipeline update:

  • The keywords_extractor module now uses HuggingFace's pipeline("feature-extraction") for KeyBERT, aligning with the new embedding backend. (welearn_datastack/modules/keywords_extractor.py) [1] [2]

Dependency and test updates:

  • The transformers library is added as a dependency, and sentence-transformers is removed. (pyproject.toml)
  • Unit tests for embedding and keyword extraction have been refactored to mock the new HuggingFace-based interfaces and behaviors, ensuring compatibility with the updated implementation. (tests/document_vectorizer/test_embedding_model_helpers.py, tests/keywords_extractor/test_keywords_extractor.py) [1] [2] [3] [4] [5] [6]

Minor formatting and cleanup:

  • Minor code formatting improvements and cleanups in various test files for readability. [1] [2] [3] [4]

These changes modernize the embedding and keyword extraction stack, improve maintainability, and align the code with current best practices for NLP model usage.

@jmsevin jmsevin requested a review from lpi-tn June 24, 2026 13:39
Comment thread welearn_datastack/modules/embedding_model_helpers.py
Comment thread welearn_datastack/modules/embedding_model_helpers.py
Comment thread welearn_datastack/modules/embedding_model_helpers.py Outdated
Comment thread pyproject.toml
@jmsevin jmsevin merged commit 93cc4c2 into main Jun 24, 2026
7 checks passed
@lpi-tn lpi-tn deleted the remove-sentence-transformers branch June 24, 2026 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants