Remove sentence transformers by jmsevin · Pull Request #152 · CyberCRI/welearn-datastack

jmsevin · 2026-06-24T13:39:16Z

This pull request migrates the embedding and keyword extraction code from using sentence-transformers to HuggingFace Transformers, updating both the core implementation and related tests. The main changes include refactoring embedding model loading and inference, updating keyword extraction to use HuggingFace pipelines, and adapting tests to mock the new interfaces.

Migration to HuggingFace Transformers:

The embedding_model_helpers module now uses HuggingFace's AutoModel and AutoTokenizer for loading embedding models instead of SentenceTransformer. The new _compute_embeddings helper performs embedding extraction using HuggingFace models, and create_content_slices has been updated to use this workflow. (welearn_datastack/modules/embedding_model_helpers.py) [1] [2] [3] [4] [5]
The load_embedding_model function now returns both the model and tokenizer, checks for local directory existence, and stores models in a new format. (welearn_datastack/modules/embedding_model_helpers.py)

Keyword extraction pipeline update:

The keywords_extractor module now uses HuggingFace's pipeline("feature-extraction") for KeyBERT, aligning with the new embedding backend. (welearn_datastack/modules/keywords_extractor.py) [1] [2]

Dependency and test updates:

The transformers library is added as a dependency, and sentence-transformers is removed. (pyproject.toml)
Unit tests for embedding and keyword extraction have been refactored to mock the new HuggingFace-based interfaces and behaviors, ensuring compatibility with the updated implementation. (tests/document_vectorizer/test_embedding_model_helpers.py, tests/keywords_extractor/test_keywords_extractor.py) [1] [2] [3] [4] [5] [6]

Minor formatting and cleanup:

Minor code formatting improvements and cleanups in various test files for readability. [1] [2] [3] [4]

These changes modernize the embedding and keyword extraction stack, improve maintainability, and align the code with current best practices for NLP model usage.

jmsevin added 4 commits June 23, 2026 17:59

Update poetry requirements

32a9ffb

Remove sentence_transformers

8f9658f

Update tests

4b04df5

Fix linter issue

9182b0b

jmsevin requested a review from lpi-tn June 24, 2026 13:39

lpi-tn reviewed Jun 24, 2026

View reviewed changes

Comment thread welearn_datastack/modules/embedding_model_helpers.py

lpi-tn reviewed Jun 24, 2026

View reviewed changes

Comment thread welearn_datastack/modules/embedding_model_helpers.py

Add docstring

1cc5fc5

lpi-tn approved these changes Jun 24, 2026

View reviewed changes

Comment thread welearn_datastack/modules/embedding_model_helpers.py Outdated

Comment thread pyproject.toml

jmsevin added 3 commits June 24, 2026 16:43

Add comments

0b8841e

Add type hint

8e639b0

Update requirements

6f6afb5

jmsevin merged commit 93cc4c2 into main Jun 24, 2026
7 checks passed

lpi-tn deleted the remove-sentence-transformers branch June 24, 2026 15:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove sentence transformers#152

Remove sentence transformers#152
jmsevin merged 8 commits into
mainfrom
remove-sentence-transformers

jmsevin commented Jun 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jmsevin commented Jun 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants