Samey measures diversity, repetition, templating, and topic coverage of text datasets. Fast and CPU-aimed.
pip install sameyfrom samey import Samey
import pandas as pd
data = pd.read_json("my_dataset.jsonl", lines=True)
model = Samey()
report = model.score(data, text="prompt", topic="category")
print(report.summary)
# OR as a single number:
print("Diversity:", report.diversity_score['score'])import samey as sl
report = sl.score(df, text="prompt", topic="topic")
print(report.summary)report = sl.score_dpo(df, prompt="prompt", chosen="chosen", rejected="rejected")
print(report.to_markdown())report = sl.score(df, text=["prompt", "response"])
report.to_json("diversity_report.json")Samey computes 8 metrics:
| Metric | What it measures | Healthy range |
|---|---|---|
| Compression Ratio | Global repetition via gzip | 0.3-0.5 |
| Near-Duplicate Rate | MinHash/LSH duplicates | < 0.1 |
| Template Dominance | Skeleton detection | < 0.1 (top skeleton share) |
| N-gram Repetition | Boilerplate via repeated 6-10 grams | < 0.2 |
| Topic Coverage | Topic entropy | > 0.8 (1=uniform) |
| Style Diversity | Char n-gram clustering | < 0.2 (largest cluster) |
| Semantic Diversity | Embedding-based concept spread | > 0.5 (higher=more diverse) |
| Distinct-N | Lexical diversity | > 0.5 for distinct-1/2/3 |
model = Samey(
length_mode="truncate", # "truncate", "window", or "none"
max_chars=512,
shingle_size=5,
lsh_threshold=0.85,
max_sample=50_000,
ngram_min=6,
ngram_max=10,
style_n_clusters=20,
# Semantic diversity settings
semantic_method="tfidf", # "tfidf" (fast) or "embedding" (better)
semantic_model="paraphrase-MiniLM-L3-v2", # Only for method="embedding"
semantic_max_sample=1000,
enable_semantic=True,
seed=42,
)report = model.score(df, text="prompt")
report.summary # Key metrics dict
report.table # pandas DataFrame
report.diversity_score # Aggregated 0-100 score
report.print_score() # Formatted score report
report.to_json("report.json")
report.to_markdown()Get a single 0-100 score combining all metrics:
report = model.score(df, text="prompt")
report.print_score()Output:
DIVERSITY SCORE: 97.8/100 (A)
Metric Breakdown (1.0 = best):
compression_ratio ██████████████████░░ 0.92 ✓
near_duplicate_rate ████████████████████ 1.00 ✓
distinct_2 ████████████████████ 1.00 ✓
...
✅ No significant issues detected!
Access programmatically:
ds = report.diversity_score
print(ds['score']) # 97.8
print(ds['issues']) # List of detected problems
print(ds['breakdown']) # Per-metric normalized scoresmodel = Samey(max_chars=256, lsh_threshold=0.9)
model.save("my_config")
model = Samey.load("my_config")Concatenates all texts and computes gzip_bytes / raw_bytes. Repetitive content compresses better.
Uses character 5-gram shingles, MinHash signatures (128 perms), and LSH to find texts with Jaccard similarity >= 0.85.
"Skeletonizes" texts by replacing URLs, numbers, emails, code blocks, quoted strings with tags. Then measures skeleton distribution.
Finds word 6-10 grams appearing in 2+ different rows.
Normalized entropy of topic labels (0 = one topic, 1 = uniform).
Character 3-5 gram TF-IDF + MiniBatchKMeans clustering.
unique_ngrams / total_ngrams for unigrams, bigrams, trigrams.
Two methods available:
- TF-IDF (default): Fast, uses word/bigram TF-IDF vectors. No extra dependencies.
- Embedding: Uses
paraphrase-MiniLM-L3-v2sentence transformer. Better at catching paraphrases/synonyms, but slower.
# Fast TF-IDF (default)
model = Samey(semantic_method="tfidf")
# Embedding-based (needs sentence-transformers)
model = Samey(semantic_method="embedding")