Skip to content

tabularis-ai/samey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Samey

Dataset Diversity Scoring

for synthetic instruction data (SFT/DPO)

Samey measures diversity, repetition, templating, and topic coverage of text datasets. Fast and CPU-aimed.

Installation

pip install samey

Quickstart

from samey import Samey
import pandas as pd

data = pd.read_json("my_dataset.jsonl", lines=True)

model = Samey()
report = model.score(data, text="prompt", topic="category")
print(report.summary)

# OR as a single number: 
print("Diversity:", report.diversity_score['score'])

One-liner usage

import samey as sl

report = sl.score(df, text="prompt", topic="topic")
print(report.summary)

DPO datasets

report = sl.score_dpo(df, prompt="prompt", chosen="chosen", rejected="rejected")
print(report.to_markdown())

Multiple text columns

report = sl.score(df, text=["prompt", "response"])
report.to_json("diversity_report.json")

Metrics

Samey computes 8 metrics:

Metric What it measures Healthy range
Compression Ratio Global repetition via gzip 0.3-0.5
Near-Duplicate Rate MinHash/LSH duplicates < 0.1
Template Dominance Skeleton detection < 0.1 (top skeleton share)
N-gram Repetition Boilerplate via repeated 6-10 grams < 0.2
Topic Coverage Topic entropy > 0.8 (1=uniform)
Style Diversity Char n-gram clustering < 0.2 (largest cluster)
Semantic Diversity Embedding-based concept spread > 0.5 (higher=more diverse)
Distinct-N Lexical diversity > 0.5 for distinct-1/2/3

Configuration

model = Samey(
    length_mode="truncate",  # "truncate", "window", or "none"
    max_chars=512,
    shingle_size=5,
    lsh_threshold=0.85,
    max_sample=50_000,
    ngram_min=6,
    ngram_max=10,
    style_n_clusters=20,
    # Semantic diversity settings
    semantic_method="tfidf",  # "tfidf" (fast) or "embedding" (better)
    semantic_model="paraphrase-MiniLM-L3-v2",  # Only for method="embedding"
    semantic_max_sample=1000,
    enable_semantic=True,
    seed=42,
)

Report Object

report = model.score(df, text="prompt")

report.summary          # Key metrics dict
report.table            # pandas DataFrame
report.diversity_score  # Aggregated 0-100 score
report.print_score()    # Formatted score report
report.to_json("report.json")
report.to_markdown()

Aggregated Diversity Score

Get a single 0-100 score combining all metrics:

report = model.score(df, text="prompt")
report.print_score()

Output:

DIVERSITY SCORE: 97.8/100 (A)

Metric Breakdown (1.0 = best):
  compression_ratio              ██████████████████░░ 0.92 ✓
  near_duplicate_rate            ████████████████████ 1.00 ✓
  distinct_2                     ████████████████████ 1.00 ✓
  ...

✅ No significant issues detected!

Access programmatically:

ds = report.diversity_score
print(ds['score'])      # 97.8
print(ds['issues'])     # List of detected problems
print(ds['breakdown'])  # Per-metric normalized scores

Saving and Loading

model = Samey(max_chars=256, lsh_threshold=0.9)
model.save("my_config")

model = Samey.load("my_config")

How It Works

Compression Ratio

Concatenates all texts and computes gzip_bytes / raw_bytes. Repetitive content compresses better.

Near-Duplicate Rate

Uses character 5-gram shingles, MinHash signatures (128 perms), and LSH to find texts with Jaccard similarity >= 0.85.

Template Dominance

"Skeletonizes" texts by replacing URLs, numbers, emails, code blocks, quoted strings with tags. Then measures skeleton distribution.

N-gram Repetition

Finds word 6-10 grams appearing in 2+ different rows.

Topic Coverage

Normalized entropy of topic labels (0 = one topic, 1 = uniform).

Style Diversity

Character 3-5 gram TF-IDF + MiniBatchKMeans clustering.

Distinct-N

unique_ngrams / total_ngrams for unigrams, bigrams, trigrams.

Semantic Diversity

Two methods available:

  • TF-IDF (default): Fast, uses word/bigram TF-IDF vectors. No extra dependencies.
  • Embedding: Uses paraphrase-MiniLM-L3-v2 sentence transformer. Better at catching paraphrases/synonyms, but slower.
# Fast TF-IDF (default)
model = Samey(semantic_method="tfidf")

# Embedding-based (needs sentence-transformers)
model = Samey(semantic_method="embedding")

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages