feat: add bio-bait spam detection with profile bio scanning#10
Open
rezhajulio wants to merge 2 commits intomainfrom
Open
feat: add bio-bait spam detection with profile bio scanning#10rezhajulio wants to merge 2 commits intomainfrom
rezhajulio wants to merge 2 commits intomainfrom
Conversation
Detects two related spam vectors common in Indonesian Telegram groups:
1. Bait phrases in messages (e.g. "cek bio aku", "liat byoh",
"open my bio"). Spammers obfuscate the word "bio" with
misspellings, separators (b.i.o, b1o), and Cyrillic look-alikes
(Ьіо). The handler normalizes (NFKC + lowercase + zero-width strip)
and canonicalizes obfuscated variants back to "bio" before matching
a small set of imperative + bio + possessive patterns.
2. Promo/scam links inside the user's Telegram profile bio. Some
spammers send innocuous group messages while their bio carries
t.me/+invite links, non-whitelisted t.me/{user} links, or multiple
non-whitelisted @mentions (sometimes paired with promo hint words
like VIP, BCL, ASP, open). The user's bio is fetched once per hour
via bot.get_chat() and cached in bot_data.
On match the handler deletes the message, restricts the user, clears
the cached bio, and posts a notification (separate templates for
message-bait vs bio-link cases) to the warning topic.
- New handler: src/bot/handlers/bio_bait.py (registered at group=2,
shifts contact/new_user/duplicate/message handlers to 3/4/5/6).
- New config: bio_bait_enabled (Settings + GroupConfig, default True).
- New templates: BIO_BAIT_SPAM_NOTIFICATION (+ NO_RESTRICT) and
BIO_LINK_SPAM_NOTIFICATION (+ NO_RESTRICT) in constants.py.
- Tests: tests/test_bio_bait.py covers normalization, true positives
(incl. Cyrillic / obfuscated forms), false positives (biology,
bioinformatics, "bio aku ada di README"), bio-link detection,
per-user TTL cache, all handler branches.
626 tests pass, bio_bait.py at 100% coverage, ruff clean.
Replace real-looking Telegram invite hashes and @username from spam examples in code comments and tests with obvious placeholders so the repository does not propagate (or appear to endorse) actual scam links.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Detects two related spam vectors that have been showing up in the Indonesian Telegram community:
Bait phrases in messages — e.g.
cek bio aku,liat byoh,open my bio. Spammers obfuscatebiowith misspellings, separators, and Cyrillic look-alikes (b.i.o,b1o,bioohh,Ьіо). The handler normalizes the text (NFKC + lowercase + zero-width strip), canonicalizes obfuscated variants back tobio, then matches a small set of imperative + bio + possessive patterns.Promo/scam links inside the user's Telegram profile bio — e.g. private
t.me/+<invite-hash>invite links combined with promo hint words (VIP,promo,open,ready, …) and/or non-whitelisted@mentions. Some spammers send innocuous group messages while their bio carries the actual links. The user's bio is fetched once per hour viabot.get_chat()and cached inbot_data.On match the handler deletes the message, restricts the user, clears the cached bio, and posts a notification (separate templates for message-bait vs bio-link cases) to the warning topic, then raises
ApplicationHandlerStop.Detection logic
Message bait
bio/byoobfuscations tobio(handles Cyrillic look-alikes)bioand/or first-person possessiveProfile bio scan
t.me/+...private invite linkst.me/{username}links (reusesis_url_whitelisted)@usernamementions, OR 1 mention combined with a promo hint (vip,bcl,asp,open,ready, …)@mentionalone is not enough (avoids false positives)Changes
src/bot/handlers/bio_bait.py(registered atgroup=2;contact_spam/new_user_spam/duplicate_spam/message_handlershifted to 3/4/5/6).bio_bait_enabled(Settings + GroupConfig, defaultTrue).BIO_BAIT_SPAM_NOTIFICATION(+_NO_RESTRICT) andBIO_LINK_SPAM_NOTIFICATION(+_NO_RESTRICT).tests/test_bio_bait.py— 79 tests covering normalization, true positives (cek bio kak,lihat bio dong,bio aku update, Cyrillic/obfuscated forms), false positives (biology,bioinformatics,bio aku ada di README,thank you my bro), bio-link detection, per-user TTL cache, all handler branches (admin/bot/disabled/no-text/delete-fail/restrict-fail/notify-fail).Verification
uv run pytest→ 626 passed (was 547 → +79)bio_bait.pyat 100% coverageruff checkcleanNotes
ApplicationHandlerStopshort-circuits downstream when a match fires.