Skip to content

istat-methodology/semantic-auto-coding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Automatically Coding Economic Activities via Semantic Search

Code and data accompanying our paper currently under review at the Journal of Official Statistics.

Overview

When business respond to surveys, or when they use interactive coding tools to find the right classification for their activity, they use descriptions in plain language. National Statistical Institutes must then map these free-text descriptions to an official economic activity classification. In Italy this is ATECO, a hierarchical taxonomy with more than one thousand fine-grained categories.

The current standard approach relies on hand-crafted rule sets: synonym lists, spelling-variant mappings, abbreviation tables, accumulated over years of expert effort. These work well but are expensive to maintain, and need to be largely rebuilt every time the classification is updated.

This project evaluates a lower-maintenance alternative: embedding-based semantic search. Both the user query and the classification entries are mapped into a shared vector space, and the closest matches are retrieved by nearest-neighbour search. No task-specific training data is required, we only need the official taxonomy texts.

We compare several sentence transformer models (118M to 8B parameters) and five strategies for representing the taxonomy as a searchable knowledge base, evaluated against 33,544 real-world queries. Compact models (300–700M parameters) prove competitive with much larger ones, achieving Hit@5 above 0.82 at the fine-grained 5-digit level.

Repository Contents

File Description
main.ipynb Python notebook containing the code we used to obtain our results
data/circe_embeddings_gemma.pt Pre-computed EmbeddingGemma query embeddings + CIRCE labels (unfortunately we can't publish the raw texts!)
data/ateco22_descriptor.csv Naïve knowledge base (one consolidated descriptor per ATECO code)
data/ateco22_disentangled.csv Disentangled knowledge base (one row per sub-text element)
data/ateco22_synthetic.csv Synthetic knowledge base (GPT-5-mini generated queries)
data/ateco22_classification.csv Full ATECO 2007 (2022 revision) taxonomy

The raw CIRCE-labelled queries (ateco22_circe.csv) are not published for privacy reasons. To reproduce query encoding from scratch, set USE_PRECOMPUTED_GEMMA_EMBEDDINGS = False in the notebook and supply your own query file.

Citation

The paper is currently under review.

About

Repository for automatic coding via semantic search approaches.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors