Code and data accompanying our paper currently under review at the Journal of Official Statistics.
When business respond to surveys, or when they use interactive coding tools to find the right classification for their activity, they use descriptions in plain language. National Statistical Institutes must then map these free-text descriptions to an official economic activity classification. In Italy this is ATECO, a hierarchical taxonomy with more than one thousand fine-grained categories.
The current standard approach relies on hand-crafted rule sets: synonym lists, spelling-variant mappings, abbreviation tables, accumulated over years of expert effort. These work well but are expensive to maintain, and need to be largely rebuilt every time the classification is updated.
This project evaluates a lower-maintenance alternative: embedding-based semantic search. Both the user query and the classification entries are mapped into a shared vector space, and the closest matches are retrieved by nearest-neighbour search. No task-specific training data is required, we only need the official taxonomy texts.
We compare several sentence transformer models (118M to 8B parameters) and five strategies for representing the taxonomy as a searchable knowledge base, evaluated against 33,544 real-world queries. Compact models (300–700M parameters) prove competitive with much larger ones, achieving Hit@5 above 0.82 at the fine-grained 5-digit level.
| File | Description |
|---|---|
main.ipynb |
Python notebook containing the code we used to obtain our results |
data/circe_embeddings_gemma.pt |
Pre-computed EmbeddingGemma query embeddings + CIRCE labels (unfortunately we can't publish the raw texts!) |
data/ateco22_descriptor.csv |
Naïve knowledge base (one consolidated descriptor per ATECO code) |
data/ateco22_disentangled.csv |
Disentangled knowledge base (one row per sub-text element) |
data/ateco22_synthetic.csv |
Synthetic knowledge base (GPT-5-mini generated queries) |
data/ateco22_classification.csv |
Full ATECO 2007 (2022 revision) taxonomy |
The raw CIRCE-labelled queries (ateco22_circe.csv) are not published for privacy reasons. To reproduce query encoding from scratch, set USE_PRECOMPUTED_GEMMA_EMBEDDINGS = False in the notebook and supply your own query file.
The paper is currently under review.