Automatically Coding Economic Activities via Semantic Search

Code and data accompanying our paper currently under review at the Journal of Official Statistics.

Overview

When business respond to surveys, or when they use interactive coding tools to find the right classification for their activity, they use descriptions in plain language. National Statistical Institutes must then map these free-text descriptions to an official economic activity classification. In Italy this is ATECO, a hierarchical taxonomy with more than one thousand fine-grained categories.

The current standard approach relies on hand-crafted rule sets: synonym lists, spelling-variant mappings, abbreviation tables, accumulated over years of expert effort. These work well but are expensive to maintain, and need to be largely rebuilt every time the classification is updated.

This project evaluates a lower-maintenance alternative: embedding-based semantic search. Both the user query and the classification entries are mapped into a shared vector space, and the closest matches are retrieved by nearest-neighbour search. No task-specific training data is required, we only need the official taxonomy texts.

We compare several sentence transformer models (118M to 8B parameters) and five strategies for representing the taxonomy as a searchable knowledge base, evaluated against 33,544 real-world queries. Compact models (300–700M parameters) prove competitive with much larger ones, achieving Hit@5 above 0.82 at the fine-grained 5-digit level.

Repository Contents

File	Description
`main.ipynb`	Python notebook containing the code we used to obtain our results
`data/circe_embeddings_gemma.pt`	Pre-computed EmbeddingGemma query embeddings + CIRCE labels (unfortunately we can't publish the raw texts!)
`data/ateco22_descriptor.csv`	Naïve knowledge base (one consolidated descriptor per ATECO code)
`data/ateco22_disentangled.csv`	Disentangled knowledge base (one row per sub-text element)
`data/ateco22_synthetic.csv`	Synthetic knowledge base (GPT-5-mini generated queries)
`data/ateco22_classification.csv`	Full ATECO 2007 (2022 revision) taxonomy

The raw CIRCE-labelled queries (ateco22_circe.csv) are not published for privacy reasons. To reproduce query encoding from scratch, set USE_PRECOMPUTED_GEMMA_EMBEDDINGS = False in the notebook and supply your own query file.

Citation

The paper is currently under review.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.ipynb		main.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatically Coding Economic Activities via Semantic Search

Overview

Repository Contents

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Automatically Coding Economic Activities via Semantic Search

Overview

Repository Contents

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages