Skip to content

Add .conll / .conllu dataset format loader (CoNLL-2003 / 2000 / U)#8219

Merged
lhoestq merged 1 commit into
huggingface:mainfrom
CrypticCortex:add-conll-format-loader
May 27, 2026
Merged

Add .conll / .conllu dataset format loader (CoNLL-2003 / 2000 / U)#8219
lhoestq merged 1 commit into
huggingface:mainfrom
CrypticCortex:add-conll-format-loader

Conversation

@CrypticCortex
Copy link
Copy Markdown
Contributor

Closes #7757.

Summary

Adds a packaged_modules/conll/ builder so .conll and .conllu files load directly via load_dataset(...) without manual parsing scripts. Each row of the loaded dataset corresponds to one sentence, with each configured column produced as a list aligned with the token list.

Builder shape mirrors the text packaged module (line-based reading) — as suggested by @namesarnav in the issue thread.

from datasets import load_dataset

ds = load_dataset(
    "conll",
    data_files="train.conll",
    column_names=["tokens", "pos_tags", "chunk_tags", "ner_tags"],
)
# Each example: {"tokens": [...], "pos_tags": [...], "chunk_tags": [...], "ner_tags": [...]}

Reads CoNLL-style files (one token per line, columns whitespace-separated,
blank lines = sentence boundaries) into one row per sentence with lists
aligned across configurable column names.

- Supports CoNLL-2003 NER, CoNLL-2000 chunking, CoNLL-U, and arbitrary
  custom column schemas via ConllConfig.column_names.
- Configurable delimiter (default: any whitespace), comment_prefix
  (CoNLL-U `#`), and skip_docstart (CoNLL-2003 -DOCSTART- markers).
- Registered for .conll and .conllu extensions in _EXTENSION_TO_MODULE
  (the .conllu mapping pre-sets comment_prefix="#").
- Pads short rows / truncates long rows to keep column alignment.

Closes huggingface#7757. Design discussed with namesarnav who suggested basing the
implementation on the text builder shape since CoNLL files are line-based.
Copy link
Copy Markdown
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay it looks all good to me, just running the CI and if grren let's merge :)

@lhoestq lhoestq merged commit 0a81d51 into huggingface:main May 27, 2026
13 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for .conll file format in datasets

2 participants