Human Virus RefSeq Coding Sequences

A curated dataset of paired DNA and protein sequences for all annotated protein-coding genes across NCBI RefSeq reference genomes of viruses that infect Homo sapiens.

Each record links a CDS nucleotide sequence to its translated protein, with 5' and 3' UTR context extracted where the full genomic RNA sequence is available.

Dataset at a Glance

Metric	Value
Source	NCBI RefSeq
Taxon	Viruses (taxid 10239)
Host scope	Homo sapiens
Coverage	Reference genomes — one per virus species
Records	14,528 CDS–protein pairs
Unique virus species	869
Output file	`final/viruses_dna_protein.parquet` (zstd, ~9 MB)
Validation	10/10 checks pass

Note: The final Parquet file and intermediate downloads are excluded from this repository. See Reproducing the Dataset to rebuild.

Schema

Each row is one protein-coding CDS. Primary key: protein_accession.

Column	Type	Description
`genome_accession`	string	NCBI assembly accession (GCF_...)
`virus_name`	string	Virus scientific name
`taxon_id`	int	NCBI taxon ID
`genome_type`	string	Genome class (e.g. ssRNA(+), dsDNA)
`host`	string	Host organism (always "Homo sapiens")
`transcript_accession`	string	RefSeq transcript or genomic accession (NM_/NC_/etc.)
`protein_accession`	string	RefSeq protein accession (NP_/YP_/XP_/WP_) — unique key
`gene_name`	string	Standard gene symbol
`protein_name`	string	Product name from RefSeq annotation
`mrna_sequence`	string	Full mRNA / genomic RNA nucleotide sequence (null if unavailable)
`utr5_sequence`	string	5' UTR sequence (empty string if not annotated)
`cds_sequence`	string	Coding sequence (ATG through stop codon, inclusive)
`utr3_sequence`	string	3' UTR sequence (empty string if not annotated)
`protein_sequence`	string	Translated amino acid sequence
`cds_start_in_mrna`	int	0-based start of CDS within mRNA (null if no mRNA)
`cds_end_in_mrna`	int	Exclusive end of CDS within mRNA (null if no mRNA)
`has_utr`	bool	True if at least one UTR sequence is non-empty

Verified invariants (all 14,528 records):

len(cds_sequence) % 3 == 0
cds_sequence begins with ATG and ends with TAA, TAG, or TGA
protein_accession is globally unique
utr5_sequence + cds_sequence + utr3_sequence == mrna_sequence (for UTR-bearing records)

Quick Start

import polars as pl

df = pl.read_parquet("final/viruses_dna_protein.parquet")
print(df.shape)        # (14528, 17)
print(df.dtypes)

# All CDS for SARS-CoV-2
sarscov2 = df.filter(pl.col("virus_name").str.contains("SARS-CoV-2"))

# CDS length distribution per virus family
df.group_by("virus_name").agg(
    pl.col("cds_sequence").str.len_chars().median().alias("median_cds_nt"),
    pl.len().alias("n_cds"),
).sort("n_cds", descending=True)

# All CDS for a specific gene across all viruses
rdrp = df.filter(pl.col("gene_name") == "RdRp")

Reproducing the Dataset

Requirements: Python 3.10+, NCBI API key (optional but recommended), ~100 MB disk.

pip install -r requirements.txt

# 1. Enumerate human-infecting viral reference genomes from NCBI
python pipeline/phase1_enumerate_virus_genomes.py [--api-key YOUR_KEY]

# 2. Download genome packages from NCBI FTP (~38 MB)
python pipeline/phase2_download_genomes.py [--workers 4] [--api-key YOUR_KEY]

# 3. Parse feature tables + FASTAs, extract CDS/UTRs, write per-genome Parquets
python pipeline/phase3_parse_and_pair.py [--workers 8]

# 4. Deduplicate and merge into final Parquet
python pipeline/phase4_dedup_and_merge.py

# 5. Validate (10 quality checks)
python pipeline/phase5_validation.py

NCBI API key (free): raises rate limit from 3 to 10 req/sec. Register at https://www.ncbi.nlm.nih.gov/account/

The data/virus_genomes.jsonl file in this repo is the Phase 1 output — you can skip straight to Phase 2 if you use it as-is.

Repository Layout

.
├── pipeline/
│   ├── phase1_enumerate_virus_genomes.py   # NCBI Genome Assembly API enumeration
│   ├── phase2_download_genomes.py          # FTP + Entrez fallback download
│   ├── phase3_parse_and_pair.py            # CDS/UTR extraction, per-genome Parquet
│   ├── phase4_dedup_and_merge.py           # Dedup on protein_accession, final Parquet
│   └── phase5_validation.py               # 10 quality checks
├── data/
│   └── virus_genomes.jsonl                # Phase 1 output: 1,897 genome records
├── requirements.txt
└── README.md

Pipeline Design

Phase 1 — Enumerate viral reference genomes

The NCBI Genome Assembly API has no reliable host filter. Phase 1 works around this by querying a curated list of ~30 taxon IDs covering the major human-pathogenic viral families (Coronaviridae, Flaviviridae, Retroviridae, Herpesviridae, Orthomyxoviridae, Paramyxoviridae, Picornaviridae, Papillomaviridae, Poxviridae, Adenoviridae, Hepadnaviridae, Caliciviridae, Reoviridae, Filoviridae, Arenaviridae, Rhabdoviridae, and others).

For each family, the Genome Assembly API (/datasets/v2/genome/taxon/{taxid}/dataset_report) returns all RefSeq assemblies. Only assemblies with a GCF_ accession and a valid FTP path are retained. The result is 1,897 viral reference genome records.

Phase 2 — Download

Four files per genome from the NCBI FTP: _cds_from_genomic.fna.gz, _feature_table.txt.gz, _rna.fna.gz, _translated_cds.faa.gz.

The RNA file (_rna.fna.gz) is treated as optional; most viral RefSeq assemblies do not include it (the CDS is annotated directly on the genomic sequence). An Entrez efetch fallback downloads a GenBank record for any NC_-only genomes lacking an FTP package. Downloads are idempotent.

Phase 3 — Parse and pair

The feature table maps protein_id → genomic_accession (NC_...) for viral CDS (unlike mammalian genomes where it maps to a separate mRNA transcript).
CDS FASTA supplies nucleotide sequences; protein FASTA supplies amino acid sequences.
UTR extraction: CDS is located by exact substring search within the full RNA/genomic sequence. Since most viral assemblies lack a separate RNA file, has_utr = False for the majority of records.
Non-CDS features (mat_peptide, misc RNA) excluded via [gbkey=CDS] filter.
Pseudogenes excluded via [pseudo=true] filter.
~84 non-human viruses (plant viruses and bovine/rodent-specific viruses) that entered through broad family-level enumeration are filtered out by name pattern.

Phase 4 — Deduplicate and merge

Dedup key: protein_accession (globally unique per protein isoform).
Priority order when multiple assemblies share a protein: refseq_category = reference genome > reference > other, then by annotated gene count.
has_utr is recomputed post-merge from UTR sequence content.

Phase 5 — Validation

Ten checks: schema completeness, null audit on required fields, protein_accession uniqueness, accession format patterns (NP_/YP_/XP_/WP_ proteins; NC_/NM_/XM_ transcripts), full IUPAC nucleotide and amino acid alphabets, CDS divisibility by 3, ATG start / TAA-TAG-TGA stop codons, UTR consistency, mRNA reconstruction (UTR5 + CDS + UTR3 == mRNA), and CDS coordinate bounds.

Known Limitations

No UTRs for most records. NCBI viral RefSeq assemblies do not provide separate RNA FASTA files; _rna.fna.gz is absent for nearly all genomes. UTR extraction requires downloading the full genomic FASTA separately.
genome_type is "unknown" for all records. The NCBI Virus API does not reliably return molecule type in its current dataset report format.
Broad family enumeration. Querying at the family level captures some animal-specific viruses within the same families as human pathogens (e.g. equine and avian herpesviruses). These are not filtered out beyond obvious name-based exclusions.
Polyprotein mature peptides excluded. Only primary ORF CDS are included; mat_peptide features (e.g. individual SARS-CoV-2 nsps) are intentionally excluded.
Frameshifted CDS excluded. Programmed ribosomal frameshifts (e.g. ORF1ab) may fail the standard QC checks and are dropped.

Source & License

Database: NCBI RefSeq Taxon: Viruses (taxid 10239) Host scope: Homo sapiens (taxid 9606) NCBI data is in the public domain. See NCBI disclaimer.

Reference

Adapted from the Primate RefSeq CDS pipeline, which built an equivalent dataset for 35 primate species (2.3 M records, 2.1 GB).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Human Virus RefSeq Coding Sequences

Dataset at a Glance

Schema

Quick Start

Reproducing the Dataset

Repository Layout

Pipeline Design

Phase 1 — Enumerate viral reference genomes

Phase 2 — Download

Phase 3 — Parse and pair

Phase 4 — Deduplicate and merge

Phase 5 — Validation

Known Limitations

Source & License

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
pipeline		pipeline
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Human Virus RefSeq Coding Sequences

Dataset at a Glance

Schema

Quick Start

Reproducing the Dataset

Repository Layout

Pipeline Design

Phase 1 — Enumerate viral reference genomes

Phase 2 — Download

Phase 3 — Parse and pair

Phase 4 — Deduplicate and merge

Phase 5 — Validation

Known Limitations

Source & License

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages