A curated dataset of paired DNA and protein sequences for all annotated protein-coding genes across NCBI RefSeq reference genomes of viruses that infect Homo sapiens.
Each record links a CDS nucleotide sequence to its translated protein, with 5' and 3' UTR context extracted where the full genomic RNA sequence is available.
| Metric | Value |
|---|---|
| Source | NCBI RefSeq |
| Taxon | Viruses (taxid 10239) |
| Host scope | Homo sapiens |
| Coverage | Reference genomes — one per virus species |
| Records | 14,528 CDS–protein pairs |
| Unique virus species | 869 |
| Output file | final/viruses_dna_protein.parquet (zstd, ~9 MB) |
| Validation | 10/10 checks pass |
Note: The final Parquet file and intermediate downloads are excluded from this repository. See Reproducing the Dataset to rebuild.
Each row is one protein-coding CDS. Primary key: protein_accession.
| Column | Type | Description |
|---|---|---|
genome_accession |
string | NCBI assembly accession (GCF_...) |
virus_name |
string | Virus scientific name |
taxon_id |
int | NCBI taxon ID |
genome_type |
string | Genome class (e.g. ssRNA(+), dsDNA) |
host |
string | Host organism (always "Homo sapiens") |
transcript_accession |
string | RefSeq transcript or genomic accession (NM_/NC_/etc.) |
protein_accession |
string | RefSeq protein accession (NP_/YP_/XP_/WP_) — unique key |
gene_name |
string | Standard gene symbol |
protein_name |
string | Product name from RefSeq annotation |
mrna_sequence |
string | Full mRNA / genomic RNA nucleotide sequence (null if unavailable) |
utr5_sequence |
string | 5' UTR sequence (empty string if not annotated) |
cds_sequence |
string | Coding sequence (ATG through stop codon, inclusive) |
utr3_sequence |
string | 3' UTR sequence (empty string if not annotated) |
protein_sequence |
string | Translated amino acid sequence |
cds_start_in_mrna |
int | 0-based start of CDS within mRNA (null if no mRNA) |
cds_end_in_mrna |
int | Exclusive end of CDS within mRNA (null if no mRNA) |
has_utr |
bool | True if at least one UTR sequence is non-empty |
Verified invariants (all 14,528 records):
len(cds_sequence) % 3 == 0cds_sequencebegins withATGand ends withTAA,TAG, orTGAprotein_accessionis globally uniqueutr5_sequence + cds_sequence + utr3_sequence == mrna_sequence(for UTR-bearing records)
import polars as pl
df = pl.read_parquet("final/viruses_dna_protein.parquet")
print(df.shape) # (14528, 17)
print(df.dtypes)
# All CDS for SARS-CoV-2
sarscov2 = df.filter(pl.col("virus_name").str.contains("SARS-CoV-2"))
# CDS length distribution per virus family
df.group_by("virus_name").agg(
pl.col("cds_sequence").str.len_chars().median().alias("median_cds_nt"),
pl.len().alias("n_cds"),
).sort("n_cds", descending=True)
# All CDS for a specific gene across all viruses
rdrp = df.filter(pl.col("gene_name") == "RdRp")Requirements: Python 3.10+, NCBI API key (optional but recommended), ~100 MB disk.
pip install -r requirements.txt# 1. Enumerate human-infecting viral reference genomes from NCBI
python pipeline/phase1_enumerate_virus_genomes.py [--api-key YOUR_KEY]
# 2. Download genome packages from NCBI FTP (~38 MB)
python pipeline/phase2_download_genomes.py [--workers 4] [--api-key YOUR_KEY]
# 3. Parse feature tables + FASTAs, extract CDS/UTRs, write per-genome Parquets
python pipeline/phase3_parse_and_pair.py [--workers 8]
# 4. Deduplicate and merge into final Parquet
python pipeline/phase4_dedup_and_merge.py
# 5. Validate (10 quality checks)
python pipeline/phase5_validation.pyNCBI API key (free): raises rate limit from 3 to 10 req/sec. Register at https://www.ncbi.nlm.nih.gov/account/
The data/virus_genomes.jsonl file in this repo is the Phase 1 output — you can skip straight to Phase 2 if you use it as-is.
.
├── pipeline/
│ ├── phase1_enumerate_virus_genomes.py # NCBI Genome Assembly API enumeration
│ ├── phase2_download_genomes.py # FTP + Entrez fallback download
│ ├── phase3_parse_and_pair.py # CDS/UTR extraction, per-genome Parquet
│ ├── phase4_dedup_and_merge.py # Dedup on protein_accession, final Parquet
│ └── phase5_validation.py # 10 quality checks
├── data/
│ └── virus_genomes.jsonl # Phase 1 output: 1,897 genome records
├── requirements.txt
└── README.md
The NCBI Genome Assembly API has no reliable host filter. Phase 1 works around this by querying a curated list of ~30 taxon IDs covering the major human-pathogenic viral families (Coronaviridae, Flaviviridae, Retroviridae, Herpesviridae, Orthomyxoviridae, Paramyxoviridae, Picornaviridae, Papillomaviridae, Poxviridae, Adenoviridae, Hepadnaviridae, Caliciviridae, Reoviridae, Filoviridae, Arenaviridae, Rhabdoviridae, and others).
For each family, the Genome Assembly API (/datasets/v2/genome/taxon/{taxid}/dataset_report) returns all RefSeq assemblies. Only assemblies with a GCF_ accession and a valid FTP path are retained. The result is 1,897 viral reference genome records.
Four files per genome from the NCBI FTP:
_cds_from_genomic.fna.gz, _feature_table.txt.gz, _rna.fna.gz, _translated_cds.faa.gz.
The RNA file (_rna.fna.gz) is treated as optional; most viral RefSeq assemblies do not include it (the CDS is annotated directly on the genomic sequence). An Entrez efetch fallback downloads a GenBank record for any NC_-only genomes lacking an FTP package. Downloads are idempotent.
- The feature table maps
protein_id → genomic_accession (NC_...)for viral CDS (unlike mammalian genomes where it maps to a separate mRNA transcript). - CDS FASTA supplies nucleotide sequences; protein FASTA supplies amino acid sequences.
- UTR extraction: CDS is located by exact substring search within the full RNA/genomic sequence. Since most viral assemblies lack a separate RNA file,
has_utr = Falsefor the majority of records. - Non-CDS features (
mat_peptide, misc RNA) excluded via[gbkey=CDS]filter. - Pseudogenes excluded via
[pseudo=true]filter. - ~84 non-human viruses (plant viruses and bovine/rodent-specific viruses) that entered through broad family-level enumeration are filtered out by name pattern.
- Dedup key:
protein_accession(globally unique per protein isoform). - Priority order when multiple assemblies share a protein:
refseq_category = reference genome>reference> other, then by annotated gene count. has_utris recomputed post-merge from UTR sequence content.
Ten checks: schema completeness, null audit on required fields, protein_accession uniqueness, accession format patterns (NP_/YP_/XP_/WP_ proteins; NC_/NM_/XM_ transcripts), full IUPAC nucleotide and amino acid alphabets, CDS divisibility by 3, ATG start / TAA-TAG-TGA stop codons, UTR consistency, mRNA reconstruction (UTR5 + CDS + UTR3 == mRNA), and CDS coordinate bounds.
- No UTRs for most records. NCBI viral RefSeq assemblies do not provide separate RNA FASTA files;
_rna.fna.gzis absent for nearly all genomes. UTR extraction requires downloading the full genomic FASTA separately. genome_typeis "unknown" for all records. The NCBI Virus API does not reliably return molecule type in its current dataset report format.- Broad family enumeration. Querying at the family level captures some animal-specific viruses within the same families as human pathogens (e.g. equine and avian herpesviruses). These are not filtered out beyond obvious name-based exclusions.
- Polyprotein mature peptides excluded. Only primary ORF CDS are included;
mat_peptidefeatures (e.g. individual SARS-CoV-2 nsps) are intentionally excluded. - Frameshifted CDS excluded. Programmed ribosomal frameshifts (e.g. ORF1ab) may fail the standard QC checks and are dropped.
Database: NCBI RefSeq Taxon: Viruses (taxid 10239) Host scope: Homo sapiens (taxid 9606) NCBI data is in the public domain. See NCBI disclaimer.
Adapted from the Primate RefSeq CDS pipeline, which built an equivalent dataset for 35 primate species (2.3 M records, 2.1 GB).