Developed at Los Alamos National Laboratory (O# O4644)
Download ATCC Genome Portal microbial reference genomes and align query sequences.
The American Type Culture Collection (ATCC) sells a wide variety of microbes with strain-level taxonomy classification and associated sequenced reference genomes. ATCCfinder utilizes ATCC application interface software (API) to generate query-able databases from ATCC Genome resources. This tool provides the ability to generate databases of the four ATCC data types:
- Strain-specific genome assembly sequence data (reference)
- Information about how each strain was collected (meta, catalogue)
- Structural/functional information about genome assemblies (annotation).
ATCCfinder contains two core functionalities that may be used in conjunction or independently:
- Download ATCC references (with a valid API key)
- Query refernce sequences and report alignment results. The tool was built primarily for usage with ATCC refernce genomes, but custom sequence databases may also be searched against.
Once the ATCC reference genome database is retrieved by ATCCfinder, queries may be compared against ATCC reference genomes using the sequence alignment tool minimap2, whose results are then parsed to produce summary data describing what ATCC-available species and strain, if any, the query sequence matches.
If you plan to download databases from ATCC yourself, the following is required in an environment:
The following is an example environment created for downloading databases from ATCC:
## Download genome_portal_api package
git clone https://github.com/ATCC-Bioinformatics/genome_portal_api.git
## Define environment
mamba create -n ATCCfinder_download
mamba activate ATCCfinder_download
## Install genome_portal_api package
mamba install git pip
pip install /path/to/genome_portal_api
Note that you will need to follow ATCC's instructions for generating your own account & API Key, see above Genome Portal API github link for instructions.
The following software are required for performing alignment to ATCC references:
The following is an example environment created for searching databases from ATCC:
## Define environment
mamba create -n ATCCfinder_search
mamba activate ATCCfinder_search
## Install packages
mamba install -c bioconda minimap2
mamba install -c bioconda samtools
mamba install r-base r-argparse
download.py
| Parameter | Description |
|---|---|
| --help | Directory containing subread BAM file(s) |
| --download | Specify which / any ATCC databases to download |
| --overwrite, --no-overwrite | If downloading databases, specifies whether or not downloads in output folder should be kept or replaced |
| --format | Used to specify path to references (if, for example, references were downloaded from my database repository), which will be combined for searching |
| --out | Output folder |
search.R
| Parameter | Description |
|---|---|
| --query | Query Sequence(s) fasta file |
| --target | Target database fasta file |
| --aligner | Specify the alignment method. Currently supports: minimap2 (fna:fna), tblastn (faa:fna), & blastp (faa:faa). Default = minimap2. |
| --target_meta | Required if using a custom target database. Provides meta data associated with sequences in a standardized format. Must contain columns 'header_id', 'reference_id', and 'taxonomy'. Additional columns may be included. |
| --nhits | The maximum number of target hits to return per query |
| --note_nmatch_p | (0-100) Specify an alignment nmatch percentage at or above which results will be noted as potential matches despite their aggregate score not being the maximum. |
| --overwrite | (T/F) If alignment output already exists, should it be overwritten? |
| --outdir | Output directory |
Download ATCC Reference & Catalogue Databases:
ATCCfinder/download.py \
--download reference catalogue \
--api_key <api_string> \
--overwrite
Note that the combined ATCC reference genome file for use as alignment target is named atcc_references.fa
Align & Report a Query Sequence against ATCC References:
ATCCfinder/search.R \
--target path/to/atcc_references.fa \
--query path/to/query.fa \
--nhits 100 \
--overwrite F \
--outdir path/for/output
Create database from existing .fa.gz files:
ATCCfinder/download.py \
--format "./db_multifile" \
--out "./db_singlefile"
Here, fasta files in folder 'db_multifile' are combined into a single multi-fasta file that is deposited in the folder 'db_singlefile'.
search.R will return a file titled report.tsv contining the following columns summarizing alignment results:
| Column | Description |
|---|---|
| qname | Query sequence name |
| qlen | Query sequence length |
| n_total_hits | Total number of query alignments against reference |
| max_mapq | The maximum mapq score returned from any alignment hit |
| max_nmatch | The maximum number of matched basepairs from any alignment |
| max_nmatch_% | The maximum number of matched basepairs from any alignment, reported as percentage of qlen |
| n_best_refs | The number of alignment hits containing both max_mapq and max_nmatch scores |
| best_refs | The reference assembly(s) corresponding to n_best_refs, with bracketed values indicating the number of hits associated with this reference |
| best_refs_taxonomy | The taxonomy assignment corresponding to n_best_refs, with bracketed values indicating the number of hits associated with this classification |