ATCCfinder

Developed at Los Alamos National Laboratory (O# O4644)

Download ATCC Genome Portal microbial reference genomes and align query sequences.

About

The American Type Culture Collection (ATCC) sells a wide variety of microbes with strain-level taxonomy classification and associated sequenced reference genomes. ATCCfinder utilizes ATCC application interface software (API) to generate query-able databases from ATCC Genome resources. This tool provides the ability to generate databases of the four ATCC data types:

Strain-specific genome assembly sequence data (reference)
Information about how each strain was collected (meta, catalogue)
Structural/functional information about genome assemblies (annotation).

ATCCfinder contains two core functionalities that may be used in conjunction or independently:

Download ATCC references (with a valid API key)
Query refernce sequences and report alignment results. The tool was built primarily for usage with ATCC refernce genomes, but custom sequence databases may also be searched against.

Once the ATCC reference genome database is retrieved by ATCCfinder, queries may be compared against ATCC reference genomes using the sequence alignment tool minimap2, whose results are then parsed to produce summary data describing what ATCC-available species and strain, if any, the query sequence matches.

Dependencies

Downloading Databases

If you plan to download databases from ATCC yourself, the following is required in an environment:

ATCC-Bioinformatics, Genome Portal API

The following is an example environment created for downloading databases from ATCC:

## Download genome_portal_api package
git clone https://github.com/ATCC-Bioinformatics/genome_portal_api.git

## Define environment
mamba create -n ATCCfinder_download
mamba activate ATCCfinder_download

## Install genome_portal_api package
mamba install git pip
pip install /path/to/genome_portal_api

Note that you will need to follow ATCC's instructions for generating your own account & API Key, see above Genome Portal API github link for instructions.

Searching References

The following software are required for performing alignment to ATCC references:

The following is an example environment created for searching databases from ATCC:

## Define environment
mamba create -n ATCCfinder_search
mamba activate ATCCfinder_search

## Install packages
mamba install -c bioconda minimap2
mamba install -c bioconda samtools
mamba install r-base r-argparse

Parameters

download.py

Parameter	Description
--help	Directory containing subread BAM file(s)
--download	Specify which / any ATCC databases to download
--overwrite, --no-overwrite	If downloading databases, specifies whether or not downloads in output folder should be kept or replaced
--format	Used to specify path to references (if, for example, references were downloaded from my database repository), which will be combined for searching
--out	Output folder

search.R

Parameter	Description
--query	Query Sequence(s) fasta file
--target	Target database fasta file
--aligner	Specify the alignment method. Currently supports: minimap2 (fna:fna), tblastn (faa:fna), & blastp (faa:faa). Default = minimap2.
--target_meta	Required if using a custom target database. Provides meta data associated with sequences in a standardized format. Must contain columns 'header_id', 'reference_id', and 'taxonomy'. Additional columns may be included.
--nhits	The maximum number of target hits to return per query
--note_nmatch_p	(0-100) Specify an alignment nmatch percentage at or above which results will be noted as potential matches despite their aggregate score not being the maximum.
--overwrite	(T/F) If alignment output already exists, should it be overwritten?
--outdir	Output directory

Example Usage

Download ATCC Reference & Catalogue Databases:

ATCCfinder/download.py \
--download reference catalogue \
--api_key <api_string> \
--overwrite

Note that the combined ATCC reference genome file for use as alignment target is named atcc_references.fa

Align & Report a Query Sequence against ATCC References:

ATCCfinder/search.R \
--target path/to/atcc_references.fa \
--query path/to/query.fa \
--nhits 100 \
--overwrite F \
--outdir path/for/output

Create database from existing .fa.gz files:

ATCCfinder/download.py \
--format "./db_multifile" \
--out "./db_singlefile"

Here, fasta files in folder 'db_multifile' are combined into a single multi-fasta file that is deposited in the folder 'db_singlefile'.

Output

search.R will return a file titled report.tsv contining the following columns summarizing alignment results:

Column	Description
qname	Query sequence name
qlen	Query sequence length
n_total_hits	Total number of query alignments against reference
max_mapq	The maximum mapq score returned from any alignment hit
max_nmatch	The maximum number of matched basepairs from any alignment
max_nmatch_%	The maximum number of matched basepairs from any alignment, reported as percentage of `qlen`
n_best_refs	The number of alignment hits containing both `max_mapq` and `max_nmatch` scores
best_refs	The reference assembly(s) corresponding to `n_best_refs`, with bracketed values indicating the number of hits associated with this reference
best_refs_taxonomy	The taxonomy assignment corresponding to `n_best_refs`, with bracketed values indicating the number of hits associated with this classification

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
LICENSE		LICENSE
README.md		README.md
create_atcc_taxonomy_file.py		create_atcc_taxonomy_file.py
download.py		download.py
search.R		search.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ATCCfinder

Table of Contents

About

Dependencies

Parameters

Example Usage

Output

About

Uh oh!

Releases

Packages

Languages

License

lanl/ATCCfinder

Folders and files

Latest commit

History

Repository files navigation

ATCCfinder

Table of Contents

About

Dependencies

Parameters

Example Usage

Output

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages