Skip to content

Download ATCC Genome Portal microbial reference genomes, align query sequences, receive detailed result reports.

License

Notifications You must be signed in to change notification settings

lanl/ATCCfinder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ATCCfinder

Developed at Los Alamos National Laboratory (O# O4644)

Download ATCC Genome Portal microbial reference genomes and align query sequences.



Table of Contents

  1. About
  2. Dependencies
  3. Parameters
  4. Example Usage
  5. Output



About

The American Type Culture Collection (ATCC) sells a wide variety of microbes with strain-level taxonomy classification and associated sequenced reference genomes. ATCCfinder utilizes ATCC application interface software (API) to generate query-able databases from ATCC Genome resources. This tool provides the ability to generate databases of the four ATCC data types:

  • Strain-specific genome assembly sequence data (reference)
  • Information about how each strain was collected (meta, catalogue)
  • Structural/functional information about genome assemblies (annotation).



ATCCfinder contains two core functionalities that may be used in conjunction or independently:

  1. Download ATCC references (with a valid API key)
  2. Query refernce sequences and report alignment results. The tool was built primarily for usage with ATCC refernce genomes, but custom sequence databases may also be searched against.

Once the ATCC reference genome database is retrieved by ATCCfinder, queries may be compared against ATCC reference genomes using the sequence alignment tool minimap2, whose results are then parsed to produce summary data describing what ATCC-available species and strain, if any, the query sequence matches.



Dependencies

Downloading Databases

If you plan to download databases from ATCC yourself, the following is required in an environment:

The following is an example environment created for downloading databases from ATCC:

## Download genome_portal_api package
git clone https://github.com/ATCC-Bioinformatics/genome_portal_api.git

## Define environment
mamba create -n ATCCfinder_download
mamba activate ATCCfinder_download

## Install genome_portal_api package
mamba install git pip
pip install /path/to/genome_portal_api

Note that you will need to follow ATCC's instructions for generating your own account & API Key, see above Genome Portal API github link for instructions.



Searching References

The following software are required for performing alignment to ATCC references:

The following is an example environment created for searching databases from ATCC:

## Define environment
mamba create -n ATCCfinder_search
mamba activate ATCCfinder_search

## Install packages
mamba install -c bioconda minimap2
mamba install -c bioconda samtools
mamba install r-base r-argparse



Parameters

download.py

Parameter Description
--help Directory containing subread BAM file(s)
--download Specify which / any ATCC databases to download
--overwrite, --no-overwrite If downloading databases, specifies whether or not downloads in output folder should be kept or replaced
--format Used to specify path to references (if, for example, references were downloaded from my database repository), which will be combined for searching
--out Output folder



search.R

Parameter Description
--query Query Sequence(s) fasta file
--target Target database fasta file
--aligner Specify the alignment method. Currently supports: minimap2 (fna:fna), tblastn (faa:fna), & blastp (faa:faa). Default = minimap2.
--target_meta Required if using a custom target database. Provides meta data associated with sequences in a standardized format. Must contain columns 'header_id', 'reference_id', and 'taxonomy'. Additional columns may be included.
--nhits The maximum number of target hits to return per query
--note_nmatch_p (0-100) Specify an alignment nmatch percentage at or above which results will be noted as potential matches despite their aggregate score not being the maximum.
--overwrite (T/F) If alignment output already exists, should it be overwritten?
--outdir Output directory



Example Usage

Download ATCC Reference & Catalogue Databases:

ATCCfinder/download.py \
--download reference catalogue \
--api_key <api_string> \
--overwrite

Note that the combined ATCC reference genome file for use as alignment target is named atcc_references.fa

Align & Report a Query Sequence against ATCC References:

ATCCfinder/search.R \
--target path/to/atcc_references.fa \
--query path/to/query.fa \
--nhits 100 \
--overwrite F \
--outdir path/for/output

Create database from existing .fa.gz files:

ATCCfinder/download.py \
--format "./db_multifile" \
--out "./db_singlefile"

Here, fasta files in folder 'db_multifile' are combined into a single multi-fasta file that is deposited in the folder 'db_singlefile'.



Output

search.R will return a file titled report.tsv contining the following columns summarizing alignment results:

Column Description
qname Query sequence name
qlen Query sequence length
n_total_hits Total number of query alignments against reference
max_mapq The maximum mapq score returned from any alignment hit
max_nmatch The maximum number of matched basepairs from any alignment
max_nmatch_% The maximum number of matched basepairs from any alignment, reported as percentage of qlen
n_best_refs The number of alignment hits containing both max_mapq and max_nmatch scores
best_refs The reference assembly(s) corresponding to n_best_refs, with bracketed values indicating the number of hits associated with this reference
best_refs_taxonomy The taxonomy assignment corresponding to n_best_refs, with bracketed values indicating the number of hits associated with this classification

About

Download ATCC Genome Portal microbial reference genomes, align query sequences, receive detailed result reports.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published