Skip to content

nf-core/seqsubmit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

104 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

nf-core/seqsubmit

Open in GitHub Codespaces GitHub Actions CI Status GitHub Actions Linting StatusAWS CICite with Zenodo nf-test

Nextflow nf-core template version run with conda run with docker run with singularity Launch on Seqera Platform

Get help on SlackFollow on BlueskyFollow on MastodonWatch on YouTube

Introduction

nf-core/seqsubmit is a Nextflow pipeline for submitting sequence data to ENA. Currently, the pipeline supports three submission modes, each routed to a dedicated workflow and requiring its own input samplesheet structure:

  • mags for Metagenome Assembled Genomes (MAGs) submission with GENOMESUBMIT workflow
  • bins for bins submission with GENOMESUBMIT workflow
  • metagenomic_assemblies for assembly submission with ASSEMBLYSUBMIT workflow

seqsubmit workflow diagram

Requirements

Setup your environment secrets before running the pipeline:

nextflow secrets set WEBIN_ACCOUNT "Webin-XXX"

nextflow secrets set WEBIN_PASSWORD "XXX"

Make sure you update commands above with your authorised credentials.

Input samplesheets

mags and bins modes (GENOMESUBMIT)

The input must follow assets/schema_input_genome.json.

Required columns:

  • sample
  • fasta (must end with .fa.gz or .fasta.gz)
  • accession
  • assembly_software
  • binning_software
  • binning_parameters
  • stats_generation_software
  • metagenome
  • environmental_medium
  • broad_environment
  • local_environment
  • co-assembly

Columns that required for now, but will be optional in the nearest future:

  • completeness
  • contamination
  • genome_coverage
  • rRNA_presence
  • NCBI_lineage

Those fields are metadata required for genome_uploader package. They are described in docs.

Example samplesheet_genome.csv:

sample,fasta,accession,assembly_software,binning_software,binning_parameters,stats_generation_software,completeness,contamination,genome_coverage,metagenome,co-assembly,broad_environment,local_environment,environmental_medium,rRNA_presence,NCBI_lineage
lachnospira_eligens,data/bin_lachnospira_eligens.fa.gz,SRR24458089,spades_v3.15.5,metabat2_v2.6,default,CheckM2_v1.0.1,61.0,0.21,32.07,sediment metagenome,false,marine,cable_bacteria,marine_sediment,false,d__Bacteria;p__Proteobacteria;s_unclassified_Proteobacteria

metagenomic_assemblies mode (ASSEMBLYSUBMIT)

The input must follow assets/schema_input_assembly.json.

Required columns:

  • sample
  • fasta (must end with .fa.gz or .fasta.gz)
  • run_accession
  • assembler
  • assembler_version

At least one of the following must be provided per row:

  • reads (fastq_1, optional fastq_2 for paired-end)
  • coverage

If coverage is missing and reads are provided, the workflow calculates average coverage with coverm.

Example samplesheet_assembly.csv:

sample,fasta,fastq_1,fastq_2,coverage,run_accession,assembler,assembler_version
assembly_1,data/contigs_1.fasta.gz,data/reads_1.fastq.gz,data/reads_2.fastq.gz,,ERR011322,SPAdes,3.15.5
assembly_2,data/contigs_2.fasta.gz,,,42.7,ERR011323,MEGAHIT,1.2.9

Usage

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

Required parameters:

Parameter Description
--mode Type of the data to be submitted. Options: [mags, bins, metagenomic_assemblies]
--input Path to the samplesheet describing the data to be submitted
--outdir Path to the output directory for pipeline results
--submission_study ENA study accession (PRJ/ERP) to submit the data to
--centre_name Name of the submitter's organisation

Optional parameters:

Parameter Description
--upload_tpa Flag to control the type of assembly study (third party assembly or not). Default: false
--test_upload Upload to TEST ENA server instead of LIVE. Default: false
--webincli_submit If set to false, submissions will be validated, but not submitted. Default: true

General command template:

nextflow run nf-core/seqsubmit \
   -profile <docker/singularity/...> \
   --mode <mags|bins|metagenomic_assemblies> \
   --input <samplesheet.csv> \
   --centre_name <your_centre> \
   --submission_study <your_study> \
   --outdir <outdir>

Validation run (submission to the ENA TEST server) in mags mode:

nextflow run nf-core/seqsubmit \
   -profile docker \
   --mode mags \
   --input assets/samplesheet_genomes.csv \
   --submission_study <your_study> \
   --centre_name TEST_CENTER \
   --webincli_submit true \
   --test_upload true \
   --outdir results/validate_mags

Validation run (submission to the ENA TEST server) in metagenomic_assemblies mode:

nextflow run nf-core/seqsubmit \
   -profile docker \
   --mode metagenomic_assemblies \
   --input assets/samplesheet_assembly.csv \
   --submission_study <your_study> \
   --centre_name TEST_CENTER \
   --webincli_submit true \
   --test_upload true \
   --outdir results/validate_assemblies

Live submission example:

nextflow run nf-core/seqsubmit \
   -profile docker \
   --mode metagenomic_assemblies \
   --input assets/samplesheet_assembly.csv \
   --submission_study PRJEB98843 \
   --test_upload false \
   --webincli_submit true \
   --outdir results/live_assembly

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

For more details and further functionality, please refer to the usage documentation and the parameter documentation.

Pipeline output

Key output locations in --outdir:

  • upload/manifests/: generated manifest files for submission
  • upload/webin_cli/: ENA Webin CLI reports
  • multiqc/: MultiQC summary report
  • pipeline_info/: execution reports, trace, DAG, and software versions

For full details, see the output documentation.

Credits

nf-core/seqsubmit was originally written by Martin Beracochea, Ekaterina Sakharova, Sofiia Ochkalova, Evangelos Karatzas.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to get in touch on the Slack #seqsubmit channel (you can join with this invite).

Citations

If you use this pipeline please make sure to cite all used software. This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

MGnify: the microbiome sequence data analysis resource in 2023

Richardson L, Allen B, Baldi G, Beracochea M, Bileschi ML, Burdett T, et al.

Vol. 51, Nucleic Acids Research. Oxford University Press (OUP); 2022. p. D753–9. Available from: http://dx.doi.org/10.1093/nar/gkac1080

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

About

nf-core pipeline for data submission to ENA

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Contributors