nf-core/seqsubmit is a Nextflow pipeline for submitting sequence data to ENA. Currently, the pipeline supports three submission modes, each routed to a dedicated workflow and requiring its own input samplesheet structure:
magsfor Metagenome Assembled Genomes (MAGs) submission withGENOMESUBMITworkflowbinsfor bins submission withGENOMESUBMITworkflowmetagenomic_assembliesfor assembly submission withASSEMBLYSUBMITworkflow
- Nextflow
>=25.04.0 - Webin account registered at https://www.ebi.ac.uk/ena/submit/webin/login
- Raw reads used to assemble contigs submitted to INSDC and associated accessions available
Setup your environment secrets before running the pipeline:
nextflow secrets set WEBIN_ACCOUNT "Webin-XXX"
nextflow secrets set WEBIN_PASSWORD "XXX"
Make sure you update commands above with your authorised credentials.
The input must follow assets/schema_input_genome.json.
Required columns:
samplefasta(must end with.fa.gzor.fasta.gz)accessionassembly_softwarebinning_softwarebinning_parametersstats_generation_softwaremetagenomeenvironmental_mediumbroad_environmentlocal_environmentco-assembly
Columns that required for now, but will be optional in the nearest future:
completenesscontaminationgenome_coveragerRNA_presenceNCBI_lineage
Those fields are metadata required for genome_uploader package. They are described in docs.
Example samplesheet_genome.csv:
sample,fasta,accession,assembly_software,binning_software,binning_parameters,stats_generation_software,completeness,contamination,genome_coverage,metagenome,co-assembly,broad_environment,local_environment,environmental_medium,rRNA_presence,NCBI_lineage
lachnospira_eligens,data/bin_lachnospira_eligens.fa.gz,SRR24458089,spades_v3.15.5,metabat2_v2.6,default,CheckM2_v1.0.1,61.0,0.21,32.07,sediment metagenome,false,marine,cable_bacteria,marine_sediment,false,d__Bacteria;p__Proteobacteria;s_unclassified_ProteobacteriaThe input must follow assets/schema_input_assembly.json.
Required columns:
samplefasta(must end with.fa.gzor.fasta.gz)run_accessionassemblerassembler_version
At least one of the following must be provided per row:
- reads (
fastq_1, optionalfastq_2for paired-end) coverage
If coverage is missing and reads are provided, the workflow calculates average coverage with coverm.
Example samplesheet_assembly.csv:
sample,fasta,fastq_1,fastq_2,coverage,run_accession,assembler,assembler_version
assembly_1,data/contigs_1.fasta.gz,data/reads_1.fastq.gz,data/reads_2.fastq.gz,,ERR011322,SPAdes,3.15.5
assembly_2,data/contigs_2.fasta.gz,,,42.7,ERR011323,MEGAHIT,1.2.9Note
If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.
| Parameter | Description |
|---|---|
--mode |
Type of the data to be submitted. Options: [mags, bins, metagenomic_assemblies] |
--input |
Path to the samplesheet describing the data to be submitted |
--outdir |
Path to the output directory for pipeline results |
--submission_study |
ENA study accession (PRJ/ERP) to submit the data to |
--centre_name |
Name of the submitter's organisation |
| Parameter | Description |
|---|---|
--upload_tpa |
Flag to control the type of assembly study (third party assembly or not). Default: false |
--test_upload |
Upload to TEST ENA server instead of LIVE. Default: false |
--webincli_submit |
If set to false, submissions will be validated, but not submitted. Default: true |
General command template:
nextflow run nf-core/seqsubmit \
-profile <docker/singularity/...> \
--mode <mags|bins|metagenomic_assemblies> \
--input <samplesheet.csv> \
--centre_name <your_centre> \
--submission_study <your_study> \
--outdir <outdir>Validation run (submission to the ENA TEST server) in mags mode:
nextflow run nf-core/seqsubmit \
-profile docker \
--mode mags \
--input assets/samplesheet_genomes.csv \
--submission_study <your_study> \
--centre_name TEST_CENTER \
--webincli_submit true \
--test_upload true \
--outdir results/validate_magsValidation run (submission to the ENA TEST server) in metagenomic_assemblies mode:
nextflow run nf-core/seqsubmit \
-profile docker \
--mode metagenomic_assemblies \
--input assets/samplesheet_assembly.csv \
--submission_study <your_study> \
--centre_name TEST_CENTER \
--webincli_submit true \
--test_upload true \
--outdir results/validate_assembliesLive submission example:
nextflow run nf-core/seqsubmit \
-profile docker \
--mode metagenomic_assemblies \
--input assets/samplesheet_assembly.csv \
--submission_study PRJEB98843 \
--test_upload false \
--webincli_submit true \
--outdir results/live_assemblyWarning
Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.
For more details and further functionality, please refer to the usage documentation and the parameter documentation.
Key output locations in --outdir:
upload/manifests/: generated manifest files for submissionupload/webin_cli/: ENA Webin CLI reportsmultiqc/: MultiQC summary reportpipeline_info/: execution reports, trace, DAG, and software versions
For full details, see the output documentation.
nf-core/seqsubmit was originally written by Martin Beracochea, Ekaterina Sakharova, Sofiia Ochkalova, Evangelos Karatzas.
If you would like to contribute to this pipeline, please see the contributing guidelines.
For further information or help, don't hesitate to get in touch on the Slack #seqsubmit channel (you can join with this invite).
If you use this pipeline please make sure to cite all used software. This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
MGnify: the microbiome sequence data analysis resource in 2023
Richardson L, Allen B, Baldi G, Beracochea M, Bileschi ML, Burdett T, et al.
Vol. 51, Nucleic Acids Research. Oxford University Press (OUP); 2022. p. D753–9. Available from: http://dx.doi.org/10.1093/nar/gkac1080
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
You can cite the nf-core publication as follows:
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.
