Staff Publications

Staff Publications

  • external user (warningwarning)
  • Log in as
  • language uk
  • About

    'Staff publications' is the digital repository of Wageningen University & Research

    'Staff publications' contains references to publications authored by Wageningen University staff from 1976 onward.

    Publications authored by the staff of the Research Institutes are available from 1995 onwards.

    Full text documents are added when available. The database is updated daily and currently holds about 240,000 items, of which 72,000 in open access.

    We have a manual that explains all the features 

Record number 552499
Title Scalable workflows and reproducible data analysis for genomics
Author(s) Strozzi, Francesco; Janssen, Roel; Wurmus, Ricardo; Crusoe, Michael R.; Githinji, George; Tommaso, Paolo Di; Belhachemi, Dominique; Möller, Steffen; Smant, Geert; Ligt, Joep de; Prins, Pjotr
Source In: Evolutionary Genomics Humana Press Inc. (Methods in Molecular Biology ) - ISBN 9781493990733 - p. 723 - 745.
DOI https://doi.org/10.1007/978-1-4939-9074-0_24
Department(s) Laboratory of Nematology
Publication type Peer reviewed book chapter
Publication year 2019
Keyword(s) Big data - Bioconda - Bioinformatics - Cloud computing - Cluster computing - Common Workflow Language - CWL - Debian Linux - Evolutionary biology - GNU Guix - Guix Workflow Language - MPI - MrBayes - Nextflow - Parallelization - Snakemake - Virtual machine
Abstract

Biological, clinical, and pharmacological research now often involves analyses of genomes, transcriptomes, proteomes, and interactomes, within and between individuals and across species. Due to large volumes, the analysis and integration of data generated by such high-throughput technologies have become computationally intensive, and analysis can no longer happen on a typical desktop computer. In this chapter we show how to describe and execute the same analysis using a number of workflow systems and how these follow different approaches to tackle execution and reproducibility issues. We show how any researcher can create a reusable and reproducible bioinformatics pipeline that can be deployed and run anywhere. We show how to create a scalable, reusable, and shareable workflow using four different workflow engines: The Common Workflow Language (CWL), Guix Workflow Language (GWL), Snakemake, and Nextflow. Each of which can be run in parallel. We show how to bundle a number of tools used in evolutionary biology by using Debian, GNU Guix, and Bioconda software distributions, along with the use of container systems, such as Docker, GNU Guix, and Singularity. Together these distributions represent the overall majority of software packages relevant for biology, including PAML, Muscle, MAFFT, MrBayes, and BLAST. By bundling software in lightweight containers, they can be deployed on a desktop, in the cloud, and, increasingly, on compute clusters. By bundling software through these public software distributions, and by creating reproducible and shareable pipelines using these workflow engines, not only do bioinformaticians have to spend less time reinventing the wheel but also do we get closer to the ideal of making science reproducible. The examples in this chapter allow a quick comparison of different solutions.

Comments
There are no comments yet. You can post the first one!
Post a comment
 
Please log in to use this service. Login as Wageningen University & Research user or guest user in upper right hand corner of this page.