Staff Publications

Staff Publications

  • external user (warningwarning)
  • Log in as
  • language uk
  • About

    'Staff publications' is the digital repository of Wageningen University & Research

    'Staff publications' contains references to publications authored by Wageningen University staff from 1976 onward.

    Publications authored by the staff of the Research Institutes are available from 1995 onwards.

    Full text documents are added when available. The database is updated daily and currently holds about 240,000 items, of which 72,000 in open access.

    We have a manual that explains all the features 

Record number 532216
Title Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet
Author(s) Boufea, Aikaterini; Finkers, H.J.; Kaauwen, M.P.W. van; Kramer, M.R.; Athanasiadis, I.N.
Source In: BDCAT '17 Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies. - ACM - ISBN 9781450355490 - p. 219 - 226.
Event Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies, Austin, 2017-12-05/2017-12-08
Department(s) Information Technology
WUR PB Kwantitatieve Aspecten
WUR PB Non host en Insectenresistentie
Publication type Peer reviewed book chapter
Publication year 2017
Keyword(s) Big Data - bioinformatics - variant calling - Hadoop - HDFS - Apache Spark - Apache Parquet
Abstract Big Data has been seen as a remedy for the efficient management of the ever-increasing genomic data. In this paper, we investigate the use of Apache Spark to store and process Variant Calling Files (VCF) on a Hadoop cluster. We demonstrate Tomatula, a software tool for converting VCF files to Apache Parquet storage format, and an application to query variant calling datasets. We evaluate how the wall time (i.e. time until the query answer is returned to the user) scales out on a Hadoop cluster storing VCF files, either in the original flat-file format, or using the Apache Parquet columnar storage format. Apache Parquet can compress the VCF data by around a factor of 10, and supports easier querying of VCF files as it exposes the field structure. We discuss advantages and disadvantages in terms of storage capacity and querying performance with both flat VCF files and Apache Parquet using an open plant breeding dataset. We conclude that Apache Parquet offers benefits for reducing storage size and wall time, and scales out with larger datasets.
There are no comments yet. You can post the first one!
Post a comment
Please log in to use this service. Login as Wageningen University & Research user or guest user in upper right hand corner of this page.