Staff Publications

Staff Publications

  • external user (warningwarning)
  • Log in as
  • language uk
  • About

    'Staff publications' is the digital repository of Wageningen University & Research

    'Staff publications' contains references to publications authored by Wageningen University staff from 1976 onward.

    Publications authored by the staff of the Research Institutes are available from 1995 onwards.

    Full text documents are added when available. The database is updated daily and currently holds about 240,000 items, of which 72,000 in open access.

    We have a manual that explains all the features 

Current refinement(s):

Records 1 - 20 / 36

  • help
  • print

    Print search results

  • export

    Export search results

  • alert
    We will mail you new results for this query: keywords==bioinformatics
Check title to add to marked list
Profile hidden Markov models trained on aligned KEGG Orthology sequences for enzyme annotation
Rodenburg, Y.A. ; Ridder, D. de; Govers, F. ; Seidl, M.F. - \ 2019
Wageningen University & Research
annotation - bioinformatics - enzymes - hidden Markov models - HMM - homology - KEGG - Kyoto Encyclopedia of Genes and Genomes - proteins
Profile hidden Markov models trained on aligned KEGG Orthology sequences for enzyme annotation. These HMMs were used to reconstruct metabolic networks for the manuscript: The genome of Peronospora belbahrii reveals high heterozygosity, a low number of canonical effectors and CT-rich promoters
Diversity of cis-regulatory elements associated with auxin response in Arabidopsis thaliana
Cherenkov, Pavel ; Novikova, Daria ; Omelyanchuk, Nadya ; Levitsky, Victor ; Grosse, Ivo ; Weijers, Dolf ; Mironova, Victoria - \ 2018
Journal of Experimental Botany 69 (2018)2. - ISSN 0022-0957 - p. 329 - 339.
ARF - Auxin - AuxRE - bHLH - bioinformatics - bZIP - chromatin states - transcriptional regulation
The phytohormone auxin regulates virtually every developmental process in land plants. This regulation is mediated via de-repression of DNA-binding auxin response factors (ARFs). ARFs bind TGTC-containing auxin response cis-elements (AuxREs), but there is growing evidence that additional cis-elements occur in auxin-responsive regulatory regions. The repertoire of auxin-related cis-elements and their involvement in different modes of auxin response are not yet known. Here we analyze the enrichment of nucleotide hexamers in upstream regions of auxin-responsive genes associated with auxin up-or down-regulation, with early or late response, ARF-binding domains, and with different chromatin states. Intriguingly, hexamers potentially bound by basic helix-loop-helix (bHLH) and basic leucine zipper (bZIP) factors as well as a family of A/T-rich hexamers are more highly enriched in auxin-responsive regions than canonical TGTC-containing AuxREs. We classify and annotate the whole spectrum of enriched hexamers and discuss their patterns of enrichment related to different modes of auxin response.
Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet
Boufea, Aikaterini ; Finkers, H.J. ; Kaauwen, M.P.W. van; Kramer, M.R. ; Athanasiadis, I.N. - \ 2017
In: BDCAT '17 Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies. - ACM - ISBN 9781450355490 - p. 219 - 226.
Big Data - bioinformatics - variant calling - Hadoop - HDFS - Apache Spark - Apache Parquet
Big Data has been seen as a remedy for the efficient management of the ever-increasing genomic data. In this paper, we investigate the use of Apache Spark to store and process Variant Calling Files (VCF) on a Hadoop cluster. We demonstrate Tomatula, a software tool for converting VCF files to Apache Parquet storage format, and an application to query variant calling datasets. We evaluate how the wall time (i.e. time until the query answer is returned to the user) scales out on a Hadoop cluster storing VCF files, either in the original flat-file format, or using the Apache Parquet columnar storage format. Apache Parquet can compress the VCF data by around a factor of 10, and supports easier querying of VCF files as it exposes the field structure. We discuss advantages and disadvantages in terms of storage capacity and querying performance with both flat VCF files and Apache Parquet using an open plant breeding dataset. We conclude that Apache Parquet offers benefits for reducing storage size and wall time, and scales out with larger datasets.
A Novel Chewing Detection System Based on PPG, Audio, and Accelerometry
Papapanagiotou, Vasileios ; Diou, Christos ; Zhou, Lingchuan ; Boer, Janet van den; Mars, Monica ; Delopoulos, Anastasios - \ 2017
IEEE Journal of Biomedical and Health Informatics 21 (2017)3. - ISSN 2168-2194 - p. 607 - 618.
Acoustic sensors - acoustic signal processing - bioinformatics - biomedical informatics - optical sensors - optical signal processing
In the context of dietary management, accurate monitoring of eating habits is receiving increased attention. Wearable sensors, combined with the connectivity and processing of modern smartphones, can be used to robustly extract objective and real-time measurements of human behavior. In particular, for the task of chewing detection, several approaches based on an in-ear microphone can be found in the literature, while other types of sensors have also been reported, such as strain sensors. In this paper, performed in the context of the SPLENDID project, we propose to combine an in-ear microphone with a photoplethysmography (PPG) sensor placed in the ear concha, in a new high accuracy and low sampling rate prototype chewing detection system. We propose a pipeline that initially processes each sensor signal separately, and then fuses both to perform the final detection. Features are extracted from each modality, and support vector machine (SVM) classifiers are used separately to perform snacking detection. Finally, we combine the SVM scores from both signals in a late-fusion scheme, which leads to increased eating detection accuracy. We evaluate the proposed eating monitoring system on a challenging, semifree living dataset of 14 subjects, which includes more than 60 h of audio and PPG signal recordings. Results show that fusing the audio and PPG signals significantly improves the effectiveness of eating event detection, achieving accuracy up to 0.938 and class-weighted accuracy up to 0.892.
The fish egg microbiome : diversity and activity against the oomycete pathogen Saprolegnia
Liu, Y. - \ 2016
Wageningen University. Promotor(en): Francine Govers; Jos Raaijmakers, co-promotor(en): Irene de Bruijn. - Wageningen : Wageningen University - ISBN 9789462577671 - 169
salmon - fish eggs - marine microorganisms - microbial diversity - bioinformatics - genomics - saprolegnia - oomycota - fish diseases - suppression - fungal antagonists - zalm - visseneieren - mariene micro-organismen - microbiële diversiteit - bio-informatica - genomica - saprolegnia - oömycota - visziekten - onderdrukking - schimmelantagonisten

Y. Liu

Prof. dr. F. Govers (promotor); Prof. dr. J.M. Raaijmakers (promotor); Dr. I. de Bruijn (co-promotor); Wageningen University, 13 June 2016, 170 pp.

The fish egg microbiome: diversity and activity against the oomycete pathogen Saprolegnia

Emerging oomycete pathogens increasingly threaten biodiversity and food security. This thesis describes the study of the microbiome of Atlantic salmon (Salmo salar L.) eggs and analyses of the effects of infections by the oomycete pathogen Saprolegnia on the microbial architecture. A low incidence of Saprolegniosis was correlated with a relatively high abundance and richness of specific commensal Actinobacteria. Among the bacterial community, the isolates Frondihabitans sp. 762G35 (Microbacteriaceae) and Pseudomonas sp. H6 significantly inhibited hyphal attachment of Saprolegnia diclina to live salmon eggs. Chemical profiling showed that these two isolates produce furancarboxylic acid-derived metabolites and a lipopeptide viscosin-like biosurfactant, respectively, which inhibited hyphal growth of S. diclina in vitro. Among the fungal community, the fungal isolates obtained from salmon eggs were closely related to Microdochium lycopodinum/Microdochium phragmitis and Trichoderma viride. Both a quantitative and qualitative difference in the Trichoderma population between Saprolegnia-infected and healthy salmon eggs was observed, which suggested that mycoparasitic Trichoderma species could play a role in Saprolegnia suppression in aquaculture. This research provides a scientific framework for studying the diversity and dynamics of microbial communities to mitigate emerging diseases. The Frondihabitans, Pseudomonas and Trichoderma isolates, and/or their bioactive metabolites, are proposed as effective candidates to control Saprolegniosis.

Identification and functional characterization of putative (a)virulence factors in the fungal wheat pathogen Zymoseptoria tritici
Mirzadi Gohari, A. - \ 2015
Wageningen University. Promotor(en): Pierre de Wit, co-promotor(en): Gert Kema; Rahim Mehrabi. - Wageningen : Wageningen University - ISBN 9789462575912 - 159
triticum aestivum - wheat - plant pathogenic fungi - mycosphaerella graminicola - virulence factors - genetic analysis - pathogenesis - bioinformatics - triticum aestivum - tarwe - plantenziekteverwekkende schimmels - mycosphaerella graminicola - virulente factoren - genetische analyse - pathogenese - bio-informatica

Zymoseptoria tritici (Desm.) Quaedvlieg & Crous (previously known as Mycosphaerella graminicola) is the causal agent of septoria tritici blotch (STB), which is a devastating foliar wheat disease worldwide. It is responsible for significant yield losses occurring annually in all major wheat-growing areas and threatens global food security. Z. tritici is a hemi-biotrophic fungal pathogen that, after stomatal penetration, establishes a stealthy biotrophic and symptomless relation with its host plant that is followed by a sudden switch to a necrotrophic growth phase coinciding with chlorosis that eventually develops in large necrotic blotches containing many pycnidia producing asexual splash-borne conidia. Under natural conditions - once competent mating partners are present and conditions are conducive- pseudothecia are formed producing airborne ascospores. Disease management of STB is primarily achieved through fungicide applications and growing commercial cultivars carrying Stb resistance genes. However, the efficacy of both strategies is limited as strains resistant to fungicides frequently develop and progressively dominate natural populations, which hampers disease management; also the deployed Stb genes are often overcome by existing or newly developed isolates of the fungus. Hence, there is a need for discovery research to better understand the molecular basis of the host-pathogen interaction that enables breeders to identify and deploy new Stb genes, which will eventually contribute to more sustainable disease control.

Chapter 1 introduces the subject of the thesis and describes various aspects of the lifestyle of Z. tritici with emphasis on dissecting the various stages and physiological processes during pathogenesis on wheat. In addition, it includes a short summary and discussion of the current understanding of the role of (a)virulence factors in the Z. tritici–wheat pathosystem.

Chapter 2 describes new gateway technology-driven molecular tools comprising 22 entry constructs facilitating rapid construction of binary vectors for functional analyses of fungal genes. The entry vectors for single, double or triple gene deletion mutants were developed using hygromycin, geneticin and nourseothricin resistance genes as selection markers. Furthermore, these entry vectors contain the genes encoding green fluorescent (GFP) or red fluorescent (RFP) protein in combination with the three selection markers, which enables simultaneous tagging of gene deletion mutants for microscopic analyses. The functionality of these entry vectors was validated in Z. tritici and described in Chapters 3, 4 and 5.

Chapter 3 describes the functional characterization of ZtWor1, the orthologue of Wor1 in the fungal human pathogen Candida albicans. ZtWor1 is up-regulated during initiation of colonization and fructification, and regulates expression of candidate effector genes, including one that was discovered after comparative proteome analysis of Z. tritici wild-type and ΔZtWor1 strains. Cell fusion and anastomosis occurred frequently in ΔZtWor1 strains, which is reminiscent of mutants of MgGpb1, the β-subunit of the heterotrimeric G protein. Comparative expression profiling of ΔZtWor1, ΔMgGpb1 and ΔMgTpk2 (the catalytic subunit of protein kinase A) strains, suggests that ZtWor1 is downstream of the cyclic adenosine monophosphate (cAMP) pathway that is crucial for pathogenicity of many fungal plant pathogens.

Chapter 4 describes combined bioinformatics and expression profiling studies during pathogenesis in order to discover candidate effectors of  Z. tritici important for virulence. In addition, a genetic approach was followed to map quantitative trait loci (QTLs) in Z. tritici carrying putative effectors. Functional analysis of two top effector candidates, small-secreted proteins SSP15 and SSP18, which were selected based on their expression profile in planta, showed that they are dispensable for virulence of Z. tritici. These analyses suggest that generally adopted criteria for effector discovery, such as protein size, number of cysteine residues and up-regulated expression during pathogenesis, should be taken with caution and cannot be applied to every pathosystem, as they likely represent only a subset of effector genes.

Chapter 5 describes the functional characterization of ZtCpx1 and ZtCpx2 encoding a secreted and a cytoplasmic catalase-peroxidase (CP) in Z. tritici, respectively. Gene replacement of ZtCpx1 resulted in mutant strains that were sensitive to exogenously added H2O2 and in planta phenotyping showed they are significantly less virulent compared to wild-type. All mutant phenotypes could be restored to wild-type by complementation with the wild-type allele of ZtCpx1 driven by its native promoter. Additionally, functional analysis of ZtCpx2 confirmed that this gene encodes a secreted CP and is, however, dispensable for virulence of Z. tritici on wheat. However, we showed that both genes act synergistically, as the generated double knock-out strain showed a significantly stronger reduction in virulence than the individual single knock-out strains. Hence, both genes are required by Z. tritici for successful infection and colonization of wheat.

In Chapter 6 I discuss and summarize the genetic approaches used in this study, reflect on the major findings and bottlenecks encountered, and propose new strategies to identify effectors of Z. tritici in the future.

Lipoxygenase : a game-changing enzyme
Heshof, R. - \ 2015
Wageningen University. Promotor(en): Willem de Vos; Vitor Martins dos Santos, co-promotor(en): Leo de Graaff. - Wageningen : Wageningen University - ISBN 9789462571761 - 160
lipoxygenase - aminozuursequenties - lipiden - schimmels - bio-informatica - kaliumjodide - industriële toepassingen - biobased economy - lipoxygenase - amino acid sequences - lipids - fungi - bioinformatics - potassium iodide - industrial applications - biobased economy

Many challenges lie ahead in using LOXs as tools in industrial oleochemistry. One of these challenges is the supply of PUFAs. Although we are moving towards a biobased economy where second and third generation biomass is taking a leading role, it is still faster and cheaper to use first generation biomass. Industrialization of microbial oils is a good alternative to supply the demand of PUFAs. Another challenge is the production of heterologous LOX in sufficient quantities. Since the last decade this problem is being tackled and more research is being done in heterologous expression of LOXs. The LOX with the highest potential so far is the secreted Pseudomonas aeruginosa LOX produced in Escherichia coli. During this thesis research different lox genes were tried for heterologous production of LOX using different Aspergillus niger and Aspergillus nidulans strains as expression hosts. These LOXs were identified as discussed in Chapter 3 and Chapter 6. Unfortunately, heterologous production in sufficient quantities was unsuccessful using these expression hosts as discussed in Chapter 5 and Chapter 6. Since production of Gaeumannomyces graminis LOX was successful in Trichoderma reesei, as discussed in Chapter 4, the production of polymers used for bioplastics could be demonstrated in this ERA-NOEL project anyway. Therefore this thesis shifted its focus on resolving the question of the difficulties in the heterologous expression of LOX in different Aspergillus species. Chapter 5 is the result of a systematic approach to analyze different aspects of G. graminis LOX expression in A. nidulans. Chapter 2 shows that heterologous expression of extracellular fungal LOX can be performed using T. reesei and Pichia pastoris as production hosts, and E. coli can be used for the production of intracellular LOXs of plant, mammal, bacterial, and fungal origin. As shown in Chapter 2, E. coli is not very efficient in the production of heterologous LOX due to the formation of inclusion bodies and low induction temperature necessary for production. The use of Aspergillus oryzae can be exploited further in the heterologous production of LOXs. Due to the choice of using A. niger and A. nidulans as expression hosts, this expression host was not exploited for its potential. The last challenge is to synthetically engineer LOX to broaden its use in industry. In this way more building blocks for chemicals can be synthetically produced and more products based on LOX origin can be made. Therefore, LOX can be a world-wide game-changing enzyme in a biobased economy as its use can decrease the demand for petroleum-based products.

Dick de Ridder over de bioloog die steeds meer een datawetenschapper wordt
Ridder, D. de - \ 2015
Wageningen UR
bio-informatica - gegevensverwerking - informatica - beroepen - moleculaire biologie - bioinformatics - data processing - informatics - occupations - molecular biology
De biologie krijgt snel het karakter van een datawetenschap. Miljarden gegevens over genomen, genen, eiwitten en andere moleculen worden in grote bestanden bij elkaar gebracht en systematisch onderzocht. Dit moet leiden tot meer basiskennis en begrip van levende organismen waarvan gewassen en vee aan de basis staan van de voedselvoorziening van de wereldbevolking. Dat zegt Dick de Ridder in zijn inaugurele rede bij de aanvaarding van het ambt van hoogleraar Bioinformatica aan Wageningen University op 30 april.
Filling the gap between sequence and function: a bioinformatics approach
Bargsten, J.W. - \ 2014
Wageningen University. Promotor(en): Richard Visser, co-promotor(en): Jan-Peter (Jp) Nap. - Wageningen : Wageningen University - ISBN 9789462570764 - 170
bio-informatica - planten - genomica - nucleotidenvolgordes - functionele genomica - vergelijkende genomica - vergelijkende genetische kartering - genomen - genetische kartering - plantenveredeling - methodologie - bioinformatics - plants - genomics - nucleotide sequences - functional genomics - comparative genomics - comparative mapping - genomes - genetic mapping - plant breeding - methodology

The research presented in this thesis focuses on deriving function from sequence information, with the emphasis on plant sequence data. Unravelling the impact of genomic elements, in most cases genes, on the phenotype of an organism is a major challenge in biological research and modern plant breeding. An important part of this challenge is the (functional) annotation of such genomic elements. Currently, wet lab experiments may provide high quality, but they are laborious and costly. With the advent of next generation sequencing platforms, vast amounts of sequence data are generated. This data are used in connection with the available experimental data to derive function from a bioinformatics perspective.

The connection between sequence information and function was approached on the level of chromosome structure (chapter 2) and of gene families (chapter 3) using combinations of existing bioinformatics tools. The applicability of using interaction networks for function prediction was demonstrated by first markedly improving an existing method (chapter 4) and by exploring the role of network topology in function prediction (chapter 5). Taken together, the combination of methods and results presented indicate the potential as well as the current state-of-the-art of function prediction in (plant) bioinformatics.

Chapter 1 introduces the basis for the approaches used and developed in this thesis. This includes the concepts of genome annotation, comparative genomics, gene function prediction and the analysis of network topology for gene function prediction. A requirement for the study of any new organism is the sequencing and annotation of its genome. Current genome annotation is divided into structural identification and functional categorization of genomic elements. The de facto standard for categorizing functional annotation is provided by the Gene Ontology. The Gene Ontology is divided into three domains, molecular function, biological process and cellular component. Approaches to predict molecular function and biological process are outlined. Accurate function prediction generally relies on existing input data, often of experimental origin, that can be transferred to unannotated genomic elements. Plants often lack such input data, which poses a big challenge for current function prediction algorithms. In unravelling the function of genomic elements, comparative genomics is an important approach. Via the comparison of multiple genomes it gives insights into evolution, function as well as genomic structure and variation. Comparative genomics has become an essential toolkit for the analysis of newly sequenced organisms. Often bioinformatics methods need to be adapted to the specific needs of plant genome research. With a focus on the commercially important crop plants tomato and potato, specific requirements of plant bioinformatics, such as the high amount of repetitive elements and the lack of experimental data, are outlined.

In chapter 2, the structural homology of the long arm of chromosome 2 (2L) of tomato, potato and pepper is analyzed. Molecular organization and collinear junctions are delineated using multi-color BAC FISH analysis and comparative sequence alignment. We identify several large-scale rearrangements including inversions and segmental translocations that were not reported in previous comparative studies. Some of the structural rearrangements are specific for the tomato clade, and differentiate tomato from potato, pepper and other solanaceous species. There are many small-scale synteny perturbations, but local gene vicinity is largely preserved. The data suggests that long distance intra-chromosomal rearrangements and local gene rearrangements have evolved frequently during speciation in the Solanum genus, and that small changes are more prevalent than large-scale differences. The occurrence of transposable elements and other repeats near or at junction breaks may indicate repeat-mediated rearrangements. The ancestral 2L topology is reconstructed and the evolutionary events leading to the current topology are discussed.

In chapter 3, we analyze the Snf2 gene family. As part of large protein complexes, Snf2 family ATPases are responsible for energy supply during chromatin remodeling, but the precise mechanism of action of many of these proteins is largely unknown. They influence many processes in plants, such as the response to environmental stress. The analysis is the first comprehensive study of Snf2 family ATPases in plants. Some subfamilies of the Snf2 gene family are remarkably stable in number of genes per genome, whereas others show expansion and contraction in several plants. One of these subfamilies, the plant-specific DRD1 subfamily, is non-existent in lower eukaryote genomes, yet it developed into the largest Snf2 subfamily in plant genomes. It shows the occurrence of a complex series of evolutionary events. Its expansion, notably in tomato, suggests novel functionality in processes connected to chromatin remodeling. The results underpin and extend the Snf2 subfamily classification, which could help to determine the various functional roles of Snf2 ATPases and to target environmental stress tolerance and yield in future breeding with these genes.

In chapter 4, a new approach to improve the prediction of protein function in terms of biological processes is developed that is particularly attractive for sparsely annotated plant genomes. The combination of the network-based prediction method Bayesian Markov Random Field (BMRF) with the sequence-based prediction method Argot2 shows significantly improved performance compared to each of the methods separately, as well as compared to Blast2GO. The approach was applied to predict biological processes for the proteomes of rice, barrel clover, poplar, soybean and tomato. Analysis of the relationships between sequence similarity and predicted function similarity identifies numerous cases of divergence of biological processes in which proteins are involved, in spite of sequence similarity. Examples of potential divergence are identified for various biological processes, notably for processes related to cell development, regulation, and response to chemical stimulus. Such divergence in biological process annotation for proteins with similar sequences should be taken into account when analyzing plant gene and genome evolution. This way, the integration of network-based and sequence-based function prediction will strengthen the analysis of evolutionary relationships of plant genomes.

In chapter 5 the influence of network topology on network-based function prediction algorithms is investigated. The analysis of biological networks using algorithms such as Bayesian Markov Random Field (BMRF) is a valuable predictor of the biological processes that proteins are involved in. The topological properties and constraints that determine prediction performance in such networks are however largely unknown. This chapter presents analyses based on network centrality measures, such as node degree, to evaluate the performance of BMRF upon progressive removal of highly connected hub nodes (pruning). Three different protein-protein interaction networks with data from Arabidopsis, human and yeast were analyzed. All three show that the average prediction performance can improve significantly. The chapter paves the way for further improvement of network-based function prediction methods based on node pruning.

Chapter 6 discusses the results and methods developed in this thesis in the context of the vast amount of generated sequencing data. Sequencing or re-sequencing a (plant) genome has become fairly straightforward and affordable, but the interpretation for subsequent use of this sequence data is far from trivial. The topics addressed in this thesis, annotation of function, analysis of genome structure and identifying genomic variation, focus on this main bottleneck of biological research. Issues discussed in connection with this work and its future are data accuracy, error propagation, possible improvements and future implications for biological research in crop plants. In particular the shift of costs from sequencing to downstream analyses, with functional genome annotation as essential step, is covered. One of the biggest challenges biology and bioinformatics will face is the integration of results from such downstream analyses and other sources into a complete picture. Only this will allow understanding of complex biological systems.

Elicitin-triggerd apoplastic immunity against late blight in potato
Du, J. - \ 2014
Wageningen University. Promotor(en): Richard Visser; Evert Jacobsen, co-promotor(en): Vivianne Vleeshouwers. - Wageningen : Wageningen University - ISBN 9789462570092 - 140
solanum tuberosum - aardappelen - plantenziekteverwekkende schimmels - phytophthora infestans - ziekteresistentie - genen - schimmeleiwit - genetische merkers - bio-informatica - plantenveredeling - solanum tuberosum - potatoes - plant pathogenic fungi - phytophthora infestans - disease resistance - genes - fungal protein - genetic markers - bioinformatics - plant breeding
SNPs & indels Schizophyllum commune
Nieuwenhuis, B.P.S. ; Aanen, D.K. - \ 2013
bioinformatics - genetics - ecology - evolutionary biology
This description accompanies four files containing SNPs and indels found in two sets of isolates of Schizophyllum commune. This dataset was created for and used in Nieuwenhuis, Nieuwhof and Aanen (2013) On the asymmetry of mating in natural populations of the mushroom fungus Schizophyllum commune. Fungal genetics and biology.
Applications in computer-assisted biology
Nijveen, H. - \ 2013
Wageningen University. Promotor(en): Ton Bisseling, co-promotor(en): P.E. van der Vet. - Wageningen : Wageningen UR - ISBN 9789461737816 - 106
bio-informatica - moleculaire biologie - computers - databanken - prokaryoten - computeranalyse - informatietechnologie - bioinformatics - molecular biology - computers - databases - prokaryotes - computer analysis - information technology

Biology is becoming a data-rich science driven by the development of high-throughput technologies like next-generation DNA sequencing. This is fundamentally changing biological research. The genome sequences of many species are becoming available, as well as the genetic variation within a species, and the activity of the genes in a genome under various conditions. With the opportunities that these new technologies offer, comes the challenge to effectively deal with the large volumes of data that they produce. Bioinformaticians have an important role to play in organising and analysing this data to extract biological information and gain knowledge. Also for experimental biologists computers have become essential tools. This has created a strong need for software applications aimed at biological research. The chapters in this thesis detail my contributions to this area. Together with molecular biologists, plant breeders, immunologists, and microbiologists, I have developed several software tools and performed computational analyses to study biological questions.

Chapter 2 is about Primer3Plus, a web tool that helps biologists to design DNA primers for their experiments. These primers are typically short stretches of DNA (~20 nucleotides) that direct the DNA replication machinery to copy a selected region of a DNA molecule. The specificity of a primer is determined by several chemical and physical properties and therefore designing good primers is best done with the help of a computer program. Primer3Plus offers a user-friendly task-oriented web interface to the popular primer3 primer design program. Primer3Plus clearly fulfils a need in the biological research community as already over 400 scientific articles have cited the Primer3Plus publication.

Single nucleotide differences or polymorphisms (SNPs) that are present within a species can be used as markers to link phenotypic observations to locations on the genome. Chapter 3 discusses QualitySNPng, which is a stand-alone software tool for finding SNPs in high-throughput sequencing data. QualitySNPng was inspired by the QualitySNP pipeline for SNP detection that was published in 2006 and it uses similar filtering criteria to distinguish SNPs from technical artefacts like sequence read errors. In addition, the SNPs are used to predict haplotypes. QualitySNPng has a graphical user interface that allows the user to run the SNP detection and evaluate the results. It has already been successfully used in several projects on marker detection for plant breeding.

Single nucleotide polymorphisms can lead to single amino acid changes in protein sequences. These single amino acid polymorphisms (SAPs) play a key role in graft-versus-host (GVH) effects that often accompany tissue transplantations. A beneficial variant of GVH is the graft-versus-leukaemia (GVL) effect that is sometimes witnessed after bone marrow transplantation in leukaemia patients. When the GVL effect occurs, the donor’s immune cells actively destroy residual tumour cells in the patient. The GVL effect can already be elicited by a single amino acid difference between the patient and the donor. Currently, a small number of SAPs that can elicit a GVL effect are known and these are used to select the right bone marrow donor for a leukaemia patient. Together with researchers at the Leiden University Medical Center I developed a database to aid in the discovery of more such SAPs. We called this database the “Human Short Peptide Variation database” or HSPVdb. It is described in chapter 4.

The work described in chapter 5 is focused on the regions in bacterial genomes that are involved in gene regulation, the promoters. Intrigued by anecdotal evidence that duplication of bacterial promoters can activate or silence genes, we investigated how often promoter duplication occurs in bacterial genomes. Using the large number of bacterial genomes that are currently available, we looked for clusters of highly similar promoter regions. Since duplication assumes some sort of mobility, we termed the duplicated promoters: putative mobile promoters or PMPs. We found over 4,000 clusters of PMPs in 1,043 genomes. Most of the clusters consist of two members, indicating a single duplication event, but we also found much larger clusters of PMPs within some genomes. A number of PMPs are present in multiple species, even in very distantly related bacterial species, suggesting perhaps that these were subjected to horizontal gene transfer. The mobile promoters could play an important role in the rapid rewiring of gene regulatory networks.

Chapter 6 discusses how current biological research can adapt to make full use of the opportunities offered by the high-throughput technologies by following three different approaches. The first approach empowers the biologists with user-friendly software that allows him to analyse the large volumes of genome scale data without requiring expert computer skills. In the second approach the biologist teams up with a bioinformatician to combine in-depth biological knowledge with expert computational skills. The third approach combines the biologist and the bioinformatician in one person by teaching the biologist computational skills. Each of these three approaches has it merits and shortcomings, so I do not expect any of them to become dominant in the near future. Looking further ahead, it seems inevitable that any biologist will have to learn at least the basics of computational methods and that this should be an integral part of biology education. Bioinformatics might in time cease to exist as a separate field and instead become an intrinsic aspect of most biological research disciplines.

The 3rd DBCLS BioHackathon: improving life science data integration with Semantic Web technologies
Katayama, T. ; Wilkinson, D.W. ; Micklem, G. ; Kawashima, S. ; Yamaguchi, A. ; Nakao, M. ; Yamamoto, T. ; Okamoto, S. ; Oouchida, K. ; Chung, H. ; Aerts, J. ; Afzal, H. ; Antezana, E. ; Arakawa, K. ; Aranda, B. ; Belleau, F. ; Bolleman, J. ; Bonnal, R.J.P. ; Chapman, B. ; Cock, P.J.A. ; Eriksson, T. ; Gordon, P.M.K. ; Goto, N. ; Hayashida, K. ; Horn, H. ; Ishiwata, R. ; Kaminuma, E. ; Kasprzyk, A. ; Kawaji, H. ; Kido, N. ; Kim, Y. ; Kinjo, A.R. ; Konishi, F. ; Kwon, K.H. ; Labarga, A. ; Lamprecht, A. ; Lin, Y. ; Lindenbaum, P. ; McCarthy, L. ; Morita, H. ; Murakami, K. ; Nagao, K. ; Nishida, K. ; Nishimura, K. ; Nishizawa, T. ; Ogishima, S. ; Ono, K. ; Oshita, K. ; Park, K. ; Prins, J.C.P. ; Saito, T. ; Samwald, M. ; Satagopam, V.P. ; Shigemoto, Y. ; Smith, R. ; Splendiani, A. ; Sugawara, H. ; Taylor, J. ; Vos, R.A. ; Withers, D. ; Yamasaki, C. ; Zmasek, C.M. ; Kawamoto, S. ; Okubo, K. ; Asai, K. ; Takagi, T. - \ 2013
Journal of Biomedical Semantics 4 (2013). - ISSN 2041-1480
protein-interaction database - systems biology - ontology - bioinformatics - tool - representation - services - language - framework - networks
Background: BioHackathon 2010 was the third in a series of meetings hosted by the Database Center for Life Sciences (DBCLS) in Tokyo, Japan. The overall goal of the BioHackathon series is to improve the quality and accessibility of life science research data on the Web by bringing together representatives from public databases, analytical tool providers, and cyber-infrastructure researchers to jointly tackle important challenges in the area of in silico biological research. Results: The theme of BioHackathon 2010 was the 'Semantic Web', and all attendees gathered with the shared goal of producing Semantic Web data from their respective resources, and/or consuming or interacting those data using their tools and interfaces. We discussed on topics including guidelines for designing semantic data and interoperability of resources. We consequently developed tools and clients for analysis and visualization. Conclusion: We provide a meeting report from BioHackathon 2010, in which we describe the discussions, decisions, and breakthroughs made as we moved towards compliance with Semantic Web technologies - from source provider, through middleware, to the end-consumer. source provider, through middleware, to the end-consumer.
Bioinformatics assisted breeding, from QTL to candidate genes
Chibon, P.Y. - \ 2013
Wageningen University. Promotor(en): Richard Visser, co-promotor(en): Richard Finkers. - S.l. : s.n. - ISBN 9789461737366 - 149
plantenveredeling - bio-informatica - moleculaire veredeling - marker assisted breeding - loci voor kwantitatief kenmerk - genetische kartering - gegevensverwerking - ontologieën - plant breeding - bioinformatics - molecular breeding - marker assisted breeding - quantitative trait loci - genetic mapping - data processing - ontologies

Over the last decade, the amount of data generated by a single run of a NGS sequencer outperforms days of work done with Sanger sequencing. Metabolomics, proteomics and transcriptomics technologies have also involved producing more and more information at an ever faster rate. In addition, the number of databases available to biologists and breeders is increasing every year. The challenge for them becomes two-fold, namely: to cope with the increased amount of data produced by these new technologies and to cope with the distribution of the information across the Web. An example of a study with a lot of ~omics data is described in Chapter 2, where more than 600 peaks have been measured using liquid chromatography mass-spectrometry (LCMS) in peel and flesh of a segregating F1apple population. In total, 669 mQTL were identified in this study. The amount of mQTL identified is vast and almost overwhelming. Extracting meaningful information from such an experiment requires appropriate data filtering and data visualization techniques. The visualization of the distribution of the mQTL on the genetic map led to the discovery of QTL hotspots on linkage group: 1, 8, 13 and 16. The mQTL hotspot on linkage group 16 was further investigated and mainly contained compounds involved in the phenylpropanoid pathway. The apple genome sequence and its annotation were used to gain insight in genes potentially regulating this QTL hotspot. This led to the identification of the structural gene leucoanthocyanidin reductase (LAR1) as well as seven genes encoding transcription factors as putative candidates regulating the phenylpropanoid pathway, and thus candidates for the biosynthesis of health beneficial compounds. However, this study also indicated bottlenecks in the availability of biologist-friendly tools to visualize large-scale QTL mapping results and smart ways to mine genes underlying QTL intervals.

In this thesis, we provide bioinformatics solutions to allow exploration of regions of interest on the genome more efficiently. In Chapter 3, we describe MQ2, a tool to visualize results of large-scale QTL mapping experiments. It allows biologists and breeders to use their favorite QTL mapping tool such as MapQTL or R/qtl and visualize the distribution of these QTL among the genetic map used in the analysis with MQ2. MQ2provides the distribution of the QTL over the markers of the genetic map for a few hundreds traits. MQ2is accessible online via its web interface but can also be used locally via its command line interface. In Chapter 4, we describe Marker2sequence (M2S), a tool to filter out genes of interest from all the genes underlying a QTL. M2S returns the list of genes for a specific genome interval and provides a search function to filter out genes related to the provided keyword(s) by their annotation. Genome annotations often contain cross-references to resources such as the Gene Ontology (GO), or proteins of the UniProt database. Via these annotations, additional information can be gathered about each gene. By integrating information from different resources and offering a way to mine the list of genes present in a QTL interval, M2S provides a way to reduce a list of hundreds of genes to possibly tens or less of genes potentially related to the trait of interest. Using semantic web technologies M2S integrates multiple resources and has the flexibility to extend this integration to more resources as they become available to these technologies.

Besides the importance of efficient bioinformatics tools to analyze and visualize data, the work in Chapter 2also revealed the importance of regulatory elements controlling key genes of pathways. The limitation of M2S is that it only considers genes within the interval. In genome annotations, transcription factors are not linked to the trait (keyword) and to the gene it controls, and these relationships will therefore not be considered. By integrating information about the gene regulatory network of the organism into Marker2sequence, it should be able to integrate in its list of genes, genes outside of the QTL interval but regulated by elements present within the QTL interval. In tomato, the genome annotation already lists a number of transcription factors, however, it does not provide any information about their target. In Chapter 5, we describe how we combined transcriptomics information with six genotypes from an Introgression Line (IL) population to find genes differentially expressed while being in a similar genomic background (i.e.: outside of any introgression segments) as the reference genotype (with no introgression). These genes may be differentially expressed as a result of a regulatory element present in an introgression. The promoter regions of these genes have been analyzed for DNA motifs, and putative transcription factor binding sites have been found.

The approaches taken in M2S (Chaper 4) are focused on a specific region of the genome, namely the QTL interval. In Chapter 6, we generalized this approach to develop Annotex. Annotex provides a simple way to browse the cross-references existing between biological databases (ChEBI, Rhea, UniProt, GO) and genome annotations. The main concept of Annotex being, that from any type of data present in the databases, one can navigate the cross-references to retrieve the desired type of information.

This thesis has resulted in the production of three tools that biologists and breeders can use to speed up their research and build new hypothesis on. This thesis also revealed the state of bioinformatics with regards to data integration. It also reveals the need for integration into annotations (for example, genome annotations, protein annotations, and pathway annotations) of more ontologies than just the Gene Ontology (GO) currently used. Multiple platforms are arising to build these new ontologies but the process of integrating them into existing resources remains to be done. It also confirms the state of the data in plants where multiples resources may contain overlapping. Finally, this thesis also shows what can be achieved when the data is made inter-operable which should be an incentive to the community to work together and build inter-operable, non-overlapping resources, creating a bioinformatics Web for plant research.

Vergelijkende genoomanalyse geeft inzicht in de evolutie en biologie van pathogene oömyceten
Seidl, M.F. ; Govers, F. - \ 2013
Gewasbescherming 44 (2013)4. - ISSN 0166-6495 - p. 109 - 112.
genomica - oömyceten - pathogenen - biologie - evolutie - moleculaire genetica - plantenziekteverwekkende schimmels - bio-informatica - plantenziekten - phytophthora infestans - genomics - oomycetes - pathogens - biology - evolution - molecular genetics - plant pathogenic fungi - bioinformatics - plant diseases
Hoewel oömyceten nog maar kortgeleden het genomica-tijdperk zijn binnengetreden hebben de nieuwe ‘-omics’-technieken al geleid tot een overvloed aan kwantitatieve data. Vergelijkende en geïntegreerde genomica is cruciaal om deze schatkist met data te ontsluiten. In het proefschrift ‘Exploring Evolution and Biology of Oomycetes: Integrative and Comparative Genomics’ zijn met succes de eerste stappen gezet om deze data te gebruiken om zodoende de evolutie en biologie van oömyceten verder te ontrafelen en dit heeft reeds geleid tot waardevole nieuwe inzichten.
Prioritization of candidate genes for cattle reproductive traits, based on protein-protein interactions, gene expression, and text-mining
Hulsegge, B. ; Woelders, H. ; Smits, M.A. ; Schokker, D. ; Jiang, L. ; Sorensen, P. - \ 2013
Physiological genomics 45 (2013)10. - ISSN 1094-8341 - p. 400 - 406.
dairy-cows - quantitative measure - interaction networks - estrous behavior - disease genes - identification - patterns - amygdala - brain - bioinformatics
Reproduction is of significant economic importance in dairy cattle. Improved understanding of mechanisms that control estrous behavior and other reproduction traits could help in developing strategies to improve and/or monitor these traits. The objective of this study was to predict and rank genes and processes in brain areas and pituitary involved in reproductive traits in cattle using information derived from three different data sources: gene expression, protein-protein interactions, and literature. We identified 59, 89, 53, 23, and 71 genes in bovine amygdala, dorsal hypothalamus, hippocampus, pituitary, and ventral hypothalamus, respectively, potentially involved in processes underlying estrus and estrous behavior. Functional annotation of the candidate genes points to a number of tissue-specific processes of which the “neurotransmitter/ion channel/synapse” process in the amygdala, “steroid hormone receptor activity/ion binding” in the pituitary, “extracellular region” in the ventral hypothalamus, and “positive regulation of transcription/metabolic process” in the dorsal hypothalamus are most prominent. The regulation of the functional processes in the various tissues operate at different biological levels, including transcriptional, posttranscriptional, extracellular, and intercellular signaling levels.
From existing data to novel hypotheses : design and application of structure-based Molecular Class Specific Information Systems
Kuipers, R.K.P. - \ 2012
Wageningen University. Promotor(en): Vitor Martins dos Santos; G. Vriend, co-promotor(en): Peter Schaap. - S.l. : s.n. - ISBN 9789461733504 - 231
systeembiologie - bio-informatica - genomica - informatiesystemen - computerwetenschappen - databanken - datamining - eiwitten - eiwitexpressieanalyse - systems biology - bioinformatics - genomics - information systems - computer sciences - databases - data mining - proteins - proteomics

As the active component of many biological systems, proteins are of great interest to life scientists. Proteins are used in a large number of different applications such as the production of precursors and compounds, for bioremediation, as drug targets, to diagnose patients suffering from genetic disorders, etc. Many research projects have therefore focused on the characterization of proteins and on improving the understanding of the functional and mechanistic properties of proteins. Studies have examined folding mechanisms, reaction mechanisms, stability under stress, effects of mutations, etc. All these research projects have resulted in an enormous amount of available data in lots of different formats that are difficult to retrieve, combine, and use efficiently.

The main topic of this thesis is the 3DM platform that was developed to generate Molecular Class Specific Information Systems (3DM systems) for protein superfamilies. These superfamily systems can be used to collect and interlink heterogeneous data sets based on structure based multiple sequence alignments. 3DM systems can be used to integrate protein, structure, mutation, reaction, conservation, correlation, contact, and many other types of data. Data is visualized using websites, directly in protein structures using YASARA, and in literature using Utopia Documents. 3DM systems contain a number of modules that can be used to analyze superfamily characteristics namely Comulator for correlated mutation analyses, Mutator for mutation retrieval, and Validator for mutant pathogenicity prediction. To be able to determine the characteristics of subsets of proteins and to be able to compare the characteristics of different subsets a powerful filtering mechanism is available. 3DM systems can be used as a central knowledge base for projects in protein engineering, DNA diagnostics, and drug design.

The scientific and technical background of the 3DM platform is described in the first two chapters. Chapter 1 describes the scientific background, starting with an overview of the foundations of the 3DM platform. Alignment methods and tools for both structure and sequence alignments, and the techniques used in the 3DM modules are described in detail. Alternative methods are also described with the advantages and disadvantages of the various strategies. Chapter 2 contains a technical description of the implementation of the 3DM platform and the 3DM modules. A schematic overview of the database used to store the data is provided together with a description of the various tables and the steps required to create new 3DM systems. The techniques used in the Comulator, Mutator and Validator modules of the 3DM platforms are discussed in more detail.

Chapter 3 contains a concise overview of the 3DM platform, its capabilities, and the results of protein engineering projects using 3DM systems. Thirteen 3DM systems were generated for superfamilies such as the PEPM/ICL and Nuclear Receptors. These systems are available online for further examination. Protein engineering studies aimed at optimizing substrate specificity, enzyme activity, or thermostability were designed targeting proteins from these superfamilies. Preliminary results of drug design and DNA diagnostics projects are also included to highlight the diversity of projects 3DM systems can be applied to.

Project HOPE: a biomedical tool to predict the effect of a mutation on the structure of a protein is described in chapter 4. Project HOPE is developed at the Radboud University Nijmegen Medical Center under supervision of H. Venselaar. Project HOPE employs webservices to optimally reuse existing databases and computing facilities. After selection of a mutant in a protein, data is collected from various sources such as UniProt and PISA. A homology model is created to determine features such as contacts and side-chain accessibility directly in the structure. Using a decision tree, the available data is evaluated to predict the effects of the mutation on the protein.

Chapter 5 describes Comulator: the 3DM module for correlated mutation analyses. Two positions in an alignment correlate when they co-evolve, that is they mutate simultaneously or not at all. Comulator uses a statistical coupling algorithm to calculate correlated mutation analyses. Correlated mutations are visualized using heatmaps, or directly in protein structures using YASARA. Analyses of correlated mutations in various superfamilies showed that positions that correlate are often found in networks and that the positions in these networks often share a common function. Using these networks, mutants were predicted to increase the specificity or activity of proteins. Mutational studies confirmed that correlated mutation analyses are a valuable tool for rational design of proteins.

Mutator, the text mining tool used to incorporate mutations into 3DM systems is described in chapter 6. Mutator was designed to automatically retrieve mutations from literature and store these mutations in a 3DM system. A PubMed search using keywords from the 3DM system is used to preselect articles of interest. These articles are retrieved from the internet, converted to text, and parsed for mutations. Mutations are then grounded to proteins and stored in a 3DM database. Mutation retrieval was tested on the alpha-amylase superfamily as this superfamily contains the enzyme involved in Fabry’s disease: an x linked lysosomal storage disease. Compared to existing mutant databases, such as the HGMD and SwissProt, Mutator retrieved 30% more mutations from literature. A major problem in DNA diagnostics is the differentiation between natural variants and pathogenic mutations. To distinguish between pathogenic mutations and natural variation in proteins the Validator modules was added to 3DM. Validator uses the data available in a 3DM system to predict the pathogenicity of a mutant using, for example, the residue conservation of the mutants alignment position, side-chain accessibility of the mutant in the structure, and the number of mutations found in literature for the alignment position. Mutator and Validator can be used to study mutants found in disorder related genes. Although these tools are not the definitive solution for DNA diagnostics they can hopefully be used to increase our understanding of the molecular basis of disorders.

Chapter 7 and 8 describe applied research projects using 3DM systems containg proteins of potential commercial interest. A 3DM system for the a/b-beta hydrolases superfamily is described in chapter 7. This superfamily consists of almost 20,000 proteins with a diverse range of functions. Superfamily alignments were generated for the common beta-barrel fold shared by all superfamily members, and for five distinct subtypes within the superfamily. Due to the size and functional diversity of the superfamily, there is a lot of potential for industrial application of superfamily members. Chapter 8 describes a study focusing on a sucrose phosphorylase enzyme from the a-amylase superfamily. This enzyme can be potentially used in an industrial setting for the transfer of glucose to a wide variety of molecules. The aim of the study was to increase the stability of the protein at higher temperatures. A combination of rational design using a 3DM system, and in-depth study of the protein structure, led to a series of mutations that resulted in more than doubling the half-life of the protein at 60°C.

3DM systems have been successfully applied in a wide range of protein engineering and DNA diagnostics studies. Currently, 3DM systems are applied most successfully in project studying a single protein family or monogenetic disorder. In the future, we hope to be able to apply 3DM to more complex scenarios such as enzyme factories and polygenetic disorders by combining multiple 3DM systems for interacting proteins.

Multiplex SSR analysis of Phytophthora infestans in different countries and the importance for potato breeding
Li, Y. - \ 2012
Wageningen University. Promotor(en): Evert Jacobsen, co-promotor(en): Theo van der Lee; D.E.L. Cooke. - S.l. : s.n. - ISBN 9789461732798 - 206
solanum tuberosum - aardappelen - plantenveredeling - plantenziekteverwekkende schimmels - phytophthora infestans - microsatellieten - populaties - ziekteresistentie - genetische merkers - moleculaire merkers - bio-informatica - genomica - plant-microbe interacties - solanum tuberosum - potatoes - plant breeding - plant pathogenic fungi - phytophthora infestans - microsatellites - populations - disease resistance - genetic markers - molecular markers - bioinformatics - genomics - plant-microbe interactions

Potato is the most important non-cereal crop in the world. Late blight, caused by the oomycete pathogen Phytophthora infestans, is the most devastating disease of potato. In the mid-19th century, P. infestans attacked the European potato fields and this resulted in a widespread famine in Ireland and other parts of Europe. Late blight remains the most important pathogen to potato and causes a yearly multi-billion US dollar loss globally. In Europe and North America, late blight control heavily relies on the use of chemicals, which is hardly affordable to farmers in developing countries and also raises considerable environmental concerns in the developed countries.
The structure of P. infestans populations can change quickly by migration, sexual recombination and sub-clonal variation. Migration and the reconvening of the two mating types considerably raised the level of genetic diversity in the global P. infestans population, leading to a more variable population with a presumed higher level of adaptability as compared to the previously, purely asexually, reproducing population. How can the P. infestans population efficiently be monitored with such diverse genotypes? A high-throughput, high-resolution and easy-handled set of markers would be favorable for this purpose. Few genetic markers, if any, have found such widespread use as SSRs. Sequencing allows the identification of large numbers of microsatellites by bioinformatics. So far, however, only a limited number of informative microsatellite loci had been described for P. infestans and none have been mapped. This thesis first describes the development and mapping of SSR markers in P. infestans and integration with other SSRs to generate a multiplex SSR set and its application in the population analysis of P. infestans from four countries are described with the developed multiplex SSRs. Finally, the use of this knowledge in resistance breeding of potato is shortly indicated and discussed.
Chapter 1 describes the historic population changes of P. infestans at the global level and the current population trends. It summarizes microsatellite as favorite molecular markers for studying pathogen population diversity and assesses monitoring of population dynamics in more detail for resistance breeding in potato.
The selection and identification of new SSR markers is presented in Chapter 2. From EST and genomic sequences from P. infestans we identified 300 non-redundant SSR loci by a bioinformatic screening pipeline. Based on the robustness, level of polymorphism and map position eight SSR markers were selected, which were assembled in two multiplex PCR sets and labeled with two different fluorescent dyes to allow scoring after single capillary electrophoresis.
This successful multiplex SSR approach encouraged the development of fast, accurate and high-throughput genotyping, in an one-step multiplex PCR method to facilitate worldwide screening of P. infestans populations. Published SSRs and the 8 new SSRs were integrated. All these SSR markers were re-evaluated and the 12 most informative SSRs were selected to set up a standard set for global application (Chapter 3). The 12-plex SSRs are distributed over different chromosomes, significantly increasing the resolution of genotyping compared to the previous set of 8 SSRs. The 12-plex SSRs were integrated to one-step fluorescence-based multiplex reaction, which plays a key role to facilitate highly paralleled genotyping and efficient dissection of the more complex P. infestans populations. This multiplex PCR for P. infestans populations is (i) simple, as only one PCR is needed to perform multi-locus typing with twelve markers; (ii) rapid, as the genotyping results can be available in 1 day; and (iii) reproducible and adapted to different laboratories. The genotyping data from different geographic populations were submitted to the Euroblight database. With the same SSR set and the bin set, a comparable global database can easily be achieved.
As indicated earlier, more recent analyses of P. infestans populations highlight the appearance of many new genotypes via migration and/or sexual recombination. To practice the newly developed 12-plex SSR set and dissect the current population structure, several P. infestans populations from 4 different continents were selected for analysis. These include Chinese (Chapter 4), Dutch (Chapter 5), Ecuadorian (Chapter 6) and Tunisian (Chapter 7) populations.
China has become the largest potato producing country not only for potato cultivation area but also in Megaton potato production. Interprovincial trade of consumption and seed potatoes is very important and frequent in China. Although both, the A1 and A2 mating types are found in China, to this date, no evidence of an active sexual cycle based on changes in allele frequency was found. With the ten SSRs, a large genotypic survey of in nation-wide collection of 228 P. infestans isolates was performed (Chapter 4). One of the three dominant clonal lineages CN-04 (A2) in this Chinese population was genetically similar to a major clonal lineage identified in Europe, called “Blue_13” with A2 mating type. It was not possible to critically assess the origin of this clonal lineage. This study is the first report of “Blue_13” outside Europe. The virulence spectrum of selected Chinese P. infestans isolates showed seven different virulence spectra varying from 3 to 10 differentials. The CN04 genotypes were identified as more aggressive and more virulent genotypes, one of whom had the full virulence pattern after using the potato differential set. Within the Chinese P. infestans population, the genotypes strongly clustered according to their six sampling provinces, which seems not to be influenced by the frequent interprovincial trading activities of seed potatoes. The mating type ratio and the SSR allele frequencies indicate that in China the contribution of the sexual cycle to P. infestans on population dynamics is minimal. It was concluded that the migration through asexual propagules and the generation of sub-clonal variation are the dominant driving factors behind the Chinese P. infestans population structure.
The Netherlands has a long history of population studies on local P. infestans isolates and a substantial amount of commercial potato varieties growing in the field. One decade (2000-2009) of isolate sampling in 5 different regions provided the basis for a good understanding of the population dynamics in the Netherlands (Chapter 5). The surveyed population revealed the presence of several clonal lineages and a group of sexual progenies. The major clonal lineage with A2 mating type is known as “Blue_13”, but also two distinct clonal lineages with A1 mating type in this study have been identified. This survey witnesses that the Dutch population was undergoing dramatic changes in the ten years under study. The most notable change was the emergence and spread of A2 mating type strain “Blue_13”. The results emphasize the importance of the sexual cycle in generating genetic diversity and the importance of the asexual cycle as the propagation- and dispersal mechanism for successful genotypes. In addition to the neutral SSR markers a molecular marker for the virulence of isolates on potato lines that contain the Rpi-blb1 R-gene has been developed. Using this Avr-blb1 marker and the corresponding virulence assay we report, for the first time, the presence of Rpi-blb1 breaker isolates in the Netherlands even before a Rpi-blb1 containing resistant variety was introduced. The 12 breaker isolates only occurred in sexual progeny. So far the asexual spread of such virulent isolates has been limited because of the absence of Rpi-blb1 containing varieties in the field.
Remarkably, on the other end of the world in the Andes, the region of potato origin, the situation is far less complex as far as P. infestans is concerned. There are more than 400 potato landraces in Ecuador and the planting habit by local farmers by traditional cultivation at small scale in the highlands is different from potato cultivation in other potato countries in North America or Europe (Chapter 6). Phytophthora isolates in Ecuador belong to two closely related species, P. infestans (on potato and tomato) and P. andina (on non-tuber bearing host), but SSR analysis of 66 isolates indicated that the two species are separated in two clearly distinguished genetic groups. Two ancient clonal lineages of P. infestans appeared to be dominant in Ecuador one is found only on tomato the other one only on potato. Within the potato isolates, but not in the tomato isolates, there is a large sub-clonal variation caused by (partial) polyploidization and loss of alleles.
In Tunisia, potato is cultivated in three to four partly overlapping seasons while tomato is grown either in greenhouses or as aerial crop in most potato producing regions. Chapter 7 revealed, among 165 isolates of five regions, the presence of a major clonal lineage (NA-01, A1 mating type, Ia mtDNA haplotype) that seems to consist of races that are relatively simple. Another highly genetic diverse group of isolates was found containing more complex races and isolates with both mating types. Season clustering indicated that at least some of the new genotypes generated by sexual reproduction overlapped between seasons and such a sexual progeny may play an important role in the next season epidemics. On tomato, mostly asexual progeny was identified with two mtDNA haplotypes but less nuclear genotypes, compared to potato. This study shows that the P. infestans population is currently changing, and the old clonal lineage is being replaced by a more complex, genetically diverse and sexually propagating population in two sub-regions in Tunisia. Despite the massive import of potato seeds from Europe, the P. infestans population in Tunisia is still clearly distinct from the European population.
Chapter 8 discusses the application of microsatellites in monitoring genetic diversity of late blight and the potential use in resistance breeding. Monitoring of the local P. infestans population for new virulent genotypes with the differential potato set in combination with screening for effector variation, allows early detection of adaptation of certain genotypes within the P. infestans population to particular resistance genes in a specific region. This provides the possibilities to determine which broad spectrum R-genes are still useful in order to adapt the control strategy by resistance breeding to the new situation. One way of doing that is to replace the existing varieties by other varieties with stacked non-broken R-genes obtained by marker assisted selection or to add additional R-genes to existing (R-gene containing) varieties by transformation. In a transgenic or cisgenic approach, additional broad spectrum R-genes could be added by re-transformation. As we have shown, the right R-gene management strategy in potato breeding, but also in potato production, should include the direct monitoring of local pathogen populations by using the differential set and the 12-plex SSR set.

Genome bioinformatics of tomato and potato
Datema, E. - \ 2011
Wageningen University. Promotor(en): W. Stiekema, co-promotor(en): Roeland van Ham. - [S.l.] : S.n. - ISBN 9789461730473 - 139
gewassen - solanum lycopersicum - solanum tuberosum - genomica - bio-informatica - nucleotidenvolgordes - genomen - genen - crops - solanum lycopersicum - solanum tuberosum - genomics - bioinformatics - nucleotide sequences - genomes - genes

In the past two decades genome sequencing has developed from a laborious and costly technology employed by large international consortia to a widely used, automated and affordable tool used worldwide by many individual research groups. Genome sequences of many food animals and crop plants have been deciphered and are being exploited for fundamental research and applied to improve their breeding programs. The developments in sequencing technologies have also impacted the associated bioinformatics strategies and tools, both those that are required for data processing, management, and quality control, and those used for interpretation of the data.

This thesis focuses on the application of genome sequencing, assembly and annotation to two members of the Solanaceae family, tomato and potato. Potato is the economically most important species within the Solanaceae, and its tubers contribute to dietary intake of starch, protein, antioxidants, and vitamins. Tomato fruits are the second most consumed vegetable after potato, and are a globally important dietary source of lycopene, beta-carotene, vitamin C, and fiber. The chapters in this thesis document the generation, exploitation and interpretation of genomic sequence resources for these two species and shed light on the contents, structure and evolution of their genomes.

Chapter 1introduces the concepts of genome sequencing, assembly and annotation, and explains the novel genome sequencing technologies that have been developed in the past decade. These so-called Next Generation Sequencing platforms display considerable variation in chemistry and workflow, and as a consequence the throughput and data quality differs by orders of magnitude between the platforms. The currently available sequencing platforms produce a vast variety of read lengths and facilitate the generation of paired sequences with an approximately fixed distance between them. The choice of sequencing chemistry and platform combined with the type of sequencing template demands specifically adapted bioinformatics for data processing and interpretation. Irrespective of the sequencing and assembly strategy that is chosen, the resulting genome sequence, often represented by a collection of long linear strings of nucleotides, is of limited interest by itself. Interpretation of the genome can only be achieved through sequence annotation – that is, identification and classification of all functional elements in a genome sequence. Once these elements have been annotated, sequence alignments between multiple genomes of related accessions or species can be utilized to reveal the genetic variation on both the nucleotide and the structural level that underlies the difference between these species or accessions.

Chapter 2describes BlastIf, a novel software tool that exploits sequence similarity searches with BLAST to provide a straightforward annotation of long nucleotide sequences. Generally, two problems are associated with the alignment of a long nucleotide sequence to a database of short gene or protein sequences: (i) the large number of similar hits that can be generated due to database redundancy; and (ii) the relationships implied between aligned segments within a hit that in fact correspond to distinct elements on the sequence such as genes. BlastIf generates a comprehensible BLAST output for long nucleotide sequences by reducing the number of similar hits while revealing most of the variation present between hits. It is a valuable tool for molecular biologists who wish to get a quick overview of the genetic elements present in a newly sequenced segment of DNA, prior to more elaborate efforts of gene structure prediction and annotation.

In Chapter 3 a first genome-wide comparison between the emerging genomic sequence resources of tomato and potato is presented. Large collections of BAC end sequences from both species were annotated through repeat searches, transcript alignments and protein domain identification. In-depth comparisons of the annotated sequences revealed remarkable differences in both gene and repeat content between these closely related genomes. The tomato genome was found to be more repetitive than the potato genome, and substantial differences in the distribution of Gypsy and Copia retrotransposable elements as well as microsatellites were observed between the two genomes. A higher gene content was identified in the potato sequences, and in particular several large gene families including cytochrome P450 mono-oxygenases and serine-threonine protein kinases were significantly overrepresented in potato compared to tomato. Moreover, the cytochrome P450 gene family was found to be expanded in both tomato and potato when compared to Arabidopsis thaliana, suggesting an expanded network of secondary metabolic pathways in the Solanaceae. Together these findings present a first glimpse into the evolution of Solanaceous genomes, both within the family and relative to other plant species.

Chapter 4explores the physical and genetic organization of tomato chromosome 6 through integration of BAC sequence analysis, High Information Content Fingerprinting, genetic analysis, and BAC-FISH mapping data. A collection of BACs spanning substantial parts of the short and long arm euchromatin and several dispersed regions of the pericentrometric heterochromatin were sequenced and assembled into several tiling paths spanning approximately 11 Mb. Overall, the cytogenetic order of BACs was in agreement with the order of BACs anchored to the Tomato EXPEN 2000 genetic map, although a few striking discrepancies were observed. The integration of BAC-FISH, sequence and genetic mapping data furthermore provided a clear picture of the borders between eu- and heterochromatin on chromosome 6. Annotation of the BAC sequences revealed that, although the majority of protein-coding genes were located in the euchromatin, the highly repetitive pericentromeric heterochromatin displayed an unexpectedly high gene content. Moreover, the short arm euchromatin was relatively rich in repeats, but the ratio of Gypsy and Copia retrotransposons across the different domains of the chromosome clearly distinguished euchromatin from heterochromatin. The ongoing whole-genome sequencing effort will reveal if these properties are unique for tomato chromosome 6, or a more general property of the tomato genome.

Chapter 5presents the potato genome, the first genome sequence of an Asterid. To overcome the problems associated with genome assembly due tothe high level of heterozygosity that is observed in commercial tetraploid potato varieties, a homozygous doubled-monoploid potato clone was exploited to sequence and assemble 86% of the 844 Mb genome. This potato reference genome sequence was complemented with re-sequencing of aheterozygous diploid clone, revealing the form and extent of sequence polymorphism both between different genotypes and within a single heterozygous genotype. Gene presence/absence variants and other potentially deleterious mutations were found to occur frequently in potato and are a likely cause of inbreeding depression. Annotation of the genome was supported by deep transcriptome sequencing of both the doubled-monoploid and the heterozygous potato, resulting in the prediction of more than 39,000 protein coding genes. Transcriptome analysis provided evidence for the contribution of gene family expansion, tissue specific expression, and recruitment of genes to new pathways to the evolution of tuber development. The sequence of the potato genome has provided new insights into Eudicot genome evolution and has provided a solid basis for the elucidation of the evolution of tuberisation. Many traits of interest to plant breeders are quantitative in nature and the potato sequence will simplify both their characterization and deployment to generate novel cultivars.

The outstanding challenges in plant genome sequencing are addressed in Chapter 6. The high concentration of repetitive elements and the heterozygosity and polyploidy of many interesting crop plant species currently pose a barrier for the efficient reconstruction of their genome sequences. Nonetheless, the completion of a large number of new genome sequences in recent years and the ongoing advances in sequencing technology provide many excitingopportunities for plant breeding and genome research. Current sequencing platforms are being continuously updated and improved, and novel technologies are being developed and implemented in third-generation sequencing platforms that sequence individual molecules without need for amplification. While these technologies create exciting opportunities for new sequencing applications, they also require robust software tools to process the data produced through them efficiently. The ever increasing amount of available genome sequences creates the need for an intuitive platform for the automated and reproducible interrogation of these data in order to formulate new biologically relevant questions on datasets spanning hundreds or thousands of genome sequences.

Bayesian Markov random field analysis for integrated network-based protein function prediction
Kourmpetis, Y.I.A. - \ 2011
Wageningen University. Promotor(en): Cajo ter Braak, co-promotor(en): Roeland van Ham. - [S.l.] : S.n. - ISBN 9789085859598 - 113
statistiek - bayesiaanse theorie - markov-processen - netwerkanalyse - biostatistiek - toegepaste statistiek - bio-informatica - eiwitten - genen - moleculaire biologie - statistics - bayesian theory - markov processes - network analysis - biostatistics - applied statistics - bioinformatics - proteins - genes - molecular biology

Unravelling the functions of proteins is one of the most important aims of modern biology. Experimental inference of protein function is expensive and not scalable to large datasets. In this thesis a probabilistic method for protein function prediction is presented that integrates different types of data such as sequences and networks. The method is based on Bayesian Markov Random Field (BMRF) analysis. BMRF was initially applied to genome wide protein function prediction using network data in yeast and in also in Arabidopsis by integrating protein domains (i.e InterPro signatures), expressions and protein protein interactions. Several of the predictions were confirmed by experimental evidence. Further, an evolutionary discrete optimization algorithm is presented that integrates function predictions from different Gene Ontology (GO) terms to a single prediction that is consistent to the True Path Rule as imposed by the GO Directed Acyclic Graph. This integration leads to predictions that are easy to be interpreted. Evaluation of of this algorithm using Arabidopsis data showed that the prediction performance is improved, compared to single GO term predictions.

Check title to add to marked list
<< previous | next >>

Show 20 50 100 records per page

Please log in to use this service. Login as Wageningen University & Research user or guest user in upper right hand corner of this page.