- Bioinformatics (11)
- EPS (8)
- EPS-4 (6)
- Laboratory of Phytopathology (4)
- Laboratory of Plant Breeding (4)
- Plant Breeding (4)
- BIOS Applied Bioinformatics (3)
- PRI BIOS Applied Bioinformatics (3)
- VLAG (3)
- Bioint Moleculair Phytopathology (2)
- EPS-1 (2)
- EPS-2 (2)
- Microbiological Laboratory (2)
- Microbiology (2)
- PRI BIOINT Moleculair Phytopathology (2)
- PRI Bioscience (2)
- Systems and Synthetic Biology (2)
- Animal Breeding and Genetics (1)
- Animal Breeding and Genomics (1)
- BIOS Applied Metabolic Systems (1)
- Bioint Entomology & Disease Management (1)
- Biometris (1)
- Biometris (PPO/PRI) (1)
- Chair Nutrition Metabolism and Genomics (1)
- Corporate Staff (1)
- HNE Nutrition, Metabolism and Genomics (1)
- Information Technology (1)
- LEI MARKT & K - Risico- en Informatiemanagement (1)
- Laboratory of Molecular Biology (1)
- Laboratory of Nematology (1)
- Livestock Research (1)
- Nutrition, Metabolism and Genomics (1)
- PBR Biodiversiteit en Genetische Variatie (1)
- PBR Biodiversity and genetic variation (1)
- PRI BIOS Applied Metabolic Systems (1)
- PRI Biodiversity and Breeding (1)
- PRI Bioint Entomology & Disease Management (1)
- Plant Research International (1)
- RIKILT - Business Unit Safety & Health (1)
- WIAS (1)
- Wageningen Livestock Research (1)
- Wageningen UR Administration Office (1)
- E. Datema (1)
- J. Du (1)
- M.W.E.J. Fiers (1)
- A.K. Gavai (1)
- F. Govers (1)
- R.D. Hall (1)
- R. Heshof (1)
- M. Hollander de (1)
- J. Keijer (1)
- H.H.D. Kerstens (1)
- Y.I.A. Kourmpetis (1)
- R.K.P. Kuipers (1)
- A. Kuzniar (1)
- Y. Li (1)
- Y. Liu (1)
- A. Mirzadi Gohari (1)
- P. Neerincx (1)
- H. Nijveen (1)
- L. Qin (1)
- X. Ren (1)
- D. Ridder de (1)
- M.F. Seidl (1)
- I. Stergiopoulos (1)
- W.J. Stiekema (1)
- J. Tang (1)
- P.J.G.M. Wit de (1)
- J. Wolfert (1)
- B. Ökmen (1)
The fish egg microbiome : diversity and activity against the oomycete pathogen Saprolegnia
Liu, Y. - \ 2016
Wageningen University. Promotor(en): Francine Govers; Jos Raaijmakers, co-promotor(en): Irene de Bruijn. - Wageningen : Wageningen University - ISBN 9789462577671 - 169
salmon - fish eggs - marine microorganisms - microbial diversity - bioinformatics - genomics - saprolegnia - oomycota - fish diseases - suppression - fungal antagonists - zalm - visseneieren - mariene micro-organismen - microbiële diversiteit - bio-informatica - genomica - saprolegnia - oömycota - visziekten - onderdrukking - schimmelantagonisten
Prof. dr. F. Govers (promotor); Prof. dr. J.M. Raaijmakers (promotor); Dr. I. de Bruijn (co-promotor); Wageningen University, 13 June 2016, 170 pp.
The fish egg microbiome: diversity and activity against the oomycete pathogen Saprolegnia
Emerging oomycete pathogens increasingly threaten biodiversity and food security. This thesis describes the study of the microbiome of Atlantic salmon (Salmo salar L.) eggs and analyses of the effects of infections by the oomycete pathogen Saprolegnia on the microbial architecture. A low incidence of Saprolegniosis was correlated with a relatively high abundance and richness of specific commensal Actinobacteria. Among the bacterial community, the isolates Frondihabitans sp. 762G35 (Microbacteriaceae) and Pseudomonas sp. H6 significantly inhibited hyphal attachment of Saprolegnia diclina to live salmon eggs. Chemical profiling showed that these two isolates produce furancarboxylic acid-derived metabolites and a lipopeptide viscosin-like biosurfactant, respectively, which inhibited hyphal growth of S. diclina in vitro. Among the fungal community, the fungal isolates obtained from salmon eggs were closely related to Microdochium lycopodinum/Microdochium phragmitis and Trichoderma viride. Both a quantitative and qualitative difference in the Trichoderma population between Saprolegnia-infected and healthy salmon eggs was observed, which suggested that mycoparasitic Trichoderma species could play a role in Saprolegnia suppression in aquaculture. This research provides a scientific framework for studying the diversity and dynamics of microbial communities to mitigate emerging diseases. The Frondihabitans, Pseudomonas and Trichoderma isolates, and/or their bioactive metabolites, are proposed as effective candidates to control Saprolegniosis.
Identification and functional characterization of putative (a)virulence factors in the fungal wheat pathogen Zymoseptoria tritici
Mirzadi Gohari, A. - \ 2015
Wageningen University. Promotor(en): Pierre de Wit, co-promotor(en): Gert Kema; Rahim Mehrabi. - Wageningen : Wageningen University - ISBN 9789462575912 - 159
triticum aestivum - wheat - plant pathogenic fungi - mycosphaerella graminicola - virulence factors - genetic analysis - pathogenesis - bioinformatics - triticum aestivum - tarwe - plantenziekteverwekkende schimmels - mycosphaerella graminicola - virulente factoren - genetische analyse - pathogenese - bio-informatica
Zymoseptoria tritici (Desm.) Quaedvlieg & Crous (previously known as Mycosphaerella graminicola) is the causal agent of septoria tritici blotch (STB), which is a devastating foliar wheat disease worldwide. It is responsible for significant yield losses occurring annually in all major wheat-growing areas and threatens global food security. Z. tritici is a hemi-biotrophic fungal pathogen that, after stomatal penetration, establishes a stealthy biotrophic and symptomless relation with its host plant that is followed by a sudden switch to a necrotrophic growth phase coinciding with chlorosis that eventually develops in large necrotic blotches containing many pycnidia producing asexual splash-borne conidia. Under natural conditions - once competent mating partners are present and conditions are conducive- pseudothecia are formed producing airborne ascospores. Disease management of STB is primarily achieved through fungicide applications and growing commercial cultivars carrying Stb resistance genes. However, the efficacy of both strategies is limited as strains resistant to fungicides frequently develop and progressively dominate natural populations, which hampers disease management; also the deployed Stb genes are often overcome by existing or newly developed isolates of the fungus. Hence, there is a need for discovery research to better understand the molecular basis of the host-pathogen interaction that enables breeders to identify and deploy new Stb genes, which will eventually contribute to more sustainable disease control.
Chapter 1 introduces the subject of the thesis and describes various aspects of the lifestyle of Z. tritici with emphasis on dissecting the various stages and physiological processes during pathogenesis on wheat. In addition, it includes a short summary and discussion of the current understanding of the role of (a)virulence factors in the Z. tritici–wheat pathosystem.
Chapter 2 describes new gateway technology-driven molecular tools comprising 22 entry constructs facilitating rapid construction of binary vectors for functional analyses of fungal genes. The entry vectors for single, double or triple gene deletion mutants were developed using hygromycin, geneticin and nourseothricin resistance genes as selection markers. Furthermore, these entry vectors contain the genes encoding green fluorescent (GFP) or red fluorescent (RFP) protein in combination with the three selection markers, which enables simultaneous tagging of gene deletion mutants for microscopic analyses. The functionality of these entry vectors was validated in Z. tritici and described in Chapters 3, 4 and 5.
Chapter 3 describes the functional characterization of ZtWor1, the orthologue of Wor1 in the fungal human pathogen Candida albicans. ZtWor1 is up-regulated during initiation of colonization and fructification, and regulates expression of candidate effector genes, including one that was discovered after comparative proteome analysis of Z. tritici wild-type and ΔZtWor1 strains. Cell fusion and anastomosis occurred frequently in ΔZtWor1 strains, which is reminiscent of mutants of MgGpb1, the β-subunit of the heterotrimeric G protein. Comparative expression profiling of ΔZtWor1, ΔMgGpb1 and ΔMgTpk2 (the catalytic subunit of protein kinase A) strains, suggests that ZtWor1 is downstream of the cyclic adenosine monophosphate (cAMP) pathway that is crucial for pathogenicity of many fungal plant pathogens.
Chapter 4 describes combined bioinformatics and expression profiling studies during pathogenesis in order to discover candidate effectors of Z. tritici important for virulence. In addition, a genetic approach was followed to map quantitative trait loci (QTLs) in Z. tritici carrying putative effectors. Functional analysis of two top effector candidates, small-secreted proteins SSP15 and SSP18, which were selected based on their expression profile in planta, showed that they are dispensable for virulence of Z. tritici. These analyses suggest that generally adopted criteria for effector discovery, such as protein size, number of cysteine residues and up-regulated expression during pathogenesis, should be taken with caution and cannot be applied to every pathosystem, as they likely represent only a subset of effector genes.
Chapter 5 describes the functional characterization of ZtCpx1 and ZtCpx2 encoding a secreted and a cytoplasmic catalase-peroxidase (CP) in Z. tritici, respectively. Gene replacement of ZtCpx1 resulted in mutant strains that were sensitive to exogenously added H2O2 and in planta phenotyping showed they are significantly less virulent compared to wild-type. All mutant phenotypes could be restored to wild-type by complementation with the wild-type allele of ZtCpx1 driven by its native promoter. Additionally, functional analysis of ZtCpx2 confirmed that this gene encodes a secreted CP and is, however, dispensable for virulence of Z. tritici on wheat. However, we showed that both genes act synergistically, as the generated double knock-out strain showed a significantly stronger reduction in virulence than the individual single knock-out strains. Hence, both genes are required by Z. tritici for successful infection and colonization of wheat.
In Chapter 6 I discuss and summarize the genetic approaches used in this study, reflect on the major findings and bottlenecks encountered, and propose new strategies to identify effectors of Z. tritici in the future.
Lipoxygenase : a game-changing enzyme
Heshof, R. - \ 2015
Wageningen University. Promotor(en): Willem de Vos; Vitor Martins dos Santos, co-promotor(en): Leo de Graaff. - Wageningen : Wageningen University - ISBN 9789462571761 - 160
lipoxygenase - aminozuursequenties - lipiden - schimmels - bio-informatica - kaliumjodide - industriële toepassingen - biobased economy - lipoxygenase - amino acid sequences - lipids - fungi - bioinformatics - potassium iodide - industrial applications - biobased economy
Many challenges lie ahead in using LOXs as tools in industrial oleochemistry. One of these challenges is the supply of PUFAs. Although we are moving towards a biobased economy where second and third generation biomass is taking a leading role, it is still faster and cheaper to use first generation biomass. Industrialization of microbial oils is a good alternative to supply the demand of PUFAs. Another challenge is the production of heterologous LOX in sufficient quantities. Since the last decade this problem is being tackled and more research is being done in heterologous expression of LOXs. The LOX with the highest potential so far is the secreted Pseudomonas aeruginosa LOX produced in Escherichia coli. During this thesis research different lox genes were tried for heterologous production of LOX using different Aspergillus niger and Aspergillus nidulans strains as expression hosts. These LOXs were identified as discussed in Chapter 3 and Chapter 6. Unfortunately, heterologous production in sufficient quantities was unsuccessful using these expression hosts as discussed in Chapter 5 and Chapter 6. Since production of Gaeumannomyces graminis LOX was successful in Trichoderma reesei, as discussed in Chapter 4, the production of polymers used for bioplastics could be demonstrated in this ERA-NOEL project anyway. Therefore this thesis shifted its focus on resolving the question of the difficulties in the heterologous expression of LOX in different Aspergillus species. Chapter 5 is the result of a systematic approach to analyze different aspects of G. graminis LOX expression in A. nidulans. Chapter 2 shows that heterologous expression of extracellular fungal LOX can be performed using T. reesei and Pichia pastoris as production hosts, and E. coli can be used for the production of intracellular LOXs of plant, mammal, bacterial, and fungal origin. As shown in Chapter 2, E. coli is not very efficient in the production of heterologous LOX due to the formation of inclusion bodies and low induction temperature necessary for production. The use of Aspergillus oryzae can be exploited further in the heterologous production of LOXs. Due to the choice of using A. niger and A. nidulans as expression hosts, this expression host was not exploited for its potential. The last challenge is to synthetically engineer LOX to broaden its use in industry. In this way more building blocks for chemicals can be synthetically produced and more products based on LOX origin can be made. Therefore, LOX can be a world-wide game-changing enzyme in a biobased economy as its use can decrease the demand for petroleum-based products.
Dick de Ridder over de bioloog die steeds meer een datawetenschapper wordt
Ridder, D. de - \ 2015
bio-informatica - gegevensverwerking - informatica - beroepen - moleculaire biologie - bioinformatics - data processing - informatics - occupations - molecular biology
De biologie krijgt snel het karakter van een datawetenschap. Miljarden gegevens over genomen, genen, eiwitten en andere moleculen worden in grote bestanden bij elkaar gebracht en systematisch onderzocht. Dit moet leiden tot meer basiskennis en begrip van levende organismen waarvan gewassen en vee aan de basis staan van de voedselvoorziening van de wereldbevolking. Dat zegt prof.dr.ir. Dick de Ridder in zijn inaugurele rede bij de aanvaarding van het ambt van hoogleraar Bioinformatica aan Wageningen University op 30 april.
Filling the gap between sequence and function: a bioinformatics approach
Bargsten, J.W. - \ 2014
Wageningen University. Promotor(en): Richard Visser, co-promotor(en): Jan-Peter (Jp) Nap. - Wageningen : Wageningen University - ISBN 9789462570764 - 170
bio-informatica - planten - genomica - nucleotidenvolgordes - functionele genomica - vergelijkende genomica - vergelijkende genetische kartering - genomen - genetische kartering - plantenveredeling - methodologie - bioinformatics - plants - genomics - nucleotide sequences - functional genomics - comparative genomics - comparative mapping - genomes - genetic mapping - plant breeding - methodology
The research presented in this thesis focuses on deriving function from sequence information, with the emphasis on plant sequence data. Unravelling the impact of genomic elements, in most cases genes, on the phenotype of an organism is a major challenge in biological research and modern plant breeding. An important part of this challenge is the (functional) annotation of such genomic elements. Currently, wet lab experiments may provide high quality, but they are laborious and costly. With the advent of next generation sequencing platforms, vast amounts of sequence data are generated. This data are used in connection with the available experimental data to derive function from a bioinformatics perspective.
The connection between sequence information and function was approached on the level of chromosome structure (chapter 2) and of gene families (chapter 3) using combinations of existing bioinformatics tools. The applicability of using interaction networks for function prediction was demonstrated by first markedly improving an existing method (chapter 4) and by exploring the role of network topology in function prediction (chapter 5). Taken together, the combination of methods and results presented indicate the potential as well as the current state-of-the-art of function prediction in (plant) bioinformatics.
Chapter 1 introduces the basis for the approaches used and developed in this thesis. This includes the concepts of genome annotation, comparative genomics, gene function prediction and the analysis of network topology for gene function prediction. A requirement for the study of any new organism is the sequencing and annotation of its genome. Current genome annotation is divided into structural identification and functional categorization of genomic elements. The de facto standard for categorizing functional annotation is provided by the Gene Ontology. The Gene Ontology is divided into three domains, molecular function, biological process and cellular component. Approaches to predict molecular function and biological process are outlined. Accurate function prediction generally relies on existing input data, often of experimental origin, that can be transferred to unannotated genomic elements. Plants often lack such input data, which poses a big challenge for current function prediction algorithms. In unravelling the function of genomic elements, comparative genomics is an important approach. Via the comparison of multiple genomes it gives insights into evolution, function as well as genomic structure and variation. Comparative genomics has become an essential toolkit for the analysis of newly sequenced organisms. Often bioinformatics methods need to be adapted to the specific needs of plant genome research. With a focus on the commercially important crop plants tomato and potato, specific requirements of plant bioinformatics, such as the high amount of repetitive elements and the lack of experimental data, are outlined.
In chapter 2, the structural homology of the long arm of chromosome 2 (2L) of tomato, potato and pepper is analyzed. Molecular organization and collinear junctions are delineated using multi-color BAC FISH analysis and comparative sequence alignment. We identify several large-scale rearrangements including inversions and segmental translocations that were not reported in previous comparative studies. Some of the structural rearrangements are specific for the tomato clade, and differentiate tomato from potato, pepper and other solanaceous species. There are many small-scale synteny perturbations, but local gene vicinity is largely preserved. The data suggests that long distance intra-chromosomal rearrangements and local gene rearrangements have evolved frequently during speciation in the Solanum genus, and that small changes are more prevalent than large-scale differences. The occurrence of transposable elements and other repeats near or at junction breaks may indicate repeat-mediated rearrangements. The ancestral 2L topology is reconstructed and the evolutionary events leading to the current topology are discussed.
In chapter 3, we analyze the Snf2 gene family. As part of large protein complexes, Snf2 family ATPases are responsible for energy supply during chromatin remodeling, but the precise mechanism of action of many of these proteins is largely unknown. They influence many processes in plants, such as the response to environmental stress. The analysis is the first comprehensive study of Snf2 family ATPases in plants. Some subfamilies of the Snf2 gene family are remarkably stable in number of genes per genome, whereas others show expansion and contraction in several plants. One of these subfamilies, the plant-specific DRD1 subfamily, is non-existent in lower eukaryote genomes, yet it developed into the largest Snf2 subfamily in plant genomes. It shows the occurrence of a complex series of evolutionary events. Its expansion, notably in tomato, suggests novel functionality in processes connected to chromatin remodeling. The results underpin and extend the Snf2 subfamily classification, which could help to determine the various functional roles of Snf2 ATPases and to target environmental stress tolerance and yield in future breeding with these genes.
In chapter 4, a new approach to improve the prediction of protein function in terms of biological processes is developed that is particularly attractive for sparsely annotated plant genomes. The combination of the network-based prediction method Bayesian Markov Random Field (BMRF) with the sequence-based prediction method Argot2 shows significantly improved performance compared to each of the methods separately, as well as compared to Blast2GO. The approach was applied to predict biological processes for the proteomes of rice, barrel clover, poplar, soybean and tomato. Analysis of the relationships between sequence similarity and predicted function similarity identifies numerous cases of divergence of biological processes in which proteins are involved, in spite of sequence similarity. Examples of potential divergence are identified for various biological processes, notably for processes related to cell development, regulation, and response to chemical stimulus. Such divergence in biological process annotation for proteins with similar sequences should be taken into account when analyzing plant gene and genome evolution. This way, the integration of network-based and sequence-based function prediction will strengthen the analysis of evolutionary relationships of plant genomes.
In chapter 5 the influence of network topology on network-based function prediction algorithms is investigated. The analysis of biological networks using algorithms such as Bayesian Markov Random Field (BMRF) is a valuable predictor of the biological processes that proteins are involved in. The topological properties and constraints that determine prediction performance in such networks are however largely unknown. This chapter presents analyses based on network centrality measures, such as node degree, to evaluate the performance of BMRF upon progressive removal of highly connected hub nodes (pruning). Three different protein-protein interaction networks with data from Arabidopsis, human and yeast were analyzed. All three show that the average prediction performance can improve significantly. The chapter paves the way for further improvement of network-based function prediction methods based on node pruning.
Chapter 6 discusses the results and methods developed in this thesis in the context of the vast amount of generated sequencing data. Sequencing or re-sequencing a (plant) genome has become fairly straightforward and affordable, but the interpretation for subsequent use of this sequence data is far from trivial. The topics addressed in this thesis, annotation of function, analysis of genome structure and identifying genomic variation, focus on this main bottleneck of biological research. Issues discussed in connection with this work and its future are data accuracy, error propagation, possible improvements and future implications for biological research in crop plants. In particular the shift of costs from sequencing to downstream analyses, with functional genome annotation as essential step, is covered. One of the biggest challenges biology and bioinformatics will face is the integration of results from such downstream analyses and other sources into a complete picture. Only this will allow understanding of complex biological systems.
Elicitin-triggerd apoplastic immunity against late blight in potato
Du, J. - \ 2014
Wageningen University. Promotor(en): Richard Visser; Evert Jacobsen, co-promotor(en): Vivianne Vleeshouwers. - Wageningen : Wageningen University - ISBN 9789462570092 - 140
solanum tuberosum - aardappelen - plantenziekteverwekkende schimmels - phytophthora infestans - ziekteresistentie - genen - schimmeleiwit - genetische merkers - bio-informatica - plantenveredeling - solanum tuberosum - potatoes - plant pathogenic fungi - phytophthora infestans - disease resistance - genes - fungal protein - genetic markers - bioinformatics - plant breeding
Applications in computer-assisted biology
Nijveen, H. - \ 2013
Wageningen University. Promotor(en): Ton Bisseling, co-promotor(en): P.E. van der Vet. - Wageningen : Wageningen UR - ISBN 9789461737816 - 106
bio-informatica - moleculaire biologie - computers - databanken - prokaryoten - computeranalyse - informatietechnologie - bioinformatics - molecular biology - computers - databases - prokaryotes - computer analysis - information technology
Biology is becoming a data-rich science driven by the development of high-throughput technologies like next-generation DNA sequencing. This is fundamentally changing biological research. The genome sequences of many species are becoming available, as well as the genetic variation within a species, and the activity of the genes in a genome under various conditions. With the opportunities that these new technologies offer, comes the challenge to effectively deal with the large volumes of data that they produce. Bioinformaticians have an important role to play in organising and analysing this data to extract biological information and gain knowledge. Also for experimental biologists computers have become essential tools. This has created a strong need for software applications aimed at biological research. The chapters in this thesis detail my contributions to this area. Together with molecular biologists, plant breeders, immunologists, and microbiologists, I have developed several software tools and performed computational analyses to study biological questions.
Chapter 2 is about Primer3Plus, a web tool that helps biologists to design DNA primers for their experiments. These primers are typically short stretches of DNA (~20 nucleotides) that direct the DNA replication machinery to copy a selected region of a DNA molecule. The specificity of a primer is determined by several chemical and physical properties and therefore designing good primers is best done with the help of a computer program. Primer3Plus offers a user-friendly task-oriented web interface to the popular primer3 primer design program. Primer3Plus clearly fulfils a need in the biological research community as already over 400 scientific articles have cited the Primer3Plus publication.
Single nucleotide differences or polymorphisms (SNPs) that are present within a species can be used as markers to link phenotypic observations to locations on the genome. Chapter 3 discusses QualitySNPng, which is a stand-alone software tool for finding SNPs in high-throughput sequencing data. QualitySNPng was inspired by the QualitySNP pipeline for SNP detection that was published in 2006 and it uses similar filtering criteria to distinguish SNPs from technical artefacts like sequence read errors. In addition, the SNPs are used to predict haplotypes. QualitySNPng has a graphical user interface that allows the user to run the SNP detection and evaluate the results. It has already been successfully used in several projects on marker detection for plant breeding.
Single nucleotide polymorphisms can lead to single amino acid changes in protein sequences. These single amino acid polymorphisms (SAPs) play a key role in graft-versus-host (GVH) effects that often accompany tissue transplantations. A beneficial variant of GVH is the graft-versus-leukaemia (GVL) effect that is sometimes witnessed after bone marrow transplantation in leukaemia patients. When the GVL effect occurs, the donor’s immune cells actively destroy residual tumour cells in the patient. The GVL effect can already be elicited by a single amino acid difference between the patient and the donor. Currently, a small number of SAPs that can elicit a GVL effect are known and these are used to select the right bone marrow donor for a leukaemia patient. Together with researchers at the Leiden University Medical Center I developed a database to aid in the discovery of more such SAPs. We called this database the “Human Short Peptide Variation database” or HSPVdb. It is described in chapter 4.
The work described in chapter 5 is focused on the regions in bacterial genomes that are involved in gene regulation, the promoters. Intrigued by anecdotal evidence that duplication of bacterial promoters can activate or silence genes, we investigated how often promoter duplication occurs in bacterial genomes. Using the large number of bacterial genomes that are currently available, we looked for clusters of highly similar promoter regions. Since duplication assumes some sort of mobility, we termed the duplicated promoters: putative mobile promoters or PMPs. We found over 4,000 clusters of PMPs in 1,043 genomes. Most of the clusters consist of two members, indicating a single duplication event, but we also found much larger clusters of PMPs within some genomes. A number of PMPs are present in multiple species, even in very distantly related bacterial species, suggesting perhaps that these were subjected to horizontal gene transfer. The mobile promoters could play an important role in the rapid rewiring of gene regulatory networks.
Chapter 6 discusses how current biological research can adapt to make full use of the opportunities offered by the high-throughput technologies by following three different approaches. The first approach empowers the biologists with user-friendly software that allows him to analyse the large volumes of genome scale data without requiring expert computer skills. In the second approach the biologist teams up with a bioinformatician to combine in-depth biological knowledge with expert computational skills. The third approach combines the biologist and the bioinformatician in one person by teaching the biologist computational skills. Each of these three approaches has it merits and shortcomings, so I do not expect any of them to become dominant in the near future. Looking further ahead, it seems inevitable that any biologist will have to learn at least the basics of computational methods and that this should be an integral part of biology education. Bioinformatics might in time cease to exist as a separate field and instead become an intrinsic aspect of most biological research disciplines.
Bioinformatics assisted breeding, from QTL to candidate genes
Chibon, P.Y. - \ 2013
Wageningen University. Promotor(en): Richard Visser, co-promotor(en): Richard Finkers. - S.l. : s.n. - ISBN 9789461737366 - 149
plantenveredeling - bio-informatica - moleculaire veredeling - marker assisted breeding - loci voor kwantitatief kenmerk - genetische kartering - gegevensverwerking - ontologieën - plant breeding - bioinformatics - molecular breeding - marker assisted breeding - quantitative trait loci - genetic mapping - data processing - ontologies
Over the last decade, the amount of data generated by a single run of a NGS sequencer outperforms days of work done with Sanger sequencing. Metabolomics, proteomics and transcriptomics technologies have also involved producing more and more information at an ever faster rate. In addition, the number of databases available to biologists and breeders is increasing every year. The challenge for them becomes two-fold, namely: to cope with the increased amount of data produced by these new technologies and to cope with the distribution of the information across the Web. An example of a study with a lot of ~omics data is described in Chapter 2, where more than 600 peaks have been measured using liquid chromatography mass-spectrometry (LCMS) in peel and flesh of a segregating F1apple population. In total, 669 mQTL were identified in this study. The amount of mQTL identified is vast and almost overwhelming. Extracting meaningful information from such an experiment requires appropriate data filtering and data visualization techniques. The visualization of the distribution of the mQTL on the genetic map led to the discovery of QTL hotspots on linkage group: 1, 8, 13 and 16. The mQTL hotspot on linkage group 16 was further investigated and mainly contained compounds involved in the phenylpropanoid pathway. The apple genome sequence and its annotation were used to gain insight in genes potentially regulating this QTL hotspot. This led to the identification of the structural gene leucoanthocyanidin reductase (LAR1) as well as seven genes encoding transcription factors as putative candidates regulating the phenylpropanoid pathway, and thus candidates for the biosynthesis of health beneficial compounds. However, this study also indicated bottlenecks in the availability of biologist-friendly tools to visualize large-scale QTL mapping results and smart ways to mine genes underlying QTL intervals.
In this thesis, we provide bioinformatics solutions to allow exploration of regions of interest on the genome more efficiently. In Chapter 3, we describe MQ2, a tool to visualize results of large-scale QTL mapping experiments. It allows biologists and breeders to use their favorite QTL mapping tool such as MapQTL or R/qtl and visualize the distribution of these QTL among the genetic map used in the analysis with MQ2. MQ2provides the distribution of the QTL over the markers of the genetic map for a few hundreds traits. MQ2is accessible online via its web interface but can also be used locally via its command line interface. In Chapter 4, we describe Marker2sequence (M2S), a tool to filter out genes of interest from all the genes underlying a QTL. M2S returns the list of genes for a specific genome interval and provides a search function to filter out genes related to the provided keyword(s) by their annotation. Genome annotations often contain cross-references to resources such as the Gene Ontology (GO), or proteins of the UniProt database. Via these annotations, additional information can be gathered about each gene. By integrating information from different resources and offering a way to mine the list of genes present in a QTL interval, M2S provides a way to reduce a list of hundreds of genes to possibly tens or less of genes potentially related to the trait of interest. Using semantic web technologies M2S integrates multiple resources and has the flexibility to extend this integration to more resources as they become available to these technologies.
Besides the importance of efficient bioinformatics tools to analyze and visualize data, the work in Chapter 2also revealed the importance of regulatory elements controlling key genes of pathways. The limitation of M2S is that it only considers genes within the interval. In genome annotations, transcription factors are not linked to the trait (keyword) and to the gene it controls, and these relationships will therefore not be considered. By integrating information about the gene regulatory network of the organism into Marker2sequence, it should be able to integrate in its list of genes, genes outside of the QTL interval but regulated by elements present within the QTL interval. In tomato, the genome annotation already lists a number of transcription factors, however, it does not provide any information about their target. In Chapter 5, we describe how we combined transcriptomics information with six genotypes from an Introgression Line (IL) population to find genes differentially expressed while being in a similar genomic background (i.e.: outside of any introgression segments) as the reference genotype (with no introgression). These genes may be differentially expressed as a result of a regulatory element present in an introgression. The promoter regions of these genes have been analyzed for DNA motifs, and putative transcription factor binding sites have been found.
The approaches taken in M2S (Chaper 4) are focused on a specific region of the genome, namely the QTL interval. In Chapter 6, we generalized this approach to develop Annotex. Annotex provides a simple way to browse the cross-references existing between biological databases (ChEBI, Rhea, UniProt, GO) and genome annotations. The main concept of Annotex being, that from any type of data present in the databases, one can navigate the cross-references to retrieve the desired type of information.
This thesis has resulted in the production of three tools that biologists and breeders can use to speed up their research and build new hypothesis on. This thesis also revealed the state of bioinformatics with regards to data integration. It also reveals the need for integration into annotations (for example, genome annotations, protein annotations, and pathway annotations) of more ontologies than just the Gene Ontology (GO) currently used. Multiple platforms are arising to build these new ontologies but the process of integrating them into existing resources remains to be done. It also confirms the state of the data in plants where multiples resources may contain overlapping. Finally, this thesis also shows what can be achieved when the data is made inter-operable which should be an incentive to the community to work together and build inter-operable, non-overlapping resources, creating a bioinformatics Web for plant research.
Vergelijkende genoomanalyse geeft inzicht in de evolutie en biologie van pathogene oömyceten
Seidl, M.F. ; Govers, F. - \ 2013
Gewasbescherming 44 (2013)4. - ISSN 0166-6495 - p. 109 - 112.
genomica - oömyceten - pathogenen - biologie - evolutie - moleculaire genetica - plantenziekteverwekkende schimmels - bio-informatica - plantenziekten - phytophthora infestans - genomics - oomycetes - pathogens - biology - evolution - molecular genetics - plant pathogenic fungi - bioinformatics - plant diseases
Hoewel oömyceten nog maar kortgeleden het genomica-tijdperk zijn binnengetreden hebben de nieuwe ‘-omics’-technieken al geleid tot een overvloed aan kwantitatieve data. Vergelijkende en geïntegreerde genomica is cruciaal om deze schatkist met data te ontsluiten. In het proefschrift ‘Exploring Evolution and Biology of Oomycetes: Integrative and Comparative Genomics’ zijn met succes de eerste stappen gezet om deze data te gebruiken om zodoende de evolutie en biologie van oömyceten verder te ontrafelen en dit heeft reeds geleid tot waardevole nieuwe inzichten.
From existing data to novel hypotheses : design and application of structure-based Molecular Class Specific Information Systems
Kuipers, R.K.P. - \ 2012
Wageningen University. Promotor(en): Vitor Martins dos Santos; G. Vriend, co-promotor(en): Peter Schaap. - S.l. : s.n. - ISBN 9789461733504 - 231
systeembiologie - bio-informatica - genomica - informatiesystemen - computerwetenschappen - databanken - datamining - eiwitten - eiwitexpressieanalyse - systems biology - bioinformatics - genomics - information systems - computer sciences - databases - data mining - proteins - proteomics
As the active component of many biological systems, proteins are of great interest to life scientists. Proteins are used in a large number of different applications such as the production of precursors and compounds, for bioremediation, as drug targets, to diagnose patients suffering from genetic disorders, etc. Many research projects have therefore focused on the characterization of proteins and on improving the understanding of the functional and mechanistic properties of proteins. Studies have examined folding mechanisms, reaction mechanisms, stability under stress, effects of mutations, etc. All these research projects have resulted in an enormous amount of available data in lots of different formats that are difficult to retrieve, combine, and use efficiently.
The main topic of this thesis is the 3DM platform that was developed to generate Molecular Class Specific Information Systems (3DM systems) for protein superfamilies. These superfamily systems can be used to collect and interlink heterogeneous data sets based on structure based multiple sequence alignments. 3DM systems can be used to integrate protein, structure, mutation, reaction, conservation, correlation, contact, and many other types of data. Data is visualized using websites, directly in protein structures using YASARA, and in literature using Utopia Documents. 3DM systems contain a number of modules that can be used to analyze superfamily characteristics namely Comulator for correlated mutation analyses, Mutator for mutation retrieval, and Validator for mutant pathogenicity prediction. To be able to determine the characteristics of subsets of proteins and to be able to compare the characteristics of different subsets a powerful filtering mechanism is available. 3DM systems can be used as a central knowledge base for projects in protein engineering, DNA diagnostics, and drug design.
The scientific and technical background of the 3DM platform is described in the first two chapters. Chapter 1 describes the scientific background, starting with an overview of the foundations of the 3DM platform. Alignment methods and tools for both structure and sequence alignments, and the techniques used in the 3DM modules are described in detail. Alternative methods are also described with the advantages and disadvantages of the various strategies. Chapter 2 contains a technical description of the implementation of the 3DM platform and the 3DM modules. A schematic overview of the database used to store the data is provided together with a description of the various tables and the steps required to create new 3DM systems. The techniques used in the Comulator, Mutator and Validator modules of the 3DM platforms are discussed in more detail.
Chapter 3 contains a concise overview of the 3DM platform, its capabilities, and the results of protein engineering projects using 3DM systems. Thirteen 3DM systems were generated for superfamilies such as the PEPM/ICL and Nuclear Receptors. These systems are available online for further examination. Protein engineering studies aimed at optimizing substrate specificity, enzyme activity, or thermostability were designed targeting proteins from these superfamilies. Preliminary results of drug design and DNA diagnostics projects are also included to highlight the diversity of projects 3DM systems can be applied to.
Project HOPE: a biomedical tool to predict the effect of a mutation on the structure of a protein is described in chapter 4. Project HOPE is developed at the Radboud University Nijmegen Medical Center under supervision of H. Venselaar. Project HOPE employs webservices to optimally reuse existing databases and computing facilities. After selection of a mutant in a protein, data is collected from various sources such as UniProt and PISA. A homology model is created to determine features such as contacts and side-chain accessibility directly in the structure. Using a decision tree, the available data is evaluated to predict the effects of the mutation on the protein.
Chapter 5 describes Comulator: the 3DM module for correlated mutation analyses. Two positions in an alignment correlate when they co-evolve, that is they mutate simultaneously or not at all. Comulator uses a statistical coupling algorithm to calculate correlated mutation analyses. Correlated mutations are visualized using heatmaps, or directly in protein structures using YASARA. Analyses of correlated mutations in various superfamilies showed that positions that correlate are often found in networks and that the positions in these networks often share a common function. Using these networks, mutants were predicted to increase the specificity or activity of proteins. Mutational studies confirmed that correlated mutation analyses are a valuable tool for rational design of proteins.
Mutator, the text mining tool used to incorporate mutations into 3DM systems is described in chapter 6. Mutator was designed to automatically retrieve mutations from literature and store these mutations in a 3DM system. A PubMed search using keywords from the 3DM system is used to preselect articles of interest. These articles are retrieved from the internet, converted to text, and parsed for mutations. Mutations are then grounded to proteins and stored in a 3DM database. Mutation retrieval was tested on the alpha-amylase superfamily as this superfamily contains the enzyme involved in Fabry’s disease: an x linked lysosomal storage disease. Compared to existing mutant databases, such as the HGMD and SwissProt, Mutator retrieved 30% more mutations from literature. A major problem in DNA diagnostics is the differentiation between natural variants and pathogenic mutations. To distinguish between pathogenic mutations and natural variation in proteins the Validator modules was added to 3DM. Validator uses the data available in a 3DM system to predict the pathogenicity of a mutant using, for example, the residue conservation of the mutants alignment position, side-chain accessibility of the mutant in the structure, and the number of mutations found in literature for the alignment position. Mutator and Validator can be used to study mutants found in disorder related genes. Although these tools are not the definitive solution for DNA diagnostics they can hopefully be used to increase our understanding of the molecular basis of disorders.
Chapter 7 and 8 describe applied research projects using 3DM systems containg proteins of potential commercial interest. A 3DM system for the a/b-beta hydrolases superfamily is described in chapter 7. This superfamily consists of almost 20,000 proteins with a diverse range of functions. Superfamily alignments were generated for the common beta-barrel fold shared by all superfamily members, and for five distinct subtypes within the superfamily. Due to the size and functional diversity of the superfamily, there is a lot of potential for industrial application of superfamily members. Chapter 8 describes a study focusing on a sucrose phosphorylase enzyme from the a-amylase superfamily. This enzyme can be potentially used in an industrial setting for the transfer of glucose to a wide variety of molecules. The aim of the study was to increase the stability of the protein at higher temperatures. A combination of rational design using a 3DM system, and in-depth study of the protein structure, led to a series of mutations that resulted in more than doubling the half-life of the protein at 60°C.
3DM systems have been successfully applied in a wide range of protein engineering and DNA diagnostics studies. Currently, 3DM systems are applied most successfully in project studying a single protein family or monogenetic disorder. In the future, we hope to be able to apply 3DM to more complex scenarios such as enzyme factories and polygenetic disorders by combining multiple 3DM systems for interacting proteins.
Multiplex SSR analysis of Phytophthora infestans in different countries and the importance for potato breeding
Li, Y. - \ 2012
Wageningen University. Promotor(en): Evert Jacobsen, co-promotor(en): Theo van der Lee; D.E.L. Cooke. - S.l. : s.n. - ISBN 9789461732798 - 206
solanum tuberosum - aardappelen - plantenveredeling - plantenziekteverwekkende schimmels - phytophthora infestans - microsatellieten - populaties - ziekteresistentie - genetische merkers - moleculaire merkers - bio-informatica - genomica - plant-microbe interacties - solanum tuberosum - potatoes - plant breeding - plant pathogenic fungi - phytophthora infestans - microsatellites - populations - disease resistance - genetic markers - molecular markers - bioinformatics - genomics - plant-microbe interactions
Potato is the most important non-cereal crop in the world. Late blight, caused by the oomycete pathogen Phytophthora infestans, is the most devastating disease of potato. In the mid-19th century, P. infestans attacked the European potato fields and this resulted in a widespread famine in Ireland and other parts of Europe. Late blight remains the most important pathogen to potato and causes a yearly multi-billion US dollar loss globally. In Europe and North America, late blight control heavily relies on the use of chemicals, which is hardly affordable to farmers in developing countries and also raises considerable environmental concerns in the developed countries.
Genome bioinformatics of tomato and potato
Datema, E. - \ 2011
Wageningen University. Promotor(en): W. Stiekema, co-promotor(en): Roeland van Ham. - [S.l.] : S.n. - ISBN 9789461730473 - 139
gewassen - solanum lycopersicum - solanum tuberosum - genomica - bio-informatica - nucleotidenvolgordes - genomen - genen - crops - solanum lycopersicum - solanum tuberosum - genomics - bioinformatics - nucleotide sequences - genomes - genes
In the past two decades genome sequencing has developed from a laborious and costly technology employed by large international consortia to a widely used, automated and affordable tool used worldwide by many individual research groups. Genome sequences of many food animals and crop plants have been deciphered and are being exploited for fundamental research and applied to improve their breeding programs. The developments in sequencing technologies have also impacted the associated bioinformatics strategies and tools, both those that are required for data processing, management, and quality control, and those used for interpretation of the data.
This thesis focuses on the application of genome sequencing, assembly and annotation to two members of the Solanaceae family, tomato and potato. Potato is the economically most important species within the Solanaceae, and its tubers contribute to dietary intake of starch, protein, antioxidants, and vitamins. Tomato fruits are the second most consumed vegetable after potato, and are a globally important dietary source of lycopene, beta-carotene, vitamin C, and fiber. The chapters in this thesis document the generation, exploitation and interpretation of genomic sequence resources for these two species and shed light on the contents, structure and evolution of their genomes.
Chapter 1introduces the concepts of genome sequencing, assembly and annotation, and explains the novel genome sequencing technologies that have been developed in the past decade. These so-called Next Generation Sequencing platforms display considerable variation in chemistry and workflow, and as a consequence the throughput and data quality differs by orders of magnitude between the platforms. The currently available sequencing platforms produce a vast variety of read lengths and facilitate the generation of paired sequences with an approximately fixed distance between them. The choice of sequencing chemistry and platform combined with the type of sequencing template demands specifically adapted bioinformatics for data processing and interpretation. Irrespective of the sequencing and assembly strategy that is chosen, the resulting genome sequence, often represented by a collection of long linear strings of nucleotides, is of limited interest by itself. Interpretation of the genome can only be achieved through sequence annotation – that is, identification and classification of all functional elements in a genome sequence. Once these elements have been annotated, sequence alignments between multiple genomes of related accessions or species can be utilized to reveal the genetic variation on both the nucleotide and the structural level that underlies the difference between these species or accessions.
Chapter 2describes BlastIf, a novel software tool that exploits sequence similarity searches with BLAST to provide a straightforward annotation of long nucleotide sequences. Generally, two problems are associated with the alignment of a long nucleotide sequence to a database of short gene or protein sequences: (i) the large number of similar hits that can be generated due to database redundancy; and (ii) the relationships implied between aligned segments within a hit that in fact correspond to distinct elements on the sequence such as genes. BlastIf generates a comprehensible BLAST output for long nucleotide sequences by reducing the number of similar hits while revealing most of the variation present between hits. It is a valuable tool for molecular biologists who wish to get a quick overview of the genetic elements present in a newly sequenced segment of DNA, prior to more elaborate efforts of gene structure prediction and annotation.
In Chapter 3 a first genome-wide comparison between the emerging genomic sequence resources of tomato and potato is presented. Large collections of BAC end sequences from both species were annotated through repeat searches, transcript alignments and protein domain identification. In-depth comparisons of the annotated sequences revealed remarkable differences in both gene and repeat content between these closely related genomes. The tomato genome was found to be more repetitive than the potato genome, and substantial differences in the distribution of Gypsy and Copia retrotransposable elements as well as microsatellites were observed between the two genomes. A higher gene content was identified in the potato sequences, and in particular several large gene families including cytochrome P450 mono-oxygenases and serine-threonine protein kinases were significantly overrepresented in potato compared to tomato. Moreover, the cytochrome P450 gene family was found to be expanded in both tomato and potato when compared to Arabidopsis thaliana, suggesting an expanded network of secondary metabolic pathways in the Solanaceae. Together these findings present a first glimpse into the evolution of Solanaceous genomes, both within the family and relative to other plant species.
Chapter 4explores the physical and genetic organization of tomato chromosome 6 through integration of BAC sequence analysis, High Information Content Fingerprinting, genetic analysis, and BAC-FISH mapping data. A collection of BACs spanning substantial parts of the short and long arm euchromatin and several dispersed regions of the pericentrometric heterochromatin were sequenced and assembled into several tiling paths spanning approximately 11 Mb. Overall, the cytogenetic order of BACs was in agreement with the order of BACs anchored to the Tomato EXPEN 2000 genetic map, although a few striking discrepancies were observed. The integration of BAC-FISH, sequence and genetic mapping data furthermore provided a clear picture of the borders between eu- and heterochromatin on chromosome 6. Annotation of the BAC sequences revealed that, although the majority of protein-coding genes were located in the euchromatin, the highly repetitive pericentromeric heterochromatin displayed an unexpectedly high gene content. Moreover, the short arm euchromatin was relatively rich in repeats, but the ratio of Gypsy and Copia retrotransposons across the different domains of the chromosome clearly distinguished euchromatin from heterochromatin. The ongoing whole-genome sequencing effort will reveal if these properties are unique for tomato chromosome 6, or a more general property of the tomato genome.
Chapter 5presents the potato genome, the first genome sequence of an Asterid. To overcome the problems associated with genome assembly due tothe high level of heterozygosity that is observed in commercial tetraploid potato varieties, a homozygous doubled-monoploid potato clone was exploited to sequence and assemble 86% of the 844 Mb genome. This potato reference genome sequence was complemented with re-sequencing of aheterozygous diploid clone, revealing the form and extent of sequence polymorphism both between different genotypes and within a single heterozygous genotype. Gene presence/absence variants and other potentially deleterious mutations were found to occur frequently in potato and are a likely cause of inbreeding depression. Annotation of the genome was supported by deep transcriptome sequencing of both the doubled-monoploid and the heterozygous potato, resulting in the prediction of more than 39,000 protein coding genes. Transcriptome analysis provided evidence for the contribution of gene family expansion, tissue specific expression, and recruitment of genes to new pathways to the evolution of tuber development. The sequence of the potato genome has provided new insights into Eudicot genome evolution and has provided a solid basis for the elucidation of the evolution of tuberisation. Many traits of interest to plant breeders are quantitative in nature and the potato sequence will simplify both their characterization and deployment to generate novel cultivars.
The outstanding challenges in plant genome sequencing are addressed in Chapter 6. The high concentration of repetitive elements and the heterozygosity and polyploidy of many interesting crop plant species currently pose a barrier for the efficient reconstruction of their genome sequences. Nonetheless, the completion of a large number of new genome sequences in recent years and the ongoing advances in sequencing technology provide many excitingopportunities for plant breeding and genome research. Current sequencing platforms are being continuously updated and improved, and novel technologies are being developed and implemented in third-generation sequencing platforms that sequence individual molecules without need for amplification. While these technologies create exciting opportunities for new sequencing applications, they also require robust software tools to process the data produced through them efficiently. The ever increasing amount of available genome sequences creates the need for an intuitive platform for the automated and reproducible interrogation of these data in order to formulate new biologically relevant questions on datasets spanning hundreds or thousands of genome sequences.
Bayesian Markov random field analysis for integrated network-based protein function prediction
Kourmpetis, Y.I.A. - \ 2011
Wageningen University. Promotor(en): Cajo ter Braak, co-promotor(en): Roeland van Ham. - [S.l.] : S.n. - ISBN 9789085859598 - 113
statistiek - bayesiaanse theorie - markov-processen - netwerkanalyse - biostatistiek - toegepaste statistiek - bio-informatica - eiwitten - genen - moleculaire biologie - statistics - bayesian theory - markov processes - network analysis - biostatistics - applied statistics - bioinformatics - proteins - genes - molecular biology
Unravelling the functions of proteins is one of the most important aims of modern biology. Experimental inference of protein function is expensive and not scalable to large datasets. In this thesis a probabilistic method for protein function prediction is presented that integrates different types of data such as sequences and networks. The method is based on Bayesian Markov Random Field (BMRF) analysis. BMRF was initially applied to genome wide protein function prediction using network data in yeast and in also in Arabidopsis by integrating protein domains (i.e InterPro signatures), expressions and protein protein interactions. Several of the predictions were confirmed by experimental evidence. Further, an evolutionary discrete optimization algorithm is presented that integrates function predictions from different Gene Ontology (GO) terms to a single prediction that is consistent to the True Path Rule as imposed by the GO Directed Acyclic Graph. This integration leads to predictions that are easy to be interpreted. Evaluation of of this algorithm using Arabidopsis data showed that the prediction performance is improved, compared to single GO term predictions.
The biology of plant metabolomics
Hall, R.D. - \ 2011
Oxford : Blackwell/Wiley (Annual plant reviews vol. 43) - ISBN 9781405199544 - 420
planten - metabolomica - genexpressieanalyse - plantenfysiologie - genetica - statistiek - bio-informatica - plants - metabolomics - genomics - plant physiology - genetics - statistics - bioinformatics
Following a general introduction, this book includes details of metabolomics of model species including Arabidopsis and tomato. Further chapters provide in-depth coverage of abiotic stress, data integration, systems biology, genetics, genomics, chemometrics and biostatisitcs. Applications of plant metabolomics in food science, plant ecology and physiology are also comprehensively covered.
Slimme voedselproductie met ICT
Wolfert, J. - \ 2010
Kennis Online 8 (2010)dec. - p. 9 - 9.
informatietechnologie - bio-informatica - informatica - internet - landbouwtechniek - technische vooruitgang - information technology - bioinformatics - informatics - agricultural engineering - technical progress
Betere toepassingen van ICT geven Europa een economische voorsprong. Dat is de gedachte achter het EU-project Future Internet. Wageningen UR coördineert het onderdeel agrifood. ‘Er liggen grote vragen op het gebied van bijvoorbeeld online gegevensbescherming’, vertelt Sjaak Wolfert van het LEI, onderdeel van Wageningen UR.
Bioinformatics' approaches to detect genetic variation in whole genome sequencing data
Kerstens, H.H.D. - \ 2010
Wageningen University. Promotor(en): Martien Groenen; Mari Smits. - [S.l. : S.n. - ISBN 9789085857808 - 182
bio-informatica - genomen - nucleotidenvolgordes - genetische variatie - varkens - kalkoenen - kippen - anas platyrhynchos - dierveredeling - genexpressieanalyse - single nucleotide polymorphism - marker assisted breeding - bioinformatics - genomes - nucleotide sequences - genetic variation - pigs - turkeys - fowls - anas platyrhynchos - animal breeding - genomics - single nucleotide polymorphism - marker assisted breeding
Current genetic marker repositories are not sufficient or even are completely lacking for most farm animals. However, genetic markers are essential for the development of a research tool facilitating discovery of genetic factors that contribute to resistance to disease and the overall welfare and performance in farm animals.
By large scale identification of Single Nucleotide Polymorphisms (SNPs) and Structural Variants (SVs) we aimed to contribute to the development of a repository of genetic variants for farm animals. For this purpose bioinformatics data pipelines were designed and validated to address the challenge of the cost effective identification of genetic markers in DNA sequencing data even in absence of a fully sequenced reference genome.
To find SNPs in pig, we analysed publicly available whole genome shotgun sequencing datasets by sequence alignment and clustering. Sequence clusters were assigned to genomic locations using publicly available BAC sequencing and BAC mapping data. Within the sequence clusters thousands of SNPs were detected of which the genomic location is roughly known.
For turkey and duck, species that both were lacking a sufficient sequence data repository for variant discovery, we applied next-generation sequencing (NGS) on a reduced genome representation of a pooled DNA sample. For turkey a genome reference was reconstructed from our sequencing data and available public sequencing data whereas in duck the reference genome constructed by a (NGS) project was used. SNPs obtained by our cost-effective SNP detection procedure still turned out to cover, at intervals, the whole turkey and duck genomes and are of sufficient quality to be used in genotyping studies. Allele frequencies, obtained by genotyping animal panels with a subset our SNPs, correlated well with those observed during SNP detection. The availability of two external duck SNP datasets allowed for the construction of a subset of SNPs which we had in common with these sets. Genotyping turned out that this subset was of outstanding quality and can be used for benchmarking other SNPs that we identified within duck.
Ongoing developments in (NGS) allowed for paired end sequencing which is an extension on sequencing analysis that provides information about which pair of reads are coming from the outer ends of one sequenced DNA fragment. We applied this technique on a reduced genome representation of four chicken breeds to detect SVs. Paired end reads were mapped to the chicken reference genome and SVs were identified as abnormally aligned read pairs that have orientation or span sizes discordant from the reference genome. SV detection parameters, to distinguish true structural variants from false positives, were designed and optimized by validation of a small representative sample of SVs using PCR and traditional capillary sequencing.
To conclude: we developed SNP repositories which fulfils a requirement for SNPs to perform linkage analysis, comparative genomics QTL studies and ultimately GWA studies in a range of farm animals. We also set the first step in developing a repository for SVs in chicken, a relatively new genetic marker in animal sciences.
Functional Analysis of Cladosponum fulvum Effector Catalog
Ökmen, B. ; Hollander, M. de; Stergiopoulos, I. ; Burg, H.A. van den; Wit, P.J.G.M. de - \ 2010
Gewasbescherming 41 (2010)3. - ISSN 0166-6495 - p. 149 - 150.
pathogenesis-gerelateerde eiwitten - dna-sequencing - genoomanalyse - passalora fulva - solanum lycopersicum - bio-informatica - genen - plant-microbe interacties - pathogenesis-related proteins - dna sequencing - genome analysis - passalora fulva - solanum lycopersicum - bioinformatics - genes - plant-microbe interactions
Onlangs is de DNA-sequentie van het genoom van Cladosporium fulvum bepaald. Het voornaamste doel daarvan is de identificatie en karakterisering van nieuwe effectors.
Graph-based methods for large-scale protein classification and orthology inference
Kuzniar, A. - \ 2009
Wageningen University. Promotor(en): Jack Leunissen, co-promotor(en): Roeland van Ham; S. Pongor. - [S.l. : S.n. - ISBN 9789085855019 - 139
bio-informatica - eiwitten - classificatie - algoritmen - grafieken - evolutie - bioinformatics - proteins - classification - algorithms - graphs - evolution
The quest for understanding how proteins evolve and function has been a prominent and costly human endeavor. With advances in genomics and use of bioinformatics tools, the diversity of proteins in present day genomes can now be studied more efficiently than ever before. This thesis describes computational methods suitable for large-scale protein classification of many proteomes of diverse species. Specifically, we focus on methods that combine unsupervised learning (clustering) techniques with the knowledge of molecular phylogenetics, particularly that of orthology. In chapter 1 we introduce the biological context of protein structure, function and evolution, review the state-of-the-art sequence-based protein classification methods, and then describe methods used to validate the predictions. Finally, we present the outline and objectives of this thesis. Evolutionary (phylogenetic) concepts are instrumental in studying subjects as diverse as the diversity of genomes, cellular networks, protein structures and functions, and functional genome annotation. In particular, the detection of orthologous proteins (genes) across genomes provides reliable means to infer biological functions and processes from one organism to another. Chapter 2 evaluates the available computational tools, such as algorithms and databases, used to infer orthologous relationships between genes from fully sequenced genomes. We discuss the main caveats of large-scale orthology detection in general as well as the merits and pitfalls of each method in particular. We argue that establishing true orthologous relationships requires a phylogenetic approach which combines both trees and graphs (networks), reliable species phylogeny, genomic data for more than two species, and an insight into the processes of molecular evolution. Also proposed is a set of guidelines to aid researchers in selecting the correct tool. Moreover, this review motivates further research in developing reliable and scalable methods for functional and phylogenetic classification of large protein collections. Chapter 3 proposes a framework in which various protein knowledge-bases are combined into unique network of mappings (links), and hence allows comparisons to be made between expert curated and fully-automated protein classifications from a single entry point. We developed an integrated annotation
resource for protein orthology, ProGMap (Protein Group Mappings, http://www.bioinformatics.nl/progmap), to help researchers and database annotators who often need to assess the coherence of proposed annotations and/or group assignments, as well as users of high throughput methodologies (e.g., microarrays or proteomics) who deal with partially annotated genomic data. ProGMap is based on a non-redundant dataset of over 6.6 million protein sequences which is mapped to 240,000 protein group descriptions collected from UniProt, RefSeq, Ensembl, COG, KOG, OrthoMCL-DB, HomoloGene, TRIBES and PIRSF using a fast and fully automated sequence-based mapping approach. The ProGMap database is equipped with a web interface that enables queries to be made using synonymous sequence identifiers, gene symbols, protein functions, and amino acid or nucleotide sequences. It incorporates also services, namely BLAST similarity search and QuickMatch identity search, for finding sequences similar (or identical) to a query sequence, and tools for presenting the results in graphic form. Graphs (networks) have gained an increasing attention in contemporary biology because they have enabled complex biological systems and processes to be modeled and better understood. For example, protein similarity networks constructed of all-versus-all sequence comparisons are frequently used to delineate similarity groups, such as protein families or orthologous groups in comparative genomics studies. Chapter 4.1 presents a benchmark study of freely available graph software used for this purpose. Specifically, the computational complexity of the programs is investigated using both simulated and biological networks. We show that most available software is not suitable for large networks, such as those encountered in large-scale proteome analyzes, because of the high demands on computational resources. To address this, we developed a fast and memory-efficient graph software, netclust (http://www.bioinformatics.nl/netclust/), which can scale to large protein networks, such as those constructed of millions of proteins and sequence similarities, on a standard computer. An extended version of this program called Multi-netclust is presented in chapter 4.2. This tool that can find connected clusters of data presented by different network data sets. It uses user-defined threshold values to combine the data sets in such a way that clusters connected in all or in either of the networks can be retrieved efficiently. Automated protein sequence clustering is an important task in genome annotation projects and phylogenomic studies. During the past years, several protein clustering programs have been developed for delineating protein families or orthologous groups from large sequence collections. However, most of these programs have not been benchmarked systematically, in particular with respect to the trade-off between computational complexity and biological soundness. In chapter 5 we evaluate three best known algorithms on different protein similarity networks and validation (or 'gold' standard) data sets to find out which one can scale to hundreds of proteomes and still delineate high quality similarity groups at the minimum computational cost. For this, a reliable partition-based approach was used to assess the biological soundness of predicted groups using known protein functions, manually curated protein/domain families and orthologous groups available in expert-curated databases. Our benchmark results support the view that a simple and computationally cheap method such as netclust can perform similar to and in cases even better than more sophisticated, yet much more costly methods. Moreover, we introduce an efficient graph-based method that can delineate protein orthologs of hundreds of proteomes into hierarchical similarity groups de novo. The validity of this method is demonstrated on data obtained from 347 prokaryotic proteomes. The resulting hierarchical protein classification is not only in agreement with manually curated classifications but also provides an enriched framework in which the functional and evolutionary relationships between proteins can be studied at various levels of specificity. Finally, in chapter 6 we summarize the main findings and discuss the merits and shortcomings of the methods developed herein. We also propose directions for future research. The ever increasing flood of new sequence data makes it clear that we need improved tools to be able to handle and extract relevant (orthological) information from these protein data. This thesis summarizes these needs and how they can be addressed by the available tools, or be improved by the new tools that were developed in the course of this research.
Web services for transcriptomics
Neerincx, P. - \ 2009
Wageningen University. Promotor(en): Jack Leunissen. - [S.l. : S.n. - ISBN 9789085854647 - 184
bio-informatica - internet - moleculaire biologie - computers - datacommunicatie - gegevensverwerking - transcriptomics - computernetwerken - microarrays - genexpressieanalyse - datamining - bioinformatics - internet - molecular biology - computers - data communication - data processing - transcriptomics - computer networks - microarrays - genomics - data mining
Transcriptomics is part of a family of disciplines focussing on high throughput molecular biology experiments. In the case of transcriptomics, scientists study the expression of genes resulting in transcripts. These transcripts can either perform a biological function themselves or function as messenger molecules containing a copy of the genetic code, which can be used by the ribosomes as templates to synthesise proteins. Over the past decade microarray technology has become the dominant technology for performing high throughput gene expression experiments.
A microarray contains short sequences (oligos or probes), which are the reverse complement of fragments of the targets (transcripts or sequences derived thereof). When genes are expressed, their transcripts (or sequences derived thereof) can hybridise to these probes. Many thousand copies of a probe are immobilised in a small region on a support. These regions are called spots and a typical microarray contains thousands or sometimes even more than a million spots. When the transcripts (or sequences derived thereof) are fluorescently labelled and it is known which spots are located where on the support, a fluorescent signal in a certain region represents expression of a certain gene. For interpretation of microarray data it is essential to make sure the oligos are specific for their targets. Hence for proper probe design one needs to know all transcripts that may be expressed and how well they can hybridise with candidate oligos. Therefore oligo design requires:
1. A complete reference genome assembly.
2. Complete annotation of the genome to know which parts may be transcribed.
3. Insight in the amount of natural variation in the genomes of different individuals.
4. Knowledge on how experimental conditions influence the ability of probes to hybridise with certain transcripts.
Unfortunately such complete information does not exist, but many microarrays were designed based on incomplete data nevertheless. This can lead to a variety of problems including cross-hybridisation (non-specific binding), erroneously annotated and therefore misleading probes, missing probes and orphan probes.
Fortunately the amount of information on genes and their transcripts increases rapidly. Therefore, it is possible to improve the reliability of microarray data analysis by regular updates of the probe annotation using updated databases for genomes and their annotation. Several tools have been developed for this purpose, but these either used simplistic annotation strategies or did not support our species and/ or microarray platforms of interest. Therefore, we developed OligoRAP (Oligo Re- Annotation Pipeline), which is described in chapter 2. OligoRAP was designed to take advantage of amongst others annotation provided by Ensembl, which is the largest genome annotation effort in the world. Thereby OligoRAP supports most of the major animal model organisms including farm animals like chicken and cow. In addition to support for our species and array platforms of interest OligoRAP employs a new annotation strategy combining information from genome and transcript databases in a non-redundant way to get the most complete annotation possible.
In chapter 3 we compared annotation generated with 3 oligo annotation pipelines including OligoRAP and investigated the effect on functional analysis of a microarray experiment involving chickens infected with Eimeria bacteria. As an example of functional analysis we investigated if up- or downregulated genes were enriched for Terms from the Gene Ontology (GO). We discovered that small differences in annotation strategy could lead to alarmingly large differences in enriched GO terms.
Therefore it is important to know, which annotation strategy works best, but it was not possible to assess this due to the lack of a good reference or benchmark dataset. There are a few limited studies investigating the hybridisation potential of imperfect alignments of oligos with potential targets, but in general such data is scarce. In addition it is difficult to compare these studies due to differences in experimental setup including different hybridisation temperatures and different probe lengths. As result we cannot determine exact thresholds for the alignments of oligos with non-targets to prevent cross-hybridisation, but from these different studies we can get an idea of the range for the thresholds that would be required for optimal target specificity. Note that in these studies experimental conditions were first optimised for an optimal signal to noise ratio for hybridisation of oligos with targets. Then these conditions were used to determine the thresholds for alignments of oligos with non-targets to prevent cross-hybridisation.
Chapter 4 describes a parameter sweep using OligoRAP to explore hybridisation potential thresholds from a different perspective. Given the mouse genome thresholds were determined for the largest amount of gene specific probes. Using those thresholds we then determined thresholds for optimal signal to noise ratios. Unfortunately the annotation-based thresholds we found did not fall within the range of experimentally determined thresholds; in fact they were not even close. Hence what was experimentally determined to be optimal for the technology was not in sync with what was determined to be optimal for the mouse genome. Further research will be required to determine whether microarray technology can be modified in such a way that it is better suited for gene expression experiments. The requirement of a priori information on possible targets and the lack of sufficient knowledge on how experimental conditions influence hybridisation potential can be considered the Achiles’ heels of microarray technology.
Chapter 5 is a collection of 3 application notes describing other tools that can aid in analysis of transcriptomics data. Firstly, RShell, which is a plugin for the Taverna workbench allowing users to execute statistical computations remotely on R-servers. Secondly, MADMAX services, which provide quality control and normalisation of microarray data for AffyMetrix arrays. Finally, GeneIlluminator, which is a tool to disambiguate gene symbols allowing researchers to specifically retrieve literature for their genes of interest even if the gene symbols for those genes had many synonyms and homonyms.
High throughput experiments like those performed in transcriptomics usually require subsequent analysis with many different tools to make biological sense of the data. Installing all these tools on a single, local computer and making them compatible so users can build analysis pipelines can be very cumbersome. Therefore distributed analysis strategies have been explored extensively over the past decades. In a distributed system providers offer remote access to tools and data via the Internet allowing users to create pipelines from modules from all over the globe.
Chapter 1 provides an overview of the evolution of web services, which represent the latest breed in technology for creating distributed systems. The major advantage of web services over older technology is that web services are programming language independent, Internet communication protocol independent and operating system independent. Therefore web services are very flexible and most of them are firewall-proof. Web services play a major role in the remaining chapters of this thesis: OligoRAP is a workflow entirely made from web services and the tools described in chapter 5 all provide remote programmatic access via web service interfaces. Although web services can be used to build relatively complex workflows like OligoRAP, a lack of mainly de facto standards and of user-friendly clients has limited the use of web services to bioinformaticians. A semantic web where biologists can easily link web services into complex workflows does n
Bayesian networks for omics data analysis
Gavai, A.K. - \ 2009
Wageningen University. Promotor(en): Jack Leunissen; Michael Muller, co-promotor(en): Guido Hooiveld; P.J.F. Lucas. - [S.l.] : S.n. - ISBN 9789085853909 - 98
bio-informatica - waarschijnlijkheidsmodellen - bayesiaanse theorie - netwerkanalyse - genexpressie - roken - vluchtige verbindingen - biochemische omzettingen - voedingsonderzoek bij de mens - genexpressieanalyse - microarrays - netwerken - nutrigenomica - bioinformatics - probabilistic models - bayesian theory - network analysis - gene expression - smoking - volatile compounds - biochemical pathways - human nutrition research - genomics - microarrays - networks - nutrigenomics
This thesis focuses on two aspects of high throughput technologies, i.e. data storage and data analysis, in particular in transcriptomics and metabolomics. Both technologies are part of a research field that is generally called ‘omics’ (or ‘-omics’, with a leading hyphen), which refers to genomics, transcriptomics, proteomics, or metabolomics. Although these techniques study different entities (genes, gene expression, proteins, or metabolites), they all have in common that they use high-throughput technologies such as microarrays and mass spectrometry, and thus generate huge amounts of data. Experiments conducted using these technologies allow one to compare different states of a living cell, for example a healthy cell versus a cancer cell or the effect of food on cell condition, and at different levels.
The tools needed to apply omics technologies, in particular microarrays, are often manufactured by different vendors and require separate storage and analysis software for the data generated by them. Moreover experiments conducted using different technologies cannot be analyzed simultaneously to answer a biological question. Chapter 3 presents MADMAX, our software system which supports storage and analysis of data from multiple microarray platforms. It consists of a vendor-independent database which is tightly coupled with vendor-specific analysis tools. Upcoming technologies like metabolomics, proteomics and high-throughput sequencing can easily be incorporated in this system.
Once the data are stored in this system, one obviously wants to deduce a biological relevant meaning from these data and here statistical and machine learning techniques play a key role. The aim of such analysis is to search for relationships between entities of interest, such as genes, metabolites or proteins. One of the major goals of these techniques is to search for causal relationships rather than mere correlations. It is often emphasized in the literature that "correlation is not causation" because people tend to jump to conclusions by making inferences about causal relationships when they actually only see correlations. Statistics are often good in finding these correlations; techniques called linear regression and analysis of variance form the core of applied multivariate statistics. However, these techniques cannot find causal relationships, neither are they able to incorporate prior knowledge of the biological domain. Graphical models, a machine learning technique, on the other hand do not suffer from these limitations.
Graphical models, a combination of graph theory, statistics and information science, are one of the most exciting things happening today in the field of machine learning applied to biological problems (see chapter 2 for a general introduction). This thesis deals with a special type of graphical models known as probabilistic graphical models, belief networks or Bayesian networks. The advantage of Bayesian networks over classical statistical techniques is that they allow the incorporation of background knowledge from a biological domain, and that analysis of data is intuitive as it is represented in the form of graphs (nodes and edges). Standard statistical techniques are good in describing the data but are not able to find non-linear relations whereas Bayesian networks allow future prediction and discovering nonlinear relations. Moreover, Bayesian networks allow hierarchical representation of data, which makes them particularly useful for representing biological data, since most biological processes are hierarchical by nature. Once we have such a causal graph made either by a computer program or constructed manually we can predict the effects of a certain entity by manipulating the state of other entities, or make backward inferences from effects to causes. Of course, if the graph is big, doing the necessary calculations can be very difficult and CPU-expensive, and in such cases approximate methods are used.
Chapter 4 demonstrates the use of Bayesian networks to determine the metabolic state of feeding and fasting mice to determine the effect of a high fat diet on gene expression. This chapter also shows how selection of genes based on key biological processes generates more informative results than standard statistical tests. In chapter 5 the use of Bayesian networks is shown on the combination of gene expression data and clinical parameters, to determine the effect of smoking on gene expression and which genes are responsible for the DNA damage and the raise in plasma cotinine levels of blood of a smoking population. This study was conducted at Maastricht University where 22 twin smokers were profiled. Chapter 6 presents the reconstruction of a key metabolic pathway which plays an important role in ripening of tomatoes, thus showing the versatility of the use of Bayesian networks in metabolomics data analysis.
The general trend in research shows a flood of data emerging from sequencing and metabolomics experiments. This means that to perform data mining on these data one requires intelligent techniques that are computationally feasible and able to take the knowledge of experts into account to generate relevant results. Graphical models fit this paradigm well and we expect them to play a key role in mining the data generated from omics experiments.