The goal of this thesis is to improve the prediction of genotype to phenotypeassociations with a focus on metabolic phenotypes of prokaryotes. This goal isachieved through data integration, which in turn required the development ofsupporting solutions based on semantic web technologies. Chapter 1 providesan introduction to the challenges associated to data integration. Semantic webtechnologies provide solutions to some of these challenges and the basics ofthese technologies are explained in the Introduction. Furthermore, the ba-sics of constraint based metabolic modeling and construction of genome scalemodels (GEM) are also provided. The chapters in the thesis are separated inthree related topics: chapters 2, 3 and 4 focus on data integration based onheterogeneous networks and their application to the human pathogen M. tu-berculosis; chapters 5, 6, 7, 8 and 9 focus on the semantic web based solutionsto genome annotation and applications thereof; and chapter 10 focus on thefinal goal to associate genotypes to phenotypes using GEMs.
Chapter 2 provides the prototype of a workflow to efficiently analyze in-formation generated by different inference and prediction methods. This me-thod relies on providing the user the means to simultaneously visualize andanalyze the coexisting networks generated by different algorithms, heteroge-neous data sets, and a suite of analysis tools. As a show case, we have ana-lyzed the gene co-expression networks of M. tuberculosis generated using over600 expression experiments. Hereby we gained new knowledge about theregulation of the DNA repair, dormancy, iron uptake and zinc uptake sys-tems. Furthermore, it enabled us to develop a pipeline to integrate ChIP-seqdat and a tool to uncover multiple regulatory layers.
In chapter 3 the prototype presented in chapter 2 is further developedinto the Synchronous Network Data Integration (SyNDI) framework, whichis based on Cytoscape and Galaxy. The functionality and usability of theframework is highlighted with three biological examples. We analyzed thedistinct connectivity of plasma metabolites in networks associated with highor low latent cardiovascular disease risk. We obtained deeper insights froma few similar inflammatory response pathways in Staphylococcus aureus infec-tion common to human and mouse. We identified not yet reported regulatorymotifs associated with transcriptional adaptations of M. tuberculosis.In chapter 4 we present a review providing a systems level overview ofthe molecular and cellular components involved in divalent metal homeosta-sis and their role in regulating the three main virulence strategies of M. tu-berculosis: immune modulation, dormancy and phagosome escape. With theuse of the tools presented in chapter 2 and 3 we identified a single regulatorycascade for these three virulence strategies that respond to limited availabilityof divalent metals in the phagosome.
The tools presented in chapter 2 and 3 achieve data integration throughthe use of multiple similarity, coexistence, coexpression and interaction geneand protein networks. However, the presented tools cannot store additional(genome) annotations. Therefore, we applied semantic web technologies tostore and integrate heterogeneous annotation data sets. An increasing num-ber of widely used biological resources are already available in the RDF datamodel. There are however, no tools available that provide structural overviewsof these resources. Such structural overviews are essential to efficiently querythese resources and to assess their structural integrity and design. There-fore, in chapter 5, I present RDF2Graph, a tool that automatically recoversthe structure of an RDF resource. The generated overview enables users tocreate complex queries on these resources and to structurally validate newlycreated resources.
Direct functional comparison support genotype to phenotype predictions.A prerequisite for a direct functional comparison is consistent annotation ofthe genetic elements with evidence statements. However, the standard struc-tured formats used by the public sequence databases to present genome an-notations provide limited support for data mining, hampering comparativeanalyses at large scale. To enable interoperability of genome annotations fordata mining application, we have developed the Genome Biology OntologyLanguage (GBOL) and associated infrastructure (GBOL stack), which is pre-sented in chapter 6. GBOL is provenance aware and thus provides a consistentrepresentation of functional genome annotations linked to the provenance.The provenance of a genome annotation describes the contextual details andderivation history of the process that resulted in the annotation. GBOL is mod-ular in design, extensible and linked to existing ontologies. The GBOL stackof supporting tools enforces consistency within and between the GBOL defi-nitions in the ontology.
Based on GBOL, we developed the genome annotation pipeline SAPP (Se-mantic Annotation Platform with Provenance) presented in chapter 7. SAPPautomatically predicts, tracks and stores structural and functional annotationsand associated dataset- and element-wise provenance in a Linked Data for-mat, thereby enabling information mining and retrieval with Semantic Webtechnologies. This greatly reduces the administrative burden of handling mul-tiple analysis tools and versions thereof and facilitates multi-level large scalecomparative analysis. In turn this can be used to make genotype to phenotypepredictions.
The development of GBOL and SAPP was done simultaneously. Duringthe development we realized that we had to constantly validated the data ex-ported to RDF to ensure coherence with the ontology. This was an extremelytime consuming process and prone to error, therefore we developed the Em-pusa code generator. Empusa is presented in chapter 8.
SAPP has been successfully used to annotate 432 sequenced Pseudomonas strains and integrate the resulting annotation in a large scale functional com-parison using protein domains. This comparison is presented in chapter 9.Additionally, data from six metabolic models, nearly a thousand transcrip-tome measurements and four large scale transposon mutagenesis experimentswere integrated with the genome annotations. In this way, we linked gene es-sentiality, persistence and expression variability. This gave us insight into thediversity, versatility and evolutionary history of the Pseudomonas genus, whichcontains some important pathogens as well some useful species for bioengi-neering and bioremediation purposes.
Genome annotation can be used to create GEM, which can be used to betterlink genotypes to phenotypes. Bio-Growmatch, presented in chapter 10, istool that can automatically suggest modification to improve a GEM based onphenotype data. Thereby integrating growth data into the complete processof modelling the metabolism of an organism.
Chapter 11 presents a general discussion on how the chapters contributedthe central goal. After which I discuss provenance requirements for data reuseand integration. I further discuss how this can be used to further improveknowledge generation. The acquired knowledge could, in turn, be used to de-sign new experiments. The principles of the dry-lab cycle and how semantictechnologies can contribute to establish these cycles are discussed in chapter11. Finally a discussion is presented on how to apply these principles to im-prove the creation and usability of GEM’s.