From existing data to novel hypotheses : design and application of structure-based Molecular Class Specific Information Systems
Kuipers, R.K.P. - \ 2012
Wageningen University. Promotor(en): Vitor Martins dos Santos; G. Vriend, co-promotor(en): Peter Schaap. - S.l. : s.n. - ISBN 9789461733504 - 231
systeembiologie - bio-informatica - genomica - informatiesystemen - computerwetenschappen - databanken - datamining - eiwitten - eiwitexpressieanalyse - systems biology - bioinformatics - genomics - information systems - computer sciences - databases - data mining - proteins - proteomics
As the active component of many biological systems, proteins are of great interest to life scientists. Proteins are used in a large number of different applications such as the production of precursors and compounds, for bioremediation, as drug targets, to diagnose patients suffering from genetic disorders, etc. Many research projects have therefore focused on the characterization of proteins and on improving the understanding of the functional and mechanistic properties of proteins. Studies have examined folding mechanisms, reaction mechanisms, stability under stress, effects of mutations, etc. All these research projects have resulted in an enormous amount of available data in lots of different formats that are difficult to retrieve, combine, and use efficiently.
The main topic of this thesis is the 3DM platform that was developed to generate Molecular Class Specific Information Systems (3DM systems) for protein superfamilies. These superfamily systems can be used to collect and interlink heterogeneous data sets based on structure based multiple sequence alignments. 3DM systems can be used to integrate protein, structure, mutation, reaction, conservation, correlation, contact, and many other types of data. Data is visualized using websites, directly in protein structures using YASARA, and in literature using Utopia Documents. 3DM systems contain a number of modules that can be used to analyze superfamily characteristics namely Comulator for correlated mutation analyses, Mutator for mutation retrieval, and Validator for mutant pathogenicity prediction. To be able to determine the characteristics of subsets of proteins and to be able to compare the characteristics of different subsets a powerful filtering mechanism is available. 3DM systems can be used as a central knowledge base for projects in protein engineering, DNA diagnostics, and drug design.
The scientific and technical background of the 3DM platform is described in the first two chapters. Chapter 1 describes the scientific background, starting with an overview of the foundations of the 3DM platform. Alignment methods and tools for both structure and sequence alignments, and the techniques used in the 3DM modules are described in detail. Alternative methods are also described with the advantages and disadvantages of the various strategies. Chapter 2 contains a technical description of the implementation of the 3DM platform and the 3DM modules. A schematic overview of the database used to store the data is provided together with a description of the various tables and the steps required to create new 3DM systems. The techniques used in the Comulator, Mutator and Validator modules of the 3DM platforms are discussed in more detail.
Chapter 3 contains a concise overview of the 3DM platform, its capabilities, and the results of protein engineering projects using 3DM systems. Thirteen 3DM systems were generated for superfamilies such as the PEPM/ICL and Nuclear Receptors. These systems are available online for further examination. Protein engineering studies aimed at optimizing substrate specificity, enzyme activity, or thermostability were designed targeting proteins from these superfamilies. Preliminary results of drug design and DNA diagnostics projects are also included to highlight the diversity of projects 3DM systems can be applied to.
Project HOPE: a biomedical tool to predict the effect of a mutation on the structure of a protein is described in chapter 4. Project HOPE is developed at the Radboud University Nijmegen Medical Center under supervision of H. Venselaar. Project HOPE employs webservices to optimally reuse existing databases and computing facilities. After selection of a mutant in a protein, data is collected from various sources such as UniProt and PISA. A homology model is created to determine features such as contacts and side-chain accessibility directly in the structure. Using a decision tree, the available data is evaluated to predict the effects of the mutation on the protein.
Chapter 5 describes Comulator: the 3DM module for correlated mutation analyses. Two positions in an alignment correlate when they co-evolve, that is they mutate simultaneously or not at all. Comulator uses a statistical coupling algorithm to calculate correlated mutation analyses. Correlated mutations are visualized using heatmaps, or directly in protein structures using YASARA. Analyses of correlated mutations in various superfamilies showed that positions that correlate are often found in networks and that the positions in these networks often share a common function. Using these networks, mutants were predicted to increase the specificity or activity of proteins. Mutational studies confirmed that correlated mutation analyses are a valuable tool for rational design of proteins.
Mutator, the text mining tool used to incorporate mutations into 3DM systems is described in chapter 6. Mutator was designed to automatically retrieve mutations from literature and store these mutations in a 3DM system. A PubMed search using keywords from the 3DM system is used to preselect articles of interest. These articles are retrieved from the internet, converted to text, and parsed for mutations. Mutations are then grounded to proteins and stored in a 3DM database. Mutation retrieval was tested on the alpha-amylase superfamily as this superfamily contains the enzyme involved in Fabry’s disease: an x linked lysosomal storage disease. Compared to existing mutant databases, such as the HGMD and SwissProt, Mutator retrieved 30% more mutations from literature. A major problem in DNA diagnostics is the differentiation between natural variants and pathogenic mutations. To distinguish between pathogenic mutations and natural variation in proteins the Validator modules was added to 3DM. Validator uses the data available in a 3DM system to predict the pathogenicity of a mutant using, for example, the residue conservation of the mutants alignment position, side-chain accessibility of the mutant in the structure, and the number of mutations found in literature for the alignment position. Mutator and Validator can be used to study mutants found in disorder related genes. Although these tools are not the definitive solution for DNA diagnostics they can hopefully be used to increase our understanding of the molecular basis of disorders.
Chapter 7 and 8 describe applied research projects using 3DM systems containg proteins of potential commercial interest. A 3DM system for the a/b-beta hydrolases superfamily is described in chapter 7. This superfamily consists of almost 20,000 proteins with a diverse range of functions. Superfamily alignments were generated for the common beta-barrel fold shared by all superfamily members, and for five distinct subtypes within the superfamily. Due to the size and functional diversity of the superfamily, there is a lot of potential for industrial application of superfamily members. Chapter 8 describes a study focusing on a sucrose phosphorylase enzyme from the a-amylase superfamily. This enzyme can be potentially used in an industrial setting for the transfer of glucose to a wide variety of molecules. The aim of the study was to increase the stability of the protein at higher temperatures. A combination of rational design using a 3DM system, and in-depth study of the protein structure, led to a series of mutations that resulted in more than doubling the half-life of the protein at 60°C.
3DM systems have been successfully applied in a wide range of protein engineering and DNA diagnostics studies. Currently, 3DM systems are applied most successfully in project studying a single protein family or monogenetic disorder. In the future, we hope to be able to apply 3DM to more complex scenarios such as enzyme factories and polygenetic disorders by combining multiple 3DM systems for interacting proteins.
Application of data mining methods to establish systems for early warning and proactive control in food supply chain networks
Li, Y. - \ 2010
Wageningen University. Promotor(en): Adrie Beulens; Jack van der Vorst. - [S.l. : S.n. - ISBN 9789085856382 - 164
controle - voedselvoorziening - bedrijfsvoering - voedselindustrie - datamining - ketenmanagement - agro-industriële ketens - regelsystemen - beslissingsondersteunende systemen - kennissystemen - bedrijfsmanagement - control - food supply - management - food industry - data mining - supply chain management - agro-industrial chains - control systems - decision support systems - knowledge systems - business management
Food quality problems in Food Supply Chain Networks (FSCN) have not only brought losses to the food industry, but also risks to the health of consumers. In current FSCN, Information Systems are widely used. Those information systems contain the data about various aspects of food production (e.g. primary inputs, operations) in different stages of FSCN. By applying Data Mining (DM) methods on those data sets, managers can identify the causes of encountered new problems, and also predict and prevent those problems. However, managers are often non-experts in the DM area. In this research, a framework for Early Warning and Proactive Control (EWPC) systems has been designed, and a prototype system according to this framework has been implemented. Such systems can enable managers to employ the power of DM methods to predict and prevent encountered problems. Moreover, such systems enable managers to accumulate the knowledge they obtain from data analysis into a Knowledge Base, so that other managers can use it when they encounter similar types of problems. In this research, we have two major objectives:
To design a framework for EWPC systems to facilitate the following aspects:
• analyze relations between problems and causes
• predict upcoming problems
• suggest control actions to prevent upcoming problems
• use existing databases in FSCN
• support non-expert users in applying DM methods
• have an extendable knowledge base
The framework should describe the necessary components as well as the relations between those components in EWPC systems.
To build a prototype system based on the framework to enable managers in FSCN, as non-experts in DM, to use DM methods for Early Warning and Proactive Control on the supply chain level.
In order to realize those objectives, six research questions were formulated.
1. What are the requirements for EWPC system design considering current practice of FSCN management?
2. What components should be included in the EWPC systems, and how should those components cooperate to enable managers to achieve EWPC in FSCN?
3. What Data Mining methods are available and applicable for EWPC in FSCN?
4. What support needs to be provided to managers in order to enable them to use Data Mining methods for EWPC?
5. What kind of structure is suitable for the Knowledge Base in EWPC systems?
6. What is the validity of the designed framework and prototype system?
To answer these questions, we used both literature review and case analysis. We studied the literature from areas such as Decision Support Systems, Data Mining, Supply Chain Management, Ontology Engineering, and Knowledge Engineering. The cases we analyzed came from two food companies. From the cases in those companies we studied what kind of system would enable managers to realize EWPC in FSCN. During case analysis, we communicated with managers in those companies about the problems they encountered, the relevant data sets, and the objectives they wanted to achieve. The data sets obtained from those cases normally have more than ten fields and millions of records. By applying different DM methods on those cases, we accumulated knowledge on the applicability of those methods as well as on the generic processes of applying those methods for EWPC. Moreover, we categorized the types of knowledge obtained from problem investigation in order to design a proper structure for Knowledge Base.
Regarding the first research question, what are the requirements for EWPC system design considering current practice of FSCN management? our study distinguished three types of requirements: performance requirements concerning the time needed for using this system, specific quality requirements concerning the sufficiency and comprehensibility of the assistance this system can offer, and functional requirements. There are six functional requirements:
1) facilitate quantitatively formulating problems;
2) guiding data joining and data preparation;
3) guiding managers in using DM methods for quantitative modeling;
4) predict the problem as early as possible;
5) support evaluating different control measures;
6) provide relevant knowledge for encountered problems, and accommodate new knowledge obtained during problem solving and decision making.
Regarding the second research question, what components should be included in the EWPC systems, and how should those components cooperate to enable managers to achieve EWPC in FSCN? our study defined the following major components for the framework:
• Task classifier and Template Approaches: to direct managers to follow the correct processes when they intend to deal with the encountered problem. Task classifier helps users to quickly identify their task type: identifying a problem, finding relevant data, exploring potential causal factors for the problem, predicting upcoming problems, evaluate alternative control measures, and consulting the Knowledge Base. Each task is supported by a corresponding Template approach.
• Knowledge Base: stores information (e.g. causal factors, causal relations) on previously encountered problems in FSCN for easy reference by other users.
• DM methods library and Expert System: the DM methods library stores information (function, model format, and requirements on data sets) about the DM methods that can be used for EWPC. The Expert System gives suggestions on which DM methods to use and explain its reasoning.
• Explorer and Predictor: the Explorer component allows users to explore potential causal factors for the problems in FSCN. The Predictor warns about problems that are about to occur in FSCN. It is used for decision evaluation as well. Users can employ models built previously to compare results of different available decisions and choose the best one.
In addition to the specification of those components, we also defined the steps that are needed for using the system, as well as the correct sequence between those steps.
Regarding the third research question: what Data Mining methods are available and applicable for EWPC in FSCN? our study identified six requirements on the DM method level: prediction, problem detection, finding determinant factors, representing complex structure, different representation forms, and extendable with new knowledge. The first four functional requirements deal with functions of DM methods. In the DM area functions are categorized differently (such as classification, regression). Our study provided a mapping between these two kinds of functions. After that, we selected a list of widely used DM methods and identified which method can accomplish which DM function. The last two functional requirements relate to the representation form of DM methods. Our study provided another mapping between representation forms of those DM methods and their extensibility for new knowledge.
Regarding the fourth research question: what support needs to be provided to managers in order to enable them to use Data Mining methods for EWPC? our study found two aspects of support that are needed for enabling managers to use DM methods. One is how to find a proper DM method. This can be supported with an Expert System for DM method selection and a DM methods library. Managers can get suggestions on which DM method is proper after they specify their case situation and data set characteristics to the Expert System. The other is how to use the DM method found for EWPC. This is supported with Template approaches for data analysis. Those template approaches tell users how to execute the particular step, what performance indicator to look at, and what to do if a particular situation occurs.
Regarding the fifth research question: what kind of structure is suitable for the Knowledge Base in EWPC systems? our study defined a structure with two parts: a rule base and an inference structure. A rule base allows managers to store obtained knowledge. It allows managers to specify what kind of causal relation and/or remedies have been obtained. A rule base should contain an ontology that guarantees the consistent semantic meaning of terms in each rule. An inference structure allows managers to quickly identify relevant knowledge. It communicates with users, and uses the inference mechanism to find out applicable knowledge.
Regarding the sixth research question: What is the validity of the designed framework and prototype system? our study first assigned appropriate Key Performance Indicators (KPI) for different design aspects of the system (e.g. framework, design methodology), then employed Expert Validation to evaluate the performance of each design aspect on each KPI. Our expectations are roughly met by the results of expert validation. Expert Validation also brought forward that experts in FSCN management put special importance on the potential of the developed system.
The main contribution of this research is that it integrated different scientific areas into one decision support architecture. This architecture presents a new means to monitor and control the processes for Supply Chain Management. Managers with EWPC can handle new problems with existing data resources. The scope of problems that can be solved is not restricted beforehand. For Data Mining, this study extends the existing research on applicability of DM methods by creating a bridge between the reality needs and the DM area. For Knowledge Engineering, this study creates a suitable Knowledge Base structure for sharing knowledge among non-experts users by linking Knowledge Management with Ontology Engineering.
As far as the impact of this research for supply chain managers is concerned, we advise them to ensure that the requirements for effective proactive control are fulfilled. The framework presented in this thesis supports supply chain managers by providing them with usable DM methods for obtaining new insights through modelling and application of new data. For food quality managers in FSCN, the implication is that a EWPC system can be used to explore causal factors when a problem occurs. Food quality managers can also verify the hypothesis on the causes with the EWPC system. By using this system together with other problem investigation strategies, such as field investigation, managers can improve the efficiency and effectiveness of problem solving. For information technology managers, we advise them to use such a system to enforce correct and continuous data collection mechanism. In this system, the facilities for handling outliers and missing values can enable managers to easily identify problems in collected data. FSCN managers from practice recognize the potential of the system and knowledge stored in it for improving decision support by making Data Mining applicable for non-experts.
Web services for transcriptomics
Neerincx, P. - \ 2009
Wageningen University. Promotor(en): Jack Leunissen. - [S.l. : S.n. - ISBN 9789085854647 - 184
bio-informatica - internet - moleculaire biologie - computers - datacommunicatie - gegevensverwerking - transcriptomics - computernetwerken - microarrays - genexpressieanalyse - datamining - bioinformatics - internet - molecular biology - computers - data communication - data processing - transcriptomics - computer networks - microarrays - genomics - data mining
Transcriptomics is part of a family of disciplines focussing on high throughput molecular biology experiments. In the case of transcriptomics, scientists study the expression of genes resulting in transcripts. These transcripts can either perform a biological function themselves or function as messenger molecules containing a copy of the genetic code, which can be used by the ribosomes as templates to synthesise proteins. Over the past decade microarray technology has become the dominant technology for performing high throughput gene expression experiments.
A microarray contains short sequences (oligos or probes), which are the reverse complement of fragments of the targets (transcripts or sequences derived thereof). When genes are expressed, their transcripts (or sequences derived thereof) can hybridise to these probes. Many thousand copies of a probe are immobilised in a small region on a support. These regions are called spots and a typical microarray contains thousands or sometimes even more than a million spots. When the transcripts (or sequences derived thereof) are fluorescently labelled and it is known which spots are located where on the support, a fluorescent signal in a certain region represents expression of a certain gene. For interpretation of microarray data it is essential to make sure the oligos are specific for their targets. Hence for proper probe design one needs to know all transcripts that may be expressed and how well they can hybridise with candidate oligos. Therefore oligo design requires:
1. A complete reference genome assembly.
2. Complete annotation of the genome to know which parts may be transcribed.
3. Insight in the amount of natural variation in the genomes of different individuals.
4. Knowledge on how experimental conditions influence the ability of probes to hybridise with certain transcripts.
Unfortunately such complete information does not exist, but many microarrays were designed based on incomplete data nevertheless. This can lead to a variety of problems including cross-hybridisation (non-specific binding), erroneously annotated and therefore misleading probes, missing probes and orphan probes.
Fortunately the amount of information on genes and their transcripts increases rapidly. Therefore, it is possible to improve the reliability of microarray data analysis by regular updates of the probe annotation using updated databases for genomes and their annotation. Several tools have been developed for this purpose, but these either used simplistic annotation strategies or did not support our species and/ or microarray platforms of interest. Therefore, we developed OligoRAP (Oligo Re- Annotation Pipeline), which is described in chapter 2. OligoRAP was designed to take advantage of amongst others annotation provided by Ensembl, which is the largest genome annotation effort in the world. Thereby OligoRAP supports most of the major animal model organisms including farm animals like chicken and cow. In addition to support for our species and array platforms of interest OligoRAP employs a new annotation strategy combining information from genome and transcript databases in a non-redundant way to get the most complete annotation possible.
In chapter 3 we compared annotation generated with 3 oligo annotation pipelines including OligoRAP and investigated the effect on functional analysis of a microarray experiment involving chickens infected with Eimeria bacteria. As an example of functional analysis we investigated if up- or downregulated genes were enriched for Terms from the Gene Ontology (GO). We discovered that small differences in annotation strategy could lead to alarmingly large differences in enriched GO terms.
Therefore it is important to know, which annotation strategy works best, but it was not possible to assess this due to the lack of a good reference or benchmark dataset. There are a few limited studies investigating the hybridisation potential of imperfect alignments of oligos with potential targets, but in general such data is scarce. In addition it is difficult to compare these studies due to differences in experimental setup including different hybridisation temperatures and different probe lengths. As result we cannot determine exact thresholds for the alignments of oligos with non-targets to prevent cross-hybridisation, but from these different studies we can get an idea of the range for the thresholds that would be required for optimal target specificity. Note that in these studies experimental conditions were first optimised for an optimal signal to noise ratio for hybridisation of oligos with targets. Then these conditions were used to determine the thresholds for alignments of oligos with non-targets to prevent cross-hybridisation.
Chapter 4 describes a parameter sweep using OligoRAP to explore hybridisation potential thresholds from a different perspective. Given the mouse genome thresholds were determined for the largest amount of gene specific probes. Using those thresholds we then determined thresholds for optimal signal to noise ratios. Unfortunately the annotation-based thresholds we found did not fall within the range of experimentally determined thresholds; in fact they were not even close. Hence what was experimentally determined to be optimal for the technology was not in sync with what was determined to be optimal for the mouse genome. Further research will be required to determine whether microarray technology can be modified in such a way that it is better suited for gene expression experiments. The requirement of a priori information on possible targets and the lack of sufficient knowledge on how experimental conditions influence hybridisation potential can be considered the Achiles’ heels of microarray technology.
Chapter 5 is a collection of 3 application notes describing other tools that can aid in analysis of transcriptomics data. Firstly, RShell, which is a plugin for the Taverna workbench allowing users to execute statistical computations remotely on R-servers. Secondly, MADMAX services, which provide quality control and normalisation of microarray data for AffyMetrix arrays. Finally, GeneIlluminator, which is a tool to disambiguate gene symbols allowing researchers to specifically retrieve literature for their genes of interest even if the gene symbols for those genes had many synonyms and homonyms.
High throughput experiments like those performed in transcriptomics usually require subsequent analysis with many different tools to make biological sense of the data. Installing all these tools on a single, local computer and making them compatible so users can build analysis pipelines can be very cumbersome. Therefore distributed analysis strategies have been explored extensively over the past decades. In a distributed system providers offer remote access to tools and data via the Internet allowing users to create pipelines from modules from all over the globe.
Chapter 1 provides an overview of the evolution of web services, which represent the latest breed in technology for creating distributed systems. The major advantage of web services over older technology is that web services are programming language independent, Internet communication protocol independent and operating system independent. Therefore web services are very flexible and most of them are firewall-proof. Web services play a major role in the remaining chapters of this thesis: OligoRAP is a workflow entirely made from web services and the tools described in chapter 5 all provide remote programmatic access via web service interfaces. Although web services can be used to build relatively complex workflows like OligoRAP, a lack of mainly de facto standards and of user-friendly clients has limited the use of web services to bioinformaticians. A semantic web where biologists can easily link web services into complex workflows does n
An integrative algorithmic approach towards knowledge discovery by bioinformatics
Alako Tadontsop, F.B. - \ 2008
Wageningen University. Promotor(en): Jack Leunissen. - [S.l. : S.n. - ISBN 9789085048190 - 124
bio-informatica - nomenclatuur - computeranalyse - nucleotidenvolgordes - algoritmen - fylogenetica - moleculaire biologie - fylogenie - classificatie - genexpressieanalyse - datamining - microarrays - ontologieën - bioinformatics - nomenclature - computer analysis - nucleotide sequences - algorithms - phylogenetics - molecular biology - phylogeny - classification - genomics - data mining - microarrays - ontologies
In this thesis we describe different approaches aiding in the utilization of the exponentially growing amount of information available in the life sciences. Briefly, we address two issues in molecular biology, on sequence analysis, and on text mining. The former issue addresses the problem how to determine remote sequence homology especially when the sequence similarity is very low. For this a visualisation tool is introduced that combines sequence alignment, domain prediction and phylogeny. The second topic on text mining centres on the question how to unambiguously formulate queries for efficient information retrieval. It tackles the problem of gene nomenclature — one in two gene symbols being ambiguous - by introducing a new text-clustering- and taxonomy-based disambiguation methodology.
Applying Data Mining for Early Warning food supply networks
Li, Y. ; Kramer, M.R. ; Beulens, A.J.M. ; Vorst, J.G.A.J. van der - \ 2006
Wageningen : Mansholt Graduate School (Working paper / Mansholt Graduate School : Discussion paper ) - 18
voedselvoorziening - voedselkwaliteit - controle - gegevens verzamelen - datamining - netwerken - food supply - food quality - control - data collection - data mining - networks
In food supply networks, quality of end products is a critical issue. The quality of food products depends in a complex way on many factors. In order to effectively control food quality, our research aims at implementing early warning and proactive control systems in food supply networks. To exploit the large amounts of operational data collected throughout such a network, we employ data mining in various settings. This paper investigates the requirements on data mining posed by early warning in food supply networks, and maps those requirements to available data mining methods. Results of a preliminary case study show that data mining is a promising approach as part of early warning systems in food supply networks.