WUR Journal browser

WUR Journal browser

  • external user (warningwarning)
  • Log in as
  • The Journal Browser provides a list of more than 30,000 journals. It can be consulted by authors who wish to select a journal for publishing their manuscript Open Access. The information in this list is aggregated from several sources on a regular basis:

    • A list of journals for which the Association of Universities in the Netherlands (VSNU) has made deals with publishers, to make articles Open Access. Under these deals, corresponding authors of Dutch universities can publish their articles Open Access in the participating journals with discounts on the article processing charges (APCs).
    • A list of journals covered by the Journal Citation Reports.
    • A list of journals covered by Scopus.
    • Journals indexed in the Directory of Open Access Journals (DOAJ).
    • Lists of journals for which specific Dutch universities have made deals with publishers, to make articles Open Access. Under these deals, corresponding authors of these universities can publish their articles Open Access in the participating journals with discounts on the article processing charges (APCs). Depending on the university from which the Journal Browser is consulted, this information is shown.
    • Additional data on citations made to journals, in articles published by staff from a specific Dutch university, that are made available by that university. Depending on the university from which the Journal Browser is consulted, this information is shown.

    In the Journal Browser, a search box can be used to look up journals on certain subjects. The terms entered in this box are used to search the journal titles and other metadata (e.g. keywords).

    After having selected journals by subject, it is possible to apply additional filters. These concern no/full costs and discounts for Open Access publishing, support on Open Access publishing in journals, and the quartile to which the journal’s impact factor belongs.

    When one selects a journal in the Journal Browser, the following information may be presented:

    • General information about the selected journal such as title and ISSNs, together with a link to the journal’s website.
    • APC discount that holds for the selected journal if it is part of an Open Access deal.
    • Impact measures for the selected journal from Journal Citation Reports or Scopus. The impact measures that are shown may vary, depending on the university from which the Journal Browser is consulted. For some universities, the number of citations made to the selected journal (in articles published by staff from that university) is also shown.
    • Information from Sherpa/Romeo on the conditions under which articles from the selected journal may be made available via Green Open Access.
    • A listing of articles recently published in the selected journal.
    • For some universities, information is available on what journals have been co-cited most frequently together with the selected journal (in articles published by staff from these universities). When available, this information is presented under ‘similar journals’.
    About

Bioinformatics

Oxford University Press

1985-

ISSN: 1367-4803 (1367-4811, 1460-2059)
Biotechnology & Applied Microbiology - Mathematical & Computational Biology - Biochemical Research Methods - Statistics and Probability - Computational Mathematics - Biochemistry - Computer Science Applications - Molecular Biology - Computational Theory and Mathematics - Molecular Biology

Recent articles

1 show abstract
1367-4803 * 1460-2059 * 33191360

Bioinformatics (2019) doi:
10.1093/bioinformatics/btz102



2 show abstract
1367-4803 * 1460-2059 * 33191361

Abstract
MotivationAnalysis of differential expression of genes is often performed to understand how the metabolic activity of an organism is impacted by a perturbation. However, because the system of metabolic regulation is complex and all changes are not directly reflected in the expression levels, interpreting these data can be difficult.ResultsIn this work, we present a new algorithm and computational tool that uses a genome-scale metabolic reconstruction to infer metabolic changes from differential expression data. Using the framework of constraint-based analysis, our method produces a qualitative hypothesis of a change in metabolic activity. In other words, each reaction of the network is inferred to have increased, decreased, or remained unchanged in flux. In contrast to similar previous approaches, our method does not require a biological objective function and does not assign on/off activity states to genes. An implementation is provided and it is available online. We apply the method to three published datasets to show that it successfully accomplishes its two main goals: confirming or rejecting metabolic changes suggested by differentially expressed genes based on how well they fit in as parts of a coordinated metabolic change, as well as inferring changes in reactions whose genes did not undergo differential expression.Availability and implementationgithub.com/htpusa/moomin.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
3 show abstract
1367-4803 * 1460-2059 * 33191362

Abstract
MotivationBiomedical event extraction is fundamental for information extraction in molecular biology and biomedical research. The detected events form the central basis for comprehensive biomedical knowledge fusion, facilitating the digestion of massive information influx from the literature. Limited by the event context, the existing event detection models are mostly applicable for a single task. A general and scalable computational model is desiderated for biomedical knowledge management.ResultsWe consider and propose a bottom-up detection framework to identify the events from recognized arguments. To capture the relations between the arguments, we trained a bidirectional long short-term memory network to model their context embedding. Leveraging the compositional attributes, we further derived the candidate samples for training event classifiers. We built our models on the datasets from BioNLP Shared Task for evaluations. Our method achieved the average F-scores of 0.81 and 0.92 on BioNLPST-BGI and BioNLPST-BB datasets, respectively. Comparing with seven state-of-the-art methods, our method nearly doubled the existing F-score performance (0.92 versus 0.56) on the BioNLPST-BB dataset. Case studies were conducted to reveal the underlying reasons.Availability and implementation
https://github.com/cskyan/evntextrc. Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
4 show abstract
1367-4803 * 1460-2059 * 33191363

Abstract
MotivationProtein structure alignment is one of the fundamental problems in computational structure biology. A variety of algorithms have been developed to address this important issue in the past decade. However, due to their heuristic nature, current structure alignment methods may suffer from suboptimal alignment and/or over-fragmentation and thus lead to a biologically wrong alignment in some cases. To overcome these limitations, we have developed an accurate topology-independent and global structure alignment method through an FFT-based exhaustive search algorithm, which is referred to as FTAlign.ResultsOur FTAlign algorithm was extensively tested on six commonly used datasets and compared with seven state-of-the-art structure alignment approaches, TMalign, DeepAlign, Kpax, 3DCOMB, MICAN, SPalignNS and CLICK. It was shown that FTAlign outperformed the other methods in reproducing manually curated alignments and obtained a high success rate of 96.7 and 90.0% on two gold-standard benchmarks, MALIDUP and MALISAM, respectively. Moreover, FTAlign also achieved the overall best performance in terms of biologically meaningful structure overlap (SO) and TMscore on both the sequential alignment test sets including MALIDUP, MALISAM and 64 difficult cases from HOMSTRAD, and the non-sequential sets including MALIDUP-NS, MALISAM-NS, 199 topology-different cases, where FTAlign especially showed more advantage for non-sequential alignment. Despite its global search feature, FTAlign is also computationally efficient and can normally complete a pairwise alignment within one second.Availability and implementation
http://huanglab.phys.hust.edu.cn/ftalign/.
5 show abstract
1367-4803 * 1460-2059 * 33191364

Abstract
MotivationGene fusions are an important class of transcriptional variants that can influence cancer development and can be predicted from RNA sequencing (RNA-seq) data by multiple existing tools. However, the real-world performance of these tools is unclear due to the lack of known positive and negative events, especially with regard to fusion genes in individual samples. Often simulated reads are used, but these cannot account for all technical biases in RNA-seq data generated from real samples.ResultsHere, we present ArtiFuse, a novel approach that simulates fusion genes by sequence modification to the genomic reference, and therefore, can be applied to any RNA-seq dataset without the need for any simulated reads. We demonstrate our approach on eight RNA-seq datasets for three fusion gene prediction tools: average recall values peak for all three tools between 0.4 and 0.56 for high-quality and high-coverage datasets. As ArtiFuse affords total control over involved genes and breakpoint position, we also assessed performance with regard to gene-related properties, showing a drop-in recall value for low-expressed genes in high-coverage samples and genes with co-expressed paralogues. Overall tool performance assessed from ArtiFusions is lower compared to previously reported estimates on simulated reads. Due to the use of real RNA-seq datasets, we believe that ArtiFuse provides a more realistic benchmark that can be used to develop more accurate fusion gene prediction tools for application in clinical settings.Availability and implementationArtiFuse is implemented in Python. The source code and documentation are available at https://github.com/TRON-Bioinformatics/ArtiFusion.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
6 show abstract
1367-4803 * 1460-2059 * 33191365

Abstract
SummaryWe launch a webserver for RNA structure prediction and design corresponding to tools developed using our RNA-As-Graphs (RAG) approach. RAG uses coarse-grained tree graphs to represent RNA secondary structure, allowing the application of graph theory to analyze and advance RNA structure discovery. Our webserver consists of three modules: (a) RAG Sampler: samples tree graph topologies from an RNA secondary structure to predict corresponding tertiary topologies, (b) RAG Builder: builds three-dimensional atomic models from candidate graphs generated by RAG Sampler, and (c) RAG Designer: designs sequences that fold onto novel RNA motifs (described by tree graph topologies). Results analyses are performed for further assessment/selection. The Results page provides links to download results and indicates possible errors encountered. RAG-Web offers a user-friendly interface to utilize our RAG software suite to predict and design RNA structures and sequences.Availability and implementationThe webserver is freely available online at: http://www.biomath.nyu.edu/ragtop/.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
7 show abstract
1367-4803 * 1460-2059 * 33191366

Abstract
MotivationVisualization of multiple genomic data generally requires the use of public or commercially hosted browsers. Flexible visualization of chromatin interaction data as genomic features and network components offer informative insights to gene expression. An open source application for visualizing HiC and chromatin conformation-based data as 2D-arcs accompanied by interactive network analyses is valuable.ResultsDNA Rchitect is a new tool created to visualize HiC and chromatin conformation-based contacts at high (Kb) and low (Mb) genomic resolutions. The user can upload their pre-filtered HiC experiment in bedpe format to the DNA Rchitect web app that we have hosted or to a version they themselves have deployed. Using DNA Rchitect, the uploaded data allows the user to visualize different interactions of their sample, perform simple network analyses, while also offering visualization of other genomic data types. The user can then download their results for additional network functionality offered in network based programs such as Cytoscape.Availability and implementationDNA Rchitect is freely available both as a web application written primarily in R available at http://shiny.immgen.org/DNARchitect/ and as an open source released under an MIT license at: https://github.com/alosdiallo/DNA_Rchitect.
8 show abstract
1367-4803 * 1460-2059 * 33191367

Abstract
MotivationGenome-wide association studies have revealed that 88% of disease-associated single-nucleotide polymorphisms (SNPs) reside in noncoding regions. However, noncoding SNPs remain understudied, partly because they are challenging to prioritize for experimental validation. To address this deficiency, we developed the SNP effect matrix pipeline (SEMpl).ResultsSEMpl estimates transcription factor-binding affinity by observing differences in chromatin immunoprecipitation followed by deep sequencing signal intensity for SNPs within functional transcription factor-binding sites (TFBSs) genome-wide. By cataloging the effects of every possible mutation within the TFBS motif, SEMpl can predict the consequences of SNPs to transcription factor binding. This knowledge can be used to identify potential disease-causing regulatory loci.Availability and implementationSEMpl is available from https://github.com/Boyle-Lab/SEM_CPP.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
9 show abstract
1367-4803 * 1460-2059 * 33191368

Abstract
MotivationThe National Human Genome Research Institute Catalog of Published Genome-Wide Association Studies (GWAS) Catalog has collected, curated and made available data from over 7100 studies. The recently developed GWAS Catalog representational state transfer (REST) application programming interface (API) is the only method allowing programmatic access to this resource.ResultsHere, we describe gwasrapidd, an R package that provides the first client interface to the GWAS Catalog REST API, representing an important software counterpart to the server-side component. gwasrapidd enables users to quickly retrieve, filter and integrate data with comprehensive bioinformatics analysis tools, which is particularly critical for those looking into functional characterization of risk loci.Availability and implementation
gwasrapidd is freely available under an MIT License, and can be accessed from https://github.com/ramiromagno/gwasrapidd.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
10 show abstract
1367-4803 * 1460-2059 * 33191369

Abstract
MotivationCircular RNAs (circRNAs), a class of non-coding RNAs generated from non-canonical back-splicing events, have emerged to play key roles in many biological processes. Though numerous tools have been developed to detect circRNAs from rRNA-depleted RNA-seq data based on back-splicing junction-spanning reads, computational tools to identify critical genomic features regulating circRNA biogenesis are still lacking. In addition, rigorous statistical methods to perform differential expression (DE) analysis of circRNAs remain under-developed.ResultsWe present circMeta, a unified computational framework for circRNA analyses. circMeta has three primary functional modules: (i) a pipeline for comprehensive genomic feature annotation related to circRNA biogenesis, including length of introns flanking circularized exons, repetitive elements such as Alu elements and SINEs, competition score for forming circulation and RNA editing in back-splicing flanking introns; (ii) a two-stage DE approach of circRNAs based on circular junction reads to quantitatively compare circRNA levels and (iii) a Bayesian hierarchical model for DE analysis of circRNAs based on the ratio of circular reads to linear reads in back-splicing sites to study spatial and temporal regulation of circRNA production. Both proposed DE methods without and with considering host genes outperform existing methods by obtaining better control of false discovery rate and comparable statistical power. Moreover, the identified DE circRNAs by the proposed two-stage DE approach display potential biological functions in Gene Ontology and circRNA-miRNA–mRNA networks that are not able to be detected using existing mRNA DE methods. Furthermore, top DE circRNAs have been further validated by RT-qPCR using divergent primers spanning back-splicing junctions.Availability and implementationThe software circMeta is freely available at https://github.com/lichen-lab/circMeta.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
11 show abstract
1367-4803 * 1460-2059 * 33191370

Abstract
MotivationThe immune system has diverse types of cells that are differentiated or activated via various signaling pathways and transcriptional regulation upon challenging conditions. Immunophenotyping by flow and mass cytometry are the major approaches for identifying key signaling molecules and transcription factors directing the transition between the functional states of immune cells. However, few proteins can be evaluated by flow cytometry in a single experiment, preventing researchers from obtaining a comprehensive picture of the molecular programs involved in immune cell differentiation. Recent advances in single-cell RNA sequencing (scRNA-seq) have enabled unbiased genome-wide quantification of gene expression in individual cells on a large scale, providing a new and versatile analytical pipeline for studying immune cell differentiation.ResultsWe present VirtualCytometry, a web-based computational pipeline for evaluating immune cell differentiation by exploiting cell-to-cell variation in gene expression with scRNA-seq data. Differentiating cells often show a continuous spectrum of cellular states rather than distinct populations. VirtualCytometry enables the identification of cellular subsets for different functional states of differentiation based on the expression of marker genes. Case studies have highlighted the usefulness of this subset analysis strategy for discovering signaling molecules and transcription factors for human T-cell exhaustion, a state of T-cell dysfunction, in tumor and mouse dendritic cells activated by pathogens. With more than 226 scRNA-seq datasets precompiled from public repositories covering diverse mouse and human immune cell types in normal and disease tissues, VirtualCytometry is a useful resource for the molecular dissection of immune cell differentiation.Availability and implementation
www.grnpedia.org/cytometry
12 show abstract
1367-4803 * 1460-2059 * 33191371

Abstract
MotivationPersonalized medicine often relies on accurate estimation of a treatment effect for specific subjects. This estimation can be based on the subject’s baseline covariates but additional complications arise for a time-to-event response subject to censoring. In this paper, the treatment effect is measured as the difference between the mean survival time of a treated subject and the mean survival time of a control subject. We propose a new random forest method for estimating the individual treatment effect with survival data. The random forest is formed by individual trees built with a splitting rule specifically designed to partition the data according to the individual treatment effect. For a new subject, the forest provides a set of similar subjects from the training dataset that can be used to compute an estimation of the individual treatment effect with any adequate method.ResultsThe merits of the proposed method are investigated with a simulation study where it is compared to numerous competitors, including recent state-of-the-art methods. The results indicate that the proposed method has a very good and stable performance to estimate the individual treatment effects. Two examples of application with a colon cancer data and breast cancer data show that the proposed method can detect a treatment effect in a sub-population even when the overall effect is small or nonexistent.Availability and implementationThe authors are working on an R package implementing the proposed method and it will be available soon. In the meantime, the code can be obtained from the first author at sami.tabib@hec.ca.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
13 show abstract
1367-4803 * 1460-2059 * 33191372

Abstract
MotivationComputational approaches for predicting drug–target interactions (DTIs) can provide valuable insights into the drug mechanism of action. DTI predictions can help to quickly identify new promising (on-target) or unintended (off-target) effects of drugs. However, existing models face several challenges. Many can only process a limited number of drugs and/or have poor proteome coverage. The current approaches also often suffer from high false positive prediction rates.ResultsWe propose a novel computational approach for predicting drug target proteins. The approach is based on formulating the problem as a link prediction in knowledge graphs (robust, machine-readable representations of networked knowledge). We use biomedical knowledge bases to create a knowledge graph of entities connected to both drugs and their potential targets. We propose a specific knowledge graph embedding model, TriModel, to learn vector representations (i.e. embeddings) for all drugs and targets in the created knowledge graph. These representations are consequently used to infer candidate drug target interactions based on their scores computed by the trained TriModel model. We have experimentally evaluated our method using computer simulations and compared it to five existing models. This has shown that our approach outperforms all previous ones in terms of both area under ROC and precision–recall curves in standard benchmark tests.Availability and implementationThe data, predictions and models are available at: drugtargets.insight-centre.org.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
14 show abstract
1367-4803 * 1460-2059 * 33191373

Abstract
MotivationThe identification of sub-populations of patients with similar characteristics, called patient subtyping, is important for realizing the goals of precision medicine. Accurate subtyping is crucial for tailoring therapeutic strategies that can potentially lead to reduced mortality and morbidity. Model-based clustering, such as Gaussian mixture models, provides a principled and interpretable methodology that is widely used to identify subtypes. However, they impose identical marginal distributions on each variable; such assumptions restrict their modeling flexibility and deteriorates clustering performance.ResultsIn this paper, we use the statistical framework of copulas to decouple the modeling of marginals from the dependencies between them. Current copula-based methods cannot scale to high dimensions due to challenges in parameter inference. We develop HD-GMCM, that addresses these challenges and, to our knowledge, is the first copula-based clustering method that can fit high-dimensional data. Our experiments on real high-dimensional gene-expression and clinical datasets show that HD-GMCM outperforms state-of-the-art model-based clustering methods, by virtue of modeling non-Gaussian data and being robust to outliers through the use of Gaussian mixture copulas. We present a case study on lung cancer data from TCGA. Clusters obtained from HD-GMCM can be interpreted based on the dependencies they model, that offers a new way of characterizing subtypes. Empirically, such modeling not only uncovers latent structure that leads to better clustering but also meaningful clinical subtypes in terms of survival rates of patients.Availability and implementationAn implementation of HD-GMCM in R is available at: https://bitbucket.org/cdal/hdgmcm/.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
15 show abstract
1367-4803 * 1460-2059 * 33191374

Abstract
MotivationHigh-throughput reporter assays dramatically improve our ability to assign function to noncoding genetic variants, by measuring allelic effects on gene expression in the controlled setting of a reporter gene. Unlike genetic association tests, such assays are not confounded by linkage disequilibrium when loci are independently assayed. These methods can thus improve the identification of causal disease mutations. While work continues on improving experimental aspects of these assays, less effort has gone into developing methods for assessing the statistical significance of assay results, particularly in the case of rare variants captured from patient DNA.ResultsWe describe a Bayesian hierarchical model, called Bayesian Inference of Regulatory Differences, which integrates prior information and explicitly accounts for variability between experimental replicates. The model produces substantially more accurate predictions than existing methods when allele frequencies are low, which is of clear advantage in the search for disease-causing variants in DNA captured from patient cohorts. Using the model, we demonstrate a clear tradeoff between variant sequencing coverage and numbers of biological replicates, and we show that the use of additional biological replicates decreases variance in estimates of effect size, due to the properties of the Poisson-binomial distribution. We also provide a power and sample size calculator, which facilitates decision making in experimental design parameters.Availability and implementationThe software is freely available from www.geneprediction.org/bird. The experimental design web tool can be accessed at http://67.159.92.22:8080
Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
16 show abstract
1367-4803 * 1460-2059 * 33191375

Abstract
MotivationExciting new opportunities have arisen to solve the protein contact prediction problem from the progress in neural networks and the availability of a large number of homologous sequences through high-throughput sequencing. In this work, we study how deep convolutional neural networks (ConvNets) may be best designed and developed to solve this long-standing problem.ResultsWith publicly available datasets, we designed and trained various ConvNet architectures. We tested several recent deep learning techniques including wide residual networks, dropouts and dilated convolutions. We studied the improvements in the precision of medium-range and long-range contacts, and compared the performance of our best architectures with the ones used in existing state-of-the-art methods. The proposed ConvNet architectures predict contacts with significantly more precision than the architectures used in several state-of-the-art methods. When trained using the DeepCov dataset consisting of 3456 proteins and tested on PSICOV dataset of 150 proteins, our architectures achieve up to 15% higher precision when L/2 long-range contacts are evaluated. Similarly, when trained using the DNCON2 dataset consisting of 1426 proteins and tested on 84 protein domains in the CASP12 dataset, our single network achieves 4.8% higher precision than the ensembled DNCON2 method when top L long-range contacts are evaluated.Availability and implementationDEEPCON is available at https://github.com/badriadhikari/DEEPCON/.
17 show abstract
1367-4803 * 1460-2059 * 33191376

Abstract
MotivationMeta-analysis methods have been widely used to combine results from multiple clinical or genomic studies to increase statistical powers and ensure robust and accurate conclusions. The adaptively weighted Fisher’s method (AW-Fisher), initially developed for omics applications but applicable for general meta-analysis, is an effective approach to combine P-values from K independent studies and to provide better biological interpretability by characterizing which studies contribute to the meta-analysis. Currently, AW-Fisher suffers from the lack of fast P-value computation and variability estimate of AW weights. When the number of studies K is large, the 3K − 1 possible differential expression pattern categories generated by AW-Fisher can become intractable. In this paper, we develop an importance sampling scheme with spline interpolation to increase the accuracy and speed of the P-value calculation. We also apply bootstrapping to construct a variability index for the AW-Fisher weight estimator and a co-membership matrix to categorize (cluster) differentially expressed genes based on their meta-patterns for intuitive biological investigations.ResultsThe superior performance of the proposed methods is shown in simulations as well as two real omics meta-analysis applications to demonstrate its insightful biological findings.Availability and implementationAn R package AWFisher (calling C++) is available at Bioconductor and GitHub (https://github.com/Caleb-Huo/AWFisher), and all datasets and programing codes for this paper are available in the Supplementary MaterialSupplementary Material.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
18 show abstract
1367-4803 * 1460-2059 * 33191377

Abstract
MotivationCell type identification is one of the major goals in single cell RNA sequencing (scRNA-seq). Current methods for assigning cell types typically involve the use of unsupervised clustering, the identification of signature genes in each cluster, followed by a manual lookup of these genes in the literature and databases to assign cell types. However, there are several limitations associated with these approaches, such as unwanted sources of variation that influence clustering and a lack of canonical markers for certain cell types. Here, we present ACTINN (Automated Cell Type Identification using Neural Networks), which employs a neural network with three hidden layers, trains on datasets with predefined cell types and predicts cell types for other datasets based on the trained parameters.ResultsWe trained the neural network on a mouse cell type atlas (Tabula Muris Atlas) and a human immune cell dataset, and used it to predict cell types for mouse leukocytes, human PBMCs and human T cell sub types. The results showed that our neural network is fast and accurate, and should therefore be a useful tool to complement existing scRNA-seq pipelines.Availability and implementationThe codes and datasets are available at https://figshare.com/articles/ACTINN/8967116. Tutorial is available at https://github.com/mafeiyang/ACTINN. All codes are implemented in python.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
19 show abstract
1367-4803 * 1460-2059 * 33191378

Abstract
MotivationProtein function prediction is one of the major tasks of bioinformatics that can help in wide range of biological problems such as understanding disease mechanisms or finding drug targets. Many methods are available for predicting protein functions from sequence based features, protein–protein interaction networks, protein structure or literature. However, other than sequence, most of the features are difficult to obtain or not available for many proteins thereby limiting their scope. Furthermore, the performance of sequence-based function prediction methods is often lower than methods that incorporate multiple features and predicting protein functions may require a lot of time.ResultsWe developed a novel method for predicting protein functions from sequence alone which combines deep convolutional neural network (CNN) model with sequence similarity based predictions. Our CNN model scans the sequence for motifs which are predictive for protein functions and combines this with functions of similar proteins (if available). We evaluate the performance of DeepGOPlus using the CAFA3 evaluation measures and achieve an F
max of 0.390, 0.557 and 0.614 for BPO, MFO and CCO evaluations, respectively. These results would have made DeepGOPlus one of the three best predictors in CCO and the second best performing method in the BPO and MFO evaluations. We also compare DeepGOPlus with state-of-the-art methods such as DeepText2GO and GOLabeler on another dataset. DeepGOPlus can annotate around 40 protein sequences per second on common hardware, thereby making fast and accurate function predictions available for a wide range of proteins.Availability and implementation
http://deepgoplus.bio2vec.net/.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
20 show abstract
1367-4803 * 1460-2059 * 33191379

Abstract
MotivationCommon small-effect genetic variants that contribute to human complex traits and disease are typically identified using traditional fixed-effect (FE) meta-analysis methods. However, the power to detect genetic associations under FE models deteriorates with increasing heterogeneity, so that some small-effect heterogeneous loci might go undetected. A modified random-effects meta-analysis approach (RE2) was previously developed that is more powerful than traditional fixed and random-effects methods at detecting small-effect heterogeneous genetic associations, the method was updated (RE2C) to identify small-effect heterogeneous variants overlooked by traditional fixed-effect meta-analysis. Here, we re-appraise a large-scale meta-analysis of coronary disease with RE2C to search for small-effect genetic signals potentially masked by heterogeneity in a FE meta-analysis.ResultsOur application of RE2C suggests a high sensitivity but low specificity of this approach for discovering small-effect heterogeneous genetic associations. We recommend that reports of small-effect heterogeneous loci discovered with RE2C are accompanied by forest plots and standardized predicted random-effects statistics to reveal the distribution of genetic effect estimates across component studies of meta-analyses, highlighting overly influential outlier studies with the potential to inflate genetic signals.Availability and implementationScripts to calculate standardized predicted random-effects statistics and generate forest plots are available in the getspres R package entitled from https://magosil86.github.io/getspres/.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
21 show abstract
1367-4803 * 1460-2059 * 33191380

Abstract
MotivationThe folding dynamics of ribonucleic acids (RNAs) are typically studied via coarse-grained models of the underlying energy landscape to face the exponential growths of the RNA secondary structure space. Still, studies of exact folding kinetics based on gradient basin abstractions are currently limited to short sequence lengths due to vast memory requirements. In order to compute exact transition rates between gradient basins, state-of-the-art approaches apply global flooding schemes that require to memorize the whole structure space at once. pourRNA tackles this problem via local flooding techniques where memorization is limited to the structure ensembles of individual gradient basins.ResultsCompared to the only available tool for exact gradient basin-based macro-state transition rates (namely barriers), pourRNA computes the same exact transition rates up to 10 times faster and requires two orders of magnitude less memory for sequences that are still computationally accessible for exhaustive enumeration. Parallelized computation as well as additional heuristics further speed up computations while still producing high-quality transition model approximations. The introduced heuristics enable a guided trade-off between model quality and required computational resources. We introduce and evaluate a macroscopic direct path heuristics to efficiently compute refolding energy barrier estimations for the co-transcriptionally trapped RNA sv11 of length 115 nt. Finally, we also show how pourRNA can be used to identify folding funnels and their respective energetically lowest minima.Availability and implementation
pourRNA is freely available at https://github.com/ViennaRNA/pourRNA.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
22 show abstract
1367-4803 * 1460-2059 * 33191381

Abstract
MotivationA biochemical reaction, bio-event, depicts the relationships between participating entities. Current text mining research has been focusing on identifying bio-events from scientific literature. However, rare efforts have been dedicated to normalize bio-events extracted from scientific literature with the entries in the curated reaction databases, which could disambiguate the events and further support interconnecting events into biologically meaningful and complete networks.ResultsIn this paper, we propose BioNorm, a novel method of normalizing bio-events extracted from scientific literature to entries in the bio-molecular reaction database, e.g. IntAct. BioNorm considers event normalization as a paraphrase identification problem. It represents an entry as a natural language statement by combining multiple types of information contained in it. Then, it predicts the semantic similarity between the natural language statement and the statements mentioning events in scientific literature using a long short-term memory recurrent neural network (LSTM). An event will be normalized to the entry if the two statements are paraphrase. To the best of our knowledge, this is the first attempt of event normalization in the biomedical text mining. The experiments have been conducted using the molecular interaction data from IntAct. The results demonstrate that the method could achieve F-score of 0.87 in normalizing event-containing statements.Availability and implementationThe source code is available at the gitlab repository https://gitlab.com/BioAI/leen and BioASQvec Plus is available on figshare https://figshare.com/s/45896c31d10c3f6d857a.
23 show abstract
1367-4803 * 1460-2059 * 33191382

Abstract
MotivationAllelic imbalance (AI), i.e. the unequal expression of the alleles of the same gene in a single cell, affects a subset of genes in diploid organisms. One prominent example of AI is parental genomic imprinting, which results in parent-of-origin-dependent, mono-allelic expression of a limited number of genes in metatherian and eutherian mammals and in angiosperms. Currently available methods for identifying AI rely on data modeling and come with the associated limitations.ResultsWe have designed ISoLDE (Integrative Statistics of alleLe Dependent Expression), a novel nonparametric statistical method that takes into account both AI and the characteristics of RNA-seq data to infer allelic expression bias when at least two biological replicates are available for reciprocal crosses. ISoLDE learns the distribution of a specific test statistic from the data and calls genes ‘allelically imbalanced’, ‘bi-allelically expressed’ or ‘undetermined’. Depending on the number of replicates, predefined thresholds or permutations are used to make calls. We benchmarked ISoLDE against published methods, and showed that ISoLDE compared favorably with respect to sensitivity, specificity and robustness to the number of replicates. Using ISoLDE on different RNA-seq datasets generated from hybrid mouse tissues, we did not discover novel imprinted genes (IGs), confirming the most conservative estimations of IG number.Availability and implementationISoLDE has been implemented as a Bioconductor package available at http://bioconductor.org/packages/ISoLDE/.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
24 show abstract
1367-4803 * 1460-2059 * 33191383

Abstract
MotivationThe circulating recombinant form of HIV-1 CRF02-AG is the most frequent non-B subtype in Europe. Anti-HIV therapy and pathophysiological studies on the impact of HIV-1 tropism require genotypic determination of HIV-1 tropism for non-B subtypes. But genotypic approaches based on analysis of the V3 envelope region perform poorly when used to determine the tropism of CRF02-AG. We, therefore, designed an algorithm based on information from the gp120 and gp41 ectodomain that better predicts the tropism of HIV-1 subtype CRF02-AG.ResultsWe used a bio-statistical method to identify the genotypic determinants of CRF02-AG coreceptor use. Toulouse HIV Extended Tropism Algorithm (THETA), based on a Least Absolute Shrinkage and Selection Operator method, uses HIV envelope sequence from phenotypically characterized clones. Prediction of R5X4/X4 viruses was 86% sensitive and that of R5 viruses was 89% specific with our model. The overall accuracy of THETA was 88%, making it sufficiently reliable for predicting the tropism of subtype CRF02-AG sequences.Availability and implementationBinaries are freely available for download at https://github.com/viro-tls/THETA. It was implemented in Matlab and supported on MS Windows platform. The sequence data used in this work are available from GenBank under the accession numbers MK618182-MK618417.
25 show abstract
1367-4803 * 1460-2059 * 33191384

Abstract
MotivationThe variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are non-biological, unlikely recombinations of true haplotypes.ResultsWe augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows–Wheeler transform. We demonstrate the scalability of the new implementation by building a whole-genome index of the 5008 haplotypes of the 1000 Genomes Project, and an index of all 108 070 Trans-Omics for Precision Medicine Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes.Availability and implementationOur software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt and https://github.com/jltsiren/gcsa2.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
26 show abstract
1367-4803 * 1460-2059 * 33191385

Abstract
MotivationMetagenomics studies microbial genomes in an ecosystem such as the gastrointestinal tract of a human. Identification of novel microbial species and quantification of their distributional variations among different samples that are sequenced using next-generation-sequencing technology hold the key to the success of most metagenomic studies. To achieve these goals, we propose a simple yet powerful metagenomic binning method, MetaBMF. The method does not require prior knowledge of reference genomes and produces highly accurate results, even at a strain level. Thus, it can be broadly used to identify disease-related microbial organisms that are not well-studied.ResultsMathematically, we count the number of mapped reads on each assembled genomic fragment cross different samples as our input matrix and propose a scalable stratified angle regression algorithm to factorize this count matrix into a product of a binary matrix and a nonnegative matrix. The binary matrix can be used to separate microbial species and the nonnegative matrix quantifies the species distributions in different samples. In simulation and empirical studies, we demonstrate that MetaBMF has a high binning accuracy. It can not only bin DNA fragments accurately at a species level but also at a strain level. As shown in our example, we can accurately identify the Shiga-toxigenic Escherichia coli O104: H4 strain which led to the 2011 German E.coli outbreak. Our efforts in these areas should lead to (i) fundamental advances in metagenomic binning, (ii) development and refinement of technology for the rapid identification and quantification of microbial distributions and (iii) finding of potential probiotics or reliable pathogenic bacterial strains.Availability and implementationThe software is available at https://github.com/didi10384/MetaBMF.
27 show abstract
1367-4803 * 1460-2059 * 33191386

Abstract
MotivationAdvances in experimental and imaging techniques have allowed for unprecedented insights into the dynamical processes within individual cells. However, many facets of intracellular dynamics remain hidden, or can be measured only indirectly. This makes it challenging to reconstruct the regulatory networks that govern the biochemical processes underlying various cell functions. Current estimation techniques for inferring reaction rates frequently rely on marginalization over unobserved processes and states. Even in simple systems this approach can be computationally challenging, and can lead to large uncertainties and lack of robustness in parameter estimates. Therefore we will require alternative approaches to efficiently uncover the interactions in complex biochemical networks.ResultsWe propose a Bayesian inference framework based on replacing uninteresting or unobserved reactions with time delays. Although the resulting models are non-Markovian, recent results on stochastic systems with random delays allow us to rigorously obtain expressions for the likelihoods of model parameters. In turn, this allows us to extend MCMC methods to efficiently estimate reaction rates, and delay distribution parameters, from single-cell assays. We illustrate the advantages, and potential pitfalls, of the approach using a birth–death model with both synthetic and experimental data, and show that we can robustly infer model parameters using a relatively small number of measurements. We demonstrate how to do so even when only the relative molecule count within the cell is measured, as in the case of fluorescence microscopy.Availability and implementationAccompanying code in R is available at https://github.com/cbskust/DDE_BD.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
28 show abstract
1367-4803 * 1460-2059 * 33191387

Abstract
MotivationQuaternary structure determination for transmembrane/soluble proteins requires a reliable computational protocol that leverages observed distance restraints and/or cyclic symmetry (C
n
symmetry) found in most homo-oligomeric transmembrane proteins.ResultsWe survey 118 X-ray crystallographically solved structures of homo-oligomeric transmembrane proteins (HoTPs) and find that ∼97% are C
n
symmetric. Given the prevalence of C
n
symmetric HoTPs and the benefits of incorporating geometry restraints in aiding quaternary structure determination, we introduce two new filters, the distance-restraints (DR) and the Symmetry-Imposed Packing (SIP) filters. SIP relies on a new method that can rebuild the closest ideal C
n
symmetric complex from docking poses containing a homo-dimer without prior knowledge of the number (
n
) of monomers. Using only the geometrical filter, SIP, near-native poses of 7 HoTPs in their monomeric states can be correctly identified in the top-10 for 71% of all cases, or 29% among 31 HoTP structures obtained through homology modeling, while ZDOCK alone returns 14 and 3%, respectively. When the
n
is given, the optional
n
-mer filter is applied with SIP and returns the near-native poses for 76% of the test set within the top-10, outperforming M-ZDOCK’s 55% and Sam’s 47%. While applying only SIP to three HoTPs that comes with distance restraints, we found the near-native poses were ranked 1st, 1st and 10th among 54 000 possible decoys. The results are further improved to 1st, 1st and 3rd when both DR and SIP filters are used. By applying only DR, a soluble system with distance restraints is recovered at the 1st-ranked pose.Availability and implementation
https://github.com/capslockwizard/drsip.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
29 show abstract
1367-4803 * 1460-2059 * 33191388

Abstract
MotivationMechanistic models of biochemical reaction networks facilitate the quantitative understanding of biological processes and the integration of heterogeneous datasets. However, some biological processes require the consideration of comprehensive reaction networks and therefore large-scale models. Parameter estimation for such models poses great challenges, in particular when the data are on a relative scale.ResultsHere, we propose a novel hierarchical approach combining (i) the efficient analytic evaluation of optimal scaling, offset and error model parameters with (ii) the scalable evaluation of objective function gradients using adjoint sensitivity analysis. We evaluate the properties of the methods by parameterizing a pan-cancer ordinary differential equation model (>1000 state variables, >4000 parameters) using relative protein, phosphoprotein and viability measurements. The hierarchical formulation improves optimizer performance considerably. Furthermore, we show that this approach allows estimating error model parameters with negligible computational overhead when no experimental estimates are available, providing an unbiased way to weight heterogeneous data. Overall, our hierarchical formulation is applicable to a wide range of models, and allows for the efficient parameterization of large-scale models based on heterogeneous relative measurements.Availability and implementationSupplementary code and data are available online at http://doi.org/10.5281/zenodo.3254429 and http://doi.org/10.5281/zenodo.3254441.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
30 show abstract
1367-4803 * 1460-2059 * 33191389

Abstract
MotivationRecent advances in biomedical research have made massive amount of transcriptomic data available in public repositories from different sources. Due to the heterogeneity present in the individual experiments, identifying reproducible biomarkers for a given disease from multiple independent studies has become a major challenge. The widely used meta-analysis approaches, such as Fisher’s method, Stouffer’s method, minP and maxP, have at least two major limitations: (i) they are sensitive to outliers, and (ii) they perform only one statistical test for each individual study, and hence do not fully utilize the potential sample size to gain statistical power.ResultsHere, we propose a gene-level meta-analysis framework that overcomes these limitations and identifies a gene signature that is reliable and reproducible across multiple independent studies of a given disease. The approach provides a comprehensive global signature that can be used to understand the underlying biological phenomena, and a smaller test signature that can be used to classify future samples of a given disease. We demonstrate the utility of the framework by constructing disease signatures for influenza and Alzheimer’s disease using nine datasets including 1108 individuals. These signatures are then validated on 12 independent datasets including 912 individuals. The results indicate that the proposed approach performs better than the majority of the existing meta-analysis approaches in terms of both sensitivity as well as specificity. The proposed signatures could be further used in diagnosis, prognosis and identification of therapeutic targets.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
31 show abstract
1367-4803 * 1460-2059 * 33191390

Abstract
MotivationRecent microbiome association studies have revealed important associations between microbiome and disease/health status. Such findings encourage scientists to dive deeper to uncover the causal role of microbiome in the underlying biological mechanism, and have led to applying statistical models to quantify causal microbiome effects and to identify the specific microbial agents. However, there are no existing causal mediation methods specifically designed to handle high dimensional and compositional microbiome data.ResultsWe propose a rigorous Sparse Microbial Causal Mediation Model (SparseMCMM) specifically designed for the high dimensional and compositional microbiome data in a typical three-factor (treatment, microbiome and outcome) causal study design. In particular, linear log-contrast regression model and Dirichlet regression model are proposed to estimate the causal direct effect of treatment and the causal mediation effects of microbiome at both the community and individual taxon levels. Regularization techniques are used to perform the variable selection in the proposed model framework to identify signature causal microbes. Two hypothesis tests on the overall mediation effect are proposed and their statistical significance is estimated by permutation procedures. Extensive simulated scenarios show that SparseMCMM has excellent performance in estimation and hypothesis testing. Finally, we showcase the utility of the proposed SparseMCMM method in a study which the murine microbiome has been manipulated by providing a clear and sensible causal path among antibiotic treatment, microbiome composition and mouse weight.Availability and implementation
https://sites.google.com/site/huilinli09/software and https://github.com/chanw0/SparseMCMM.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
32 show abstract
1367-4803 * 1460-2059 * 33191391

Abstract
MotivationSequence alignment remains fundamental in bioinformatics. Pair-wise alignment is traditionally based on ad hoc scores for substitutions, insertions and deletions, but can also be based on probability models (pair hidden Markov models: PHMMs). PHMMs enable us to: fit the parameters to each kind of data, calculate the reliability of alignment parts and measure sequence similarity integrated over possible alignments.ResultsThis study shows how multiple models correspond to one set of scores. Scores can be converted to probabilities by partition functions with a ‘temperature’ parameter: for any temperature, this corresponds to some PHMM. There is a special class of models with balanced length probability, i.e. no bias toward either longer or shorter alignments. The best way to score alignments and assess their significance depends on the aim: judging whether whole sequences are related versus finding related parts. This clarifies the statistical basis of sequence alignment.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
33 show abstract
1367-4803 * 1460-2059 * 33191392

Abstract
MotivationG-quadruplexes (G4s) are non-canonical nucleic acid conformations that are widespread in all kingdoms of life and are emerging as important regulators both in RNA and DNA. Recently, two new higher-order architectures have been reported: adjacent interacting G4s and G4s with stable long loops forming stem-loop structures. As there are no specialized tools to identify these conformations, we developed QPARSE.ResultsQPARSE can exhaustively search for degenerate potential quadruplex-forming sequences (PQSs) containing bulges and/or mismatches at genomic level, as well as either multimeric or long-looped PQS (MPQS and LLPQS, respectively). While its assessment versus known reference datasets is comparable with the state-of-the-art, what is more interesting is its performance in the identification of MPQS and LLPQS that present algorithms are not designed to search for. We report a comprehensive analysis of MPQS in human gene promoters and the analysis of LLPQS on three experimentally validated case studies from HIV-1, BCL2 and hTERT.Availability and implementationQPARSE is freely accessible on the web at http://www.medcomp.medicina.unipd.it/qparse/index or downloadable from github as a python 2.7 program https://github.com/B3rse/qparse
Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
34 show abstract
1367-4803 * 1460-2059 * 33191393

Abstract
MotivationInteractions among cis-regulatory elements such as enhancers and promoters are main driving forces shaping context-specific chromatin structure and gene expression. Although there have been computational methods for predicting gene expression from genomic and epigenomic information, most of them neglect long-range enhancer–promoter interactions, due to the difficulty in precisely linking regulatory enhancers to target genes. Recently, HiChIP, a novel high-throughput experimental approach, has generated comprehensive data on high-resolution interactions between promoters and distal enhancers. Moreover, plenty of studies suggest that deep learning achieves state-of-the-art performance in epigenomic signal prediction, and thus promoting the understanding of regulatory elements. In consideration of these two factors, we integrate proximal promoter sequences and HiChIP distal enhancer–promoter interactions to accurately predict gene expression.ResultsWe propose DeepExpression, a densely connected convolutional neural network, to predict gene expression using both promoter sequences and enhancer–promoter interactions. We demonstrate that our model consistently outperforms baseline methods, not only in the classification of binary gene expression status but also in regression of continuous gene expression levels, in both cross-validation experiments and cross-cell line predictions. We show that the sequential promoter information is more informative than the experimental enhancer information; meanwhile, the enhancer–promoter interactions within ±100 kbp around the TSS of a gene are most beneficial. We finally visualize motifs in both promoter and enhancer regions and show the match of identified sequence signatures with known motifs. We expect to see a wide spectrum of applications using HiChIP data in deciphering the mechanism of gene regulation.Availability and implementationDeepExpression is freely available at https://github.com/wanwenzeng/DeepExpression.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
35 show abstract
1367-4803 * 1460-2059 * 33191394

Abstract
MotivationInferring gene regulatory networks from gene expression time series data is important for gaining insights into the complex processes of cell life. A popular approach is to infer Boolean networks. However, it is still a pressing open problem to infer accurate Boolean networks from experimental data that are typically short and noisy.ResultsTo address the problem, we propose a Boolean network inference algorithm which is able to infer accurate Boolean network topology and dynamics from short and noisy time series data. The main idea is that, for each target gene, we use an And/Or tree ensemble algorithm to select prime implicants of which each is a conjunction of a set of input genes. The selected prime implicants are important features for predicting the states of the target gene. Using these important features we then infer the Boolean function of the target gene. Finally, the Boolean functions of all target genes are combined as a Boolean network. Using the data generated from artificial and real-world gene regulatory networks, we show that our algorithm can infer more accurate Boolean network topology and dynamics from short and noisy time series data than other algorithms. Our algorithm enables us to gain better insights into complex regulatory mechanisms of cell life.Availability and implementationPackage ATEN is freely available at https://github.com/ningshi/ATEN.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
36 show abstract
1367-4803 * 1460-2059 * 33191395

Abstract
MotivationThe nonsynonymous/synonymous substitution rate ratio (dN/dS) is a commonly used parameter to quantify molecular adaptation in protein-coding data. It is known that the estimation of dN/dS can be biased if some evolutionary processes are ignored. In this concern, common ML methods to estimate dN/dS assume invariable codon frequencies among sites, despite this characteristic is rare in nature, and it could bias the estimation of this parameter.ResultsHere we studied the influence of variable codon frequencies among genetic regions on the estimation of dN/dS. We explored scenarios varying the number of genetic regions that differ in codon frequencies, the amount of variability of codon frequencies among regions and the nucleotide frequencies at each codon position among regions. We found that ignoring heterogeneous codon frequencies among regions overall leads to underestimation of dN/dS and the bias increases with the level of heterogeneity of codon frequencies. Interestingly, we also found that varying nucleotide frequencies among regions at the first or second codon position leads to underestimation of dN/dS while variation at the third codon position leads to overestimation of dN/dS. Next, we present a methodology to reduce this bias based on the analysis of partitions presenting similar codon frequencies and we applied it to analyze four real datasets. We conclude that accounting for heterogeneous codon frequencies along sequences is required to obtain realistic estimates of molecular adaptation through this relevant evolutionary parameter. Availability and implementationThe applied frameworks for the computer simulations of protein-coding data and estimation of molecular adaptation are SGWE and PAML, respectively. Both are publicly available and referenced in the study.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
37 show abstract
1367-4803 * 1460-2059 * 33191396

Abstract
MotivationThe rapid improvement of phenotyping capability, accuracy and throughput have greatly increased the volume and diversity of phenomics data. A remaining challenge is an efficient way to identify phenotypic patterns to improve our understanding of the quantitative variation of complex phenotypes, and to attribute gene functions. To address this challenge, we developed a new algorithm to identify emerging phenomena from large-scale temporal plant phenotyping experiments. An emerging phenomenon is defined as a group of genotypes who exhibit a coherent phenotype pattern during a relatively short time. Emerging phenomena are highly transient and diverse, and are dependent in complex ways on both environmental conditions and development. Identifying emerging phenomena may help biologists to examine potential relationships among phenotypes and genotypes in a genetically diverse population and to associate such relationships with the change of environments or development.ResultsWe present an emerging phenomenon identification tool called Temporal Emerging Phenomenon Finder (TEP-Finder). Using large-scale longitudinal phenomics data as input, TEP-Finder first encodes the complicated phenotypic patterns into a dynamic phenotype network. Then, emerging phenomena in different temporal scales are identified from dynamic phenotype network using a maximal clique based approach. Meanwhile, a directed acyclic network of emerging phenomena is composed to model the relationships among the emerging phenomena. The experiment that compares TEP-Finder with two state-of-art algorithms shows that the emerging phenomena identified by TEP-Finder are more functionally specific, robust and biologically significant.Availability and implementationThe source code, manual and sample data of TEP-Finder are all available at: http://phenomics.uky.edu/TEP-Finder/.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
38 show abstract
1367-4803 * 1460-2059 * 33191397

Abstract
MotivationRecent studies have shown that DNA N6-methyladenine (6mA) plays an important role in epigenetic modification of eukaryotic organisms. It has been found that 6mA is closely related to embryonic development, stress response and so on. Developing a new algorithm to quickly and accurately identify 6mA sites in genomes is important for explore their biological functions.ResultsIn this paper, we proposed a new classification method called MM-6mAPred based on a Markov model which makes use of the transition probability between adjacent nucleotides to identify 6mA site. The sensitivity and specificity of our method are 89.32% and 90.11%, respectively. The overall accuracy of our method is 89.72%, which is 6.59% higher than that of the previous method i6mA-Pred. It indicated that, compared with the 41 nucleotide chemical properties used by i6mA-Pred, the transition probability between adjacent nucleotides can capture more discriminant sequence information.Availability and implementationThe web server of MM-6mAPred is freely accessible at http://www.insect-genome.com/MM-6mAPred/
Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
39 show abstract
1367-4803 * 1460-2059 * 33191398

Abstract
MotivationNon-small-cell lung carcinoma (NSCLC) mainly consists of two subtypes: lung squamous cell carcinoma (LUSC) and lung adenocarcinoma (LUAD). It has been reported that the genetic and epigenetic profiles vary strikingly between LUAD and LUSC in the process of tumorigenesis and development. Efficient and precise treatment can be made if subtypes can be identified correctly. Identification of discriminative expression signatures has been explored recently to aid the classification of NSCLC subtypes.ResultsIn this study, we designed a classification model integrating both mRNA and long non-coding RNA (lncRNA) expression data to effectively classify the subtypes of NSCLC. A gene selection algorithm, named WGRFE, was proposed to identify the most discriminative gene signatures within the recursive feature elimination (RFE) framework. GeneRank scores considering both expression level and correlation, together with the importance generated by classifiers were all taken into account to improve the selection performance. Moreover, a module-based initial filtering of the genes was performed to reduce the computation cost of RFE. We validated the proposed algorithm on The Cancer Genome Atlas (TCGA) dataset. The results demonstrate that the developed approach identified a small number of expression signatures for accurate subtype classification and particularly, we here for the first time show the potential role of LncRNA in building computational NSCLC subtype classification models.Availability and implementationThe R implementation for the proposed approach is available at https://github.com/RanSuLab/NSCLC-subtype-classification.
40 show abstract
1367-4803 * 1460-2059 * 33191399

Abstract
MotivationGene regulatory networks describe the regulatory relationships among genes, and developing methods for reverse engineering these networks is an ongoing challenge in computational biology. The majority of the initially proposed methods for gene regulatory network discovery create a network of genes and then mine it in order to uncover previously unknown regulatory processes. More recent approaches have focused on inferring modules of co-regulated genes, linking these modules with regulatory genes and then mining them to discover new molecular biology.ResultsIn this work we analyze module-based network approaches to build gene regulatory networks, and compare their performance to single gene network approaches. In the process, we propose a novel approach to estimate gene regulatory networks drawing from the module-based methods. We show that generating modules of co-expressed genes which are predicted by a sparse set of regulators using a variational Bayes method, and then building a bipartite graph on the generated modules using sparse regression, yields more informative networks than previous single and module-based network approaches as measured by: (i) the rate of enriched gene sets, (ii) a network topology assessment, (iii) ChIP-Seq evidence and (iv) the KnowEnG Knowledge Network collection of previously characterized gene-gene interactions.Availability and implementationThe code is written in R and can be downloaded from https://github.com/mikelhernaez/linker.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
41 show abstract
1367-4803 * 1460-2059 * 33191400

Abstract
MotivationSimple tandem repeats, microsatellites in particular, have regulatory functions, links to several diseases and applications in biotechnology. There is an immediate need for an accurate tool for detecting microsatellites in newly sequenced genomes. The current available tools are either sensitive or specific but not both; some tools require adjusting parameters manually.ResultsWe propose Look4TRs, the first application of self-supervised hidden Markov models to discovering microsatellites. Look4TRs adapts itself to the input genomes, balancing high sensitivity and low false positive rate. It auto-calibrates itself. We evaluated Look4TRs on 26 eukaryotic genomes. Based on F measure, which combines sensitivity and false positive rate, Look4TRs outperformed TRF and MISA—the most widely used tools—by 78 and 84%. Look4TRs outperformed the second and the third best tools, MsDetector and Tantan, by 17 and 34%. On eight bacterial genomes, Look4TRs outperformed the second and the third best tools by 27 and 137%.Availability and implementation
https://github.com/TulsaBioinformaticsToolsmith/Look4TRs.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
42 show abstract
1367-4803 * 1460-2059 * 33191401

Abstract
MotivationProtein structure refinement is an important step of protein structure prediction. Existing approaches have generally used a single scoring function combined with Monte Carlo method or Molecular Dynamics algorithm. The one-dimension optimization of a single energy function may take the structure too far away without a constraint. The basic motivation of our study is to reduce the bias problem caused by minimizing only a single energy function due to the very diversity of different protein structures.ResultsWe report a new Artificial Intelligence-based protein structure Refinement method called AIR. Its fundamental idea is to use multiple energy functions as multi-objectives in an effort to correct the potential inaccuracy from a single function. A multi-objective particle swarm optimization algorithm-based structure refinement is designed, where each structure is considered as a particle in the protocol. With the refinement iterations, the particles move around. The quality of particles in each iteration is evaluated by three energy functions, and the non-dominated particles are put into a set called Pareto set. After enough iteration times, particles from the Pareto set are screened and part of the top solutions are outputted as the final refined structures. The multi-objective energy function optimization strategy designed in the AIR protocol provides a different constraint view of the structure, by extending the one-dimension optimization to a new three-dimension space optimization driven by the multi-objective particle swarm optimization engine. Experimental results on CASP11, CASP12 refinement targets and blind tests in CASP 13 turn to be promising.Availability and implementationThe AIR is available online at: www.csbio.sjtu.edu.cn/bioinf/AIR/.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.

Green Open Access

Sherpa/Romeo info

Author can archive pre-print (ie pre-refereeing)
Author can archive post-print (ie final draft post-refereeing)
Author cannot archive publisher's version/PDF
  • Pre-print can only be posted prior to acceptance
  • Pre-print must be accompanied by set statement (see link)
  • Pre-print must not be replaced with post-print, instead a link to published version with amended set statement should be made
  • Pre-print on author's personal website, employer website, free public server or pre-prints in subject area
  • Post-print on author's personal website immediately
  • Post-print in Institutional repositories or Central repositories after 12 months embargo
  • Publisher's version/PDF cannot be used
  • Published source must be acknowledged
  • Must link to publisher version
  • Set phrase to accompany archived copy (see policy)
  • The publisher will deposit in PubMed Central on behalf of NIH authors
  • Publisher last contacted on 19/02/2015


More Sherpa/Romeo information

APC Discount

Researchers from EUR, OU, RU, RUG, UL, UM, UT, UU, UvA, TiU, VU and WUR will receive a 100% discount on the Article Processing Charges that need to be paid by a first or corresponding author to publish open access in this journal.

More information on this Oxford University Press deal.

This deal is valid until 2020-12-31.
NB: APC discount can only be claimed for articles submitted after 2019-01-01


More information on Open Access publishing

Last updated: 2020-01-06

Impact

Journal Citation Reports

This information is only available when you log on as a WUR user !

Scopus Journal Metrics (2017)

SJR: 6.140
SNIP: 2.520
Impact (Scopus CiteScore): 0.784
Quartile: Q1
CiteScore percentile: 98%
CiteScore rank: 3 out of 187
Cited by WUR staff: 1847 times. (2016-2018)

Similar journals  

  • Nucleic acids research
  • Proceedings of the national academy of sciences o...
  • Nature
  • Plos one
  • Science (new york, n.y.)

  • More...
 
Please log in to use this service. Login as Wageningen University & Research user or guest user in upper right hand corner of this page.