RESUMO
MOTIVATION: Including ion mobility separation (IMS) into mass spectrometry proteomics experiments is useful to improve coverage and throughput. Many IMS devices enable linking experimentally derived mobility of an ion to its collisional cross-section (CCS), a highly reproducible physicochemical property dependent on the ion's mass, charge and conformation in the gas phase. Thus, known peptide ion mobilities can be used to tailor acquisition methods or to refine database search results. The large space of potential peptide sequences, driven also by posttranslational modifications of amino acids, motivates an in silico predictor for peptide CCS. Recent studies explored the general performance of varying machine-learning techniques, however, the workflow engineering part was of secondary importance. For the sake of applicability, such a tool should be generic, data driven, and offer the possibility to be easily adapted to individual workflows for experimental design and data processing. RESULTS: We created ionmob, a Python-based framework for data preparation, training, and prediction of collisional cross-section values of peptides. It is easily customizable and includes a set of pretrained, ready-to-use models and preprocessing routines for training and inference. Using a set of ≈21 000 unique phosphorylated peptides and ≈17 000 MHC ligand sequences and charge state pairs, we expand upon the space of peptides that can be integrated into CCS prediction. Lastly, we investigate the applicability of in silico predicted CCS to increase confidence in identified peptides by applying methods of re-scoring and demonstrate that predicted CCS values complement existing predictors for that task. AVAILABILITY AND IMPLEMENTATION: The Python package is available at github: https://github.com/theGreatHerrLebert/ionmob.
Assuntos
Aprendizado de Máquina , Peptídeos , Peptídeos/química , Espectrometria de Massas/métodos , Sequência de Aminoácidos , Proteômica/métodos , ÍonsRESUMO
Methods for the detection of m6A by RNA-Seq technologies are increasingly sought after. We here present NOseq, a method to detect m6A residues in defined amplicons by virtue of their resistance to chemical deamination, effected by nitrous acid. Partial deamination in NOseq affects all exocyclic amino groups present in nucleobases and thus also changes sequence information. The method uses a mapping algorithm specifically adapted to the sequence degeneration caused by deamination events. Thus, m6A sites with partial modification levels of â¼50% were detected in defined amplicons, and this threshold can be lowered to â¼10% by combination with m6A immunoprecipitation. NOseq faithfully detected known m6A sites in human rRNA, and the long non-coding RNA MALAT1, and positively validated several m6A candidate sites, drawn from miCLIP data with an m6A antibody, in the transcriptome of Drosophila melanogaster. Conceptually related to bisulfite sequencing, NOseq presents a novel amplicon-based sequencing approach for the validation of m6A sites in defined sequences.
Assuntos
Adenosina/análogos & derivados , Sequenciamento de Nucleotídeos em Larga Escala/métodos , RNA/química , Análise de Sequência de RNA/métodos , Adenosina/análise , Algoritmos , Animais , Cromatografia Líquida , Desaminação , Drosophila melanogaster/genética , Células HEK293 , Células HeLa , Humanos , RNA Longo não Codificante/química , RNA Mensageiro/química , RNA Ribossômico 18S/química , Alinhamento de Sequência , Espectrometria de Massas em TandemRESUMO
Cancer therapy with clinically established anticancer drugs is frequently hampered by the development of drug resistance of tumors and severe side effects in normal organs and tissues. The demand for powerful, but less toxic, drugs is high. Phytochemicals represent an important reservoir for drug development and frequently exert less toxicity than synthetic drugs. Bioinformatics can accelerate and simplify the highly complex, time-consuming, and expensive drug development process. Here, we analyzed 375 phytochemicals using virtual screenings, molecular docking, and in silico toxicity predictions. Based on these in silico studies, six candidate compounds were further investigated in vitro. Resazurin assays were performed to determine the growth-inhibitory effects towards wild-type CCRF-CEM leukemia cells and their multidrug-resistant, P-glycoprotein (P-gp)-overexpressing subline, CEM/ADR5000. Flow cytometry was used to measure the potential to measure P-gp-mediated doxorubicin transport. Bidwillon A, neobavaisoflavone, coptisine, and z-guggulsterone all showed growth-inhibitory effects and moderate P-gp inhibition, whereas miltirone and chamazulene strongly inhibited tumor cell growth and strongly increased intracellular doxorubicin uptake. Bidwillon A and miltirone were selected for molecular docking to wildtype and mutated P-gp forms in closed and open conformations. The P-gp homology models harbored clinically relevant mutations, i.e., six single missense mutations (F336Y, A718C, Q725A, F728A, M949C, Y953C), three double mutations (Y310A-F728A; F343C-V982C; Y953A-F978A), or one quadruple mutation (Y307C-F728A-Y953A-F978A). The mutants did not show major differences in binding energies compared to wildtypes. Closed P-gp forms generally showed higher binding affinities than open ones. Closed conformations might stabilize the binding, thereby leading to higher binding affinities, while open conformations may favor the release of compounds into the extracellular space. In conclusion, this study described the capability of selected phytochemicals to overcome multidrug resistance.
Assuntos
Resistencia a Medicamentos Antineoplásicos , Neoplasias , Humanos , Simulação de Acoplamento Molecular , Doxorrubicina/farmacologia , Compostos Fitoquímicos/farmacologia , Subfamília B de Transportador de Cassetes de Ligação de ATP/genética , Subfamília B de Transportador de Cassetes de Ligação de ATP/metabolismo , Linhagem Celular TumoralRESUMO
During the past three decades, humans have been confronted with different new coronavirus outbreaks. Since the end of the year 2019, COVID-19 threatens the world as a rapidly spreading infectious disease. For this work, we targeted the non-structural protein 16 (nsp16) as a key protein of SARS-CoV-2, SARS-CoV-1 and MERS-CoV to develop broad-spectrum inhibitors of nsp16. Computational methods were used to filter candidates from a natural product-based library of 224,205 compounds obtained from the ZINC database. The binding of the candidates to nsp16 was assessed using virtual screening with VINA LC, and molecular docking with AutoDock 4.2.6. The top 9 compounds were bound to the nsp16 protein of SARS-CoV-2, SARS-CoV-1, and MERS-CoV with the lowest binding energies (LBEs) in the range of -9.0 to -13.0 kcal with VINA LC. The AutoDock-based LBEs for nsp16 of SARS-CoV-2 ranged from -11.42 to -16.11 kcal/mol with predicted inhibition constants (pKi) from 0.002 to 4.51 nM, the natural substrate S-adenosyl methionine (SAM) was used as control. In silico results were verified by microscale thermophoresis as in vitro assay. The candidates were investigated further for their cytotoxicity in normal MRC-5 lung fibroblasts to determine their therapeutic indices. Here, the IC50 values of all three compounds were >10 µM. In summary, we identified three novel SARS-CoV-2 inhibitors, two of which showed broad-spectrum activity to nsp16 in SARS-CoV-2, SARS-CoV-1, and MERS-CoV. All three compounds are coumarin derivatives that contain chromen-2-one in their scaffolds.
Assuntos
COVID-19 , Coronavírus da Síndrome Respiratória do Oriente Médio , Humanos , SARS-CoV-2 , Simulação de Acoplamento Molecular , S-AdenosilmetioninaRESUMO
BACKGROUND: Mass spectrometry is an important experimental technique in the field of proteomics. However, analysis of certain mass spectrometry data faces a combination of two challenges: first, even a single experiment produces a large amount of multi-dimensional raw data and, second, signals of interest are not single peaks but patterns of peaks that span along the different dimensions. The rapidly growing amount of mass spectrometry data increases the demand for scalable solutions. Furthermore, existing approaches for signal detection usually rely on strong assumptions concerning the signals properties. RESULTS: In this study, it is shown that locality-sensitive hashing enables signal classification in mass spectrometry raw data at scale. Through appropriate choice of algorithm parameters it is possible to balance false-positive and false-negative rates. On synthetic data, a superior performance compared to an intensity thresholding approach was achieved. Real data could be strongly reduced without losing relevant information. Our implementation scaled out up to 32 threads and supports acceleration by GPUs. CONCLUSIONS: Locality-sensitive hashing is a desirable approach for signal classification in mass spectrometry raw data. AVAILABILITY: Generated data and code are available at https://github.com/hildebrandtlab/mzBucket . Raw data is available at https://zenodo.org/record/5036526 .
Assuntos
Algoritmos , Software , Espectrometria de Massas , Proteômica/métodosRESUMO
MOTIVATION: Error correction is a fundamental pre-processing step in many Next-Generation Sequencing (NGS) pipelines, in particular for de novo genome assembly. However, existing error correction methods either suffer from high false-positive rates since they break reads into independent k-mers or do not scale efficiently to large amounts of sequencing reads and complex genomes. RESULTS: We present CARE-an alignment-based scalable error correction algorithm for Illumina data using the concept of minhashing. Minhashing allows for efficient similarity search within large sequencing read collections which enables fast computation of high-quality multiple alignments. Sequencing errors are corrected by detailed inspection of the corresponding alignments. Our performance evaluation shows that CARE generates significantly fewer false-positive corrections than state-of-the-art tools (Musket, SGA, BFC, Lighter, Bcool, Karect) while maintaining a competitive number of true positives. When used prior to assembly it can achieve superior de novo assembly results for a number of real datasets. CARE is also the first multiple sequence alignment-based error corrector that is able to process a human genome Illumina NGS dataset in only 4 h on a single workstation using GPU acceleration. AVAILABILITYAND IMPLEMENTATION: CARE is open-source software written in C++ (CPU version) and in CUDA/C++ (GPU version). It is licensed under GPLv3 and can be downloaded at https://github.com/fkallen/CARE. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Software , Algoritmos , Humanos , Alinhamento de Sequência , Análise de Sequência de DNARESUMO
Reverse transcription (RT) of RNA templates containing RNA modifications leads to synthesis of cDNA containing information on the modification in the form of misincorporation, arrest, or nucleotide skipping events. A compilation of such events from multiple cDNAs represents an RT-signature that is typical for a given modification, but, as we show here, depends also on the reverse transcriptase enzyme. A comparison of 13 different enzymes revealed a range of RT-signatures, with individual enzymes exhibiting average arrest rates between 20 and 75%, as well as average misincorporation rates between 30 and 75% in the read-through cDNA. Using RT-signatures from individual enzymes to train a random forest model as a machine learning regimen for prediction of modifications, we found strongly variegated success rates for the prediction of methylated purines, as exemplified with N1-methyladenosine (m1A). Among the 13 enzymes, a correlation was found between read length, misincorporation, and prediction success. Inversely, low average read length was correlated to high arrest rate and lower prediction success. The three most successful polymerases were then applied to the characterization of RT-signatures of other methylated purines. Guanosines featuring methyl groups on the Watson-Crick face were identified with high confidence, but discrimination between m1G and m22G was only partially successful. In summary, the results suggest that, given sufficient coverage and a set of specifically optimized reaction conditions for reverse transcription, all RNA modifications that impede Watson-Crick bonds can be distinguished by their RT-signature.
Assuntos
DNA Polimerase Dirigida por RNA/metabolismo , Transcrição Reversa , Adenosina/análogos & derivados , Guanosina/química , Guanosina/metabolismo , Aprendizado de Máquina , Metilação , Oligorribonucleotídeos/química , TranscriptomaRESUMO
BACKGROUND: All-Food-Sequencing (AFS) is an untargeted metagenomic sequencing method that allows for the detection and quantification of food ingredients including animals, plants, and microbiota. While this approach avoids some of the shortcomings of targeted PCR-based methods, it requires the comparison of sequence reads to large collections of reference genomes. The steadily increasing amount of available reference genomes establishes the need for efficient big data approaches. RESULTS: We introduce an alignment-free k-mer based method for detection and quantification of species composition in food and other complex biological matters. It is orders-of-magnitude faster than our previous alignment-based AFS pipeline. In comparison to the established tools CLARK, Kraken2, and Kraken2+Bracken it is superior in terms of false-positive rate and quantification accuracy. Furthermore, the usage of an efficient database partitioning scheme allows for the processing of massive collections of reference genomes with reduced memory requirements on a workstation (AFS-MetaCache) or on a Spark-based compute cluster (MetaCacheSpark). CONCLUSIONS: We present a fast yet accurate screening method for whole genome shotgun sequencing-based biosurveillance applications such as food testing. By relying on a big data approach it can scale efficiently towards large-scale collections of complex eukaryotic and bacterial reference genomes. AFS-MetaCache and MetaCacheSpark are suitable tools for broad-scale metagenomic screening applications. They are available at https://muellan.github.io/metacache/afs.html (C++ version for a workstation) and https://github.com/jmabuin/MetaCacheSpark (Spark version for big data clusters).
Assuntos
Big Data , Análise de Alimentos/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Metagenômica/métodos , Sequenciamento Completo do Genoma/métodos , Biovigilância , Genoma Bacteriano , Metagenoma , Microbiota/genética , SoftwareRESUMO
MOTIVATION: Metagenomic shotgun sequencing studies are becoming increasingly popular with prominent examples including the sequencing of human microbiomes and diverse environments. A fundamental computational problem in this context is read classification, i.e. the assignment of each read to a taxonomic label. Due to the large number of reads produced by modern high-throughput sequencing technologies and the rapidly increasing number of available reference genomes corresponding software tools suffer from either long runtimes, large memory requirements or low accuracy. RESULTS: We introduce MetaCache-a novel software for read classification using the big data technique minhashing. Our approach performs context-aware classification of reads by computing representative subsamples of k-mers within both, probed reads and locally constrained regions of the reference genomes. As a result, MetaCache consumes significantly less memory compared to the state-of-the-art read classifiers Kraken and CLARK while achieving highly competitive sensitivity and precision at comparable speed. For example, using NCBI RefSeq draft and completed genomes with a total length of around 140 billion bases as reference, MetaCache's database consumes only 62 GB of memory while both Kraken and CLARK fail to construct their respective databases on a workstation with 512 GB RAM. Our experimental results further show that classification accuracy continuously improves when increasing the amount of utilized reference genome data. AVAILABILITY AND IMPLEMENTATION: MetaCache is open source software written in C ++ and can be downloaded at http://github.com/muellan/metacache. CONTACT: bertil.schmidt@uni-mainz.de. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Metagenômica/métodos , Software , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNARESUMO
Alternative splicing is an important mechanism in eukaryotes that expands the transcriptome and proteome significantly. It plays an important role in a number of biological processes. Understanding its regulation is hence an important challenge. Recently, increasing evidence has been collected that supports an involvement of intragenic DNA methylation in the regulation of alternative splicing. The exact mechanisms of regulation, however, are largely unknown, and speculated to be complex: different methylation profiles might exist, each of which could be associated with a different regulation mechanism. We present a computational technique that is able to determine such stable methylation patterns and allows to correlate these patterns with inclusion propensity of exons. Pattern detection is based on dynamic time warping (DTW) of methylation profiles, a sophisticated similarity measure for signals that can be non-trivially transformed. We design a flexible self-organizing map approach to pattern grouping. Exemplary application on available data sets indicates that stable patterns which correlate non-trivially with exon inclusion do indeed exist. To improve the reliability of these predictions, further studies on larger data sets will be required. We have thus taken great care that our software runs efficiently on modern hardware, so that it can support future studies on large-scale data sets.
Assuntos
Processamento Alternativo , Metilação de DNA , Epigênese Genética , Software , Éxons , Humanos , Íntrons , RNA/genética , RNA/metabolismo , TranscriptomaRESUMO
The combination of Reverse Transcription (RT) and high-throughput sequencing has emerged as a powerful combination to detect modified nucleotides in RNA via analysis of either abortive RT-products or of the incorporation of mismatched dNTPs into cDNA. Here we simultaneously analyze both parameters in detail with respect to the occurrence of N-1-methyladenosine (m(1)A) in the template RNA. This naturally occurring modification is associated with structural effects, but it is also known as a mediator of antibiotic resistance in ribosomal RNA. In structural probing experiments with dimethylsulfate, m(1)A is routinely detected by RT-arrest. A specifically developed RNA-Seq protocol was tailored to the simultaneous analysis of RT-arrest and misincorporation patterns. By application to a variety of native and synthetic RNA preparations, we found a characteristic signature of m(1)A, which, in addition to an arrest rate, features misincorporation as a significant component. Detailed analysis suggests that the signature depends on RNA structure and on the nature of the nucleotide 3' of m(1)A in the template RNA, meaning it is sequence dependent. The RT-signature of m(1)A was used for inspection and confirmation of suspected modification sites and resulted in the identification of hitherto unknown m(1)A residues in trypanosomal tRNA.
Assuntos
Adenosina/análogos & derivados , Sequenciamento de Nucleotídeos em Larga Escala , RNA/química , Transcrição Reversa , Análise de Sequência de RNA , Adenosina/análise , Animais , Humanos , Aprendizado de Máquina , Camundongos , Homologia de Sequência do Ácido NucleicoRESUMO
BACKGROUND: Gene Set Enrichment Analysis (GSEA) is a popular method to reveal significant dependencies between predefined sets of gene symbols and observed phenotypes by evaluating the deviation of gene expression values between cases and controls. An established measure of inter-class deviation, the enrichment score, is usually computed using a weighted running sum statistic over the whole set of gene symbols. Due to the lack of analytic expressions the significance of enrichment scores is determined using a non-parametric estimation of their null distribution by permuting the phenotype labels of the probed patients. Accordingly, GSEA is a time-consuming task due to the large number of required permutations to accurately estimate the nominal p-value - a circumstance that is even more pronounced during multiple hypothesis testing since its estimate is lower-bounded by the inverse number of samples in permutation space. RESULTS: We present rapidGSEA - a software suite consisting of two tools for facilitating permutation-based GSEA: cudaGSEA and ompGSEA. cudaGSEA is a CUDA-accelerated tool using fine-grained parallelization schemes on massively parallel architectures while ompGSEA is a coarse-grained multi-threaded tool for multi-core CPUs. Nominal p-value estimation of 4,725 gene sets on a data set consisting of 20,639 unique gene symbols and 200 patients (183 cases + 17 controls) each probing one million permutations takes 19 hours on a Xeon CPU and less than one hour on a GeForce Titan X GPU while the established GSEA tool from the Broad Institute (broadGSEA) takes roughly 13 days. CONCLUSION: cudaGSEA outperforms broadGSEA by around two orders-of-magnitude on a single Tesla K40c or GeForce Titan X GPU. ompGSEA provides around one order-of-magnitude speedup to broadGSEA on a standard Xeon CPU. The rapidGSEA suite is open-source software and can be downloaded at https://github.com/gravitino/cudaGSEA as standalone application or package for the R framework.
RESUMO
Cancer is a large class of diseases that are characterized by a common set of features, known as the Hallmarks of cancer. One of these hallmarks is the acquisition of genome instability and mutations. This, combined with high proliferation rates and failure of repair mechanisms, leads to clonal evolution as well as a high genotypic and phenotypic diversity within the tumor. As a consequence, treatment and therapy of malignant tumors is still a grand challenge. Moreover, under selective pressure, e.g., caused by chemotherapy, resistant subpopulations can emerge that then may lead to relapse. In order to minimize the risk of developing multidrug-resistant tumor cell populations, optimal (combination) therapies have to be determined on the basis of an in-depth characterization of the tumor's genetic and phenotypic makeup, a process that is an important aspect of stratified medicine and precision medicine. We present DrugTargetInspector (DTI), an interactive assistance tool for treatment stratification. DTI analyzes genomic, transcriptomic, and proteomic datasets and provides information on deregulated drug targets, enriched biological pathways, and deregulated subnetworks, as well as mutations and their potential effects on putative drug targets and genes of interest. To demonstrate DTI's broad scope of applicability, we present case studies on several cancer types and different types of input -omics data. DTI's integrative approach allows users to characterize the tumor under investigation based on various -omics datasets and to elucidate putative treatment options based on clinical decision guidelines, but also proposing additional points of intervention that might be neglected otherwise. DTI can be freely accessed at http://dti.bioinf.uni-sb.de.
Assuntos
Tomada de Decisões Assistida por Computador , Neoplasias/tratamento farmacológico , Seleção de Pacientes , Genômica/métodos , Humanos , Neoplasias/genéticaRESUMO
MOTIVATION: Web-based workflow systems have gained considerable momentum in sequence-oriented bioinformatics. In structural bioinformatics, however, such systems are still relatively rare; while commercial stand-alone workflow applications are common in the pharmaceutical industry, academic researchers often still rely on command-line scripting to glue individual tools together. RESULTS: In this work, we address the problem of building a web-based system for workflows in structural bioinformatics. For the underlying molecular modelling engine, we opted for the BALL framework because of its extensive and well-tested functionality in the field of structural bioinformatics. The large number of molecular data structures and algorithms implemented in BALL allows for elegant and sophisticated development of new approaches in the field. We hence connected the versatile BALL library and its visualization and editing front end BALLView with the Galaxy workflow framework. The result, which we call ballaxy, enables the user to simply and intuitively create sophisticated pipelines for applications in structure-based computational biology, integrated into a standard tool for molecular modelling. AVAILABILITY AND IMPLEMENTATION: ballaxy consists of three parts: some minor modifications to the Galaxy system, a collection of tools and an integration into the BALL framework and the BALLView application for molecular modelling. Modifications to Galaxy will be submitted to the Galaxy project, and the BALL and BALLView integrations will be integrated in the next major BALL release. After acceptance of the modifications into the Galaxy project, we will publish all ballaxy tools via the Galaxy toolshed. In the meantime, all three components are available from http://www.ball-project.org/ballaxy. Also, docker images for ballaxy are available at https://registry.hub.docker.com/u/anhi/ballaxy/dockerfile/. ballaxy is licensed under the terms of the GPL.
Assuntos
Algoritmos , Biologia Computacional/métodos , Análise de Sequência de DNA/métodos , Software , Humanos , Modelos Moleculares , Integração de Sistemas , Interface Usuário-Computador , Fluxo de TrabalhoRESUMO
Macromolecular oligomeric assemblies are involved in many biochemical processes of living organisms. The benefits of such assemblies in crowded cellular environments include increased reaction rates, efficient feedback regulation, cooperativity and protective functions. However, an atom-level structural determination of large assemblies is challenging due to the size of the complex and the difference in binding affinities of the involved proteins. In this study, we propose a novel combinatorial greedy algorithm for assembling large oligomeric complexes from information on the approximate position of interaction interfaces of pairs of monomers in the complex. Prior information on complex symmetry is not required but rather the symmetry is inferred during assembly. We implement an efficient geometric score, the transformation match score, that bypasses the model ranking problems of state-of-the-art scoring functions by scoring the similarity between the inferred dimers of the same monomer simultaneously with different binding partners in a (sub)complex with a set of pregenerated docking poses. We compiled a diverse benchmark set of 308 homo and heteromeric complexes containing 6 to 60 monomers. To explore the applicability of the method, we considered 48 sets of parameters and selected those three sets of parameters, for which the algorithm can correctly reconstruct the maximum number, namely 252 complexes (81.8%) in, at least one of the respective three runs. The crossvalidation coverage, that is, the mean fraction of correctly reconstructed benchmark complexes during crossvalidation, was 78.1%, which demonstrates the ability of the presented method to correctly reconstruct topology of a large variety of biological complexes.
Assuntos
Biologia Computacional/métodos , Substâncias Macromoleculares/química , Substâncias Macromoleculares/metabolismo , Modelos Moleculares , Proteínas/química , Proteínas/metabolismo , Algoritmos , Ligação Proteica , Conformação Proteica , SoftwareRESUMO
MOTIVATION: The reasons for distortions from optimal α-helical geometry are widely unknown, but their influences on structural changes of proteins are significant. Hence, their prediction is a crucial problem in structural bioinformatics. Here, we present a new web server, called SKINK, for string kernel based kink prediction. Extending our previous study, we also annotate the most probable kink position in a given α-helix sequence. AVAILABILITY AND IMPLEMENTATION: The SKINK web server is freely accessible at http://biows-inf.zdv.uni-mainz.de/skink. Moreover, SKINK is a module of the BALL software, also freely available at www.ballview.org.
Assuntos
Estrutura Secundária de Proteína , Software , Biologia Computacional/métodos , Internet , Proteínas/química , Análise de Sequência de ProteínaRESUMO
The CellLineNavigator database, freely available at http://www.medicalgenomics.org/celllinenavigator, is a web-based workbench for large scale comparisons of a large collection of diverse cell lines. It aims to support experimental design in the fields of genomics, systems biology and translational biomedical research. Currently, this compendium holds genome wide expression profiles of 317 different cancer cell lines, categorized into 57 different pathological states and 28 individual tissues. To enlarge the scope of CellLineNavigator, the database was furthermore closely linked to commonly used bioinformatics databases and knowledge repositories. To ensure easy data access and search ability, a simple data and an intuitive querying interface were implemented. It allows the user to explore and filter gene expression, focusing on pathological or physiological conditions. For a more complex search, the advanced query interface may be used to query for (i) differentially expressed genes; (ii) pathological or physiological conditions; or (iii) gene names or functional attributes, such as Kyoto Encyclopaedia of Genes and Genomes pathway maps. These queries may also be combined. Finally, CellLineNavigator allows additional advanced analysis of differentially regulated genes by a direct link to the Database for Annotation, Visualization and Integrated Discovery (DAVID) Bioinformatics Resources.
Assuntos
Linhagem Celular Tumoral , Bases de Dados Genéticas , Neoplasias/genética , Transcriptoma , Humanos , Internet , Neoplasias/metabolismoRESUMO
The computation of root mean square deviations (RMSD) is an important step in many bioinformatics applications. If approached naively, each RMSD computation takes time linear in the number of atoms. In addition, a careful implementation is required to achieve numerical stability, which further increases runtimes. In practice, the structural variations under consideration are often induced by rigid transformations of the protein, or are at least dominated by a rigid component. In this work, we show how RMSD values resulting from rigid transformations can be computed in constant time from the protein's covariance matrix, which can be precomputed in linear time. As a typical application scenario is protein clustering, we will also show how the Ward-distance which is popular in this field can be reduced to RMSD evaluations, yielding a constant time approach for their computation.
Assuntos
Biologia Computacional/métodos , Simulação por Computador , Conformação Proteica , Proteínas/químicaRESUMO
The enormous growth in the amount of data generated by the life sciences is continuously shifting the field from model-driven science towards data-driven science. The need for efficient processing has led to the adoption of massively parallel accelerators such as graphics processing units (GPUs). Consequently, the development of bioinformatics methods nowadays often heavily depends on the effective use of these powerful technologies. Furthermore, progress in computational techniques and architectures continues to be highly dynamic, involving novel deep neural network models and artificial intelligence (AI) accelerators, and potentially quantum processing units in the future. These are expected to be disruptive for the life sciences as a whole and for drug discovery in particular. Here, we identify three waves of acceleration and their applications in a bioinformatics context: (i) GPU computing, (ii) AI and (iii) next-generation quantum computers.
Assuntos
Inteligência Artificial , Biologia Computacional , Biologia Computacional/métodos , Gráficos por Computador , Teoria Quântica , Humanos , Redes Neurais de Computação , Descoberta de Drogas/métodosRESUMO
BACKGROUND: NMR chemical shift prediction plays an important role in various applications in computational biology. Among others, structure determination, structure optimization, and the scoring of docking results can profit from efficient and accurate chemical shift estimation from a three-dimensional model.A variety of NMR chemical shift prediction approaches have been presented in the past, but nearly all of these rely on laborious manual data set preparation and the training itself is not automatized, making retraining the model, e.g., if new data is made available, or testing new models a time-consuming manual chore. RESULTS: In this work, we present the framework NightShift (NMR Shift Inference by General Hybrid Model Training), which enables automated data set generation as well as model training and evaluation of protein NMR chemical shift prediction.In addition to this main result - the NightShift framework itself - we describe the resulting, automatically generated, data set and, as a proof-of-concept, a random forest model called Spinster that was built using the pipeline. CONCLUSION: By demonstrating that the performance of the automatically generated predictors is at least en par with the state of the art, we conclude that automated data set and predictor generation is well-suited for the design of NMR chemical shift estimators.The framework can be downloaded from https://bitbucket.org/akdehof/nightshift. It requires the open source Biochemical Algorithms Library (BALL), and is available under the conditions of the GNU Lesser General Public License (LGPL). We additionally offer a browser-based user interface to our NightShift instance employing the Galaxy framework via https://ballaxy.bioinf.uni-sb.de/.