RESUMO
Training algorithms to computationally plan multistep organic syntheses has been a challenge for more than 50 years1-7. However, the field has progressed greatly since the development of early programs such as LHASA1,7, for which reaction choices at each step were made by human operators. Multiple software platforms6,8-14 are now capable of completely autonomous planning. But these programs 'think' only one step at a time and have so far been limited to relatively simple targets, the syntheses of which could arguably be designed by human chemists within minutes, without the help of a computer. Furthermore, no algorithm has yet been able to design plausible routes to complex natural products, for which much more far-sighted, multistep planning is necessary15,16 and closely related literature precedents cannot be relied on. Here we demonstrate that such computational synthesis planning is possible, provided that the program's knowledge of organic chemistry and data-based artificial intelligence routines are augmented with causal relationships17,18, allowing it to 'strategize' over multiple synthetic steps. Using a Turing-like test administered to synthesis experts, we show that the routes designed by such a program are largely indistinguishable from those designed by humans. We also successfully validated three computer-designed syntheses of natural products in the laboratory. Taken together, these results indicate that expert-level automated synthetic planning is feasible, pending continued improvements to the reaction knowledge base and further code optimization.
Assuntos
Inteligência Artificial , Produtos Biológicos/síntese química , Técnicas de Química Sintética/métodos , Química Orgânica/métodos , Software , Inteligência Artificial/normas , Automação/métodos , Automação/normas , Benzilisoquinolinas/síntese química , Benzilisoquinolinas/química , Técnicas de Química Sintética/normas , Química Orgânica/normas , Indanos/síntese química , Indanos/química , Alcaloides Indólicos/síntese química , Alcaloides Indólicos/química , Bases de Conhecimento , Lactonas/síntese química , Lactonas/química , Macrolídeos/síntese química , Macrolídeos/química , Reprodutibilidade dos Testes , Sesquiterpenos/síntese química , Sesquiterpenos/química , Software/normas , Tetra-Hidroisoquinolinas/síntese química , Tetra-Hidroisoquinolinas/químicaRESUMO
Top-down proteomics approaches are becoming ever more popular, due to the advantages offered by knowledge of the intact protein mass in correctly identifying the various proteoforms that potentially arise due to point mutation, alternative splicing, post-translational modifications, etc. Usually, the average mass is used in this context; however, it is known that this can fluctuate significantly due to both natural and technical causes. Ideally, one would prefer to use the monoisotopic precursor mass, but this falls below the detection limit for all but the smallest proteins. Methods that predict the monoisotopic mass based on the average mass are potentially affected by imprecisions associated with the average mass. To address this issue, we have developed a framework based on simple, linear models that allows prediction of the monoisotopic mass based on the exact mass of the most-abundant (aggregated) isotope peak, which is a robust measure of mass, insensitive to the aforementioned natural and technical causes. This linear model was tested experimentally, as well as in silico, and typically predicts monoisotopic masses with an accuracy of only a few parts per million. A confidence measure is associated with the predicted monoisotopic mass to handle the off-by-one-Da prediction error. Furthermore, we introduce a correction function to extract the "true" (i.e., theoretically) most-abundant isotope peak from a spectrum, even if the observed isotope distribution is distorted by noise or poor ion statistics. The method is available online as an R shiny app: https://valkenborg-lab.shinyapps.io/mind/.
Assuntos
Algoritmos , Cromatografia Líquida/métodos , Modelos Estatísticos , Proteínas/análise , Proteoma/análise , Espectrometria de Massas em Tandem/métodos , Humanos , Processamento de Proteína Pós-Traducional , Proteínas/metabolismoRESUMO
Analysis of the chemical-organic knowledge represented as a giant network reveals that it contains millions of reaction sequences closing into cycles. Without realizing it, independent chemists working at different times have jointly created examples of cyclic sequences that allow for the recovery of useful reagents and for the autoamplification of synthetically important molecules, those that mimic biological cycles, and those that can be operated one-pot.
RESUMO
We delineated and analyzed directly oriented paralogous low-copy repeats (DP-LCRs) in the most recent version of the human haploid reference genome. The computationally defined DP-LCRs were cross-referenced with our chromosomal microarray analysis (CMA) database of 25,144 patients subjected to genome-wide assays. This computationally guided approach to the empirically derived large data set allowed us to investigate genomic rearrangement relative frequencies and identify new loci for recurrent nonallelic homologous recombination (NAHR)-mediated copy-number variants (CNVs). The most commonly observed recurrent CNVs were NPHP1 duplications (233), CHRNA7 duplications (175), and 22q11.21 deletions (DiGeorge/velocardiofacial syndrome, 166). In the â¼25% of CMA cases for which parental studies were available, we identified 190 de novo recurrent CNVs. In this group, the most frequently observed events were deletions of 22q11.21 (48), 16p11.2 (autism, 34), and 7q11.23 (Williams-Beuren syndrome, 11). Several features of DP-LCRs, including length, distance between NAHR substrate elements, DNA sequence identity (fraction matching), GC content, and concentration of the homologous recombination (HR) hot spot motif 5'-CCNCCNTNNCCNC-3', correlate with the frequencies of the recurrent CNVs events. Four novel adjacent DP-LCR-flanked and NAHR-prone regions, involving 2q12.2q13, were elucidated in association with novel genomic disorders. Our study quantitates genome architectural features responsible for NAHR-mediated genomic instability and further elucidates the role of NAHR in human disease.
Assuntos
Alelos , Transtornos Cromossômicos/genética , Variações do Número de Cópias de DNA , Doenças Genéticas Inatas/genética , Recombinação Homóloga , Proteínas Adaptadoras de Transdução de Sinal/genética , Composição de Bases , Deleção Cromossômica , Duplicação Cromossômica , Proteínas do Citoesqueleto , Genoma Humano , Humanos , Proteínas de Membrana/genética , Motivos de Nucleotídeos , Receptor Nicotínico de Acetilcolina alfa7/genéticaRESUMO
An unanticipated and tremendous amount of the noncoding sequence of the human genome is transcribed. Long noncoding RNAs (lncRNAs) constitute a significant fraction of non-protein-coding transcripts; however, their functions remain enigmatic. We demonstrate that deletions of a small noncoding differentially methylated region at 16q24.1, including lncRNA genes, cause a lethal lung developmental disorder, alveolar capillary dysplasia with misalignment of pulmonary veins (ACD/MPV), with parent-of-origin effects. We identify overlapping deletions 250 kb upstream of FOXF1 in nine patients with ACD/MPV that arose de novo specifically on the maternally inherited chromosome and delete lung-specific lncRNA genes. These deletions define a distant cis-regulatory region that harbors, besides lncRNA genes, also a differentially methylated CpG island, binds GLI2 depending on the methylation status of this CpG island, and physically interacts with and up-regulates the FOXF1 promoter. We suggest that lung-transcribed 16q24.1 lncRNAs may contribute to long-range regulation of FOXF1 by GLI2 and other transcription factors. Perturbation of lncRNA-mediated chromatin interactions may, in general, be responsible for position effect phenomena and potentially cause many disorders of human development.
Assuntos
Variações do Número de Cópias de DNA , Metilação de DNA , Síndrome da Persistência do Padrão de Circulação Fetal/genética , RNA Longo não Codificante/genética , Cromatina/metabolismo , Cromossomos Humanos Par 16/genética , Ilhas de CpG , Elementos Facilitadores Genéticos , Evolução Fatal , Fatores de Transcrição Forkhead/genética , Fatores de Transcrição Forkhead/metabolismo , Regulação da Expressão Gênica , Impressão Genômica , Células HEK293 , Humanos , Recém-Nascido , Fatores de Transcrição Kruppel-Like/metabolismo , Proteínas Nucleares/metabolismo , Síndrome da Persistência do Padrão de Circulação Fetal/diagnóstico , Regiões Promotoras Genéticas , RNA Longo não Codificante/metabolismo , Deleção de Sequência , Transcrição Gênica , Proteína Gli2 com Dedos de ZincoRESUMO
Exactly half a century has passed since the launch of the first documented research project (1965 Dendral) on computer-assisted organic synthesis. Many more programs were created in the 1970s and 1980s but the enthusiasm of these pioneering days had largely dissipated by the 2000s, and the challenge of teaching the computer how to plan organic syntheses earned itself the reputation of a "mission impossible". This is quite curious given that, in the meantime, computers have "learned" many other skills that had been considered exclusive domains of human intellect and creativity-for example, machines can nowadays play chess better than human world champions and they can compose classical music pleasant to the human ear. Although there have been no similar feats in organic synthesis, this Review argues that to concede defeat would be premature. Indeed, bringing together the combination of modern computational power and algorithms from graph/network theory, chemical rules (with full stereo- and regiochemistry) coded in appropriate formats, and the elements of quantum mechanics, the machine can finally be "taught" how to plan syntheses of non-trivial organic molecules in a matter of seconds to minutes. The Review begins with an overview of some basic theoretical concepts essential for the big-data analysis of chemical syntheses. It progresses to the problem of optimizing pathways involving known reactions. It culminates with discussion of algorithms that allow for a completely de novo and fully automated design of syntheses leading to relatively complex targets, including those that have not been made before. Of course, there are still things to be improved, but computers are finally becoming relevant and helpful to the practice of organic-synthetic planning. Paraphrasing Churchill's famous words after the Allies' first major victory over the Axis forces in Africa, it is not the end, it is not even the beginning of the end, but it is the end of the beginning for the computer-assisted synthesis planning. The machine is here to stay.
RESUMO
BACKGROUND: Recurrent rearrangements of the human genome resulting in disease or variation are mainly mediated by non-allelic homologous recombination (NAHR) between low-copy repeats. However, other genomic structures, including AT-rich palindromes and retroviruses, have also been reported to underlie recurrent structural rearrangements. Notably, recurrent deletions of Yq12 conveying azoospermia, as well as non-pathogenic reciprocal duplications, are mediated by human endogenous retroviral elements (HERVs). We hypothesized that HERV elements throughout the genome can serve as substrates for genomic instability and result in human copy-number variation (CNV). RESULTS: We developed parameters to identify HERV elements similar to those that mediate Yq12 rearrangements as well as recurrent deletions of 3q13.2q13.31. We used these parameters to identify HERV pairs genome-wide that may cause instability. Our analysis highlighted 170 pairs, flanking 12.1% of the genome. We cross-referenced these predicted susceptibility regions with CNVs from our clinical databases for potentially HERV-mediated rearrangements and identified 78 CNVs. We subsequently molecularly confirmed recurrent deletion and duplication rearrangements at four loci in ten individuals, including reciprocal rearrangements at two loci. Breakpoint sequencing revealed clustering in regions of high sequence identity enriched in PRDM9-mediated recombination hotspot motifs. CONCLUSIONS: The presence of deletions and reciprocal duplications suggests NAHR as the causative mechanism of HERV-mediated CNV, even though the length and the sequence homology of the HERV elements are less than currently thought to be required for NAHR. We propose that in addition to HERVs, other repetitive elements, such as long interspersed elements, may also be responsible for the formation of recurrent CNVs via NAHR.
Assuntos
Variações do Número de Cópias de DNA , DNA Viral/genética , Retrovirus Endógenos/genética , Genoma Humano , Instabilidade Genômica , Sequência de Bases , Pontos de Quebra do Cromossomo , DNA Viral/metabolismo , Retrovirus Endógenos/metabolismo , Recombinação Homóloga , Humanos , Dados de Sequência Molecular , Sequências Repetitivas de Ácido Nucleico , Deleção de SequênciaRESUMO
A thermodynamically guided calculation of free energies of substrate and product molecules allows for the estimation of the yields of organic reactions. The non-ideality of the system and the solvent effects are taken into account through the activity coefficients calculated at the molecular level by perturbed-chain statistical associating fluid theory (PC-SAFT). The model is iteratively trained using a diverse set of reactions with yields that have been reported previously. This trained model can then estimate aâ priori the yields of reactions not included in the training set with an accuracy of ca. ±15 %. This ability has the potential to translate into significant economic savings through the selection and then execution of only those reactions that can proceed in good yields.
RESUMO
Inverse paralogous low-copy repeats (IP-LCRs) can cause genome instability by nonallelic homologous recombination (NAHR)-mediated balanced inversions. When disrupting a dosage-sensitive gene(s), balanced inversions can lead to abnormal phenotypes. We delineated the genome-wide distribution of IP-LCRs >1 kB in size with >95% sequence identity and mapped the genes, potentially intersected by an inversion, that overlap at least one of the IP-LCRs. Remarkably, our results show that 12.0% of the human genome is potentially susceptible to such inversions and 942 genes, 99 of which are on the X chromosome, are predicted to be disrupted secondary to such an inversion! In addition, IP-LCRs larger than 800 bp with at least 98% sequence identity (duplication/triplication facilitating IP-LCRs, DTIP-LCRs) were recently implicated in the formation of complex genomic rearrangements with a duplication-inverted triplication-duplication (DUP-TRP/INV-DUP) structure by a replication-based mechanism involving a template switch between such inverted repeats. We identified 1,551 DTIP-LCRs that could facilitate DUP-TRP/INV-DUP formation. Remarkably, 1,445 disease-associated genes are at risk of undergoing copy-number gain as they map to genomic intervals susceptible to the formation of DUP-TRP/INV-DUP complex rearrangements. We implicate inverted LCRs as a human genome architectural feature that could potentially be responsible for genomic instability associated with many human disease traits.
Assuntos
Inversão Cromossômica , Genoma Humano/genética , Instabilidade Genômica , Duplicações Segmentares Genômicas/genética , Mapeamento Cromossômico , Deleção de Genes , Dosagem de Genes , Duplicação Gênica , Rearranjo Gênico , Predisposição Genética para Doença/genética , Humanos , Modelos Genéticos , Recombinação GenéticaRESUMO
We describe the molecular and clinical characterization of nine individuals with recurrent, 3.4-Mb, de novo deletions of 3q13.2-q13.31 detected by chromosomal microarray analysis. All individuals have hypotonia and language and motor delays; they variably express mild to moderate cognitive delays (8/9), abnormal behavior (7/9), and autism spectrum disorders (3/9). Common facial features include downslanting palpebral fissures with epicanthal folds, a slightly bulbous nose, and relative macrocephaly. Twenty-eight genes map to the deleted region, including four strong candidate genes, DRD3, ZBTB20, GAP43, and BOC, with important roles in neural and/or muscular development. Analysis of the breakpoint regions based on array data revealed directly oriented human endogenous retrovirus (HERV-H) elements of ~5 kb in size and of >95% DNA sequence identity flanking the deletion. Subsequent DNA sequencing revealed different deletion breakpoints and suggested nonallelic homologous recombination (NAHR) between HERV-H elements as a mechanism of deletion formation, analogous to HERV-I-flanked and NAHR-mediated AZFa deletions. We propose that similar HERV elements may also mediate other recurrent deletion and duplication events on a genome-wide scale. Observation of rare recurrent chromosomal events such as these deletions helps to further the understanding of mechanisms behind naturally occurring variation in the human genome and its contribution to genetic disease.
Assuntos
Deleção Cromossômica , Cromossomos Humanos Par 3/genética , Transtornos Cognitivos/genética , Deficiências do Desenvolvimento/genética , Retrovirus Endógenos/genética , Hipotonia Muscular/genética , Adolescente , Adulto , Sequência de Bases , Criança , Pré-Escolar , Pontos de Quebra do Cromossomo , Transtornos Cognitivos/diagnóstico , Hibridização Genômica Comparativa , Deficiências do Desenvolvimento/diagnóstico , Fácies , Feminino , Ordem dos Genes , Humanos , Lactente , Masculino , Dados de Sequência Molecular , Hipotonia Muscular/diagnóstico , Fenótipo , Alinhamento de Sequência , Síndrome , Adulto JovemRESUMO
This Letter presents the R-package implementation of the recently introduced polynomial method for calculating the aggregated isotopic distribution called BRAIN (Baffling Recursive Algorithm for Isotopic distributioN calculations). The algorithm is simple, easy to understand, highly accurate, fast, and memory-efficient. The method is based on the application of the Newton-Girard theorem and Viète's formulae to the polynomial coding of different aggregated isotopic variants. As a result, an elegant recursive equation is obtained for computing the occurrence probabilities of consecutive aggregated isotopic peaks. Additionally, the algorithm also allows calculating the center-masses of the aggregated isotopic variants. We propose an implementation which is suitable for high-throughput processing and easily customizable for application in different areas of mass spectral data analyses. A case study demonstrates how the R-package can be applied in the context of protein research, but the software can be also used for calculating the isotopic distribution in the context of lipidomics, metabolomics, glycoscience, or even space exploration. More materials, i.e., reference manual, vignette, and the package itself are available at Bioconductor online (http://www.bioconductor.org/packages/release/bioc/html/BRAIN.html).
Assuntos
Espectrometria de Massas , Proteínas/química , Software , Algoritmos , Internet , Isótopos/química , Metabolômica , Proteômica , Interface Usuário-ComputadorRESUMO
BACKGROUND: In this paper we deal with modeling serum proteolysis process from tandem mass spectrometry data. The parameters of peptide degradation process inferred from LC-MS/MS data correspond directly to the activity of specific enzymes present in the serum samples of patients and healthy donors. Our approach integrate the existing knowledge about peptidases' activity stored in MEROPS database with the efficient procedure for estimation the model parameters. RESULTS: Taking into account the inherent stochasticity of the process, the proteolytic activity is modeled with the use of Chemical Master Equation (CME). Assuming the stationarity of the Markov process we calculate the expected values of digested peptides in the model. The parameters are fitted to minimize the discrepancy between those expected values and the peptide activities observed in the MS data. Constrained optimization problem is solved by Levenberg-Marquadt algorithm. CONCLUSIONS: Our results demonstrates the feasibility and potential of high-level analysis for LC-MS proteomic data. The estimated enzyme activities give insights into the molecular pathology of colorectal cancer. Moreover the developed framework is general and can be applied to study proteolytic activity in different systems.
Assuntos
Neoplasias Colorretais/química , Modelos Estatísticos , Peptídeo Hidrolases/análise , Soro/química , Algoritmos , Cromatografia Líquida/métodos , Neoplasias Colorretais/enzimologia , Humanos , Cadeias de Markov , Espectrometria de Massas/métodos , Proteólise , Espectrometria de Massas em Tandem/métodosRESUMO
A computer program for retrosynthetic planning helps develop multiple "synthetic contingency" plans for hydroxychloroquine and also routes leading to remdesivir, both promising but yet unproven medications against COVID-19. These plans are designed to navigate, as much as possible, around known and patented routes and to commence from inexpensive and diverse starting materials, so as to ensure supply in case of anticipated market shortages of commonly used substrates. Looking beyond the current COVID-19 pandemic, development of similar contingency syntheses is advocated for other already-approved medications, in case such medications become urgently needed in mass quantities to face other public-health emergencies.
RESUMO
Although computer programs for retrosynthetic planning have shown improved and in some cases quite satisfactory performance in designing routes leading to specific, individual targets, no algorithms capable of planning syntheses of entire target libraries - important in modern drug discovery - have yet been reported. This study describes how network-search routines underlying existing retrosynthetic programs can be adapted and extended to multi-target design operating on one common search graph, benefitting from the use of common intermediates and reducing the overall synthetic cost. Implementation in the Chematica platform illustrates the usefulness of such algorithms in the syntheses of either (i) all members of a user-defined library, or (ii) the most synthetically accessible members of this library. In the latter case, algorithms are also readily adapted to the identification of the most facile syntheses of isotopically labelled targets. These examples are industrially relevant in the context of hit-to-lead optimization and syntheses of isotopomers of various bioactive molecules.
RESUMO
Mass spectrometry enables the study of increasingly larger biomolecules with increasingly higher resolution, which is able to distinguish between fine isotopic variants having the same additional nucleon count, but slightly different masses. Therefore, the analysis of the fine isotopic distribution becomes an interesting research topic with important practical applications. In this paper, we propose the comprehensive methodology for studying the basic characteristics of the fine isotopic distribution. Our approach uses a broad spectrum of methods ranging from generating functions--that allow us to estimate the variance and the information theory entropy of the distribution--to the theory of thermal energy fluctuations. Having characterized the variance, spread, shape, and size of the fine isotopic distribution, we are able to indicate limitations to high resolution mass spectrometry. Moreover, the analysis of "thermorelativistic" effects (i.e., mass uncertainty attributable to relativistic effects coupled with the statistical mechanical uncertainty of the energy of an isolated ion), in turn, gives us an estimate of impassable limits of isotopic resolution (understood as the ability to distinguish fine structure peaks), which can be moved further only by cooling the ions. The presented approach highlights the potential of theoretical analysis of the fine isotopic distribution, which allows modeling the data more accurately, aiming to support the successful experimental measurements.
Assuntos
Isótopos/análise , Espectrometria de Massas/normas , Modelos Teóricos , Isótopos/química , Isótopos/isolamento & purificação , Limite de DetecçãoRESUMO
Recently, an elegant iterative algorithm called BRAIN (Baffling Recursive Algorithm for Isotopic distributioN calculations) was presented. The algorithm is based on the classic polynomial method for calculating aggregated isotope distributions, and it introduces algebraic identities using Newton-Girard and Viète's formulae to solve the problem of polynomial expansion. Due to the iterative nature of the BRAIN method, it is a requirement that the calculations start from the lightest isotope variant. As such, the complexity of BRAIN scales quadratically with the mass of the putative molecule, since it depends on the number of aggregated peaks that need to be calculated. In this manuscript, we suggest two improvements of the algorithm to decrease both time and memory complexity in obtaining the aggregated isotope distribution. We also illustrate a concept to represent the element isotope distribution in a generic manner. This representation allows for omitting the root calculation of the element polynomial required in the original BRAIN method. A generic formulation for the roots is of special interest for higher order element polynomials such that root finding algorithms and its inaccuracies can be avoided.
Assuntos
Algoritmos , Isótopos/análise , Isótopos/química , Espectrometria de Massas/métodos , Proteômica/métodosRESUMO
Although physicochemical fractionation techniques play a crucial role in the analysis of complex mixtures, they are not necessarily the best solution to separate specific molecular classes, such as lipids and peptides. Any physical fractionation step such as, for example, those based on liquid chromatography, will introduce its own variation and noise. In this paper we investigate to what extent the high sensitivity and resolution of contemporary mass spectrometers offers viable opportunities for computational separation of signals in full scan spectra. We introduce an automatic method that can discriminate peptide from lipid peaks in full scan mass spectra, based on their isotopic properties. We systematically evaluate which features maximally contribute to a peptide versus lipid classification. The selected features are subsequently used to build a random forest classifier that enables almost perfect separation between lipid and peptide signals without requiring ion fragmentation and classical tandem MS-based identification approaches. The classifier is trained on in silico data, but is also capable of discriminating signals in real world experiments. We evaluate the influence of typical data inaccuracies of common classes of mass spectrometry instruments on the optimal set of discriminant features. Finally, the method is successfully extended towards the classification of individual lipid classes from full scan mass spectral features, based on input data defined by the Lipid Maps Consortium.
RESUMO
In this article, we present a computation- and memory-efficient method to calculate the probabilities of occurrence and exact center-masses of the aggregated isotopic distribution of a molecule. The method uses fundamental mathematical properties of polynomials given by the Newton-Girard theorem and Viete's formulae. The calculation is based on the atomic composition of the molecule and the natural abundances of the elemental isotopes in normal terrestrial matter. To evaluate the performance of the proposed method, which we named BRAIN, we compare it with the results obtained from five existing software packages (IsoPro, Mercury, Emass, NeutronCluster, and IsoDalton) for 10 biomolecules. Additionally, we compare the computed mass centers with the results obtained by calculating, and subsequently aggregating, the fine isotopic distribution for two of the exemplary biomolecules. The algorithm will be made available as a Bioconductor package in R, and is also available upon request.