RESUMO
To enable personalized cancer treatment, machine learning models have been developed to predict drug response as a function of tumor and drug features. However, most algorithm development efforts have relied on cross-validation within a single study to assess model accuracy. While an essential first step, cross-validation within a biological data set typically provides an overly optimistic estimate of the prediction performance on independent test sets. To provide a more rigorous assessment of model generalizability between different studies, we use machine learning to analyze five publicly available cell line-based data sets: National Cancer Institute 60, ancer Therapeutics Response Portal (CTRP), Genomics of Drug Sensitivity in Cancer, Cancer Cell Line Encyclopedia and Genentech Cell Line Screening Initiative (gCSI). Based on observed experimental variability across studies, we explore estimates of prediction upper bounds. We report performance results of a variety of machine learning models, with a multitasking deep neural network achieving the best cross-study generalizability. By multiple measures, models trained on CTRP yield the most accurate predictions on the remaining testing data, and gCSI is the most predictable among the cell line data sets included in this study. With these experiments and further simulations on partial data, two lessons emerge: (1) differences in viability assays can limit model generalizability across studies and (2) drug diversity, more than tumor diversity, is crucial for raising model generalizability in preclinical screening.
Assuntos
Neoplasias , Algoritmos , Linhagem Celular , Humanos , Aprendizado de Máquina , Neoplasias/tratamento farmacológico , Neoplasias/genética , Redes Neurais de ComputaçãoRESUMO
The core part of the program system COLUMBUS allows highly efficient calculations using variational multireference (MR) methods in the framework of configuration interaction with single and double excitations (MR-CISD) and averaged quadratic coupled-cluster calculations (MR-AQCC), based on uncontracted sets of configurations and the graphical unitary group approach (GUGA). The availability of analytic MR-CISD and MR-AQCC energy gradients and analytic nonadiabatic couplings for MR-CISD enables exciting applications including, e.g., investigations of π-conjugated biradicaloid compounds, calculations of multitudes of excited states, development of diabatization procedures, and furnishing the electronic structure information for on-the-fly surface nonadiabatic dynamics. With fully variational uncontracted spin-orbit MRCI, COLUMBUS provides a unique possibility of performing high-level calculations on compounds containing heavy atoms up to lanthanides and actinides. Crucial for carrying out all of these calculations effectively is the availability of an efficient parallel code for the CI step. Configuration spaces of several billion in size now can be treated quite routinely on standard parallel computer clusters. Emerging developments in COLUMBUS, including the all configuration mean energy multiconfiguration self-consistent field method and the graphically contracted function method, promise to allow practically unlimited configuration space dimensions. Spin density based on the GUGA approach, analytic spin-orbit energy gradients, possibilities for local electron correlation MR calculations, development of general interfaces for nonadiabatic dynamics, and MRCI linear vibronic coupling models conclude this overview.
RESUMO
á .
RESUMO
BACKGROUND: The National Cancer Institute drug pair screening effort against 60 well-characterized human tumor cell lines (NCI-60) presents an unprecedented resource for modeling combinational drug activity. RESULTS: We present a computational model for predicting cell line response to a subset of drug pairs in the NCI-ALMANAC database. Based on residual neural networks for encoding features as well as predicting tumor growth, our model explains 94% of the response variance. While our best result is achieved with a combination of molecular feature types (gene expression, microRNA and proteome), we show that most of the predictive power comes from drug descriptors. To further demonstrate value in detecting anticancer therapy, we rank the drug pairs for each cell line based on model predicted combination effect and recover 80% of the top pairs with enhanced activity. CONCLUSIONS: We present promising results in applying deep learning to predicting combinational drug response. Our feature analysis indicates screening data involving more cell lines are needed for the models to make better use of molecular features.
Assuntos
Aprendizado Profundo/tendências , Avaliação Pré-Clínica de Medicamentos/métodos , Linhagem Celular Tumoral , Humanos , National Cancer Institute (U.S.) , Redes Neurais de Computação , Estados UnidosRESUMO
Detailed understanding of the structure and function relationship of RNA requires knowledge about RNA three-dimensional (3D) topological folding. However, there are very few unique RNA entries in structure databases. This is due to challenges in determining 3D structures of RNA using conventional methods, such as X-ray crystallography and NMR spectroscopy, despite significant advances in both of these technologies. Computational methods have come a long way in accurately predicting the 3D structures of small (<50nt) RNAs to within a few angstroms compared to their native folds. However, lack of an apparent correlation between an RNA primary sequence and its 3D fold ultimately limits the success of purely computational approaches. In this context, small angle X-ray scattering (SAXS) serves as a valuable tool by providing global shape information of RNA. In this article, we review the progress in determining RNA 3D topological structures, including a new method that combines secondary structural information and SAXS data to sample conformations generated through hierarchical moves of commonly observed RNA motifs.
Assuntos
Modelos Moleculares , RNA/química , Sequência de Bases , Simulação por Computador , Conformação de Ácido Nucleico , Espalhamento a Baixo Ângulo , Difração de Raios XRESUMO
With a widely attended virtual kickoff event on January 29, 2021, the National Cancer Institute (NCI) and the Department of Energy (DOE) launched a series of 4 interactive, interdisciplinary workshops-and a final concluding "World Café" on March 29, 2021-focused on advancing computational approaches for predictive oncology in the clinical and research domains of radiation oncology. These events reflect 3,870 human hours of virtual engagement with representation from 8 DOE national laboratories and the Frederick National Laboratory for Cancer Research (FNL), 4 research institutes, 5 cancer centers, 17 medical schools and teaching hospitals, 5 companies, 5 federal agencies, 3 research centers, and 27 universities. Here we summarize the workshops by first describing the background for the workshops. Participants identified twelve key questions-and collaborative parallel ideas-as the focus of work going forward to advance the field. These were then used to define short-term and longer-term "Blue Sky" goals. In addition, the group determined key success factors for predictive oncology in the context of radiation oncology, if not the future of all of medicine. These are: cross-discipline collaboration, targeted talent development, development of mechanistic mathematical and computational models and tools, and access to high-quality multiscale data that bridges mechanisms to phenotype. The workshop participants reported feeling energized and highly motivated to pursue next steps together to address the unmet needs in radiation oncology specifically and in cancer research generally and that NCI and DOE project goals align at the convergence of radiation therapy and advanced computing.
Assuntos
Radioterapia (Especialidade) , Academias e Institutos , Humanos , National Cancer Institute (U.S.) , Radioterapia (Especialidade)/educação , Estados UnidosRESUMO
We are rapidly approaching a future in which cancer patient digital twins will reach their potential to predict cancer prevention, diagnosis, and treatment in individual patients. This will be realized based on advances in high performance computing, computational modeling, and an expanding repertoire of observational data across multiple scales and modalities. In 2020, the US National Cancer Institute, and the US Department of Energy, through a trans-disciplinary research community at the intersection of advanced computing and cancer research, initiated team science collaborative projects to explore the development and implementation of predictive Cancer Patient Digital Twins. Several diverse pilot projects were launched to provide key insights into important features of this emerging landscape and to determine the requirements for the development and adoption of cancer patient digital twins. Projects included exploring approaches to using a large cohort of digital twins to perform deep phenotyping and plan treatments at the individual level, prototyping self-learning digital twin platforms, using adaptive digital twin approaches to monitor treatment response and resistance, developing methods to integrate and fuse data and observations across multiple scales, and personalizing treatment based on cancer type. Collectively these efforts have yielded increased insights into the opportunities and challenges facing cancer patient digital twin approaches and helped define a path forward. Given the rapidly growing interest in patient digital twins, this manuscript provides a valuable early progress report of several CPDT pilot projects commenced in common, their overall aims, early progress, lessons learned and future directions that will increasingly involve the broader research community.
RESUMO
BACKGROUND: The cancer genome is commonly altered with thousands of structural rearrangements including insertions, deletions, translocation, inversions, duplications, and copy number variations. Thus, structural variant (SV) characterization plays a paramount role in cancer target identification, oncology diagnostics, and personalized medicine. As part of the SEQC2 Consortium effort, the present study established and evaluated a consensus SV call set using a breast cancer reference cell line and matched normal control derived from the same donor, which were used in our companion benchmarking studies as reference samples. RESULTS: We systematically investigated somatic SVs in the reference cancer cell line by comparing to a matched normal cell line using multiple NGS platforms including Illumina short-read, 10X Genomics linked reads, PacBio long reads, Oxford Nanopore long reads, and high-throughput chromosome conformation capture (Hi-C). We established a consensus SV call set of a total of 1788 SVs including 717 deletions, 230 duplications, 551 insertions, 133 inversions, 146 translocations, and 11 breakends for the reference cancer cell line. To independently evaluate and cross-validate the accuracy of our consensus SV call set, we used orthogonal methods including PCR-based validation, Affymetrix arrays, Bionano optical mapping, and identification of fusion genes detected from RNA-seq. We evaluated the strengths and weaknesses of each NGS technology for SV determination, and our findings provide an actionable guide to improve cancer genome SV detection sensitivity and accuracy. CONCLUSIONS: A high-confidence consensus SV call set was established for the reference cancer cell line. A large subset of the variants identified was validated by multiple orthogonal methods.
Assuntos
Variações do Número de Cópias de DNA , Neoplasias , Humanos , Análise de Sequência de DNA/métodos , Variação Estrutural do Genoma , Tecnologia , Linhagem Celular , Sequenciamento de Nucleotídeos em Larga Escala , Genoma Humano , Neoplasias/genéticaRESUMO
Conventional drug discovery is long and costly, and suffers from high attrition rates, often leaving patients with limited or expensive treatment options. Recognizing the overwhelming need to accelerate this process and increase success, the ATOM consortium was formed by government, industry, and academic partners in October 2017. ATOM applies a team science and open-source approach to foster a paradigm shift in drug discovery. ATOM is developing and validating a precompetitive, preclinical, small molecule drug discovery platform that simultaneously optimizes pharmacokinetics, toxicity, protein-ligand interactions, systems-level models, molecular design, and novel compound generation. To achieve this, the ATOM Modeling Pipeline (AMPL) has been developed to enable advanced and emerging machine learning (ML) approaches to build models from diverse historical drug discovery data. This modular pipeline has been designed to couple with a generative algorithm that optimizes multiple parameters necessary for drug discovery. ATOM's approach is to consider the full pharmacology and therapeutic window of the drug concurrently, through computationally-driven design, thereby reducing the number of molecules that are selected for experimental validation. Here, we discuss the role of collaborative efforts such as consortia and public-private partnerships in accelerating cross disciplinary innovation and the development of open-source tools for drug discovery.
RESUMO
The application of data science in cancer research has been boosted by major advances in three primary areas: (1) Data: diversity, amount, and availability of biomedical data; (2) Advances in Artificial Intelligence (AI) and Machine Learning (ML) algorithms that enable learning from complex, large-scale data; and (3) Advances in computer architectures allowing unprecedented acceleration of simulation and machine learning algorithms. These advances help build in silico ML models that can provide transformative insights from data including: molecular dynamics simulations, next-generation sequencing, omics, imaging, and unstructured clinical text documents. Unique challenges persist, however, in building ML models related to cancer, including: (1) access, sharing, labeling, and integration of multimodal and multi-institutional data across different cancer types; (2) developing AI models for cancer research capable of scaling on next generation high performance computers; and (3) assessing robustness and reliability in the AI models. In this paper, we review the National Cancer Institute (NCI) -Department of Energy (DOE) collaboration, Joint Design of Advanced Computing Solutions for Cancer (JDACS4C), a multi-institution collaborative effort focused on advancing computing and data technologies to accelerate cancer research on three levels: molecular, cellular, and population. This collaboration integrates various types of generated data, pre-exascale compute resources, and advances in ML models to increase understanding of basic cancer biology, identify promising new treatment options, predict outcomes, and eventually prescribe specialized treatments for patients with cancer.
RESUMO
Electronic medical record keeping has led to increased interest in analyzing historical patient data to improve care delivery. Such research use of patient data, however, raises concerns about confidentiality and institutional liability. Institutional review boards must balance patient data security with a researcher's ability to explore potentially important clinical relationships. We considered the issues involved when patient records from health care institutions are used in medical research. We also explored current regulations on patient confidentiality, the need for identifying information in research, and the effectiveness of deidentification and data security. We will present an algorithm for researchers to use to think about the data security needs of their research, and we will introduce a vocabulary for documenting these techniques in proposals and publications.
Assuntos
Segurança Computacional , Confidencialidade , Bases de Dados Factuais/normas , Sistemas Computadorizados de Registros Médicos/normas , Algoritmos , Bases de Dados Factuais/ética , Pesquisa sobre Serviços de Saúde , Humanos , Sistemas Computadorizados de Registros Médicos/éticaRESUMO
Identification of important transcripts from fungal pathogens and host plants is indispensable for full understanding the molecular events occurring during fungal-plant interactions. Recently, we developed an improved LongSAGE method called robust-long serial analysis of gene expression (RL-SAGE) for deep transcriptome analysis of fungal and plant genomes. Using this method, we made 10 RL-SAGE libraries from two plant species (Oryza sativa and Zea maize) and one fungal pathogen (Magnaporthe grisea). Many of the transcripts identified from these libraries were novel in comparison with their corresponding EST collections. Bioinformatic tools and databases for analyzing the RL-SAGE data were developed. Our results demonstrate that RL-SAGE is an effective approach for large-scale identification of expressed genes in fungal and plant genomes.
Assuntos
Perfilação da Expressão Gênica , Regulação da Expressão Gênica/genética , Genes Fúngicos , Genes de Plantas , Magnaporthe/genética , Oryza/microbiologia , Rhizoctonia/genética , Sequência de Bases , Northern Blotting , DNA Complementar/genética , Bases de Dados Genéticas , Biblioteca Gênica , Interações Hospedeiro-Parasita , Magnaporthe/patogenicidade , Magnaporthe/fisiologia , Dados de Sequência Molecular , Micélio , Folhas de Planta/microbiologia , RNA Fúngico/isolamento & purificação , RNA Mensageiro/isolamento & purificação , Rhizoctonia/patogenicidade , Rhizoctonia/fisiologia , SoftwareRESUMO
Knowledge of RNA three-dimensional topological structures provides important insight into the relationship between RNA structural components and function. It is often likely that near-complete sets of biochemical and biophysical data containing structural restraints are not available, but one still wants to obtain knowledge about approximate topological folding of RNA. In this regard, general methods for determining such topological structures with minimum readily available restraints are lacking. Naked RNAs are difficult to crystallize and NMR spectroscopy is generally limited to small RNA fragments. By nature, sequence determines structure and all interactions that drive folding are self-contained within sequence. Nevertheless, there is little apparent correlation between primary sequences and three-dimensional folding unless supplemented with experimental or phylogenetic data. Thus, there is an acute need for a robust high-throughput method that can rapidly determine topological structures of RNAs guided by some experimental data. We present here a novel method (RS3D) that can assimilate the RNA secondary structure information, small-angle X-ray scattering data, and any readily available tertiary contact information to determine the topological fold of RNA. Conformations are firstly sampled at glob level where each glob represents a nucleotide. Best-ranked glob models can be further refined against solvent accessibility data, if available, and then converted to explicit all-atom coordinates for refinement against SAXS data using the Xplor-NIH program. RS3D is widely applicable to a variety of RNA folding architectures currently present in the structure database. Furthermore, we demonstrate applicability and feasibility of the program to derive low-resolution topological structures of relatively large multi-domain RNAs.
Assuntos
Dobramento de RNA , RNA/química , Espalhamento a Baixo Ângulo , Difração de Raios X , Modelos MolecularesRESUMO
BACKGROUND: Rice blast, caused by the fungal pathogen Magnaporthe grisea, is a devastating disease causing tremendous yield loss in rice production. The public availability of the complete genome sequence of M. grisea provides ample opportunities to understand the molecular mechanism of its pathogenesis on rice plants at the transcriptome level. To identify all the expressed genes encoded in the fungal genome, we have analyzed the mycelium and appressorium transcriptomes using massively parallel signature sequencing (MPSS), robust-long serial analysis of gene expression (RL-SAGE) and oligoarray methods. RESULTS: The MPSS analyses identified 12,531 and 12,927 distinct significant tags from mycelia and appressoria, respectively, while the RL-SAGE analysis identified 16,580 distinct significant tags from the mycelial library. When matching these 12,531 mycelial and 12,927 appressorial significant tags to the annotated CDS, 500 bp upstream and 500 bp downstream of CDS, 6,735 unique genes in mycelia and 7,686 unique genes in appressoria were identified. A total of 7,135 mycelium-specific and 7,531 appressorium-specific significant MPSS tags were identified, which correspond to 2,088 and 1,784 annotated genes, respectively, when matching to the same set of reference sequences. Nearly 85% of the significant MPSS tags from mycelia and appressoria and 65% of the significant tags from the RL-SAGE mycelium library matched to the M. grisea genome. MPSS and RL-SAGE methods supported the expression of more than 9,000 genes, representing over 80% of the predicted genes in M. grisea. About 40% of the MPSS tags and 55% of the RL-SAGE tags represent novel transcripts since they had no matches in the existing M. grisea EST collections. Over 19% of the annotated genes were found to produce both sense and antisense tags in the protein-coding region. The oligoarray analysis identified the expression of 3,793 mycelium-specific and 4,652 appressorium-specific genes. A total of 2,430 mycelial genes and 1,886 appressorial genes were identified by both MPSS and oligoarray. CONCLUSION: The comprehensive and deep transcriptome analysis by MPSS and RL-SAGE methods identified many novel sense and antisense transcripts in the M. grisea genome at two important growth stages. The differentially expressed transcripts that were identified, especially those specifically expressed in appressoria, represent a genomic resource useful for gaining a better understanding of the molecular basis of M. grisea pathogenicity. Further analysis of the novel antisense transcripts will provide new insights into the regulation and function of these genes in fungal growth, development and pathogenesis in the host plants.
Assuntos
Regulação Fúngica da Expressão Gênica , Magnaporthe/genética , Análise de Sequência com Séries de Oligonucleotídeos , Transcrição Gênica , DNA Fúngico/genética , Etiquetas de Sequências Expressas , Técnicas Genéticas , Magnaporthe/patogenicidade , Micélio/genética , RNA Antissenso/genéticaRESUMO
BACKGROUND: Long alpha-helical coiled-coil proteins are involved in diverse organizational and regulatory processes in eukaryotic cells. They provide cables and networks in the cyto- and nucleoskeleton, molecular scaffolds that organize membrane systems and tissues, motors, levers, rotating arms, and possibly springs. Mutations in long coiled-coil proteins have been implemented in a growing number of human diseases. Using the coiled-coil prediction program MultiCoil, we have previously identified all long coiled-coil proteins from the model plant Arabidopsis thaliana and have established a searchable Arabidopsis coiled-coil protein database. RESULTS: Here, we have identified all proteins with long coiled-coil domains from 21 additional fully sequenced genomes. Because regions predicted to form coiled-coils interfere with sequence homology determination, we have developed a sequence comparison and clustering strategy based on masking predicted coiled-coil domains. Comparing and grouping all long coiled-coil proteins from 22 genomes, the kingdom-specificity of coiled-coil protein families was determined. At the same time, a number of proteins with unknown function could be grouped with already characterized proteins from other organisms. CONCLUSION: MultiCoil predicts proteins with extended coiled-coil domains (more than 250 amino acids) to be largely absent from bacterial genomes, but present in archaea and eukaryotes. The structural maintenance of chromosomes proteins and their relatives are the only long coiled-coil protein family clearly conserved throughout all kingdoms, indicating their ancient nature. Motor proteins, membrane tethering and vesicle transport proteins are the dominant eukaryote-specific long coiled-coil proteins, suggesting that coiled-coil proteins have gained functions in the increasingly complex processes of subcellular infrastructure maintenance and trafficking control of the eukaryotic cell.
Assuntos
Proteínas de Arabidopsis/química , Proteômica/métodos , Sequência de Aminoácidos , Animais , Proteínas de Bactérias , Proteínas de Transporte , Membrana Celular/metabolismo , Análise por Conglomerados , Citoesqueleto/metabolismo , Bases de Dados de Proteínas , Evolução Molecular , Genes de Plantas , Genoma , Genoma Arqueal , Humanos , Modelos Biológicos , Modelos Genéticos , Filogenia , Conformação Proteica , Dobramento de Proteína , Estrutura Secundária de Proteína , Estrutura Terciária de Proteína , Proteínas , Proteoma , Análise de Sequência de DNA , Software , Especificidade da EspécieRESUMO
Hutchinson-Gilford progeria syndrome (HGPS) patients do not develop cancer despite a significant accumulation of DNA damage in their cells. We have recently reported that HGPS cells are refractory to experimental oncogenic transformation and we identified the bromodomain-containing 4 protein (BRD4) as a mediator of the transformation resistance. ChIP-sequencing experiments revealed distinct genome-wide binding patterns for BRD4 in HGPS cells when compared to control wild type cells. Here we provide a detailed description of the ChIP-seq dataset (NCBI GEO accession number GSE61325), the specific and common BRD4 binding sites between HGPS and control cells, and the data analysis procedure associated with the publication by Fernandez et al., 2014 in Cell Reports 9, 248-260 [1].
RESUMO
The pmel-1 T cell receptor transgenic mouse has been extensively employed as an ideal model system to study the mechanisms of tumor immunology, CD8+ T cell differentiation, autoimmunity and adoptive immunotherapy. The 'zygosity' of the transgene affects the transgene expression levels and may compromise optimal breeding scheme design. However, the integration sites for the pmel-1 mouse have remained uncharacterized. This is also true for many other commonly used transgenic mice created before the modern era of rapid and inexpensive next-generation sequencing. Here, we show that whole genome sequencing can be used to determine the exact pmel-1 genomic integration site, even with relatively 'shallow' (8X) coverage. The results were used to develop a validated polymerase chain reaction-based genotyping assay. For the first time, we provide a quick and convenient polymerase chain reaction method to determine the dosage of pmel-1 transgene for this freely and publically available mouse resource. We also demonstrate that next-generation sequencing provides a feasible approach for mapping foreign DNA integration sites, even when information of the original vector sequences is only partially known.
Assuntos
Cromossomos Humanos Par 2/química , Mutagênese Insercional , Receptores de Antígenos de Linfócitos T alfa-beta/genética , Transgenes , Antígeno gp100 de Melanoma/genética , Animais , Sequência de Bases , Linfócitos T CD8-Positivos/citologia , Linfócitos T CD8-Positivos/imunologia , Mapeamento Cromossômico , Dosagem de Genes , Expressão Gênica , Vetores Genéticos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Camundongos , Camundongos Transgênicos , Dados de Sequência Molecular , Receptores de Antígenos de Linfócitos T alfa-beta/imunologia , Antígeno gp100 de Melanoma/imunologiaRESUMO
Rice blast disease, caused by the fungal pathogen Magnaporthe grisea, is an excellent model system to study plant-fungal interactions and host defense responses. In this study, comprehensive analysis of the rice (Oryza sativa) transcriptome after M. grisea infection was conducted using robust-long serial analysis of gene expression. A total of 83,382 distinct 21-bp robust-long serial analysis of gene expression tags were identified from 627,262 individual tags isolated from the resistant (R), susceptible (S), and control (C) libraries. Sequence analysis revealed that the tags in the R and S libraries had a significant reduced matching rate to the rice genomic and expressed sequences in comparison to the C library. The high level of one-nucleotide mismatches of the R and S library tags was due to nucleotide conversions. The A-to-G and U-to-C nucleotide conversions were the most predominant types, which were induced in the M. grisea-infected plants. Reverse transcription-polymerase chain reaction analysis showed that expression of the adenine deaminase and cytidine deaminase genes was highly induced after inoculation. In addition, many antisense transcripts were induced in infected plants and expression of four antisense transcripts was confirmed by strand-specific reverse transcription-polymerase chain reaction. These results demonstrate that there is a series of dynamic and complex transcript modifications and changes in the rice transcriptome at the M. grisea early infection stages.