Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 16 de 16
Filtrar
1.
BMC Bioinformatics ; 18(1): 391, 2017 Sep 02.
Artigo em Inglês | MEDLINE | ID: mdl-28865429

RESUMO

BACKGROUND: Due to the degeneracy of the genetic code, most amino acids can be encoded by multiple synonymous codons. Synonymous codons naturally occur with different frequencies in different organisms. The choice of codons may affect protein expression, structure, and function. Recombinant gene technologies commonly take advantage of the former effect by implementing a technique termed codon optimization, in which codons are replaced with synonymous ones in order to increase protein expression. This technique relies on the accurate knowledge of codon usage frequencies. Accurately quantifying codon usage bias for different organisms is useful not only for codon optimization, but also for evolutionary and translation studies: phylogenetic relations of organisms, and host-pathogen co-evolution relationships, may be explored through their codon usage similarities. Furthermore, codon usage has been shown to affect protein structure and function through interfering with translation kinetics, and cotranslational protein folding. RESULTS: Despite the obvious need for accurate codon usage tables, currently available resources are either limited in scope, encompassing only organisms from specific domains of life, or greatly outdated. Taking advantage of the exponential growth of GenBank and the creation of NCBI's RefSeq database, we have developed a new database, the High-performance Integrated Virtual Environment-Codon Usage Tables (HIVE-CUTs), to present and analyse codon usage tables for every organism with publicly available sequencing data. Compared to existing databases, this new database is more comprehensive, addresses concerns that limited the accuracy of earlier databases, and provides several new functionalities, such as the ability to view and compare codon usage between individual organisms and across taxonomical clades, through graphical representation or through commonly used indices. In addition, it is being routinely updated to keep up with the continuous flow of new data in GenBank and RefSeq. CONCLUSION: Given the impact of codon usage bias on recombinant gene technologies, this database will facilitate effective development and review of recombinant drug products and will be instrumental in a wide area of biological research. The database is available at hive.biochemistry.gwu.edu/review/codon .


Assuntos
Códon , Bases de Dados de Ácidos Nucleicos , Animais , Humanos
2.
Front Mol Biosci ; 11: 1419213, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38966129

RESUMO

Introduction: Nucleic acid tests for blood donor screening have improved the safety of the blood supply; however, increasing numbers of emerging pathogen tests are burdensome. Multiplex testing platforms are a potential solution. Methods: The Blood Borne Pathogen Resequencing Microarray Expanded (BBP-RMAv.2) can perform multiplex detection and identification of 80 viruses, bacteria and parasites. This study evaluated pathogen detection in human blood or plasma. Samples spiked with selected pathogens, each with one of 6 viruses, 2 bacteria and 5 protozoans were tested on this platform. The nucleic acids were extracted, amplified using multiplexed sets of primers, and hybridized to a microarray. The reported sequences were aligned to a database to identify the pathogen. To directly compare the microarray to an emerging molecular approach, the amplified nucleic acids were also submitted to nanopore next generation sequencing (NGS). Results: The BBP-RMAv.2 detected viral pathogens at a concentration as low as 100 copies/ml and a range of concentrations from 1,000 to 100,000 copies/ml for all the spiked pathogens. Coded specimens were identified correctly demonstrating the effectiveness of the platform. The nanopore sequencing correctly identified most samples and the results of the two platforms were compared. Discussion: These results indicated that the BBP-RMAv.2 could be employed for multiplex detection with potential for use in blood safety or disease diagnosis. The NGS was nearly as effective at identifying pathogens in blood and performed better than BBP-RMAv.2 at identifying pathogen-negative samples.

3.
JAMIA Open ; 7(2): ooae037, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-38911332

RESUMO

Objectives: Anaphylaxis is a severe life-threatening allergic reaction, and its accurate identification in healthcare databases can harness the potential of "Big Data" for healthcare or public health purposes. Materials and methods: This study used claims data obtained between October 1, 2015 and February 28, 2019 from the CMS database to examine the utility of machine learning in identifying incident anaphylaxis cases. We created a feature selection pipeline to identify critical features between different datasets. Then a variety of unsupervised and supervised methods were used (eg, Sammon mapping and eXtreme Gradient Boosting) to train models on datasets of differing data quality, which reflects the varying availability and potential rarity of ground truth data in medical databases. Results: Resulting machine learning model accuracies ranged from 47.7% to 94.4% when tested on ground truth data. Finally, we found new features to help experts enhance existing case-finding algorithms. Discussion: Developing precise algorithms to detect medical outcomes in claims can be a laborious and expensive process, particularly for conditions presented and coded diversely. We found it beneficial to filter out highly potent codes used for data curation to identify underlying patterns and features. To improve rule-based algorithms where necessary, researchers could use model explainers to determine noteworthy features, which could then be shared with experts and included in the algorithm. Conclusion: Our work suggests machine learning models can perform at similar levels as a previously published expert case-finding algorithm, while also having the potential to improve performance or streamline algorithm construction processes by identifying new relevant features for algorithm construction.

4.
Clin Pharmacol Ther ; 115(4): 745-757, 2024 04.
Artigo em Inglês | MEDLINE | ID: mdl-37965805

RESUMO

In 2020, Novartis Pharmaceuticals Corporation and the U.S. Food and Drug Administration (FDA) started a 4-year scientific collaboration to approach complex new data modalities and advanced analytics. The scientific question was to find novel radio-genomics-based prognostic and predictive factors for HR+/HER- metastatic breast cancer under a Research Collaboration Agreement. This collaboration has been providing valuable insights to help successfully implement future scientific projects, particularly using artificial intelligence and machine learning. This tutorial aims to provide tangible guidelines for a multi-omics project that includes multidisciplinary expert teams, spanning across different institutions. We cover key ideas, such as "maintaining effective communication" and "following good data science practices," followed by the four steps of exploratory projects, namely (1) plan, (2) design, (3) develop, and (4) disseminate. We break each step into smaller concepts with strategies for implementation and provide illustrations from our collaboration to further give the readers actionable guidance.


Assuntos
Inteligência Artificial , Multiômica , Humanos , Aprendizado de Máquina , Genômica
5.
JAMIA Open ; 6(4): ooad090, 2023 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-37900974

RESUMO

Objective: Anaphylaxis is a severe life-threatening allergic reaction, and its accurate identification in healthcare databases can harness the potential of "Big Data" for healthcare or public health purposes. Methods: This study used claims data obtained between October 1, 2015 and February 28, 2019 from the CMS database to examine the utility of machine learning in identifying incident anaphylaxis cases. We created a feature selection pipeline to identify critical features between different datasets. Then a variety of unsupervised and supervised methods were used (eg, Sammon mapping and eXtreme Gradient Boosting) to train models on datasets of differing data quality, which reflects the varying availability and potential rarity of ground truth data in medical databases. Results: Resulting machine learning model accuracies ranged between 47.7% and 94.4% when tested on ground truth data. Finally, we found new features to help experts enhance existing case-finding algorithms. Discussion: Developing precise algorithms to detect medical outcomes in claims can be a laborious and expensive process, particularly for conditions presented and coded diversely. We found it beneficial to filter out highly potent codes used for data curation to identify underlying patterns and features. To improve rule-based algorithms where necessary, researchers could use model explainers to determine noteworthy features, which could then be shared with experts and included in the algorithm. Conclusion: Our work suggests machine learning models can perform at similar levels as a previously published expert case-finding algorithm, while also having the potential to improve performance or streamline algorithm construction processes by identifying new relevant features for algorithm construction.

6.
Genome Biol ; 23(1): 12, 2022 01 07.
Artigo em Inglês | MEDLINE | ID: mdl-34996510

RESUMO

BACKGROUND: Accurate detection of somatic mutations is challenging but critical in understanding cancer formation, progression, and treatment. We recently proposed NeuSomatic, the first deep convolutional neural network-based somatic mutation detection approach, and demonstrated performance advantages on in silico data. RESULTS: In this study, we use the first comprehensive and well-characterized somatic reference data sets from the SEQC2 consortium to investigate best practices for using a deep learning framework in cancer mutation detection. Using the high-confidence somatic mutations established for a cancer cell line by the consortium, we identify the best strategy for building robust models on multiple data sets derived from samples representing real scenarios, for example, a model trained on a combination of real and spike-in mutations had the highest average performance. CONCLUSIONS: The strategy identified in our study achieved high robustness across multiple sequencing technologies for fresh and FFPE DNA input, varying tumor/normal purities, and different coverages, with significant superiority over conventional detection approaches in general, as well as in challenging situations such as low coverage, low variant allele frequency, DNA damage, and difficult genomic regions.


Assuntos
Aprendizado Profundo , Neoplasias , Genômica , Humanos , Mutação , Neoplasias/genética , Redes Neurais de Computação
7.
Microorganisms ; 10(1)2021 Dec 30.
Artigo em Inglês | MEDLINE | ID: mdl-35056521

RESUMO

Very little is known about disease transmission via the gut microbiome. We hypothesized that certain inflammatory features could be transmitted via the gut microbiome and tested this hypothesis using an animal model of inflammatory diseases. Twelve-week-old healthy C57 Bl/6 and Germ-Free (GF) female and male mice were fecal matter transplanted (FMT) under anaerobic conditions with TNFΔARE-/+ donors exhibiting spontaneous Rheumatoid Arthritis (RA) and Inflammatory Bowel Disease (IBD) or with conventional healthy mice control donors. The gut microbiome analysis was performed using 16S rRNA sequencing amplification and bioinformatics analysis with the HIVE bioinformatics platform. Histology, immunohistochemistry, ELISA Multiplex analysis, and flow cytometry were conducted to confirm the inflammatory transmission status. We observed RA and IBD features transmitted in the GF mice cohort, with gut tissue disruption, cartilage alteration, elevated inflammatory mediators in the tissues, activation of CD4/CD8+ T cells, and colonization and transmission of the gut microbiome similar to the donors' profile. We did not observe a change or transmission when conventional healthy mice were FMT with TNFΔARE-/+ donors, suggesting that a healthy microbiome might withstand an unhealthy transplant. These findings show the potential involvement of the gut microbiome in inflammatory diseases. We identified a cluster of bacteria playing a role in this mechanism.

8.
Genome Med ; 13(1): 122, 2021 07 28.
Artigo em Inglês | MEDLINE | ID: mdl-34321100

RESUMO

BACKGROUND: Gene expression is highly variable across tissues of multi-cellular organisms, influencing the codon usage of the tissue-specific transcriptome. Cancer disrupts the gene expression pattern of healthy tissue resulting in altered codon usage preferences. The topic of codon usage changes as they relate to codon demand, and tRNA supply in cancer is of growing interest. METHODS: We analyzed transcriptome-weighted codon and codon pair usage based on The Cancer Genome Atlas (TCGA) RNA-seq data from 6427 solid tumor samples and 632 normal tissue samples. This dataset represents 32 cancer types affecting 11 distinct tissues. Our analysis focused on tissues that give rise to multiple solid tumor types and cancer types that are present in multiple tissues. RESULTS: We identified distinct patterns of synonymous codon usage changes for different cancer types affecting the same tissue. For example, a substantial increase in GGT-glycine was observed in invasive ductal carcinoma (IDC), invasive lobular carcinoma (ILC), and mixed invasive ductal and lobular carcinoma (IDLC) of the breast. Change in synonymous codon preference favoring GGT correlated with change in synonymous codon preference against GGC in IDC and IDLC, but not in ILC. Furthermore, we examined the codon usage changes between paired healthy/tumor tissue from the same patient. Using clinical data from TCGA, we conducted a survival analysis of patients based on the degree of change between healthy and tumor-specific codon usage, revealing an association between larger changes and increased mortality. We have also created a database that contains cancer-specific codon and codon pair usage data for cancer types derived from TCGA, which represents a comprehensive tool for codon-usage-oriented cancer research. CONCLUSIONS: Based on data from TCGA, we have highlighted tumor type-specific signatures of codon and codon pair usage. Paired data revealed variable changes to codon usage patterns, which must be considered when designing personalized cancer treatments. The associated database, CancerCoCoPUTs, represents a comprehensive resource for codon and codon pair usage in cancer and is available at https://dnahive.fda.gov/review/cancercocoputs/ . These findings are important to understand the relationship between tRNA supply and codon demand in cancer states and could help guide the development of new cancer therapeutics.


Assuntos
Uso do Códon , Códon , Biologia Computacional/métodos , Bases de Dados Genéticas , Neoplasias/diagnóstico , Neoplasias/genética , Biomarcadores Tumorais , Perfilação da Expressão Gênica , Regulação Neoplásica da Expressão Gênica , Estudo de Associação Genômica Ampla , Genômica/métodos , Humanos , Estimativa de Kaplan-Meier , Neoplasias/mortalidade , Prognóstico , Transcriptoma
9.
J Mol Biol ; 432(11): 3369-3378, 2020 05 15.
Artigo em Inglês | MEDLINE | ID: mdl-31982380

RESUMO

Protein expression in multicellular organisms varies widely across tissues. Codon usage in the transcriptome of each tissue is derived from genomic codon usage and the relative expression level of each gene. We created a comprehensive computational resource that houses tissue-specific codon, codon-pair, and dinucleotide usage data for 51 Homo sapiens tissues (TissueCoCoPUTs: https://hive.biochemistry.gwu.edu/review/tissue_codon), using transcriptome data from the Broad Institute Genotype-Tissue Expression (GTEx) portal. Distances between tissue-specific codon and codon-pair frequencies were used to generate a dendrogram based on the unique patterns of codon and codon-pair usage in each tissue that are clearly distinct from the genomic distribution. This novel resource may be useful in unraveling the relationship between codon usage and tRNA abundance, which could be critical in determining translation kinetics and efficiency across tissues. Areas of investigation such as biotherapeutic development, tissue-specific genetic engineering, and genetic disease prediction will greatly benefit from this resource.


Assuntos
Códon/genética , Bases de Dados Genéticas , Regulação da Expressão Gênica/genética , Especificidade de Órgãos/genética , Uso do Códon/genética , Genoma Humano/genética , Genótipo , Humanos , Internet
10.
Mol Ecol Resour ; 19(2): 377-387, 2019 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-30506954

RESUMO

Whole genome sequencing of bacterial isolates has become a daily task in many laboratories, generating incredible amounts of data. However, data acquisition is not an end in itself; the goal is to acquire high-quality data useful for understanding genetic relationships. Having a method that could rapidly determine which of the many available run metrics are the most important indicators of overall run quality and having a way to monitor these during a given sequencing run would be extremely helpful to this effect. Therefore, we compared various run metrics across 486 MiSeq runs, from five different machines. By performing a statistical analysis using principal components analysis and a K-means clustering algorithm of the metrics, we were able to validate metric comparisons among instruments, allowing for the development of a predictive algorithm, which permits one to observe whether a given MiSeq run has performed adequately. This algorithm is available in an Excel spreadsheet: that is, MiSeq Instrument & Run (In-Run) Forecast. Our tool can help verify that the quantity/quality of the generated sequencing data consistently meets or exceeds recommended manufacturer expectations. Patterns of deviation from those expectations can be used to assess potential run problems and plan preventative maintenance, which can save valuable time and funding resources.


Assuntos
Bactérias/genética , Genoma Bacteriano , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequenciamento de Nucleotídeos em Larga Escala/normas , Controle de Qualidade , Sequenciamento Completo do Genoma/métodos , Sequenciamento Completo do Genoma/normas , Algoritmos , Modelos Estatísticos
11.
PLoS One ; 14(5): e0216944, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31100083

RESUMO

Most viruses are known to spontaneously generate defective viral genomes (DVG) due to errors during replication. These DVGs are subgenomic and contain deletions that render them unable to complete a full replication cycle in the absence of a co-infecting, non-defective helper virus. DVGs, especially of the copyback type, frequently observed with paramyxoviruses, have been recognized to be important triggers of the antiviral innate immune response. DVGs have therefore gained interest for their potential to alter the attenuation and immunogenicity of vaccines. To investigate this potential, accurate identification and quantification of DVGs is essential. Conventional methods, such as RT-PCR, are labor intensive and will only detect primer sequence-specific species. High throughput sequencing (HTS) is much better suited for this undertaking. Here, we present an HTS-based algorithm called DVG-profiler to identify and quantify all DVG sequences in an HTS data set generated from a virus preparation. DVG-profiler identifies DVG breakpoints relative to a reference genome and reports the directionality of each segment from within the same read. The specificity and sensitivity of the algorithm was assessed using both in silico data sets as well as HTS data obtained from parainfluenza virus 5, Sendai virus and mumps virus preparations. HTS data from the latter were also compared with conventional RT-PCR data and with data obtained using an alternative algorithm. The data presented here demonstrate the high specificity, sensitivity, and robustness of DVG-profiler. This algorithm was implemented within an open source cloud-based computing environment for analyzing HTS data. DVG-profiler might prove valuable not only in basic virus research but also in monitoring live attenuated vaccines for DVG content and to assure vaccine lot to lot consistency.


Assuntos
Algoritmos , Mapeamento Cromossômico/estatística & dados numéricos , Vírus Defeituosos/genética , Genoma Viral , Vírus da Caxumba/genética , Vírus da Parainfluenza 5/genética , Vírus Sendai/genética , Animais , Mapeamento Cromossômico/métodos , Primers do DNA/síntese química , Primers do DNA/metabolismo , Conjuntos de Dados como Assunto , Vírus Defeituosos/classificação , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos , Tipagem Molecular , Vírus da Caxumba/classificação , Vírus da Parainfluenza 5/classificação , Reação em Cadeia da Polimerase em Tempo Real , Vírus Sendai/classificação , Sensibilidade e Especificidade
12.
J Mol Biol ; 431(13): 2434-2441, 2019 06 14.
Artigo em Inglês | MEDLINE | ID: mdl-31029701

RESUMO

Usage of sequential codon-pairs is non-random and unique to each species. Codon-pair bias is related to but clearly distinct from individual codon usage bias. Codon-pair bias is thought to affect translational fidelity and efficiency and is presumed to be under the selective pressure. It was suggested that changes in codon-pair utilization may affect human disease more significantly than changes in single codons. Although recombinant gene technologies often take codon-pair usage bias into account, codon-pair usage data/tables are not readily available, thus potentially impeding research efforts. The present computational resource (https://hive.biochemistry.gwu.edu/review/codon2) systematically addresses this issue. Building on our recent HIVE-Codon Usage Tables, we constructed a new database to include genomic codon-pair and dinucleotide statistics of all organisms with sequenced genome, available in the GenBank. We believe that the growing understanding of the importance of codon-pair usage will make this resource an invaluable tool to many researchers in academia and pharmaceutical industry.


Assuntos
Uso do Códon , Biologia Computacional/métodos , Variação Genética , Algoritmos , Sequência de Bases , Bases de Dados Genéticas , Humanos
13.
Artigo em Inglês | MEDLINE | ID: mdl-26989153

RESUMO

The High-performance Integrated Virtual Environment (HIVE) is a distributed storage and compute environment designed primarily to handle next-generation sequencing (NGS) data. This multicomponent cloud infrastructure provides secure web access for authorized users to deposit, retrieve, annotate and compute on NGS data, and to analyse the outcomes using web interface visual environments appropriately built in collaboration with research and regulatory scientists and other end users. Unlike many massively parallel computing environments, HIVE uses a cloud control server which virtualizes services, not processes. It is both very robust and flexible due to the abstraction layer introduced between computational requests and operating system processes. The novel paradigm of moving computations to the data, instead of moving data to computational nodes, has proven to be significantly less taxing for both hardware and network infrastructure.The honeycomb data model developed for HIVE integrates metadata into an object-oriented model. Its distinction from other object-oriented databases is in the additional implementation of a unified application program interface to search, view and manipulate data of all types. This model simplifies the introduction of new data types, thereby minimizing the need for database restructuring and streamlining the development of new integrated information systems. The honeycomb model employs a highly secure hierarchical access control and permission system, allowing determination of data access privileges in a finely granular manner without flooding the security subsystem with a multiplicity of rules. HIVE infrastructure will allow engineers and scientists to perform NGS analysis in a manner that is both efficient and secure. HIVE is actively supported in public and private domains, and project collaborations are welcomed. Database URL: https://hive.biochemistry.gwu.edu.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Interface Usuário-Computador , Biologia Computacional , Mutação/genética , Poliovirus/genética , Vacinas contra Poliovirus/imunologia , Proteômica , Recombinação Genética , Alinhamento de Sequência , Estatística como Assunto
15.
PLoS One ; 9(6): e99033, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24918764

RESUMO

UNLABELLED: Due to the size of Next-Generation Sequencing data, the computational challenge of sequence alignment has been vast. Inexact alignments can take up to 90% of total CPU time in bioinformatics pipelines. High-performance Integrated Virtual Environment (HIVE), a cloud-based environment optimized for storage and analysis of extra-large data, presents an algorithmic solution: the HIVE-hexagon DNA sequence aligner. HIVE-hexagon implements novel approaches to exploit both characteristics of sequence space and CPU, RAM and Input/Output (I/O) architecture to quickly compute accurate alignments. Key components of HIVE-hexagon include non-redundification and sorting of sequences; floating diagonals of linearized dynamic programming matrices; and consideration of cross-similarity to minimize computations. AVAILABILITY: https://hive.biochemistry.gwu.edu/hive/


Assuntos
Alinhamento de Sequência , Análise de Sequência de DNA/métodos , Genoma
16.
Evol Comput ; 15(4): 493-517, 2007.
Artigo em Inglês | MEDLINE | ID: mdl-18021017

RESUMO

Efficiency has become one of the main concerns in evolutionary multiobjective optimization during recent years. One of the possible alternatives to achieve a faster convergence is to use a relaxed form of Pareto dominance that allows us to regulate the granularity of the approximation of the Pareto front that we wish to achieve. One such relaxed forms of Pareto dominance that has become popular in the last few years is epsilon-dominance, which has been mainly used as an archiving strategy in some multiobjective evolutionary algorithms. Despite its advantages, epsilon-dominance has some limitations. In this paper, we propose a mechanism that can be seen as a variant of epsilon-dominance, which we call Pareto-adaptive epsilon-dominance (paepsilon-dominance). Our proposed approach tries to overcome the main limitation of epsilon-dominance: the loss of several nondominated solutions from the hypergrid adopted in the archive because of the way in which solutions are selected within each box.


Assuntos
Algoritmos , Modelos Teóricos , Armazenamento e Recuperação da Informação
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA