Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 9 de 9
Filtrar
1.
Genome Biol ; 23(1): 2, 2022 01 03.
Artículo en Inglés | MEDLINE | ID: mdl-34980216

RESUMEN

BACKGROUND: Reproducible detection of inherited variants with whole genome sequencing (WGS) is vital for the implementation of precision medicine and is a complicated process in which each step affects variant call quality. Systematically assessing reproducibility of inherited variants with WGS and impact of each step in the process is needed for understanding and improving quality of inherited variants from WGS. RESULTS: To dissect the impact of factors involved in detection of inherited variants with WGS, we sequence triplicates of eight DNA samples representing two populations on three short-read sequencing platforms using three library kits in six labs and call variants with 56 combinations of aligners and callers. We find that bioinformatics pipelines (callers and aligners) have a larger impact on variant reproducibility than WGS platform or library preparation. Single-nucleotide variants (SNVs), particularly outside difficult-to-map regions, are more reproducible than small insertions and deletions (indels), which are least reproducible when > 5 bp. Increasing sequencing coverage improves indel reproducibility but has limited impact on SNVs above 30×. CONCLUSIONS: Our findings highlight sources of variability in variant detection and the need for improvement of bioinformatics pipelines in the era of precision medicine with WGS.


Asunto(s)
Genoma Humano , Polimorfismo de Nucleótido Simple , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Mutación INDEL , Reproducibilidad de los Resultados , Secuenciación Completa del Genoma
2.
BMC Med Inform Decis Mak ; 20(1): 68, 2020 04 15.
Artículo en Inglés | MEDLINE | ID: mdl-32293428

RESUMEN

BACKGROUND: Drug label, or packaging insert play a significant role in all the operations from production through drug distribution channels to the end consumer. Image of the label also called Display Panel or label could be used to identify illegal, illicit, unapproved and potentially dangerous drugs. Due to the time-consuming process and high labor cost of investigation, an artificial intelligence-based deep learning model is necessary for fast and accurate identification of the drugs. METHODS: In addition to image-based identification technology, we take advantages of rich text information on the pharmaceutical package insert of drug label images. In this study, we developed the Drug Label Identification through Image and Text embedding model (DLI-IT) to model text-based patterns of historical data for detection of suspicious drugs. In DLI-IT, we first trained a Connectionist Text Proposal Network (CTPN) to crop the raw image into sub-images based on the text. The texts from the cropped sub-images are recognized independently through the Tesseract OCR Engine and combined as one document for each raw image. Finally, we applied universal sentence embedding to transform these documents into vectors and find the most similar reference images to the test image through the cosine similarity. RESULTS: We trained the DLI-IT model on 1749 opioid and 2365 non-opioid drug label images. The model was then tested on 300 external opioid drug label images, the result demonstrated our model achieves up-to 88% of the precision in drug label identification, which outperforms previous image-based or text-based identification method by up-to 35% improvement. CONCLUSION: To conclude, by combining Image and Text embedding analysis under deep learning framework, our DLI-IT approach achieved a competitive performance in advancing drug label identification.


Asunto(s)
Aprendizaje Profundo , Preparaciones Farmacéuticas , Inteligencia Artificial
3.
Environ Int ; 89-90: 81-92, 2016.
Artículo en Inglés | MEDLINE | ID: mdl-26826365

RESUMEN

BACKGROUND: ToxCast data have been used to develop models for predicting in vivo toxicity. To predict the in vivo toxicity of a new chemical using a ToxCast data based model, its ToxCast bioactivity data are needed but not normally available. The capability of predicting ToxCast bioactivity data is necessary to fully utilize ToxCast data in the risk assessment of chemicals. OBJECTIVES: We aimed to understand and elucidate the relationships between the chemicals and bioactivity data of the assays in ToxCast and to develop a network analysis based method for predicting ToxCast bioactivity data. METHODS: We conducted modularity analysis on a quantitative network constructed from ToxCast data to explore the relationships between the assays and chemicals. We further developed Nebula (neighbor-edges based and unbiased leverage algorithm) for predicting ToxCast bioactivity data. RESULTS: Modularity analysis on the network constructed from ToxCast data yielded seven modules. Assays and chemicals in the seven modules were distinct. Leave-one-out cross-validation yielded a Q(2) of 0.5416, indicating ToxCast bioactivity data can be predicted by Nebula. Prediction domain analysis showed some types of ToxCast assay data could be more reliably predicted by Nebula than others. CONCLUSIONS: Network analysis is a promising approach to understand ToxCast data. Nebula is an effective algorithm for predicting ToxCast bioactivity data, helping fully utilize ToxCast data in the risk assessment of chemicals.


Asunto(s)
Ecotoxicología/métodos , Modelos Teóricos , Medición de Riesgo , Algoritmos , Bioensayo , Humanos , Pruebas de Toxicidad
4.
Pharmaceutics ; 7(4): 523-41, 2015 Nov 23.
Artículo en Inglés | MEDLINE | ID: mdl-26610555

RESUMEN

Precision medicine or personalized medicine has been proposed as a modernized and promising medical strategy. Genetic variants of patients are the key information for implementation of precision medicine. Next-generation sequencing (NGS) is an emerging technology for deciphering genetic variants. Alignment of raw reads to a reference genome is one of the key steps in NGS data analysis. Many algorithms have been developed for alignment of short read sequences since 2008. Users have to make a decision on which alignment algorithm to use in their studies. Selection of the right alignment algorithm determines not only the alignment algorithm but also the set of suitable parameters to be used by the algorithm. Understanding these algorithms helps in selecting the appropriate alignment algorithm for different applications in precision medicine. Here, we review current available algorithms and their major strategies such as seed-and-extend and q-gram filter. We also discuss the challenges in current alignment algorithms, including alignment in multiple repeated regions, long reads alignment and alignment facilitated with known genetic variants.

5.
BMC Bioinformatics ; 15 Suppl 11: S6, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-25350283

RESUMEN

BACKGROUND: Due to a significant decline in the costs associated with next-generation sequencing, it has become possible to decipher the genetic architecture of a population by sequencing a large number of individuals to a deep coverage. The Korean Personal Genomes Project (KPGP) recently sequenced 35 Korean genomes at high coverage using the Illumina Hiseq platform and made the deep sequencing data publicly available, providing the scientific community opportunities to decipher the genetic architecture of the Korean population. METHODS: In this study, we used two single nucleotide variant (SNV) calling pipelines: mapping the raw reads obtained from whole genome sequencing of 35 Korean individuals in KPGP using BWA and SOAP2 followed by SNV calling using SAMtools and SOAPsnp, respectively. The consensus SNVs obtained from the two SNV pipelines were used to represent the SNVs of the Korean population. We compared these SNVs to those from 17 other populations provided by the HapMap consortium and the 1000 Genomes Project (1KGP) and identified SNVs that were only present in the Korean population. We studied the mutation spectrum and analyzed the genes of non-synonymous SNVs only detected in the Korean population. RESULTS: We detected a total of 8,555,726 SNVs in the 35 Korean individuals and identified 1,213,613 SNVs detected in at least one Korean individual (SNV-1) and 12,640 in all of 35 Korean individuals (SNV-35) but not in 17 other populations. In contrast with the SNVs common to other populations in HapMap and 1KGP, the Korean only SNVs had high percentages of non-silent variants, emphasizing the unique roles of these Korean only SNVs in the Korean population. Specifically, we identified 8,361 non-synonymous Korean only SNVs, of which 58 SNVs existed in all 35 Korean individuals. The 5,754 genes of non-synonymous Korean only SNVs were highly enriched in some metabolic pathways. We found adhesion is the top disease term associated with SNV-1 and Nelson syndrome is the only disease term associated with SNV-35. We found that a significant number of Korean only SNVs are in genes that are associated with the drug term of adenosine. CONCLUSION: We identified the SNVs that were found in the Korean population but not seen in other populations, and explored the corresponding genes and pathways as well as the associated disease terms and drug terms. The results expand our knowledge of the genetic architecture of the Korean population, which will benefit the implementation of personalized medicine for the Korean population.


Asunto(s)
Pueblo Asiatico/genética , Polimorfismo de Nucleótido Simple , Enfermedad/genética , Ontología de Genes , Estudios de Asociación Genética , Genoma Humano , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Corea (Geográfico) , Mutación , Alineación de Secuencia , Análisis de Secuencia de ADN , Programas Informáticos
6.
Nat Biotechnol ; 32(9): 926-32, 2014 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-25150839

RESUMEN

The concordance of RNA-sequencing (RNA-seq) with microarrays for genome-wide analysis of differential gene expression has not been rigorously assessed using a range of chemical treatment conditions. Here we use a comprehensive study design to generate Illumina RNA-seq and Affymetrix microarray data from the same liver samples of rats exposed in triplicate to varying degrees of perturbation by 27 chemicals representing multiple modes of action (MOAs). The cross-platform concordance in terms of differentially expressed genes (DEGs) or enriched pathways is linearly correlated with treatment effect size (R(2)0.8). Furthermore, the concordance is also affected by transcript abundance and biological complexity of the MOA. RNA-seq outperforms microarray (93% versus 75%) in DEG verification as assessed by quantitative PCR, with the gain mainly due to its improved accuracy for low-abundance transcripts. Nonetheless, classifiers to predict MOAs perform similarly when developed using data from either platform. Therefore, the endpoint studied and its biological complexity, transcript abundance and the genomic application are important factors in transcriptomic research and for clinical and regulatory decision making.


Asunto(s)
Análisis de Secuencia por Matrices de Oligonucleótidos , ARN Mensajero/genética , Análisis de Secuencia de ARN , Animales , Ratas
7.
Genome Biol ; 15(12): 523, 2014 Dec 03.
Artículo en Inglés | MEDLINE | ID: mdl-25633159

RESUMEN

BACKGROUND: Gene expression microarray has been the primary biomarker platform ubiquitously applied in biomedical research, resulting in enormous data, predictive models, and biomarkers accrued. Recently, RNA-seq has looked likely to replace microarrays, but there will be a period where both technologies co-exist. This raises two important questions: Can microarray-based models and biomarkers be directly applied to RNA-seq data? Can future RNA-seq-based predictive models and biomarkers be applied to microarray data to leverage past investment? RESULTS: We systematically evaluated the transferability of predictive models and signature genes between microarray and RNA-seq using two large clinical data sets. The complexity of cross-platform sequence correspondence was considered in the analysis and examined using three human and two rat data sets, and three levels of mapping complexity were revealed. Three algorithms representing different modeling complexity were applied to the three levels of mappings for each of the eight binary endpoints and Cox regression was used to model survival times with expression data. In total, 240,096 predictive models were examined. CONCLUSIONS: Signature genes of predictive models are reciprocally transferable between microarray and RNA-seq data for model development, and microarray-based models can accurately predict RNA-seq-profiled samples; while RNA-seq-based models are less accurate in predicting microarray-profiled samples and are affected both by the choice of modeling algorithm and the gene mapping complexity. The results suggest continued usefulness of legacy microarray data and established microarray biomarkers and predictive models in the forthcoming RNA-seq era.


Asunto(s)
Perfilación de la Expresión Génica/métodos , Marcadores Genéticos , ARN/análisis , Análisis de Secuencia de ARN , Algoritmos , Animales , Biología Computacional/métodos , Humanos , Modelos Genéticos , Análisis de Secuencia por Matrices de Oligonucleótidos , Ratas
8.
BMC Bioinformatics ; 14 Suppl 14: S15, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-24267777

RESUMEN

BACKGROUND: Pulsed field gel electrophoresis (PFGE) is currently the most widely and routinely used method by the Centers for Disease Control and Prevention (CDC) and state health labs in the United States for Salmonella surveillance and outbreak tracking. Major drawbacks of commercially available PFGE analysis programs have been their difficulty in dealing with large datasets and the limited availability of analysis tools. There exists a need to develop new analytical tools for PFGE data mining in order to make full use of valuable data in large surveillance databases. RESULTS: In this study, a software package was developed consisting of five types of bioinformatics approaches exploring and implementing for the analysis and visualization of PFGE fingerprinting. The approaches include PFGE band standardization, Salmonella serotype prediction, hierarchical cluster analysis, distance matrix analysis and two-way hierarchical cluster analysis. PFGE band standardization makes it possible for cross-group large dataset analysis. The Salmonella serotype prediction approach allows users to predict serotypes of Salmonella isolates based on their PFGE patterns. The hierarchical cluster analysis approach could be used to clarify subtypes and phylogenetic relationships among groups of PFGE patterns. The distance matrix and two-way hierarchical cluster analysis tools allow users to directly visualize the similarities/dissimilarities of any two individual patterns and the inter- and intra-serotype relationships of two or more serotypes, and provide a summary of the overall relationships between user-selected serotypes as well as the distinguishable band markers of these serotypes. The functionalities of these tools were illustrated on PFGE fingerprinting data from PulseNet of CDC. CONCLUSIONS: The bioinformatics approaches included in the software package developed in this study were integrated with the PFGE database to enhance the data mining of PFGE fingerprints. Fast and accurate prediction makes it possible to elucidate Salmonella serotype information before conventional serological methods are pursued. The development of bioinformatics tools to distinguish the PFGE markers and serotype specific patterns will enhance PFGE data retrieval, interpretation and serotype identification and will likely accelerate source tracking to identify the Salmonella isolates implicated in foodborne diseases.


Asunto(s)
Biología Computacional/métodos , Electroforesis en Gel de Campo Pulsado/métodos , Salmonella/clasificación , Análisis por Conglomerados , Minería de Datos , Bases de Datos Genéticas , Humanos , Salmonella/química , Salmonella/genética , Serotipificación
9.
PLoS One ; 8(3): e59224, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-23516614

RESUMEN

A database was constructed consisting of 45,923 Salmonella pulsed-field gel electrophoresis (PFGE) patterns. The patterns, randomly selected from all submissions to CDC PulseNet during 2005 to 2010, included the 20 most frequent serotypes and 12 less frequent serotypes. Meta-analysis was applied to all of the PFGE patterns in the database. In the range of 20 to 1100 kb, serotype Enteritidis averaged the fewest bands at 12 bands and Paratyphi A the most with 19, with most serotypes in the 13-15 range among the 32 serptypes. The 10 most frequent bands for each of the 32 serotypes were sorted and distinguished, and the results were in concordance with those from distance matrix and two-way hierarchical cluster analyses of the patterns in the database. The hierarchical cluster analysis divided the 32 serotypes into three major groups according to dissimilarity measures, and revealed for the first time the similarities among the PFGE patterns of serotype Saintpaul to serotypes Typhimurium, Typhimurium var. 5-, and I 4,[5],12:i:-; of serotype Hadar to serotype Infantis; and of serotype Muenchen to serotype Newport. The results of the meta-analysis indicated that the pattern similarities/dissimilarities determined the serotype discrimination of PFGE method, and that the possible PFGE markers may have utility for serotype identification. The presence of distinct, serotype specific patterns may provide useful information to aid in the distribution of serotypes in the population and potentially reduce the need for laborious analyses, such as traditional serotyping.


Asunto(s)
Electroforesis en Gel de Campo Pulsado/métodos , Salmonella/metabolismo , Serotipificación/métodos , Bases de Datos Factuales , Salmonella/clasificación
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA