Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 6.218
Filtrar
1.
Nat Commun ; 11(1): 4459, 2020 09 08.
Artigo em Inglês | MEDLINE | ID: mdl-32900997

RESUMO

The origins of multicellular physiology are tied to evolution of gene expression. Genes can shift expression as organisms evolve, but how ancestral expression influences altered descendant expression is not well understood. To examine this, we amalgamate 1,903 RNA-seq datasets from 182 research projects, including 6 organs in 21 vertebrate species. Quality control eliminates project-specific biases, and expression shifts are reconstructed using gene-family-wise phylogenetic Ornstein-Uhlenbeck models. Expression shifts following gene duplication result in more drastic changes in expression properties than shifts without gene duplication. The expression properties are tightly coupled with protein evolutionary rate, depending on whether and how gene duplication occurred. Fluxes in expression patterns among organs are nonrandom, forming modular connections that are reshaped by gene duplication. Thus, if expression shifts, ancestral expression in some organs induces a strong propensity for expression in particular organs in descendants. Regardless of whether the shifts are adaptive or not, this supports a major role for what might be termed preadaptive pathways of gene expression evolution.


Assuntos
Evolução Molecular , Transcriptoma , Animais , Bases de Dados de Ácidos Nucleicos , Feminino , Duplicação Gênica , Humanos , Masculino , Modelos Genéticos , Família Multigênica , Especificidade de Órgãos , Filogenia , Proteínas/genética , RNA-Seq , Especificidade da Espécie , Vertebrados/classificação , Vertebrados/genética
2.
PLoS One ; 15(8): e0232994, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32866155

RESUMO

Transposable elements (TEs) are mobile genetic elements in eukaryotic genomes. Recent research highlights the important role of TEs in the embryogenesis, neurodevelopment, and immune functions. However, there is a lack of a one-stop and easy to use computational pipeline for expression analysis of both genes and locus-specific TEs from RNA-Seq data. Here, we present GeneTEFlow, a fully automated, reproducible and platform-independent workflow, for the comprehensive analysis of gene and locus-specific TEs expression from RNA-Seq data employing Nextflow and Docker technologies. This application will help researchers more easily perform integrated analysis of both gene and TEs expression, leading to a better understanding of roles of gene and TEs regulation in human diseases. GeneTEFlow is freely available at https://github.com/zhongw2/GeneTEFlow.


Assuntos
Elementos de DNA Transponíveis , RNA-Seq/estatística & dados numéricos , Software , Biologia Computacional , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Perfilação da Expressão Gênica/estatística & dados numéricos , Genoma Humano , Humanos , Fluxo de Trabalho
3.
PLoS One ; 15(9): e0238420, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32931492

RESUMO

BACKGROUND: Patients diagnosed with Oral Floor Squamous Cell Carcinoma (OFSCC) face considerable challenges in physiology and psychology. This study explored prognostic signatures to predict prognosis in OFSCC through a detailed transcriptomic analysis. METHOD: We built an interactive competing endogenous RNA (ceRNA) network that included lncRNAs, miRNAs and mRNAs. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) were used to predict the gene functions and regulatory pathways of mRNAs. Least absolute shrinkage and selection operator algorithm (LASSO) analysis and Cox regression analysis were used to screen prognosis factors. The Kaplan-Meier method was used to analyze the survival rate of prognosis factors. Risk score was used to assess the reliability of the prediction model. RESULTS: A specific ceRNA network consisting of 56 mRNAs, 16 miRNAs and 31 lncRNAs was established. Three key genes (HOXC13, TGFBR3, KLHL40) and 4 clinical factors (age, gender, TNM, and clinical stage) were identified and effectively predicted the for survival time. The expression of a gene signature was validated in two external validation cohorts. The signature (areas under the curve of 3 and 5 years were 0.977 and 0.982, respectively) showed high prognostic accuracy in the complete TCGA cohort. CONCLUSIONS: Our study successfully developed an extensive ceRNA network for OFSCC and further identified a 3-mRNA and 4-clinical-factor signature, which may serve as a biomarker.


Assuntos
Biomarcadores Tumorais/genética , Carcinoma de Células Escamosas/genética , Neoplasias Bucais/genética , RNA Neoplásico/genética , Idoso , Idoso de 80 Anos ou mais , Carcinoma de Células Escamosas/mortalidade , Bases de Dados de Ácidos Nucleicos , Feminino , Perfilação da Expressão Gênica , Regulação Neoplásica da Expressão Gênica , Ontologia Genética , Redes Reguladoras de Genes , Proteínas de Homeodomínio/genética , Humanos , Estimativa de Kaplan-Meier , Masculino , MicroRNAs/genética , Pessoa de Meia-Idade , Soalho Bucal , Neoplasias Bucais/mortalidade , Proteínas Musculares/genética , Prognóstico , Proteoglicanas/genética , RNA Longo não Codificante/genética , RNA Mensageiro/genética , Receptores de Fatores de Crescimento Transformadores beta/genética , Fatores de Risco
4.
Nat Commun ; 11(1): 3697, 2020 07 29.
Artigo em Inglês | MEDLINE | ID: mdl-32728101

RESUMO

As the number of genomics datasets grows rapidly, sample mislabeling has become a high stakes issue. We present CrosscheckFingerprints (Crosscheck), a tool for quantifying sample-relatedness and detecting incorrectly paired sequencing datasets from different donors. Crosscheck outperforms similar methods and is effective even when data are sparse or from different assays. Application of Crosscheck to 8851 ENCODE ChIP-, RNA-, and DNase-seq datasets enabled us to identify and correct dozens of mislabeled samples and ambiguous metadata annotations, representing ~1% of ENCODE datasets.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Desequilíbrio de Ligação/genética , Bases de Dados de Ácidos Nucleicos , Genótipo , Células HEK293 , Células Endoteliais da Veia Umbilical Humana/metabolismo , Humanos , Células K562 , Escore Lod , Anotação de Sequência Molecular
5.
PLoS One ; 15(6): e0234385, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32603327

RESUMO

Utilising a reconstructed ancestral mitochondrial genome of a clade to design hybridisation capture baits can provide the opportunity for recovering mitochondrial sequences from all its descendent and even sister lineages. This approach is useful for taxa with no extant close relatives, as is often the case for rare or extinct species, and is a viable approach for the analysis of historical museum specimens. Asiatic linsangs (genus Prionodon) exemplify this situation, being rare Southeast Asian carnivores for which little molecular data is available. Using ancestral capture we recover partial mitochondrial genome sequences for seven banded linsangs (P. linsang) from historical specimens, representing the first intraspecific genetic dataset for this species. We additionally assemble a high quality mitogenome for the banded linsang using shotgun sequencing for time-calibrated phylogenetic analysis. This reveals a deep divergence between the two Asiatic linsang species (P. linsang, P. pardicolor), with an estimated divergence of ~12 million years (Ma). Although our sample size precludes any robust interpretation of the population structure of the banded linsang, we recover two distinct matrilines with an estimated tMRCA of ~1 Ma. Our results can be used as a basis for further investigation of the Asiatic linsangs, and further demonstrate the utility of ancestral capture for studying divergent taxa without close relatives.


Assuntos
Genoma Mitocondrial , Viverridae/genética , Animais , Ásia Sudeste , DNA Mitocondrial/genética , DNA Mitocondrial/história , Bases de Dados de Ácidos Nucleicos , Evolução Molecular , Extinção Biológica , Fósseis/história , Especiação Genética , História Antiga , Filogenia , Filogeografia , Alinhamento de Sequência , Análise de Sequência de DNA , Especificidade da Espécie , Viverridae/classificação
8.
Rev. esp. med. legal ; 46(2): 75-80, abr.-jun. 2020. graf
Artigo em Espanhol | IBECS | ID: ibc-193994

RESUMO

En los últimos años la genética ha adquirido una gran importancia en los procesos de identificación masiva de víctimas y constituye, en muchos casos, la única herramienta útil. Algunas instituciones externalizan estos análisis en laboratorios especializados. Es el caso de la Unidad de Derechos Humanos del Servicio Médico Legal (SML) de Chile, creada con el objetivo de identificar y restituir a las familias los restos de las víctimas de la dictadura cívico-militar instaurada en el país entre 1973 y 1990, que provocó más de 1.300 desaparecidos y muertos sin entrega. La externalización de los análisis impone la necesidad de establecer una rigurosa sistemática de revisión y control de calidad de los análisis realizados por el laboratorio externo, lo que incluye asegurar la trazabilidad de las muestras y los análisis, además de reproducir tanto la comparación de los perfiles genéticos como su valoración estadística. En este trabajo se presenta la experiencia del SML en esta materia y se establecen una serie de recomendaciones que pueden ser utilizadas como guía por otras instituciones que decidan externalizar los análisis genéticos en procesos de identificación masiva de víctimas


In recent years, genetics has acquired great importance in the processes of mass victim identification and, in many cases, is the only useful tool. Some institutions outsource these analyses to specialized laboratories. This is the case of the Human Rights Unit of the Legal Medical Service (SML) of Chile, created with the objective of identifying and restoring to families the remains of the victims of the civic-military dictatorship established in the country between 1973 and 1990, which caused more than 1,300 missing and dead without delivery. The outsourcing of the analyses imposes the need to establish a rigorous systematic review and quality control of the analyses performed by the external laboratory, which includes ensuring traceability of the samples and analyses, in addition to reproducing the comparison of the genetic profiles and their statistical assessment. This paper presents the experience of the SML in this area and establishes a series of recommendations that can be used as a guide by other institutions that decide to outsource genetic analysis in processes of mass identification of victims


Assuntos
Humanos , Testes Genéticos/métodos , Bases de Dados de Ácidos Nucleicos/organização & administração , Identificação de Vítimas , Genética Forense/métodos , Gestão da Qualidade Total/métodos , Chile/epidemiologia , Direitos Humanos/legislação & jurisprudência , Antropologia Forense/métodos , Controle de Qualidade , Padrões de Referência
9.
Nucleic Acids Res ; 48(13): e77, 2020 07 27.
Artigo em Inglês | MEDLINE | ID: mdl-32496533

RESUMO

A fast-growing number of non-coding RNA structures have been resolved and deposited in Protein Data Bank (PDB). In contrast to the wide range of global alignment and motif search tools, there is still a lack of local alignment tools. Among all the global alignment tools for RNA 3D structures, STAR3D has become a valuable tool for its unprecedented speed and accuracy. STAR3D compares the 3D structures of RNA molecules using consecutive base-pairs (stacks) as anchors and generates an optimal global alignment. In this article, we developed a local RNA 3D structural alignment tool, named LocalSTAR3D, which was extended from STAR3D and designed to report multiple local alignments between two RNAs. The benchmarking results show that LocalSTAR3D has better accuracy and coverage than other local alignment tools. Furthermore, the utility of this tool has been demonstrated by rediscovering kink-turn motif instances, conserved domains in group II intron RNAs, and the tRNA mimicry of IRES RNAs.


Assuntos
Pareamento de Bases , RNA/química , Alinhamento de Sequência/métodos , Análise de Sequência de RNA/métodos , Software , Bases de Dados de Ácidos Nucleicos , Modelos Moleculares
11.
Proc Natl Acad Sci U S A ; 117(24): 13421-13427, 2020 06 16.
Artigo em Inglês | MEDLINE | ID: mdl-32482858

RESUMO

Although the backlog of untested sexual assault kits in the United States is starting to be addressed, many municipalities are opting for selective testing of samples within a kit, where only the most probative samples are tested. We use data from the San Francisco Police Department Criminalistics Laboratory, which tests all samples but also collects information on the samples flagged by sexual assault forensic examiners as most probative, to build a standard machine learning model that predicts (based on covariates gleaned from sexual assault kit questionnaires) which samples are most probative. This model is embedded within an optimization framework that selects which samples to test from each kit to maximize the Combined DNA Index System (CODIS) yield (i.e., the number of kits that generate at least one DNA profile for the criminal DNA database) subject to a budget constraint. Our analysis predicts that, relative to a policy that tests only the samples deemed probative by the sexual assault forensic examiners, the proposed policy increases the CODIS yield by 45.4% without increasing the cost. Full testing of all samples has a slightly lower cost-effectiveness than the selective policy based on forensic examiners, but more than doubles the yield. In over half of the sexual assaults, a sample was not collected during the forensic medical exam from the body location deemed most probative by the machine learning model. Our results suggest that electronic forensic records coupled with machine learning and optimization models could enhance the effectiveness of criminal investigations of sexual assaults.


Assuntos
Vítimas de Crime , Ciências Forenses/economia , Aplicação da Lei/métodos , Delitos Sexuais , Manejo de Espécimes/economia , Adulto , Análise Custo-Benefício , Vítimas de Crime/estatística & dados numéricos , DNA/análise , Bases de Dados de Ácidos Nucleicos , Feminino , Ciências Forenses/estatística & dados numéricos , Humanos , Aprendizado de Máquina , Masculino , São Francisco , Delitos Sexuais/estatística & dados numéricos , Manejo de Espécimes/estatística & dados numéricos
12.
Interdiscip Sci ; 12(3): 368-376, 2020 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-32488835

RESUMO

A novel coronavirus, called 2019-nCoV, was recently found in Wuhan, Hubei Province of China, and now is spreading across China and other parts of the world. Although there are some drugs to treat 2019-nCoV, there is no proper scientific evidence about its activity on the virus. It is of high significance to develop a drug that can combat the virus effectively to save valuable human lives. It usually takes a much longer time to develop a drug using traditional methods. For 2019-nCoV, it is now better to rely on some alternative methods such as deep learning to develop drugs that can combat such a disease effectively since 2019-nCoV is highly homologous to SARS-CoV. In the present work, we first collected virus RNA sequences of 18 patients reported to have 2019-nCoV from the public domain database, translated the RNA into protein sequences, and performed multiple sequence alignment. After a careful literature survey and sequence analysis, 3C-like protease is considered to be a major therapeutic target and we built a protein 3D model of 3C-like protease using homology modeling. Relying on the structural model, we used a pipeline to perform large scale virtual screening by using a deep learning based method to accurately rank/identify protein-ligand interacting pairs developed recently in our group. Our model identified potential drugs for 2019-nCoV 3C-like protease by performing drug screening against four chemical compound databases (Chimdiv, Targetmol-Approved_Drug_Library, Targetmol-Natural_Compound_Library, and Targetmol-Bioactive_Compound_Library) and a database of tripeptides. Through this paper, we provided the list of possible chemical ligands (Meglumine, Vidarabine, Adenosine, D-Sorbitol, D-Mannitol, Sodium_gluconate, Ganciclovir and Chlorobutanol) and peptide drugs (combination of isoleucine, lysine and proline) from the databases to guide the experimental scientists and validate the molecules which can combat the virus in a shorter time.


Assuntos
Antivirais/farmacologia , Betacoronavirus/efeitos dos fármacos , Infecções por Coronavirus/tratamento farmacológico , Infecções por Coronavirus/virologia , Aprendizado Profundo , Avaliação Pré-Clínica de Medicamentos/métodos , Pneumonia Viral/tratamento farmacológico , Pneumonia Viral/virologia , Proteínas não Estruturais Virais/antagonistas & inibidores , Sequência de Aminoácidos , Antivirais/química , Betacoronavirus/genética , Domínio Catalítico , Infecções por Coronavirus/epidemiologia , Cisteína Endopeptidases/química , Cisteína Endopeptidases/genética , Bases de Dados de Ácidos Nucleicos , Bases de Dados de Produtos Farmacêuticos , Desenho de Fármacos , Avaliação Pré-Clínica de Medicamentos/estatística & dados numéricos , Humanos , Ligantes , Modelos Moleculares , Simulação de Dinâmica Molecular , Oligopeptídeos/química , Oligopeptídeos/farmacologia , Pandemias , Pneumonia Viral/epidemiologia , Alinhamento de Sequência , Homologia Estrutural de Proteína , Interface Usuário-Computador , Proteínas não Estruturais Virais/química , Proteínas não Estruturais Virais/genética
13.
J Cancer Res Ther ; 16(1): 7-12, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32362602

RESUMO

Background: Xist is a long noncoding RNA involved in the X chromosome inactivation in females. It may act as an onco-suppressor gene in hematologic malignancies, and its activity is strongly dependent from SATB1 gene expression. However, its potential role in Hodgkin's disease (HD) onset and progression is unknown. Materials and Methods: Three gene expression microarray datasets were analyzed for the expression of Xist and SATB1 in patients with classical HD, namely, GDS4222 (130 patients and 54,000 gene features), GSE39134 (29 patients and 54,000 features), and E-MEXP-507 (29 patients and 27,648 probes). The first two were oligonucleotide arrays (platform: Affymetrix gene chip HG-U133-Plus2), whereas the latter was a cDNA two-channel array (platform: OncoChip. v2). Summary and time-dependent receiver operating characteristic (ROC) analysis were applied to obtain a summary measure (summary area under the ROC curve [sAUC]) of association between gene expression and unfavorable patient outcome in each probe set. Results: Xist was overexpressed among females in each data set. A slight overexpression was associated with a good prognosis both in males (sAUC = 0.75, 95% confidence interval [CI]: 0.70-0.80) and at a lesser extent, in females (sAUC = 0.64, 95% CI: 0.59-0.69). However, this finding was limited to the analysis of the biggest database (GDS4222). No association was found between Xist and SATB1 expression. Conclusions: A reactivation of Xist might act as an onco-suppressor gene in male patients with HD, which seems independent from SATB1 expression. The possibility that Xist could contribute to the better survival of female patients should also be investigated.


Assuntos
Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Perfilação da Expressão Gênica/métodos , Regulação Neoplásica da Expressão Gênica , Genes Supressores de Tumor , Doença de Hodgkin/genética , Doença de Hodgkin/patologia , RNA Longo não Codificante/genética , Adulto , Biologia Computacional/métodos , Progressão da Doença , Feminino , Doença de Hodgkin/metabolismo , Humanos , Masculino , Proteínas de Ligação à Região de Interação com a Matriz/metabolismo , Pessoa de Meia-Idade , Adulto Jovem
14.
PLoS One ; 15(5): e0233573, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32437469

RESUMO

The accuracy of the DNA barcoding tool depends on the existence of a comprehensive archived library of sequences reliably determined at species level by expert taxonomists. However, misidentifications are not infrequent, especially following large-scale DNA barcoding campaigns on diverse and taxonomically complex groups. In this study we used the species-rich flea beetle genus Longitarsus, that requires a high level of expertise for morphological species identification, as a case study to assess the accuracy of the DNA barcoding tool following several optimization procedures. We built a cox1 reference database of 1502 sequences representing 78 Longitarsus species, among which 117 sequences (32 species) were newly generated using a non-invasive DNA extraction method that allows keeping reference voucher specimens. Within this dataset we identified 69 taxonomic inconsistencies using barcoding gap analysis and tree topology methods. Threshold optimisation and a posteriori taxonomic revision based on newly generated reference sequences and metadata allowed resolving 44 sequences with ambiguous and incorrect identification and provided a significant improvement of the DNA barcoding accuracy and identification efficacy. Unresolved taxonomic uncertainties, due to overlapping intra- and inter-specific levels of divergences, mainly regards the Longitarsus pratensis species complex and polyphyletic groups L. melanocephalus, L. nigrofasciatus and L. erro. Such type of errors indicates either poorly established taxonomy or any biological processes that make mtDNA groups poorly predictive of species boundaries (e.g. recent speciation or interspecific hybridisation), thus providing directions for further integrative taxonomic and evolutionary studies. Overall, this study underlines the importance of reference vouchers and high-quality metadata associated to sequences in reference databases and corroborates, once again, the key role of taxonomists in any step of the DNA barcoding pipeline in order to generate and maintain a correct and functional reference library.


Assuntos
Besouros/genética , Código de Barras de DNA Taxonômico , Animais , Besouros/classificação , DNA/genética , DNA/isolamento & purificação , Código de Barras de DNA Taxonômico/métodos , Bases de Dados de Ácidos Nucleicos , Evolução Molecular
15.
PLoS One ; 15(5): e0233717, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32469983

RESUMO

Metastasis is known as a key step in cancer recurrence and could be stimulated by multiple factors. Calumenin (CALU) is one of these factors which has a direct impact on cancer metastasis and yet, its underlined mechanisms have not been completely elucidated. The current study was aimed to identify CALU co-expressed genes, their signaling pathways, and expression status within the human cancers. To this point, CALU associated genes were visualized using the Cytoscape plugin BisoGenet and annotated with the Enrichr web-based application. The list of CALU related diseases was retrieved using the DisGenNet, and cancer datasets were downloaded from The Cancer Genome Atlas (TCGA) and analyzed with the Cufflink software. ROC curve analysis was used to estimate the diagnostic accuracy of DEGs in each cancer, and the Kaplan-Meier survival analysis was performed to plot the overall survival of patients. The protein level of the signature biomarkers was measured in 40 biopsy specimens and matched adjacent normal tissues collected from CRC and lung cancer patients. Analysis of CALU co-expressed genes network in TCGA datasets indicated that the network is markedly altered in human colon (COAD) and lung (LUAD) cancers. Diagnostic accuracy estimation of differentially expressed genes showed that a gene panel consisted of CALU, AURKA, and MCM2 was able to successfully distinguish cancer tumors from healthy samples. Cancer cases with abnormal expression of the signature genes had a significantly lower survival rate than other patients. Additionally, comparison of CALU, AURKA, and MCM2 proteins between healthy samples, early and advanced tumors showed that the level of these proteins was increased through normal-carcinoma transition in both types of cancers. These data indicate that the interactions between CALU, AURKA, and MCM2 has a pivotal role in cancer development, and thereby needs to be explored in the future.


Assuntos
Aurora Quinase A , Proteínas de Ligação ao Cálcio , Neoplasias do Colo , Bases de Dados de Ácidos Nucleicos , Regulação Neoplásica da Expressão Gênica , Neoplasias Pulmonares , Componente 2 do Complexo de Manutenção de Minicromossomo , Aurora Quinase A/biossíntese , Aurora Quinase A/genética , Proteínas de Ligação ao Cálcio/biossíntese , Proteínas de Ligação ao Cálcio/genética , Neoplasias do Colo/diagnóstico , Neoplasias do Colo/genética , Neoplasias do Colo/metabolismo , Neoplasias do Colo/mortalidade , Intervalo Livre de Doença , Feminino , Humanos , Neoplasias Pulmonares/diagnóstico , Neoplasias Pulmonares/genética , Neoplasias Pulmonares/metabolismo , Neoplasias Pulmonares/mortalidade , Masculino , Componente 2 do Complexo de Manutenção de Minicromossomo/biossíntese , Componente 2 do Complexo de Manutenção de Minicromossomo/genética , Metástase Neoplásica , Taxa de Sobrevida
16.
Toxicol Lett ; 329: 80-84, 2020 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-32360788

RESUMO

A large number of computer-based prediction methods to determine the potential of chemicals to induce mutations at the gene level has been developed over the last decades. Conversely, only few such methods are currently available to predict potential structural and numerical chromosome aberrations. Even fewer of these are based on the preferred testing method for this endpoint, i.e. the micronucleus test. For the present work, in vivo micronucleus test results of 718 structurally diverse compounds were collected and applied for the construction of new models by means of the freely available SARpy in silico model building software. Multiple QSAR models were created using parameter variation and manual verification of (non-) alerting structures. To this extent, the original set of 718 compounds was split into a training (80 %) and a test (20 %) set. SARpy was applied on the training set to automatically extract sets of rules by generating and selecting substructures based on their prediction performance whereas the test set was used to evaluate model performance. Five different splits were made randomly, each of which had a similar balance between positive and negative substances compared to the full dataset. All generated models were characterised by an overall better performance than existing free and commercial models for the same endpoint, while demonstrating high coverage.


Assuntos
Cromossomos/efeitos dos fármacos , Simulação por Computador , Bases de Dados de Ácidos Nucleicos , Testes para Micronúcleos , Modelos Biológicos , Relação Quantitativa Estrutura-Atividade , Animais , Sensibilidade e Especificidade , Software
17.
BMC Bioinformatics ; 21(1): 211, 2020 May 24.
Artigo em Inglês | MEDLINE | ID: mdl-32448124

RESUMO

BACKGROUND: GenBank contains over 3 million viral sequences. The National Center for Biotechnology Information (NCBI) previously made available a tool for validating and annotating influenza virus sequences that is used to check submissions to GenBank. Before this project, there was no analogous tool in use for non-influenza viral sequence submissions. RESULTS: We developed a system called VADR (Viral Annotation DefineR) that validates and annotates viral sequences in GenBank submissions. The annotation system is based on the analysis of the input nucleotide sequence using models built from curated RefSeqs. Hidden Markov models are used to classify sequences by determining the RefSeq they are most similar to, and feature annotation from the RefSeq is mapped based on a nucleotide alignment of the full sequence to a covariance model. Predicted proteins encoded by the sequence are validated with nucleotide-to-protein alignments using BLAST. The system identifies 43 types of "alerts" that (unlike the previous BLAST-based system) provide deterministic and rigorous feedback to researchers who submit sequences with unexpected characteristics. VADR has been integrated into GenBank's submission processing pipeline allowing for viral submissions passing all tests to be accepted and annotated automatically, without the need for any human (GenBank indexer) intervention. Unlike the previous submission-checking system, VADR is freely available (https://github.com/nawrockie/vadr) for local installation and use. VADR has been used for Norovirus submissions since May 2018 and for Dengue virus submissions since January 2019. Since March 2020, VADR has also been used to check SARS-CoV-2 sequence submissions. Other viruses with high numbers of submissions will be added incrementally. CONCLUSION: VADR improves the speed with which non-flu virus submissions to GenBank can be checked and improves the content and quality of the GenBank annotations. The availability and portability of the software allow researchers to run the GenBank checks prior to submitting their viral sequences, and thereby gain confidence that their submissions will be accepted immediately without the need to correspond with GenBank staff. Reciprocally, the adoption of VADR frees GenBank staff to spend more time on services other than checking routine viral sequence submissions.


Assuntos
Betacoronavirus , Infecções por Coronavirus , Bases de Dados de Ácidos Nucleicos , Anotação de Sequência Molecular , Pandemias , Pneumonia Viral , Software , Betacoronavirus/genética , Infecções por Coronavirus/genética , Vírus de DNA , Genômica , Humanos , Anotação de Sequência Molecular/normas , Pneumonia Viral/genética , Vírus
18.
BMC Bioinformatics ; 21(1): 174, 2020 May 04.
Artigo em Inglês | MEDLINE | ID: mdl-32366294

RESUMO

BACKGROUND: Transcriptome analysis by next-generation sequencing has become a popular technique in recent years. This approach is quite suitable for non-model organism study, as de novo assembly is independent of prior genomic sequences of organisms. De novo sequencing has benefited many studies on commercially important fish species. However, to understand the functions of these assembled sequences, they still need to be annotated with existing sequence databases. By combining Basic Local Alignment Search Tool (BLAST) and Gene Ontology analysis, we were able to identify homologous sequences of assembled sequences and describe their characteristics using pre-defined tags for each gene, though the above conventional annotation results obtained for non-model assembled sequences was still associated with a lack of pre-defined tags and poorly documented records in the database. RESULTS: We introduced Blast2Fish, a novel approach for performing functional enrichment analysis on non-model teleost fish transcriptome data. The Blast2Fish pipeline was designed to be a reference-based enrichment method. Instead of annotating the BLAST single top hit by a pre-defined gene-to-tag database, we included 500 hits to search related PubMed articles and parse biological terms. These descriptive terms were then sorted and recorded as annotations for the query. The results showed that Blast2Fish was capable of providing meaningful annotations on immunology topics for non-model fish transcriptome analysis. CONCLUSION: Blast2Fish provides a novel approach for annotating sequences of non-model fish. The reference-based strategy allows annotation to be performed without pre-defined tags for each gene. This method strongly benefits non-model teleost fish studies for gene functional enrichment analysis.


Assuntos
Biologia Computacional/métodos , Proteínas de Peixes/genética , Peixes/genética , Anotação de Sequência Molecular/métodos , Animais , Bases de Dados de Ácidos Nucleicos , Proteínas de Peixes/química , Proteínas de Peixes/metabolismo , Peixes/metabolismo , Perfilação da Expressão Gênica , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Internet , Software , Transcriptoma
19.
Hum Genet ; 139(8): 1065-1075, 2020 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-32248359

RESUMO

Wilson disease (WD) is a genetic disorder of copper metabolism caused by variants in the copper transporting P-type ATPase gene ATP7B. Estimates for WD population prevalence vary with 1 in 30,000 generally quoted. However, some genetic studies have reported much higher prevalence rates. The aim of this study was to estimate the population prevalence of WD and the pathogenicity/penetrance of WD variants by determining the frequency of ATP7B variants in a genomic sequence database. A catalogue of WD-associated ATP7B variants was constructed, and then, frequency information for these was extracted from the gnomAD data set. Pathogenicity of variants was assessed by (a) comparing gnomAD allele frequencies against the number of reports for variants in the WD literature and (b) using variant effect prediction algorithms. 231 WD-associated ATP7B variants were identified in the gnomAD data set, giving an initial estimated population prevalence of around 1 in 2400. After exclusion of WD-associated ATP7B variants with predicted low penetrance, the revised estimate showed a prevalence of around 1 in 20,000, with higher rates in the Asian and Ashkenazi Jewish populations. Reanalysis of other recent genetic studies using our penetrance criteria also predicted lower population prevalences for WD in the UK and France than had been reported. Our results suggest that differences in variant penetrance can explain the discrepancy between reported epidemiological and genetic prevalences of WD. They also highlight the challenge in defining penetrance when assigning causality to some ATP7B variants.


Assuntos
ATPases Transportadoras de Cobre/genética , Variação Genética/genética , Degeneração Hepatolenticular/genética , Cobre/metabolismo , Bases de Dados de Ácidos Nucleicos , Frequência do Gene , Degeneração Hepatolenticular/epidemiologia , Humanos , Penetrância , Prevalência
20.
PLoS One ; 15(4): e0231814, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32298363

RESUMO

Applications of biological knowledge, such as forensics, often require the determination of biological materials to a species level. As such, DNA-based approaches to identification, particularly DNA barcoding, are attracting increased interest. The capacity of DNA barcodes to assign newly encountered specimens to a species relies upon access to informatics platforms, such as BOLD and GenBank, which host libraries of reference sequences and support the comparison of new sequences to them. As parameterization of these libraries expands, DNA barcoding has the potential to make valuable contributions in diverse applied contexts. However, a recent publication called for caution after finding that both platforms performed poorly in identifying specimens of 17 common insect species. This study follows up on this concern by asking if the misidentifications reflected problems in the reference libraries or in the query sequences used to test them. Because this reanalysis revealed that missteps in acquiring and analyzing the query sequences were responsible for most misidentifications, a workflow is described to minimize such errors in future investigations. The present study also revealed the limitations imposed by the lack of a polished species-level taxonomy for many groups. In such cases, applications can be strengthened by mapping the geographic distributions of sequence-based species proxies rather than waiting for the maturation of formal taxonomic systems based on morphology.


Assuntos
DNA/genética , Bases de Dados de Ácidos Nucleicos , Insetos/genética , Animais , Código de Barras de DNA Taxonômico , Confiabilidade dos Dados , Filogenia , Erro Experimental , Especificidade da Espécie
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA