Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 20
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Comput Biol Med ; 175: 108497, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38678944

RESUMO

Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing downstream analysis. Traditional PCA, a main workhorse in dimensionality reduction, lacks the ability to capture geometrical structure information embedded in the data, and previous graph Laplacian regularizations are limited by the analysis of only a single scale. We propose a topological Principal Components Analysis (tPCA) method by the combination of persistent Laplacian (PL) technique and L2,1 norm regularization to address multiscale and multiclass heterogeneity issues in data. We further introduce a k-Nearest-Neighbor (kNN) persistent Laplacian technique to improve the robustness of our persistent Laplacian method. The proposed kNN-PL is a new algebraic topology technique which addresses the many limitations of the traditional persistent homology. Rather than inducing filtration via the varying of a distance threshold, we introduced kNN-tPCA, where filtrations are achieved by varying the number of neighbors in a kNN network at each step, and find that this framework has significant implications for hyper-parameter tuning. We validate the efficacy of our proposed tPCA and kNN-tPCA methods on 11 diverse benchmark scRNA-seq datasets, and showcase that our methods outperform other unsupervised PCA enhancements from the literature, as well as popular Uniform Manifold Approximation (UMAP), t-Distributed Stochastic Neighbor Embedding (tSNE), and Projection Non-Negative Matrix Factorization (NMF) by significant margins. For example, tPCA provides up to 628%, 78%, and 149% improvements to UMAP, tSNE, and NMF, respectively on classification in the F1 metric, and kNN-tPCA offers 53%, 63%, and 32% improvements to UMAP, tSNE, and NMF, respectively on clustering in the ARI metric.


Assuntos
Análise de Componente Principal , Análise de Sequência de RNA , Análise de Célula Única , Análise de Célula Única/métodos , Humanos , Análise de Sequência de RNA/métodos , Algoritmos , RNA-Seq/métodos
2.
J Comput Appl Math ; 4452024 Aug 01.
Artigo em Inglês | MEDLINE | ID: mdl-38464901

RESUMO

Single-cell RNA sequencing (scRNA-seq) is a relatively new technology that has stimulated enormous interest in statistics, data science, and computational biology due to the high dimensionality, complexity, and large scale associated with scRNA-seq data. Nonnegative matrix factorization (NMF) offers a unique approach due to its meta-gene interpretation of resulting low-dimensional components. However, NMF approaches suffer from the lack of multiscale analysis. This work introduces two persistent Laplacian regularized NMF methods, namely, topological NMF (TNMF) and robust topological NMF (rTNMF). By employing a total of 12 datasets, we demonstrate that the proposed TNMF and rTNMF significantly outperform all other NMF-based methods. We have also utilized TNMF and rTNMF for the visualization of popular Uniform Manifold Approximation and Projection (UMAP) and t-distributed stochastic neighbor embedding (t-SNE).

3.
Comput Biol Med ; 171: 108211, 2024 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-38422960

RESUMO

Single-cell RNA sequencing (scRNA-seq) has emerged as a transformative technology, offering unparalleled insights into the intricate landscape of cellular diversity and gene expression dynamics. scRNA-seq analysis represents a challenging and cutting-edge frontier within the field of biological research. Differential geometry serves as a powerful mathematical tool in various applications of scientific research. In this study, we introduce, for the first time, a multiscale differential geometry (MDG) strategy for addressing the challenges encountered in scRNA-seq data analysis. We assume that intrinsic properties of cells lie on a family of low-dimensional manifolds embedded in the high-dimensional space of scRNA-seq data. Multiscale cell-cell interactive manifolds are constructed to reveal complex relationships in the cell-cell network, where curvature-based features for cells can decipher the intricate structural and biological information. We showcase the utility of our novel approach by demonstrating its effectiveness in classifying cell types. This innovative application of differential geometry in scRNA-seq analysis opens new avenues for understanding the intricacies of biological networks and holds great potential for network analysis in other fields.


Assuntos
Análise de Dados , Perfilação da Expressão Gênica , Análise de Sequência de RNA/métodos , Perfilação da Expressão Gênica/métodos , Análise por Conglomerados
4.
J Chem Inf Model ; 64(7): 2829-2838, 2024 Apr 08.
Artigo em Inglês | MEDLINE | ID: mdl-37402705

RESUMO

Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing the downstream analysis. We present Correlated Clustering and Projection (CCP), a new data-domain dimensionality reduction method, for the first time. CCP projects each cluster of similar genes into a supergene defined as the accumulated pairwise nonlinear gene-gene correlations among all cells. Using 14 benchmark data sets, we demonstrate that CCP has significant advantages over classical principal component analysis (PCA) for clustering and/or classification problems with intrinsically high dimensionality. In addition, we introduce the Residue-Similarity index (RSI) as a novel metric for clustering and classification and the R-S plot as a new visualization tool. We show that the RSI correlates with accuracy without requiring the knowledge of the true labels. The R-S plot provides a unique alternative to the uniform manifold approximation and projection (UMAP) and t-distributed stochastic neighbor embedding (t-SNE) for data with a large number of cell types.


Assuntos
Algoritmos , Análise de Célula Única , Análise de Célula Única/métodos , Análise por Conglomerados , Análise de Componente Principal , Análise de Sequência de RNA/métodos , Perfilação da Expressão Gênica/métodos
5.
ArXiv ; 2023 Oct 23.
Artigo em Inglês | MEDLINE | ID: mdl-37961744

RESUMO

Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing downstream analysis. Traditional PCA, a main workhorse in dimensionality reduction, lacks the ability to capture geometrical structure information embedded in the data, and previous graph Laplacian regularizations are limited by the analysis of only a single scale. We propose a topological Principal Components Analysis (tPCA) method by the combination of persistent Laplacian (PL) technique and L2,1 norm regularization to address multiscale and multiclass heterogeneity issues in data. We further introduce a k-Nearest-Neighbor (kNN) persistent Laplacian technique to improve the robustness of our persistent Laplacian method. The proposed kNN-PL is a new algebraic topology technique which addresses the many limitations of the traditional persistent homology. Rather than inducing filtration via the varying of a distance threshold, we introduced kNN-tPCA, where filtrations are achieved by varying the number of neighbors in a kNN network at each step, and find that this framework has significant implications for hyper-parameter tuning. We validate the efficacy of our proposed tPCA and kNN-tPCA methods on 11 diverse benchmark scRNA-seq datasets, and showcase that our methods outperform other unsupervised PCA enhancements from the literature, as well as popular Uniform Manifold Approximation (UMAP), t-Distributed Stochastic Neighbor Embedding (tSNE), and Projection Non-Negative Matrix Factorization (NMF) by significant margins. For example, tPCA provides up to 628%, 78%, and 149% improvements to UMAP, tSNE, and NMF, respectively on classification in the F1 metric, and kNN-tPCA offers 53%, 63%, and 32% improvements to UMAP, tSNE, and NMF, respectively on clustering in the ARI metric.

6.
J Chem Inf Model ; 63(1): 335-342, 2023 01 09.
Artigo em Inglês | MEDLINE | ID: mdl-36577010

RESUMO

Accurate and reliable forecasting of emerging dominant severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants enables policymakers and vaccine makers to get prepared for future waves of infections. The last three waves of SARS-CoV-2 infections caused by dominant variants, Omicron (BA.1), BA.2, and BA.4/BA.5, were accurately foretold by our artificial intelligence (AI) models built with biophysics, genotyping of viral genomes, experimental data, algebraic topology, and deep learning. On the basis of newly available experimental data, we analyzed the impacts of all possible viral spike (S) protein receptor-binding domain (RBD) mutations on the SARS-CoV-2 infectivity. Our analysis sheds light on viral evolutionary mechanisms, i.e., natural selection through infectivity strengthening and antibody resistance. We forecast that BP.1, BL*, BA.2.75*, BQ.1*, and particularly BN.1* have a high potential to become the new dominant variants to drive the next surge. Our key projection about these variants dominance made on Oct. 18, 2022 (see arXiv:2210.09485) became reality in late November 2022.


Assuntos
COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , Inteligência Artificial , Anticorpos
7.
ArXiv ; 2022 Oct 18.
Artigo em Inglês | MEDLINE | ID: mdl-36299737

RESUMO

Accurate and reliable forecasting of emerging dominant severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants enables policymakers and vaccine makers to get prepared for future waves of infections. The last three waves of SARS-CoV-2 infections caused by dominant variants Omicron (BA.1), BA.2, and BA.4/BA.5 were accurately foretold by our artificial intelligence (AI) models built with biophysics, genotyping of viral genomes, experimental data, algebraic topology, and deep learning. Based on newly available experimental data, we analyzed the impacts of all possible viral spike (S) protein receptor-binding domain (RBD) mutations on the SARS-CoV-2 infectivity. Our analysis sheds light on viral evolutionary mechanisms, i.e., natural selection through infectivity strengthening and antibody resistance. We forecast that BA.2.10.4, BA.2.75, BQ.1.1, and particularly, BA.2.75+R346T, have high potential to become new dominant variants to drive the next surge.

8.
Prog Earth Planet Sci ; 9(1): 11, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35127336

RESUMO

The exact occurrence frequency of noctilucent clouds (NLCs) in middle latitudes is significant information because it is thought to be sensitive to long-term atmospheric change. We conducted NLC observation from airline jets in the Northern Hemisphere during the summer 2019 to evaluate the effectiveness of NLC observation from airborne platforms. By cooperating with the Japanese airline All Nippon Airways (ANA), imaging observations of NLCs were conducted on 13 flights from Jun 8 to Jul 12. As a result of careful analysis, 8 of these 13 flights were found to successfully detect NLCs from middle latitudes (lower than 55° N) during their cruising phase. Based on the results of these test observations, it is shown that an airline jet is a powerful tool to continuously monitor the occurrence frequency of NLCs at midlatitudes which is generally difficult with a polar orbiting satellite due to sparse sampling in both temporal and spatial domain. The advantages and merits of NLC observation from jets over satellite observation from a point of view of imaging geometry are also presented.

9.
ACS Infect Dis ; 8(3): 546-556, 2022 03 11.
Artigo em Inglês | MEDLINE | ID: mdl-35133792

RESUMO

The surge of COVID-19 infections has been fueled by new SARS-CoV-2 variants, namely Alpha, Beta, Gamma, Delta, and so forth. The molecular mechanism underlying such surge is elusive due to the existence of 28 554 unique mutations, including 4 653 non-degenerate mutations on the spike protein. Understanding the molecular mechanism of SARS-CoV-2 transmission and evolution is a prerequisite to foresee the trend of emerging vaccine-breakthrough variants and the design of mutation-proof vaccines and monoclonal antibodies. We integrate the genotyping of 1 489 884 SARS-CoV-2 genomes, a library of 130 human antibodies, tens of thousands of mutational data, topological data analysis, and deep learning to reveal SARS-CoV-2 evolution mechanism and forecast emerging vaccine-breakthrough variants. We show that prevailing variants can be quantitatively explained by infectivity-strengthening and vaccine-escape (co-)mutations on the spike protein RBD due to natural selection and/or vaccination-induced evolutionary pressure. We illustrate that infectivity strengthening mutations were the main mechanism for viral evolution, while vaccine-escape mutations become a dominating viral evolutionary mechanism among highly vaccinated populations. We demonstrate that Lambda is as infectious as Delta but is more vaccine-resistant. We analyze emerging vaccine-breakthrough comutations in highly vaccinated countries, including the United Kingdom, the United States, Denmark, and so forth. Finally, we identify sets of comutations that have a high likelihood of massive growth: [A411S, L452R, T478K], [L452R, T478K, N501Y], [V401L, L452R, T478K], [K417N, L452R, T478K], [L452R, T478K, E484K, N501Y], and [P384L, K417N, E484K, N501Y]. We predict they can escape existing vaccines. We foresee an urgent need to develop new virus combating strategies.

10.
ArXiv ; 2021 Sep 09.
Artigo em Inglês | MEDLINE | ID: mdl-34518803

RESUMO

The recent global surge in COVID-19 infections has been fueled by new SARS-CoV-2 variants, namely Alpha, Beta, Gamma, Delta, etc. The molecular mechanism underlying such surge is elusive due to 4,653 non-degenerate mutations on the spike protein, which is the target of most COVID-19 vaccines. The understanding of the molecular mechanism of transmission and evolution is a prerequisite to foresee the trend of emerging vaccine-breakthrough variants and the design of mutation-proof vaccines and monoclonal antibodies. We integrate the genotyping of 1,489,884 SARS-CoV-2 genomes isolates, 130 human antibodies, tens of thousands of mutational data points, topological data analysis, and deep learning to reveal SARS-CoV-2 evolution mechanism and forecast emerging vaccine-escape variants. We show that infectivity-strengthening and antibody-disruptive co-mutations on the S protein RBD can quantitatively explain the infectivity and virulence of all prevailing variants. We demonstrate that Lambda is as infectious as Delta but is more vaccine-resistant. We analyze emerging vaccine-breakthrough co-mutations in 20 countries, including the United Kingdom, the United States, Denmark, Brazil, and Germany, etc. We envision that natural selection through infectivity will continue to be the main mechanism for viral evolution among unvaccinated populations, while antibody disruptive co-mutations will fuel the future growth of vaccine-breakthrough variants among fully vaccinated populations. Finally, we have identified the co-mutations that have the great likelihood of becoming dominant: [A411S, L452R, T478K], [L452R, T478K, N501Y], [V401L, L452R, T478K], [K417N, L452R, T478K], [L452R, T478K, E484K, N501Y], and [P384L, K417N, E484K, N501Y]. We predict they, particularly the last four, will break through existing vaccines. We foresee an urgent need to develop new vaccines that target these co-mutations.

11.
Comput Biol Med ; 131: 104264, 2021 04.
Artigo em Inglês | MEDLINE | ID: mdl-33647832

RESUMO

Coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a worldwide devastating effect. Understanding the evolution and transmission of SARS-CoV-2 is of paramount importance for controlling, combating and preventing COVID-19. Due to the rapid growth in both the number of SARS-CoV-2 genome sequences and the number of unique mutations, the phylogenetic analysis of SARS-CoV-2 genome isolates faces an emergent large-data challenge. We introduce a dimension-reduced K-means clustering strategy to tackle this challenge. We examine the performance and effectiveness of three dimension-reduction algorithms: principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP). By using four benchmark datasets, we found that UMAP is the best-suited technique due to its stable, reliable, and efficient performance, its ability to improve clustering accuracy, especially for large Jaccard distanced-based datasets, and its superior clustering visualization. The UMAP-assisted K-means clustering enables us to shed light on increasingly large datasets from SARS-CoV-2 genome isolates.


Assuntos
Algoritmos , COVID-19/genética , Bases de Dados de Ácidos Nucleicos , Genoma Viral , Mutação , Filogenia , SARS-CoV-2/genética , Humanos
13.
Commun Biol ; 4(1): 228, 2021 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-33589648

RESUMO

SARS-CoV-2 has been mutating since it was first sequenced in early January 2020. Here, we analyze 45,494 complete SARS-CoV-2 geneome sequences in the world to understand their mutations. Among them, 12,754 sequences are from the United States. Our analysis suggests the presence of four substrains and eleven top mutations in the United States. These eleven top mutations belong to 3 disconnected groups. The first and second groups consisting of 5 and 8 concurrent mutations are prevailing, while the other group with three concurrent mutations gradually fades out. Moreover, we reveal that female immune systems are more active than those of males in responding to SARS-CoV-2 infections. One of the top mutations, 27964C > T-(S24L) on ORF8, has an unusually strong gender dependence. Based on the analysis of all mutations on the spike protein, we uncover that two of four SASR-CoV-2 substrains in the United States become potentially more infectious.


Assuntos
COVID-19/virologia , Mutação/genética , SARS-CoV-2/genética , Regiões 5' não Traduzidas/genética , Sequência de Aminoácidos , Enzima de Conversão de Angiotensina 2/química , Enzima de Conversão de Angiotensina 2/metabolismo , Evolução Molecular , Feminino , Humanos , Masculino , Modelos Moleculares , Nucleocapsídeo/metabolismo , Fases de Leitura Aberta/genética , Polimorfismo de Nucleotídeo Único/genética , Ligação Proteica , Domínios Proteicos , Dobramento de Proteína , SARS-CoV-2/patogenicidade , Glicoproteína da Espícula de Coronavírus/química , Glicoproteína da Espícula de Coronavírus/genética , Termodinâmica , Estados Unidos
14.
J Phys Chem Lett ; 11(23): 10007-10015, 2020 Dec 03.
Artigo em Inglês | MEDLINE | ID: mdl-33179934

RESUMO

One of the major challenges in controlling the coronavirus disease 2019 (COVID-19) outbreak is its asymptomatic transmission. The pathogenicity and virulence of asymptomatic COVID-19 remain mysterious. On the basis of the genotyping of 75775 SARS-CoV-2 genome isolates, we reveal that asymptomatic infection is linked to SARS-CoV-2 11083G>T mutation (i.e., L37F at nonstructure protein 6 (NSP6)). By analyzing the distribution of 11083G>T in various countries, we unveil that 11083G>T may correlate with the hypotoxicity of SARS-CoV-2. Moreover, we show a global decaying tendency of the 11083G>T mutation ratio indicating that 11083G>T hinders the SARS-CoV-2 transmission capacity. Artificial intelligence, sequence alignment, and network analysis are applied to show that NSP6 mutation L37F may have compromised the virus's ability to undermine the innate cellular defense against viral infection via autophagy regulation. This assessment is in good agreement with our genotyping of the SARS-CoV-2 evolution and transmission across various countries and regions over the past few months.


Assuntos
Infecções Assintomáticas , COVID-19/transmissão , SARS-CoV-2/genética , Inteligência Artificial , COVID-19/virologia , Genoma Viral , Genótipo , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Mutação , Proteínas Virais/genética
15.
Viruses ; 12(10)2020 09 27.
Artigo em Inglês | MEDLINE | ID: mdl-32992592

RESUMO

The transmission and evolution of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) are of paramount importance in controlling and combating the coronavirus disease 2019 (COVID-19) pandemic. Currently, over 15,000 SARS-CoV-2 single mutations have been recorded, which have a great impact on the development of diagnostics, vaccines, antibody therapies, and drugs. However, little is known about SARS-CoV-2's evolutionary characteristics and general trend. In this work, we present a comprehensive genotyping analysis of existing SARS-CoV-2 mutations. We reveal that host immune response via APOBEC and ADAR gene editing gives rise to near 65% of recorded mutations. Additionally, we show that children under age five and the elderly may be at high risk from COVID-19 because of their overreaction to the viral infection. Moreover, we uncover that populations of Oceania and Africa react significantly more intensively to SARS-CoV-2 infection than those of Europe and Asia, which may explain why African Americans were shown to be at increased risk of dying from COVID-19, in addition to their high risk of COVID-19 infection caused by systemic health and social inequities. Finally, our study indicates that for two viral genome sequences of the same origin, their evolution order may be determined from the ratio of mutation type, C > T over T > C.


Assuntos
Betacoronavirus/genética , Betacoronavirus/imunologia , Infecções por Coronavirus/imunologia , Infecções por Coronavirus/virologia , Evolução Molecular , Pneumonia Viral/imunologia , Pneumonia Viral/virologia , COVID-19 , Feminino , Edição de Genes , Genoma Viral , Genótipo , Interações Hospedeiro-Patógeno , Humanos , Masculino , Mutação , Pandemias , Polimorfismo de Nucleotídeo Único , SARS-CoV-2 , Alinhamento de Sequência , Proteínas Virais/genética
16.
Genomics ; 112(6): 5204-5213, 2020 11.
Artigo em Inglês | MEDLINE | ID: mdl-32966857

RESUMO

Effective, sensitive, and reliable diagnostic reagents are of paramount importance for combating the ongoing coronavirus disease 2019 (COVID-19) pandemic when there is neither a preventive vaccine nor a specific drug available for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). It will cause a large number of false-positive and false-negative tests if currently used diagnostic reagents are undermined. Based on genotyping of 31,421 SARS-CoV-2 genome samples collected up to July 23, 2020, we reveal that essentially all of the current COVID-19 diagnostic targets have undergone mutations. We further show that SARS-CoV-2 has the most mutations on the targets of various nucleocapsid (N) gene primers and probes, which have been widely used around the world to diagnose COVID-19. To understand whether SARS-CoV-2 genes have mutated unevenly, we have computed the mutation rate and mutation h-index of all SARS-CoV-2 genes, indicating that the N gene is one of the most non-conservative genes in the SARS-CoV-2 genome. We show that due to human immune response induced APOBEC mRNA (C > T) editing, diagnostic targets should also be selected to avoid cytidines. Our findings might enable optimally selecting the conservative SARS-CoV-2 genes and proteins for the design and development of COVID-19 diagnostic reagents, prophylactic vaccines, and therapeutic medicines. AVAILABILITY: Interactive real-time online Mutation Tracker.


Assuntos
Teste para COVID-19 , COVID-19/virologia , Mutação , SARS-CoV-2/genética , Proteínas do Envelope de Coronavírus/genética , Primers do DNA , Técnicas de Genotipagem , Humanos , Polimorfismo de Nucleotídeo Único , SARS-CoV-2/isolamento & purificação
17.
Res Sq ; 2020 Aug 11.
Artigo em Inglês | MEDLINE | ID: mdl-32818213

RESUMO

The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has been mutating since it was first sequenced in early January 2020. The genetic variants have developed into a few distinct clusters with different properties. Since the United States (US) has the highest number of viral infected patients globally, it is essential to understand the US SARS-CoV-2. Using genotyping, sequence-alignment, time-evolution, k-means clustering, protein-folding stability, algebraic topology, and network theory, we reveal that the US SARS-CoV-2 has four substrains and five top US SARS-CoV-2 mutations were first detected in China (2 cases), Singapore (2 cases), and the United Kingdom (1 case). The next three top US SARS-CoV-2 mutations were first detected in the US. These eight top mutations belong to two disconnected groups. The first group consisting of 5 concurrent mutations is prevailing, while the other group with three concurrent mutations gradually fades out. We identify that one of the top mutations, 27964C>T-(S24L) on ORF8, has an unusually strong gender dependence. Based on the analysis of all mutations on the spike protein, we further uncover that three of four US SASR-CoV-2 substrains become more infectious. Our study calls for effective viral control and containing strategies in the US.

18.
ArXiv ; 2020 Aug 20.
Artigo em Inglês | MEDLINE | ID: mdl-32839723

RESUMO

The transmission and evolution of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) are of paramount importance to the controlling and combating of coronavirus disease 2019 (COVID-19) pandemic. Currently, near 15,000 SARS-CoV-2 single mutations have been recorded, having a great ramification to the development of diagnostics, vaccines, antibody therapies, and drugs. However, little is known about SARS-CoV-2 evolutionary characteristics and general trend. In this work, we present a comprehensive genotyping analysis of existing SARS-CoV-2 mutations. We reveal that host immune response via APOBEC and ADAR gene editing gives rise to near 65\% of recorded mutations. Additionally, we show that children under age five and the elderly may be at high risk from COVID-19 because of their overreacting to the viral infection. Moreover, we uncover that populations of Oceania and Africa react significantly more intensively to SARS-CoV-2 infection than those of Europe and Asia, which may explain why African Americans were shown to be at increased risk of dying from COVID-19, in addition to their high risk of getting sick from COVID-19 caused by systemic health and social inequities. Finally, our study indicates that for two viral genome sequences of the same origin, their evolution order may be determined from the ratio of mutation type C$>$T over T$>$C.

19.
J Chem Inf Model ; 60(12): 5853-5865, 2020 12 28.
Artigo em Inglês | MEDLINE | ID: mdl-32530284

RESUMO

Tremendous effort has been given to the development of diagnostic tests, preventive vaccines, and therapeutic medicines for coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Much of this development has been based on the reference genome collected on January 5, 2020. Based on the genotyping of 15 140 genome samples collected up to June 1, 2020, we report that SARS-CoV-2 has undergone 8309 single mutations which can be clustered into six subtypes. We introduce mutation ratio and mutation h-index to characterize the protein conservativeness and unveil that SARS-CoV-2 envelope protein, main protease, and endoribonuclease protein are relatively conservative, while SARS-CoV-2 nucleocapsid protein, spike protein, and papain-like protease are relatively nonconservative. In particular, we have identified mutations on 40% of nucleotides in the nucleocapsid gene in the population level, signaling potential impacts on the ongoing development of COVID-19 diagnosis, vaccines, and antibody and small-molecular drugs.


Assuntos
COVID-19 , SARS-CoV-2/classificação , SARS-CoV-2/metabolismo , Anticorpos Antivirais/metabolismo , COVID-19/diagnóstico , COVID-19/epidemiologia , COVID-19/prevenção & controle , COVID-19/terapia , Proteases 3C de Coronavírus/química , Proteases 3C de Coronavírus/genética , Proteínas do Envelope de Coronavírus/química , Proteínas do Envelope de Coronavírus/genética , Proteínas do Nucleocapsídeo de Coronavírus/química , Proteínas do Nucleocapsídeo de Coronavírus/genética , Proteases Semelhantes à Papaína de Coronavírus/química , Proteases Semelhantes à Papaína de Coronavírus/genética , Endorribonucleases/química , Endorribonucleases/genética , Genoma Viral , Genótipo , Geografia , Humanos , Proteínas Mutantes/química , Proteínas Mutantes/genética , Mutação , Fosfoproteínas/química , Fosfoproteínas/genética , Conformação Proteica , Glicoproteína da Espícula de Coronavírus/química , Glicoproteína da Espícula de Coronavírus/genética , Vacinas/metabolismo , Proteínas não Estruturais Virais/química , Proteínas não Estruturais Virais/genética
20.
ArXiv ; 2020 Dec 30.
Artigo em Inglês | MEDLINE | ID: mdl-33398244

RESUMO

Coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a worldwide devastating effect. The understanding of evolution and transmission of SARS-CoV-2 is of paramount importance for the COVID-19 control, combating, and prevention. Due to the rapid growth of both the number of SARS-CoV-2 genome sequences and the number of unique mutations, the phylogenetic analysis of SARS-CoV-2 genome isolates faces an emergent large-data challenge. We introduce a dimension-reduced $k$-means clustering strategy to tackle this challenge. We examine the performance and effectiveness of three dimension-reduction algorithms: principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP). By using four benchmark datasets, we found that UMAP is the best-suited technique due to its stable, reliable, and efficient performance, its ability to improve clustering accuracy, especially for large Jaccard distanced-based datasets, and its superior clustering visualization. The UMAP-assisted $k$-means clustering enables us to shed light on increasingly large datasets from SARS-CoV-2 genome isolates.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA