Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Resultados 1 - 20 de 44
Filtrar
1.
Sensors (Basel) ; 23(2)2023 Jan 11.
Artículo en Inglés | MEDLINE | ID: mdl-36679658

RESUMEN

In this paper, the problem of trajectory design for energy harvesting unmanned aerial vehicles (UAVs) is studied. In the considered model, the UAV acts as a moving base station to serve the ground users, while collecting energy from the charging stations located at the center of a user group. For this purpose, the UAV must be examined and repaired regularly. In consequence, it is necessary to optimize the trajectory design of the UAV while jointly considering the maintenance costs, the reward of serving users, the energy management, and the user service time. To capture the relationship among these factors, we first model the completion of service and the harvested energy as the reward, and the energy consumption during the deployment as the cost. Then, the deployment profitability is defined as the ratio of the reward to the cost of the UAV trajectory. Based on this definition, the trajectory design problem is formulated as an optimization problem whose goal is to maximize the deployment profitability of the UAV. To solve this problem, a foraging-based algorithm is proposed to find the optimal trajectory so as to maximize the deployment profitability and minimize the average user service time. The proposed algorithm can find the optimal trajectory for the UAV with low time complexity at the level of polynomial. Fundamental analysis shows that the proposed algorithm achieves the maximal deployment profitability. Simulation results show that, compared to Q-learning algorithm, the proposed algorithm effectively reduces the operation time and the average user service time while achieving the maximal deployment profitability.


Asunto(s)
Algoritmos , Inteligencia , Fenómenos Físicos , Simulación por Computador , Recompensa
2.
Sensors (Basel) ; 22(1)2022 Jan 04.
Artículo en Inglés | MEDLINE | ID: mdl-35009885

RESUMEN

In order to reduce the amount of hyperspectral imaging (HSI) data transmission required through hyperspectral remote sensing (HRS), we propose a structured low-rank and joint-sparse (L&S) data compression and reconstruction method. The proposed method exploits spatial and spectral correlations in HSI data using sparse Bayesian learning and compressive sensing (CS). By utilizing a simultaneously L&S data model, we employ the information of the principal components and Bayesian learning to reconstruct the hyperspectral images. The simulation results demonstrate that the proposed method is superior to LRMR and SS&LR methods in terms of reconstruction accuracy and computational burden under the same signal-to-noise tatio (SNR) and compression ratio.

3.
J Theor Biol ; 515: 110604, 2021 04 21.
Artículo en Inglés | MEDLINE | ID: mdl-33508323

RESUMEN

The ongoing global pandemic of infection disease COVID-19 caused by the 2019 novel coronavirus (SARS-COV-2, formerly 2019-nCoV) presents critical threats to public health and the economy. The genome of SARS-CoV-2 had been sequenced and structurally annotated, yet little is known of the intrinsic organization and evolution of the genome. To this end, we present a mathematical method for the genomic spectrum, a kind of barcode, of SARS-CoV-2 and common human coronaviruses. The genomic spectrum is constructed according to the periodic distributions of nucleotides and therefore reflects the unique characteristics of the genome. The results demonstrate that coronavirus SARS-CoV-2 exhibits predominant latent periodicity-2 regions of non-structural proteins 3, 4, 5, and 6. Further analysis of the latent periodicity-2 regions suggests that the dinucleotide imbalances are increased during evolution and may confer the evolutionary fitness of the virus. Especially, SARS-CoV-2 isolates have increased latent periodicity-2 and periodicity-3 during COVID-19 pandemic. The special strong periodicity-2 regions and the intensity of periodicity-2 in the SARS-CoV-2 whole genome may become diagnostic and pharmaceutical targets in monitoring and curing the COVID-19 disease.


Asunto(s)
Evolución Molecular , Genoma Viral , Modelos Teóricos , Proteínas Circadianas Period/genética , SARS-CoV-2/genética , Virulencia/genética , Secuencia de Bases , COVID-19/epidemiología , COVID-19/virología , Código de Barras del ADN Taxonómico/métodos , Genoma Viral/genética , Genómica , Historia del Siglo XXI , Humanos , Sistemas de Lectura Abierta/genética , Pandemias , Filogenia , ARN Viral/genética , SARS-CoV-2/patogenicidad , Análisis de Secuencia de ADN
4.
J Theor Biol ; 530: 110885, 2021 12 07.
Artículo en Inglés | MEDLINE | ID: mdl-34478743

RESUMEN

The world faces a great unforeseen challenge through the COVID-19 pandemic caused by coronavirus SARS-CoV-2. The virus genome structure and evolution are positioned front and center for further understanding insights on vaccine development, monitoring of transmission trajectories, and prevention of zoonotic infections of new coronaviruses. Of particular interest are genomic elements Inverse Repeats (IRs), which maintain genome stability, regulate gene expressions, and are the targets of mutations. However, little research attention is given to the IR content analysis in the SARS-CoV-2 genome. In this study, we propose a geometric analysis method and using the method to investigate the distributions of IRs in SARS-CoV-2 and its related coronavirus genomes. The method represents each genomic IR sequence pair as a single point and constructs the geometric shape of the genome using the IRs. Thus, the IR shape can be considered as the signature of the genome. The genomes of different coronaviruses are then compared using the constructed IR shapes. The results demonstrate that SARS-CoV-2 genome, specifically, has an abundance of IRs, and the IRs in coronavirus genomes show an increase during evolution events.


Asunto(s)
COVID-19 , SARS-CoV-2 , Genoma Viral/genética , Genómica , Humanos , Pandemias , Filogenia
5.
Genomics ; 112(5): 3588-3596, 2020 09.
Artículo en Inglés | MEDLINE | ID: mdl-32353474

RESUMEN

The emerging global infectious COVID-19 disease by novel Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) presents critical threats to global public health and the economy since it was identified in late December 2019 in China. The virus has gone through various pathways of evolution. To understand the evolution and transmission of SARS-CoV-2, genotyping of virus isolates is of great importance. This study presents an accurate method for effectively genotyping SARS-CoV-2 viruses using complete genomes. The method employs the multiple sequence alignments of the genome isolates with the SARS-CoV-2 reference genome. The single-nucleotide polymorphism (SNP) genotypes are then measured by Jaccard distances to track the relationship of virus isolates. The genotyping analysis of SARS-CoV-2 isolates from the globe reveals that specific multiple mutations are the predominated mutation type during the current epidemic. The proposed method serves an effective tool for monitoring and tracking the epidemic of pathogenic viruses in their global and local genetic variations. The genotyping analysis shows that the genes encoding the S proteins and RNA polymerase, RNA primase, and nucleoprotein, undergo frequent mutations. These mutations are critical for vaccine development in disease control.


Asunto(s)
Betacoronavirus/genética , Genómica , Técnicas de Genotipaje/métodos , Mutación , Polimorfismo de Nucleótido Simple , COVID-19 , Infecciones por Coronavirus , Evolución Molecular , Genoma Viral , Humanos , Pandemias , Neumonía Viral , SARS-CoV-2 , Alineación de Secuencia
6.
Genomics ; 112(2): 1847-1852, 2020 03.
Artículo en Inglés | MEDLINE | ID: mdl-31704313

RESUMEN

A novel method is proposed to detect the acceptor and donor splice sites using chaos game representation and artificial neural network. In order to achieve high accuracy, inputs to the neural network, or feature vector, shall reflect the true nature of the DNA segments. Therefore it is important to have one-to-one numerical representation, i.e. a feature vector should be able to represent the original data. Chaos game representation (CGR) is an iterative mapping technique that assigns each nucleotide in a DNA sequence to a respective position on the plane in a one-to-one manner. Using CGR, a DNA sequence can be mapped to a numerical sequence that reflects the true nature of the original sequence. In this research, we propose to use CGR as feature input to a neural network to detect splice sites on the NN269 dataset. Computational experiments indicate that this approach gives good accuracy while being simpler than other methods in the literature, with only one neural network component. The code and data for our method can be accessed from this link: https://github.com/thoang3/portfolio/tree/SpliceSites_ANN_CGR.


Asunto(s)
Redes Neurales de la Computación , Sitios de Empalme de ARN , Análisis de Secuencia de ADN/métodos , Humanos , Dinámicas no Lineales , Programas Informáticos
7.
Genomics ; 112(6): 5204-5213, 2020 11.
Artículo en Inglés | MEDLINE | ID: mdl-32966857

RESUMEN

Effective, sensitive, and reliable diagnostic reagents are of paramount importance for combating the ongoing coronavirus disease 2019 (COVID-19) pandemic when there is neither a preventive vaccine nor a specific drug available for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). It will cause a large number of false-positive and false-negative tests if currently used diagnostic reagents are undermined. Based on genotyping of 31,421 SARS-CoV-2 genome samples collected up to July 23, 2020, we reveal that essentially all of the current COVID-19 diagnostic targets have undergone mutations. We further show that SARS-CoV-2 has the most mutations on the targets of various nucleocapsid (N) gene primers and probes, which have been widely used around the world to diagnose COVID-19. To understand whether SARS-CoV-2 genes have mutated unevenly, we have computed the mutation rate and mutation h-index of all SARS-CoV-2 genes, indicating that the N gene is one of the most non-conservative genes in the SARS-CoV-2 genome. We show that due to human immune response induced APOBEC mRNA (C > T) editing, diagnostic targets should also be selected to avoid cytidines. Our findings might enable optimally selecting the conservative SARS-CoV-2 genes and proteins for the design and development of COVID-19 diagnostic reagents, prophylactic vaccines, and therapeutic medicines. AVAILABILITY: Interactive real-time online Mutation Tracker.


Asunto(s)
Prueba de COVID-19 , COVID-19/virología , Mutación , SARS-CoV-2/genética , Proteínas de la Envoltura de Coronavirus/genética , Cartilla de ADN , Técnicas de Genotipaje , Humanos , Polimorfismo de Nucleótido Simple , SARS-CoV-2/aislamiento & purificación
8.
J Chem Inf Model ; 60(12): 5853-5865, 2020 12 28.
Artículo en Inglés | MEDLINE | ID: mdl-32530284

RESUMEN

Tremendous effort has been given to the development of diagnostic tests, preventive vaccines, and therapeutic medicines for coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Much of this development has been based on the reference genome collected on January 5, 2020. Based on the genotyping of 15 140 genome samples collected up to June 1, 2020, we report that SARS-CoV-2 has undergone 8309 single mutations which can be clustered into six subtypes. We introduce mutation ratio and mutation h-index to characterize the protein conservativeness and unveil that SARS-CoV-2 envelope protein, main protease, and endoribonuclease protein are relatively conservative, while SARS-CoV-2 nucleocapsid protein, spike protein, and papain-like protease are relatively nonconservative. In particular, we have identified mutations on 40% of nucleotides in the nucleocapsid gene in the population level, signaling potential impacts on the ongoing development of COVID-19 diagnosis, vaccines, and antibody and small-molecular drugs.


Asunto(s)
COVID-19 , SARS-CoV-2/clasificación , SARS-CoV-2/metabolismo , Anticuerpos Antivirales/metabolismo , COVID-19/diagnóstico , COVID-19/epidemiología , COVID-19/prevención & control , COVID-19/terapia , Proteasas 3C de Coronavirus/química , Proteasas 3C de Coronavirus/genética , Proteínas de la Envoltura de Coronavirus/química , Proteínas de la Envoltura de Coronavirus/genética , Proteínas de la Nucleocápside de Coronavirus/química , Proteínas de la Nucleocápside de Coronavirus/genética , Proteasas Similares a la Papaína de Coronavirus/química , Proteasas Similares a la Papaína de Coronavirus/genética , Endorribonucleasas/química , Endorribonucleasas/genética , Genoma Viral , Genotipo , Geografía , Humanos , Proteínas Mutantes/química , Proteínas Mutantes/genética , Mutación , Fosfoproteínas/química , Fosfoproteínas/genética , Conformación Proteica , Glicoproteína de la Espiga del Coronavirus/química , Glicoproteína de la Espiga del Coronavirus/genética , Vacunas/metabolismo , Proteínas no Estructurales Virales/química , Proteínas no Estructurales Virales/genética
9.
J Theor Biol ; 412: 138-145, 2017 01 07.
Artículo en Inglés | MEDLINE | ID: mdl-27816675

RESUMEN

Repetitive elements are important in genomic structures, functions and regulations, yet effective methods in precisely identifying repetitive elements in DNA sequences are not fully accessible, and the relationship between repetitive elements and periodicities of genomes is not clearly understood. We present an ab initio method to quantitatively detect repetitive elements and infer the consensus repeat pattern in repetitive elements. The method uses the measure of the distribution uniformity of nucleotides at periodic positions in DNA sequences or genomes. It can identify periodicities, consensus repeat patterns, copy numbers and perfect levels of repetitive elements. The results of using the method on different DNA sequences and genomes demonstrate efficacy and accuracy in identifying repeat patterns and periodicities. The complexity of the method is linear with respect to the lengths of the analyzed sequences. The Python programs in this study are freely available to the public upon request or at https://github.com/cyinbox/DNADU.


Asunto(s)
ADN/genética , Genoma , Secuencias Repetitivas de Ácidos Nucleicos , Análisis de Secuencia de ADN/métodos , Programas Informáticos
10.
Genomics ; 108(3-4): 134-142, 2016 10.
Artículo en Inglés | MEDLINE | ID: mdl-27538895

RESUMEN

Numerical encoding plays an important role in DNA sequence analysis via computational methods, in which numerical values are associated with corresponding symbolic characters. After numerical representation, digital signal processing methods can be exploited to analyze DNA sequences. To reflect the biological properties of the original sequence, it is vital that the representation is one-to-one. Chaos Game Representation (CGR) is an iterative mapping technique that assigns each nucleotide in a DNA sequence to a respective position on the plane that allows the depiction of the DNA sequence in the form of image. Using CGR, a biological sequence can be transformed one-to-one to a numerical sequence that preserves the main features of the original sequence. In this research, we propose to encode DNA sequences by considering 2D CGR coordinates as complex numbers, and apply digital signal processing methods to analyze their evolutionary relationship. Computational experiments indicate that this approach gives comparable results to the state-of-the-art multiple sequence alignment method, Clustal Omega, and is significantly faster. The MATLAB code for our method can be accessed from: www.mathworks.com/matlabcentral/fileexchange/57152.


Asunto(s)
Análisis de Secuencia de ADN/métodos , Programas Informáticos , Algoritmos , Alineación de Secuencia/métodos
11.
Mol Phylogenet Evol ; 99: 53-62, 2016 06.
Artículo en Inglés | MEDLINE | ID: mdl-26988414

RESUMEN

Due to vast sequence divergence among different viral groups, sequence alignment is not directly applicable to genome-wide comparative analysis of viruses. More and more attention has been paid to alignment-free methods for whole genome comparison and phylogenetic tree reconstruction. Among alignment-free methods, the recently proposed "Natural Vector (NV) representation" has successfully been used to study the phylogeny of multi-segmented viruses based on a 12-dimensional genome space derived from the nucleotide sequence structure. But the preference of proteomes over genomes for the determination of viral phylogeny was not deeply investigated. As the translated products of genes, proteins directly form the shape of viral structure and are vital for all metabolic pathways. In this study, using the NV representation of a protein sequence along with the Hausdorff distance suitable to compare point sets, we construct a 60-dimensional protein space to analyze the evolutionary relationships of 4021 viruses by whole-proteomes in the current NCBI Reference Sequence Database (RefSeq). We also take advantage of the previously developed natural graphical representation to recover viral phylogeny. Our results demonstrate that the proposed method is efficient and accurate for classifying viruses. The accuracy rates of our predictions such as for Baltimore II viruses are as high as 95.9% for family labels, 95.7% for subfamily labels and 96.5% for genus labels. Finally, we discover that proteomes lead to better viral classification when reliable protein sequences are abundant. In other cases, the accuracy rates using proteomes are still comparable to that of genomes.


Asunto(s)
Proteínas Virales/química , Virus/clasificación , Secuencia de Aminoácidos , Bases de Datos de Proteínas , Genoma Viral , Filogenia , Proteoma/química , Proteoma/genética , Virus/genética
12.
J Math Biol ; 73(5): 1053-1079, 2016 11.
Artículo en Inglés | MEDLINE | ID: mdl-26942584

RESUMEN

Periodic elements play important roles in genomic structures and functions, yet some complex periodic elements in genomes are difficult to detect by conventional methods such as digital signal processing and statistical analysis. We propose a periodic power spectrum (PPS) method for analyzing periodicities of DNA sequences. The PPS method employs periodic nucleotide distributions of DNA sequences and directly calculates power spectra at specific periodicities. The magnitude of a PPS reflects the strength of a signal on periodic positions. In comparison with Fourier transform, the PPS method avoids spectral leakage, and reduces background noise that appears high in Fourier power spectrum. Thus, the PPS method can effectively capture hidden periodicities in DNA sequences. Using a sliding window approach, the PPS method can precisely locate periodic regions in DNA sequences. We apply the PPS method for detection of hidden periodicities in different genome elements, including exons, microsatellite DNA sequences, and whole genomes. The results show that the PPS method can minimize the impact of spectral leakage and thus capture true hidden periodicities in genomes. In addition, performance tests indicate that the PPS method is more effective and efficient than a fast Fourier transform. The computational complexity of the PPS algorithm is [Formula: see text]. Therefore, the PPS method may have a broad range of applications in genomic analysis. The MATLAB programs for implementing the PPS method are available from MATLAB Central ( http://www.mathworks.com/matlabcentral/fileexchange/55298 ).


Asunto(s)
Algoritmos , Genómica/métodos , Análisis de Secuencia de ADN/métodos , Secuencia de Bases , Genoma/genética , Nucleótidos/análisis
13.
J Theor Biol ; 382: 99-110, 2015 Oct 07.
Artículo en Inglés | MEDLINE | ID: mdl-26151589

RESUMEN

DNA sequence similarity comparison is one of the major steps in computational phylogenetic studies. The sequence comparison of closely related DNA sequences and genomes is usually performed by multiple sequence alignments (MSA). While the MSA method is accurate for some types of sequences, it may produce incorrect results when DNA sequences undergone rearrangements as in many bacterial and viral genomes. It is also limited by its computational complexity for comparing large volumes of data. Previously, we proposed an alignment-free method that exploits the full information contents of DNA sequences by Discrete Fourier Transform (DFT), but still with some limitations. Here, we present a significantly improved method for the similarity comparison of DNA sequences by DFT. In this method, we map DNA sequences into 2-dimensional (2D) numerical sequences and then apply DFT to transform the 2D numerical sequences into frequency domain. In the 2D mapping, the nucleotide composition of a DNA sequence is a determinant factor and the 2D mapping reduces the nucleotide composition bias in distance measure, and thus improving the similarity measure of DNA sequences. To compare the DFT power spectra of DNA sequences with different lengths, we propose an improved even scaling algorithm to extend shorter DFT power spectra to the longest length of the underlying sequences. After the DFT power spectra are evenly scaled, the spectra are in the same dimensionality of the Fourier frequency space, then the Euclidean distances of full Fourier power spectra of the DNA sequences are used as the dissimilarity metrics. The improved DFT method, with increased computational performance by 2D numerical representation, can be applicable to any DNA sequences of different length ranges. We assess the accuracy of the improved DFT similarity measure in hierarchical clustering of different DNA sequences including simulated and real datasets. The method yields accurate and reliable phylogenetic trees and demonstrates that the improved DFT dissimilarity measure is an efficient and effective similarity measure of DNA sequences. Due to its high efficiency and accuracy, the proposed DFT similarity measure is successfully applied on phylogenetic analysis for individual genes and large whole bacterial genomes.


Asunto(s)
Análisis de Fourier , Genoma , Modelos Genéticos , Filogenia , Algoritmos , Animales , Secuencia de Bases , Análisis por Conglomerados , Simulación por Computador , Genoma Mitocondrial , Humanos , Mamíferos/genética , Mutación/genética , NADH Deshidrogenasa/genética , Nucleótidos/genética
14.
J Theor Biol ; 372: 135-45, 2015 May 07.
Artículo en Inglés | MEDLINE | ID: mdl-25747773

RESUMEN

A novel clustering method is proposed to classify genes and genomes. For a given DNA sequence, a binary indicator sequence of each nucleotide is constructed, and Discrete Fourier Transform is applied on these four sequences to attain respective power spectra. Mathematical moments are built from these spectra, and multidimensional vectors of real numbers are constructed from these moments. Cluster analysis is then performed in order to determine the evolutionary relationship between DNA sequences. The novelty of this method is that sequences with different lengths can be compared easily via the use of power spectra and moments. Experimental results on various datasets show that the proposed method provides an efficient tool to classify genes and genomes. It not only gives comparable results but also is remarkably faster than other multiple sequence alignment and alignment-free methods.


Asunto(s)
ADN/genética , Análisis de Secuencia de ADN/métodos , Algoritmos , Animales , Bacterias/genética , Análisis por Conglomerados , Biología Computacional , Coronavirus/genética , ADN Mitocondrial/genética , Evolución Molecular , Análisis de Fourier , Genoma , Genoma Bacteriano , Humanos , Procesamiento de Imagen Asistido por Computador , Virus de la Influenza A/genética , Modelos Genéticos , Filogenia , Rhinovirus/genética , Alineación de Secuencia/métodos
15.
J Theor Biol ; 359: 18-28, 2014 Oct 21.
Artículo en Inglés | MEDLINE | ID: mdl-24911780

RESUMEN

Multiple sequence alignment (MSA) is a prominent method for classification of DNA sequences, yet it is hampered with inherent limitations in computational complexity. Alignment-free methods have been developed over past decade for more efficient comparison and classification of DNA sequences than MSA. However, most alignment-free methods may lose structural and functional information of DNA sequences because they are based on feature extractions. Therefore, they may not fully reflect the actual differences among DNA sequences. Alignment-free methods with information conservation are needed for more accurate comparison and classification of DNA sequences. We propose a new alignment-free similarity measure of DNA sequences using the Discrete Fourier Transform (DFT). In this method, we map DNA sequences into four binary indicator sequences and apply DFT to the indicator sequences to transform them into frequency domain. The Euclidean distance of full DFT power spectra of the DNA sequences is used as similarity distance metric. To compare the DFT power spectra of DNA sequences with different lengths, we propose an even scaling method to extend shorter DFT power spectra to equal the longest length of the sequences compared. After the DFT power spectra are evenly scaled, the DNA sequences are compared in the same DFT frequency space dimensionality. We assess the accuracy of the similarity metric in hierarchical clustering using simulated DNA and virus sequences. The results demonstrate that the DFT based method is an effective and accurate measure of DNA sequence similarity.


Asunto(s)
Análisis por Conglomerados , ADN/análisis , Análisis de Fourier , Alineación de Secuencia/métodos , Algoritmos , Secuencia de Bases , Biología Computacional , Humanos , Datos de Secuencia Molecular , Proteína 1 de la Secuencia de Leucemia de Células Mieloides/análisis , Proteína 1 de la Secuencia de Leucemia de Células Mieloides/genética , Filogenia , Análisis de Secuencia de ADN , Homología de Secuencia de Ácido Nucleico
16.
J Comput Biol ; 29(9): 1001-1021, 2022 09.
Artículo en Inglés | MEDLINE | ID: mdl-35593919

RESUMEN

The comparison of DNA sequences is of great significance in genomics analysis. Although the traditional multiple sequence alignment (MSA) method is popularly used for evolutionary analysis, optimally aligning k sequences becomes computationally intractable when k increases due to the intrinsic computational complexity of MSA. Despite numerous k-mer alignment-free methods being proposed, the existing k-mer alignment-free methods may not truly capture the contextual structures of the sequences. In this study, we present a novel k-mer contextual alignment-free method (called kmer2vec), in which the sequence k-mers are semantically embedded to word2vec vectors, an essential technique in natural language processing. Consequently, the method converts each DNA/RNA sequence into a point in the word2vec high-dimensional space and compares DNA sequences in the space. Because the word2vec vectors are trained from the contextual relationship of k-mers in the genomes, the method may extract valuable structural information from the sequences and reflect the relationship among them properly. The proposed method is optimized on the parameters from word2vec training and verified in the phylogenetic analysis of large whole genomes, including coronavirus and bacterial genomes. The results demonstrate the effectiveness of the method on phylogenetic tree construction and species clustering. The method running speed is much faster than that of the MSA method, especially the phylogenetic relationships constructed by the kmer2vec method are more accurate than the conventional k-mer alignment-free method. Therefore, this approach can provide new perspectives for phylogeny and evolution and make it possible to analyze large genomes. In addition, we discuss special parameterization in the k-mer word2vec embedding construction. An effective tool for rapid SARS-CoV-2 typing can also be derived when combining kmer2vec with clustering methods.


Asunto(s)
Algoritmos , COVID-19 , Secuencia de Bases , Humanos , Filogenia , SARS-CoV-2/genética , Análisis de Secuencia de ADN/métodos
17.
ACS Infect Dis ; 8(3): 546-556, 2022 03 11.
Artículo en Inglés | MEDLINE | ID: mdl-35133792

RESUMEN

The surge of COVID-19 infections has been fueled by new SARS-CoV-2 variants, namely Alpha, Beta, Gamma, Delta, and so forth. The molecular mechanism underlying such surge is elusive due to the existence of 28 554 unique mutations, including 4 653 non-degenerate mutations on the spike protein. Understanding the molecular mechanism of SARS-CoV-2 transmission and evolution is a prerequisite to foresee the trend of emerging vaccine-breakthrough variants and the design of mutation-proof vaccines and monoclonal antibodies. We integrate the genotyping of 1 489 884 SARS-CoV-2 genomes, a library of 130 human antibodies, tens of thousands of mutational data, topological data analysis, and deep learning to reveal SARS-CoV-2 evolution mechanism and forecast emerging vaccine-breakthrough variants. We show that prevailing variants can be quantitatively explained by infectivity-strengthening and vaccine-escape (co-)mutations on the spike protein RBD due to natural selection and/or vaccination-induced evolutionary pressure. We illustrate that infectivity strengthening mutations were the main mechanism for viral evolution, while vaccine-escape mutations become a dominating viral evolutionary mechanism among highly vaccinated populations. We demonstrate that Lambda is as infectious as Delta but is more vaccine-resistant. We analyze emerging vaccine-breakthrough comutations in highly vaccinated countries, including the United Kingdom, the United States, Denmark, and so forth. Finally, we identify sets of comutations that have a high likelihood of massive growth: [A411S, L452R, T478K], [L452R, T478K, N501Y], [V401L, L452R, T478K], [K417N, L452R, T478K], [L452R, T478K, E484K, N501Y], and [P384L, K417N, E484K, N501Y]. We predict they can escape existing vaccines. We foresee an urgent need to develop new virus combating strategies.

18.
J Comput Biol ; 28(3): 269-282, 2021 03.
Artículo en Inglés | MEDLINE | ID: mdl-33290131

RESUMEN

Directly computing Fourier power spectra at fractional periods of real sequences can be beneficial in many digital signal processing applications. In this article, we present a fast algorithm to compute the fractional Fourier power spectra of real sequences. For a real sequence of length of m = n l , we may deduce its congruence derivative sequence with a length of l. The discrete Fourier transform of the original sequence can be calculated by the discrete Fourier transform of the congruence derivative sequence. The relation of discrete Fourier transforms between the two sequences may derive the special features of Fourier power spectra of the integer and fractional periods for a real sequence. It has been proved mathematically that after calculating the Fourier power spectrum (FPS) at an integer period, the Fourier power spectra of the fractional periods related this integer period can be easily represented by the computational result of the FPS at the integer period for the sequence. Computational experiments using a simulated sinusoidal data and protein sequence show that the computed results are a kind of Fourier power spectra corresponding to new frequencies that cannot be obtained from the traditional discrete Fourier transform. Therefore, the algorithm would be a new realization method for discrete Fourier transform of the real sequence.


Asunto(s)
Biología Computacional/métodos , Algoritmos , Análisis de Fourier , Humanos , Matemática/métodos , Análisis de Secuencia de ADN/métodos
19.
Comput Biol Med ; 131: 104264, 2021 04.
Artículo en Inglés | MEDLINE | ID: mdl-33647832

RESUMEN

Coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a worldwide devastating effect. Understanding the evolution and transmission of SARS-CoV-2 is of paramount importance for controlling, combating and preventing COVID-19. Due to the rapid growth in both the number of SARS-CoV-2 genome sequences and the number of unique mutations, the phylogenetic analysis of SARS-CoV-2 genome isolates faces an emergent large-data challenge. We introduce a dimension-reduced K-means clustering strategy to tackle this challenge. We examine the performance and effectiveness of three dimension-reduction algorithms: principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP). By using four benchmark datasets, we found that UMAP is the best-suited technique due to its stable, reliable, and efficient performance, its ability to improve clustering accuracy, especially for large Jaccard distanced-based datasets, and its superior clustering visualization. The UMAP-assisted K-means clustering enables us to shed light on increasingly large datasets from SARS-CoV-2 genome isolates.


Asunto(s)
Algoritmos , COVID-19/genética , Bases de Datos de Ácidos Nucleicos , Genoma Viral , Mutación , Filogenia , SARS-CoV-2/genética , Humanos
20.
ArXiv ; 2021 Sep 09.
Artículo en Inglés | MEDLINE | ID: mdl-34518803

RESUMEN

The recent global surge in COVID-19 infections has been fueled by new SARS-CoV-2 variants, namely Alpha, Beta, Gamma, Delta, etc. The molecular mechanism underlying such surge is elusive due to 4,653 non-degenerate mutations on the spike protein, which is the target of most COVID-19 vaccines. The understanding of the molecular mechanism of transmission and evolution is a prerequisite to foresee the trend of emerging vaccine-breakthrough variants and the design of mutation-proof vaccines and monoclonal antibodies. We integrate the genotyping of 1,489,884 SARS-CoV-2 genomes isolates, 130 human antibodies, tens of thousands of mutational data points, topological data analysis, and deep learning to reveal SARS-CoV-2 evolution mechanism and forecast emerging vaccine-escape variants. We show that infectivity-strengthening and antibody-disruptive co-mutations on the S protein RBD can quantitatively explain the infectivity and virulence of all prevailing variants. We demonstrate that Lambda is as infectious as Delta but is more vaccine-resistant. We analyze emerging vaccine-breakthrough co-mutations in 20 countries, including the United Kingdom, the United States, Denmark, Brazil, and Germany, etc. We envision that natural selection through infectivity will continue to be the main mechanism for viral evolution among unvaccinated populations, while antibody disruptive co-mutations will fuel the future growth of vaccine-breakthrough variants among fully vaccinated populations. Finally, we have identified the co-mutations that have the great likelihood of becoming dominant: [A411S, L452R, T478K], [L452R, T478K, N501Y], [V401L, L452R, T478K], [K417N, L452R, T478K], [L452R, T478K, E484K, N501Y], and [P384L, K417N, E484K, N501Y]. We predict they, particularly the last four, will break through existing vaccines. We foresee an urgent need to develop new vaccines that target these co-mutations.

SELECCIÓN DE REFERENCIAS
Detalles de la búsqueda