RESUMO
The emergence of third-generation single-molecule sequencing (TGS) technology has revolutionized the generation of long reads, which are essential for genome assembly and have been widely employed in sequencing the SARS-CoV-2 virus during the COVID-19 pandemic. Although long-read sequencing has been crucial in understanding the evolution and transmission of the virus, the high error rate associated with these reads can lead to inadequate genome assembly and downstream biological interpretation. In this study, we evaluate the accuracy and robustness of machine learning (ML) models using six different embedding techniques on SARS-CoV-2 error-incorporated genome sequences. Our analysis includes two types of error-incorporated genome sequences: those generated using simulation tools to emulate error profiles of long-read sequencing platforms and those generated by introducing random errors. We show that the spaced k-mers embedding method achieves high accuracy in classifying error-free SARS-CoV-2 genome sequences, and the spaced k-mers and weighted k-mers embedding methods are highly accurate in predicting error-incorporated sequences. The fixed-length vectors generated by these methods contribute to the high accuracy achieved. Our study provides valuable insights for researchers to effectively evaluate ML models and gain a better understanding of the approach for accurate identification of critical SARS-CoV-2 genome sequences.
Assuntos
COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , Análise de Sequência de DNA/métodos , Pandemias , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Algoritmos , Aprendizado de MáquinaRESUMO
The rapid spread of the COVID-19 pandemic has resulted in an unprecedented amount of sequence data of the SARS-CoV-2 genome-millions of sequences and counting. This amount of data, while being orders of magnitude beyond the capacity of traditional approaches to understanding the diversity, dynamics, and evolution of viruses, is nonetheless a rich resource for machine learning (ML) approaches as alternatives for extracting such important information from these data. It is of hence utmost importance to design a framework for testing and benchmarking the robustness of these ML models. This paper makes the first effort (to our knowledge) to benchmark the robustness of ML models by simulating biological sequences with errors. In this paper, we introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio. We show from experiments on a wide array of ML models that some simulation-based approaches with different perturbation budgets are more robust (and accurate) than others for specific embedding methods to certain noise simulations on the input sequences. Our benchmarking framework may assist researchers in properly assessing different ML models and help them understand the behavior of the SARS-CoV-2 virus or avoid possible future pandemics.
Assuntos
Simulação por Computador , Genoma Viral , Aprendizado de Máquina , Projetos de Pesquisa , SARS-CoV-2 , Aprendizado de Máquina/normas , SARS-CoV-2/classificação , SARS-CoV-2/genética , Genoma Viral/genética , Proteínas Virais/genética , COVID-19/virologia , Análise de Sequência de RNARESUMO
With the properties of aggressive cancer and heterogeneous tumor biology, triple-negative breast cancer (TNBC) is a type of breast cancer known for its poor clinical outcome. The lack of estrogen, progesterone, and human epidermal growth factor receptor in the tumors of TNBC leads to fewer treatment options in clinics. The incidence of TNBC is higher in African American (AA) women compared with European American (EA) women with worse clinical outcomes. The significant factors responsible for the racial disparity in TNBC are socioeconomic lifestyle and tumor biology. The current study considered the open-source gene expression data of triple-negative breast cancer samples' racial information. We implemented a state-of-the-art classification Support Vector Machine (SVM) method with a recurrent feature elimination approach to the gene expression data to identify significant biomarkers deregulated in AA women and EA women. We also included Spearman's rho and Ward's linkage method in our feature selection workflow. Our proposed method generates 24 features/genes that can classify the AA and EA samples 98% accurately. We also performed the Kaplan-Meier analysis and log-rank test on the 24 features/genes. We only discussed the correlation between deregulated expression and cancer progression with a poor survival rate of 2 genes, KLK10 and LRRC37A2, out of 24 genes. We believe that further improvement of our method with a higher number of RNA-seq gene expression data will more accurately provide insight into racial disparity in TNBC.
Assuntos
Disparidades nos Níveis de Saúde , Neoplasias de Mama Triplo Negativas , Feminino , Humanos , Biomarcadores Tumorais/genética , Negro ou Afro-Americano/genética , Máquina de Vetores de Suporte , Neoplasias de Mama Triplo Negativas/etnologia , Neoplasias de Mama Triplo Negativas/patologia , Brancos/genéticaRESUMO
Machine learning (ML) models, such as SVM, for tasks like classification and clustering of sequences, require a definition of distance/similarity between pairs of sequences. Several methods have been proposed to compute the similarity between sequences, such as the exact approach that counts the number of matches between k-mers (sub-sequences of length k) and an approximate approach that estimates pairwise similarity scores. Although exact methods yield better classification performance, they pose high computational costs, limiting their applicability to a small number of sequences. The approximate algorithms are proven to be more scalable and perform comparably to (sometimes better than) the exact methods - they are designed in a "general" way to deal with different types of sequences (e.g., music, protein, etc.). Although general applicability is a desired property of an algorithm, it is not the case in all scenarios. For example, in the current COVID-19 (coronavirus) pandemic, there is a need for an approach that can deal specifically with the coronavirus. To this end, we propose a series of ways to improve the performance of the approximate kernel (using minimizers and information gain) in order to enhance its predictive performance pm coronavirus sequences. More specifically, we improve the quality of the approximate kernel using domain knowledge (computed using information gain) and efficient preprocessing (using minimizers computation) to classify coronavirus spike protein sequences corresponding to different variants (e.g., Alpha, Beta, Gamma). We report results using different classification and clustering algorithms and evaluate their performance using multiple evaluation metrics. Using two datasets, we show that our proposed method helps improve the kernel's performance compared to the baseline and state-of-the-art approaches in the healthcare domain.
RESUMO
Peripheral neurons comprise a critical component of the tumor microenvironment (TME). The role of the autonomic innervation in cancer has been firmly established. However, the effect of the afferent (sensory) neurons on tumor progression remains unclear. Utilizing surgical and chemical skin sensory denervation methods, we showed that afferent neurons supported the growth of melanoma tumors in vivo and demonstrated that sensory innervation limited the activation of effective antitumor immune responses. Specifically, sensory ablation led to improved leukocyte recruitment into tumors, with decreased presence of lymphoid and myeloid immunosuppressive cells and increased activation of T-effector cells within the TME. Cutaneous sensory nerves hindered the maturation of intratumoral high endothelial venules and limited the formation of mature tertiary lymphoid-like structures containing organized clusters of CD4+ T cells and B cells. Denervation further increased T-cell clonality and expanded the B-cell repertoire in the TME. Importantly, CD8a depletion prevented denervation-dependent antitumor effects. Finally, we observed that gene signatures of inflammation and the content of neuron-associated transcripts inversely correlated in human primary cutaneous melanomas, with the latter representing a negative prognostic marker of patient overall survival. Our results suggest that tumor-associated sensory neurons negatively regulate the development of protective antitumor immune responses within the TME, thereby defining a novel target for therapeutic intervention in the melanoma setting.
Assuntos
Melanoma , Neoplasias Cutâneas , Estruturas Linfoides Terciárias , Humanos , Imunidade , Microambiente TumoralRESUMO
The availability of millions of SARS-CoV-2 (Severe Acute Respiratory Syndrome-Coronavirus-2) sequences in public databases such as GISAID (Global Initiative on Sharing All Influenza Data) and EMBL-EBI (European Molecular Biology Laboratory-European Bioinformatics Institute) (the United Kingdom) allows a detailed study of the evolution, genomic diversity, and dynamics of a virus such as never before. Here, we identify novel variants and subtypes of SARS-CoV-2 by clustering sequences in adapting methods originally designed for haplotyping intrahost viral populations. We asses our results using clustering entropy-the first time it has been used in this context. Our clustering approach reaches lower entropies compared with other methods, and we are able to boost this even further through gap filling and Monte Carlo-based entropy minimization. Moreover, our method clearly identifies the well-known Alpha variant in the U.K. and GISAID data sets, and is also able to detect the much less represented (<1% of the sequences) Beta (South Africa), Epsilon (California), and Gamma and Zeta (Brazil) variants in the GISAID data set. Finally, we show that each variant identified has high selective fitness, based on the growth rate of its cluster over time. This demonstrates that our clustering approach is a viable alternative for detecting even rare subtypes in very large data sets.
Assuntos
Análise por Conglomerados , Biologia Computacional/métodos , Brasil , Bases de Dados Genéticas , Entropia , Humanos , Método de Monte Carlo , África do Sul , Reino Unido , Estados UnidosRESUMO
In this article, we present our novel pipeline for analysis of metabolic activity using a microbial community's metatranscriptome sequence data set for validation. Our method is based on expectation-maximization (EM) algorithm and provides enzyme expression and pathway activity levels. Further expanding our analysis, we consider individual enzymatic activity and compute enzyme participation coefficients to approximate the metabolic pathway activity more accurately. We apply our EM pathways pipeline to a metatranscriptomic data set of a plankton community from surface waters of the Northern Gulf of Mexico. The data set consists of RNA-seq data and respective environmental parameters, which were sampled at two depths, six times a day over multiple 24-hour cycles. Furthermore, we discuss microbial dependence on day-night cycle within our findings based on a three-way correlation of the enzyme expression during antipodal times-midnight and noon. We show that the enzyme participation levels strongly affect the metabolic activity estimates: that is, marginal and multiple linear regression of enzymatic and metabolic pathway activity correlated significantly with the recorded environmental parameters. Our analysis statistically validates that EM-based methods produce meaningful results, as our method confirms statistically significant dependence of metabolic pathway activity on the environmental parameters, such as salinity, temperature, brightness, and a few others.
Assuntos
Bactérias/genética , Perfilação da Expressão Gênica/métodos , Redes e Vias Metabólicas , Plâncton/microbiologia , Algoritmos , Golfo do México , Modelos Lineares , Metagenômica , Análise de Sequência de RNARESUMO
Epidermal growth factor receptor (EGFR) and human epidermal growth factor receptor 3 (HER3) have been investigated as triple-negative breast cancer (TNBC) biomarkers. Reduced EGFR levels can be compensated by increases in HER3; thus, assaying EGFR and HER3 together may improve prognostic value. In a multi-institutional cohort of 510 TNBC patients, we analyzed the impact of HER3, EGFR, or combined HER3-EGFR protein expression in pre-treatment samples on breast cancer-specific and distant metastasis-free survival (BCSS and DMFS, respectively). A subset of 60 TNBC samples were RNA-sequenced using massive parallel sequencing. The combined HER3-EGFR score outperformed individual HER3 and EGFR scores, with high HER3-EGFR score independently predicting worse BCSS (Hazard Ratio [HR] = 2.30, p = 0.006) and DMFS (HR = 1.78, p = 0.041, respectively). TNBCs with high HER3-EGFR scores exhibited significantly suppressed ATM signaling and differential expression of a network predicted to be controlled by low TXN activity, resulting in activation of EGFR, PARP1, and caspases and inhibition of p53 and NFκB. Nuclear PARP1 protein levels were higher in HER3-EGFR-high TNBCs based on immunohistochemistry (p = 0.036). Assessing HER3 and EGFR protein expression in combination may identify which adjuvant chemotherapy-treated TNBC patients have a higher risk of treatment resistance and may benefit from a dual HER3-EGFR inhibitor and a PARP1 inhibitor.
Assuntos
Regulação Neoplásica da Expressão Gênica , Redes Reguladoras de Genes , Receptor ErbB-3/genética , Neoplasias de Mama Triplo Negativas/genética , Adulto , Antineoplásicos/uso terapêutico , Biomarcadores Tumorais/genética , Biomarcadores Tumorais/metabolismo , Estudos de Coortes , Receptores ErbB/genética , Receptores ErbB/metabolismo , Feminino , Perfilação da Expressão Gênica , Humanos , Glândulas Mamárias Humanas/efeitos dos fármacos , Glândulas Mamárias Humanas/metabolismo , Glândulas Mamárias Humanas/patologia , Pessoa de Meia-Idade , NF-kappa B/genética , NF-kappa B/metabolismo , Estadiamento de Neoplasias , Poli(ADP-Ribose) Polimerase-1/genética , Poli(ADP-Ribose) Polimerase-1/metabolismo , Prognóstico , Receptor ErbB-3/metabolismo , Análise de Sobrevida , Tiorredoxinas/genética , Tiorredoxinas/metabolismo , Neoplasias de Mama Triplo Negativas/diagnóstico , Neoplasias de Mama Triplo Negativas/tratamento farmacológico , Neoplasias de Mama Triplo Negativas/mortalidade , Proteína Supressora de Tumor p53/genética , Proteína Supressora de Tumor p53/metabolismoRESUMO
With the emerging advances made in genomics and functional genomics approaches, there is a critical and growing unmet need to integrate plural datasets in order to identify driver genes in cancer. An integrative approach, with the convergence of multiple types of genetic evidence, can limit false positives through a posterior filtering strategy and reduce the need for multiple hypothesis testing to identify true cancer vulnerabilities. We performed a pooled shRNA screen against 906 human genes in the oral cancer cell line AW13516 in triplicate. The genes that were depleted in the screen were integrated with copy number alteration and gene expression data and ranked based on ROAST analysis, using an integrative scoring system, DepRanker, to compute a Rank Impact Score (RIS) for each gene. The RIS-based ranking of candidate driver genes was used to identify the putative oncogenes AURKB and TK1 as essential for oral cancer cell proliferation. We validated the findings, showing that shRNA mediated genetic knockdown of TK1 or pharmacological inhibition of AURKB by AZD-1152 HQPA in AW13516 cells could significantly impede their proliferation. Next we analysed alterations in AURKB and TK1 genes in head and neck cancer and their association with prognosis using data on 528 patients obtained from TCGA. Patients harbouring alterations in AURKB and TK1 genes were associated with poor survival. To summarise, we present DepRanker as a simple yet robust package with no third-party dependencies for the identification of potential driver genes from a pooled shRNA functional genomic screen by integrating results from RNAi screens with gene expression and copy number data. Using DepRanker, we identify AURKB and TK1 as potential therapeutic targets in oral cancer. DepRanker is in the public domain and available for download at http://www.actrec.gov.in/pi-webpages/AmitDutt/DepRanker/DepRanker.html.
Assuntos
Aurora Quinase B/genética , Tecnologia de Impulso Genético/métodos , Neoplasias de Cabeça e Pescoço/genética , RNA Interferente Pequeno/genética , Timidina Quinase/genética , Linhagem Celular , Genômica/métodos , Humanos , Oncogenes , Software , Sobrevida , Neoplasias da Língua/genéticaRESUMO
Triple-negative breast cancer (TNBC) is characterized by the absence of estrogen and progesterone receptors and absence of amplification of human epidermal growth factor receptor (HER2). This disease has no approved treatment with a poor prognosis particularly in African-American (AA) as compared to European-American (EA) patients. Gene ontology analysis showed specific gene pathways that are differentially regulated and gene signatures that are differentially expressed in AA as compared to EA. Such differences might underlie the basis for the aggressive nature and poor prognosis of TNBC in AA patients. In-depth studies of these pathways and differential genetic signature might give significant clues to improve our understanding of tumor biology associated with AA TNBC to advance the prognosis and survival rates. Along with gene ontology analysis, we suggest that post-translational modifications (PTM) could also play a crucial role in the dismal survival rate of AA TNBC patients. Further investigations are necessary to explore this terrain of PTMs to identify the racially disparate burden in TNBC.
Assuntos
Disparidades nos Níveis de Saúde , Receptores de Progesterona/metabolismo , Neoplasias de Mama Triplo Negativas/etnologia , Negro ou Afro-Americano/genética , Feminino , Perfilação da Expressão Gênica , Humanos , Fenótipo , Prognóstico , Receptor ErbB-2/metabolismo , Neoplasias de Mama Triplo Negativas/genética , Neoplasias de Mama Triplo Negativas/mortalidade , Microambiente Tumoral , População Branca/genéticaRESUMO
The uncommonness of gallbladder cancer in the developed world has contributed to the generally poor understanding of the disease. Our integrated analysis of whole exome sequencing, copy number alterations, immunohistochemical, and phospho-proteome array profiling indicates ERBB2 alterations in 40% early-stage rare gallbladder tumors, among an ethnically distinct population not studied before, that occurs through overexpression in 24% (n = 25) and recurrent mutations in 14% tumors (n = 44); along with co-occurring KRAS mutation in 7% tumors (n = 44). We demonstrate that ERBB2 heterodimerizes with EGFR to constitutively activate the ErbB signaling pathway in gallbladder cells. Consistent with this, treatment with ERBB2-specific, EGFR-specific shRNA or with a covalent EGFR family inhibitor Afatinib inhibits tumor-associated characteristics of the gallbladder cancer cells. Furthermore, we observe an in vivo reduction in tumor size of gallbladder xenografts in response to Afatinib is paralleled by a reduction in the amounts of phospho-ERK, in tumors harboring KRAS (G13D) mutation but not in KRAS (G12V) mutation, supporting an essential role of the ErbB pathway. In overall, besides implicating ERBB2 as an important therapeutic target under neo-adjuvant or adjuvant settings, we present the first evidence that the presence of KRAS mutations may preclude gallbladder cancer patients to respond to anti-EGFR treatment, similar to a clinical algorithm commonly practiced to opt for anti-EGFR treatment in colorectal cancer.
Assuntos
Antineoplásicos/uso terapêutico , Neoplasias da Vesícula Biliar/genética , Proteínas Proto-Oncogênicas p21(ras)/genética , Receptor ErbB-2/genética , Adulto , Afatinib/farmacologia , Afatinib/uso terapêutico , Idoso , Animais , Antineoplásicos/farmacologia , Linhagem Celular Tumoral , Análise Mutacional de DNA , Receptores ErbB/antagonistas & inibidores , Receptores ErbB/metabolismo , Feminino , Vesícula Biliar/patologia , Neoplasias da Vesícula Biliar/tratamento farmacológico , Neoplasias da Vesícula Biliar/patologia , Humanos , Masculino , Camundongos , Camundongos Endogâmicos NOD , Camundongos SCID , Pessoa de Meia-Idade , Mutação , Estadiamento de Neoplasias , Fosforilação/efeitos dos fármacos , Receptor ErbB-2/metabolismo , Transdução de Sinais/efeitos dos fármacos , Transdução de Sinais/genética , Resultado do Tratamento , Sequenciamento do Exoma , Ensaios Antitumorais Modelo de XenoenxertoRESUMO
Cancer is predominantly a somatic disease. A mutant allele present in a cancer cell genome is considered somatic when it's absent in the paired normal genome along with public SNP databases. The current build of dbSNP, the most comprehensive public SNP database, however inadequately represents several non-European Caucasian populations, posing a limitation in cancer genomic analyses of data from these populations. We present the T: ata M: emorial C: entre-SNP D: ata B: ase (TMC-SNPdb), as the first open source, flexible, upgradable, and freely available SNP database (accessible through dbSNP build 149 and ANNOVAR)-representing 114 309 unique germline variants-generated from whole exome data of 62 normal samples derived from cancer patients of Indian origin. The TMC-SNPdb is presented with a companion subtraction tool that can be executed with command line option or using an easy-to-use graphical user interface with the ability to deplete additional Indian population specific SNPs over and above dbSNP and 1000 Genomes databases. Using an institutional generated whole exome data set of 132 samples of Indian origin, we demonstrate that TMC-SNPdb could deplete 42, 33 and 28% false positive somatic events post dbSNP depletion in Indian origin tongue, gallbladder, and cervical cancer samples, respectively. Beyond cancer somatic analyses, we anticipate utility of the TMC-SNPdb in several Mendelian germline diseases. In addition to dbSNP build 149 and ANNOVAR, the TMC-SNPdb along with the subtraction tool is available for download in the public domain at the following:Database URL: http://www.actrec.gov.in/pi-webpages/AmitDutt/TMCSNP/TMCSNPdp.html.
Assuntos
Povo Asiático/genética , Bases de Dados de Ácidos Nucleicos , Genoma Humano , Mutação em Linhagem Germinativa , Neoplasias/genética , Polimorfismo de Nucleotídeo Único , Feminino , Humanos , Índia , MasculinoRESUMO
BACKGROUND: We earlier proposed a genetic model for gallbladder carcinogenesis and its dissemination cascade. However, the association of gallbladder cancer and 'inflammatory stimulus' to drive the initial cascade in the model remained unclear. A recent study suggested infection with Salmonella can lead to changes in the host signalling pathways in gallbladder cancer. FINDINGS: We examined the whole exomes of 26 primary gall bladder tumour and paired normal samples for presence of 143 HPV (Human papilloma virus) types along with 6 common Salmonella serotypes (S. typhi Ty2, S. typhi CT18, S. typhimurium LT2, S. choleraesuis SCB67, S. paratyphi TCC, and S. paratyphi SPB7) using a computational subtraction pipeline based on the HPVDetector, we recently described. Based on our evaluation of 26 whole exome gallbladder primary tumours and matched normal samples: association of typhoidal Salmonella species were found in 11 of 26 gallbladder cancer samples, and non-typhoidal Salmonella species in 12 of 26 gallbladder cancer, with 6 samples were found co-infected with both. CONCLUSIONS: We present the first evidence to support the association of non-typhoidal Salmonella species along with typhoidal strains in gallbladder cancer. Salmonella infection in the chronic carrier state fits the role of the 'inflammatory stimulus' in the genetic model for gallbladder carcinogenesis that may play a role in gallbladder cancer analogous to Helicobacter pylori in gastric cancer.