Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 244
Filtrar
Más filtros

Tipo del documento
Intervalo de año de publicación
1.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38349062

RESUMEN

Single-cell RNA sequencing (scRNA-seq) has emerged as a powerful tool to gain biological insights at the cellular level. However, due to technical limitations of the existing sequencing technologies, low gene expression values are often omitted, leading to inaccurate gene counts. Existing methods, including advanced deep learning techniques, struggle to reliably impute gene expressions due to a lack of mechanisms that explicitly consider the underlying biological knowledge of the system. In reality, it has long been recognized that gene-gene interactions may serve as reflective indicators of underlying biology processes, presenting discriminative signatures of the cells. A genomic data analysis framework that is capable of leveraging the underlying gene-gene interactions is thus highly desirable and could allow for more reliable identification of distinctive patterns of the genomic data through extraction and integration of intricate biological characteristics of the genomic data. Here we tackle the problem in two steps to exploit the gene-gene interactions of the system. We first reposition the genes into a 2D grid such that their spatial configuration reflects their interactive relationships. To alleviate the need for labeled ground truth gene expression datasets, a self-supervised 2D convolutional neural network is employed to extract the contextual features of the interactions from the spatially configured genes and impute the omitted values. Extensive experiments with both simulated and experimental scRNA-seq datasets are carried out to demonstrate the superior performance of the proposed strategy against the existing imputation methods.


Asunto(s)
Aprendizaje Profundo , Epistasis Genética , Análisis de Datos , Genómica , Expresión Génica , Perfilación de la Expresión Génica , Análisis de Secuencia de ARN
2.
Genes Chromosomes Cancer ; 63(9): e23275, 2024 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-39324485

RESUMEN

Concurrent testing of numerous genes for hereditary breast cancer (BC) is available but can result in management difficulties. We evaluated use of an expanded BC gene panel in women of diverse South African ancestries and assessed use of African genomic data to reclassify variants of uncertain significance (VUS). A total of 331 women of White, Black African, or Mixed Ancestry with BC had a 9-gene panel test, with an additional 75 genes tested in those without a pathogenic/likely pathogenic (P/LP) variant. The proportion of VUS reclassified using ClinGen gene-specific allele frequency (AF) thresholds or an AF > 0.001 in nonguidelines genes in African genomic data was determined. The 9-gene panel identified 58 P/LP variants, but only two of the P/LP variants detected using the 75-gene panel were in confirmed BC genes, resulting in a total of 60 (18.1%) in all participants. P/LP variant prevalence was similar across ancestry groups, but VUS prevalence was higher in Black African and Mixed Ancestry than in White participants. In total, 611 VUS were detected, representing 324 distinct variants. 10.8% (9/83) of VUS met ClinGen AF thresholds in genomic data while 10.8% (26/240) in nonguideline genes had an AF > 0.001. Overall, 27.0% of VUS occurrences could potentially be reclassified using African genomic data. Thus, expanding the gene panel yielded few clinically actionable variants but many VUS, particularly in participants of Black African and Mixed Ancestry. However, use of African genomic data has the potential to reclassify a significant proportion of VUS.


Asunto(s)
Población Negra , Neoplasias de la Mama , Humanos , Neoplasias de la Mama/genética , Neoplasias de la Mama/etnología , Femenino , Sudáfrica/epidemiología , Persona de Mediana Edad , Adulto , Población Negra/genética , Prevalencia , Variación Genética , Anciano , Predisposición Genética a la Enfermedad , Frecuencia de los Genes , Pruebas Genéticas/métodos , Población Blanca/genética
3.
BMC Bioinformatics ; 25(1): 160, 2024 Apr 22.
Artículo en Inglés | MEDLINE | ID: mdl-38649820

RESUMEN

BACKGROUND: The reconstruction of the evolutionary history of organisms has been greatly influenced by the advent of molecular techniques, leading to a significant increase in studies utilizing genomic data from different species. However, the lack of standardization in gene nomenclature poses a challenge in database searches and evolutionary analyses, impacting the accuracy of results obtained. RESULTS: To address this issue, a Python class for standardizing gene nomenclatures, SynGenes, has been developed. It automatically recognizes and converts different nomenclature variations into a standardized form, facilitating comprehensive and accurate searches. Additionally, SynGenes offers a web form for individual searches using different names associated with the same gene. The SynGenes database contains a total of 545 gene name variations for mitochondrial and 2485 for chloroplasts genes, providing a valuable resource for researchers. CONCLUSIONS: The SynGenes platform offers a solution for standardizing gene nomenclatures of mitochondrial and chloroplast genes and providing a standardized search solution for specific markers in GenBank. Evaluation of SynGenes effectiveness through research conducted on GenBank and PubMedCentral demonstrated its ability to yield a greater number of outcomes compared to conventional searches, ensuring more comprehensive and accurate results. This tool is crucial for accurate database searches, and consequently, evolutionary analyses, addressing the challenges posed by non-standardized gene nomenclature.


Asunto(s)
Evolución Molecular , Terminología como Asunto , Genes del Cloroplasto , Genes Mitocondriales , Bases de Datos Genéticas , Cloroplastos/genética , Internet , Programas Informáticos
4.
BMC Bioinformatics ; 25(1): 8, 2024 Jan 03.
Artículo en Inglés | MEDLINE | ID: mdl-38172657

RESUMEN

BACKGROUND: The increasing volume and complexity of genomic data pose significant challenges for effective data management and reuse. Public genomic data often undergo similar preprocessing across projects, leading to redundant or inconsistent datasets and inefficient use of computing resources. This is especially pertinent for bioinformaticians engaged in multiple projects. Tools have been created to address challenges in managing and accessing curated genomic datasets, however, the practical utility of such tools becomes especially beneficial for users who seek to work with specific types of data or are technically inclined toward a particular programming language. Currently, there exists a gap in the availability of an R-specific solution for efficient data management and versatile data reuse. RESULTS: Here we present ReUseData, an R software tool that overcomes some of the limitations of existing solutions and provides a versatile and reproducible approach to effective data management within R. ReUseData facilitates the transformation of ad hoc scripts for data preprocessing into Common Workflow Language (CWL)-based data recipes, allowing for the reproducible generation of curated data files in their generic formats. The data recipes are standardized and self-contained, enabling them to be easily portable and reproducible across various computing platforms. ReUseData also streamlines the reuse of curated data files and their integration into downstream analysis tools and workflows with different frameworks. CONCLUSIONS: ReUseData provides a reliable and reproducible approach for genomic data management within the R environment to enhance the accessibility and reusability of genomic data. The package is available at Bioconductor ( https://bioconductor.org/packages/ReUseData/ ) with additional information on the project website ( https://rcwl.org/dataRecipes/ ).


Asunto(s)
Manejo de Datos , Genómica , Programas Informáticos , Lenguajes de Programación , Flujo de Trabajo
5.
BMC Bioinformatics ; 25(1): 288, 2024 Sep 03.
Artículo en Inglés | MEDLINE | ID: mdl-39227781

RESUMEN

BACKGROUND: The variant call format (VCF) file is a structured and comprehensive text file crucial for researchers and clinicians in interpreting and understanding genomic variation data. It contains essential information about variant positions in the genome, along with alleles, genotype calls, and quality scores. Analyzing and visualizing these files, however, poses significant challenges due to the need for diverse resources and robust features for in-depth exploration. RESULTS: To address these challenges, we introduce variant graph craft (VGC), a VCF file visualization and analysis tool. VGC offers a wide range of features for exploring genetic variations, including extraction of variant data, intuitive visualization, and graphical representation of samples with genotype information. VGC is designed primarily for the analysis of patient cohorts, but it can also be adapted for use with individual probands or families. It integrates seamlessly with external resources, providing insights into gene function and variant frequencies in sample data. VGC includes gene function and pathway information from Molecular Signatures Database (MSigDB) for GO terms, KEGG, Biocarta, Pathway Interaction Database, and Reactome. Additionally, it dynamically links to gnomAD for variant information and incorporates ClinVar data for pathogenic variant information. VGC supports the Human Genome Assembly Hg37 and Hg38, ensuring compatibility with a wide range of data sets, and accommodates various approaches to exploring genetic variation data. It can be tailored to specific user needs with optional phenotype input data. CONCLUSIONS: In summary, VGC provides a comprehensive set of features tailored to researchers working with genomic variation data. Its intuitive interface, rapid filtering capabilities, and the flexibility to perform queries using custom groups make it an effective tool in identifying variants potentially associated with diseases. VGC operates locally, ensuring data security and privacy by eliminating the need for cloud-based VCF uploads, making it a secure and user-friendly tool. It is freely available at https://github.com/alperuzun/VGC .


Asunto(s)
Variación Genética , Programas Informáticos , Humanos , Variación Genética/genética , Bases de Datos Genéticas , Genómica/métodos , Genotipo
6.
Mol Genet Genomics ; 299(1): 96, 2024 Oct 09.
Artículo en Inglés | MEDLINE | ID: mdl-39382723

RESUMEN

DNA transposons are diverse in fish genomes and have been described to generate genomic evolutionary novelties. hAT transposable element data are scarce in Teleostei genomes, making it challenging to conduct comparative genomic studies to understand their neutrality or function. This study aimed to perform a genomic and molecular characterization of hAT copies to assess the diversity of these elements and associate changes in these sequences to genomic and karyotypic novelties in Apareiodon sp. The data revealed that hAT TEs are highly abundant in the Apareiodon sp. genome, with few possibly autonomous copies. Highly conserved sequences with likely functional transposases were observed in nine hAT elements. A great diversity of hAT subgroups was observed, especially from Ac, Charlie, Blackjack, Tip100, hAT6, and hAT5, and a similar wave of hAT genomic invasion was identified in the genome for these six groups of hAT sequences. The data also revealed a distinct number of microsatellites within degenerated hAT copies. hAT sites were demonstrated to be dispersed in the Apareiodon sp. chromosomes and not involved in W chromosome-specific region differentiation. In conclusion, the genomic analysis revealed a great diversity of hAT elements, possible autonomous copies, and differentiation of degenerated transposable elements into tandem sequences.


Asunto(s)
Elementos Transponibles de ADN , Genoma , Filogenia , Elementos Transponibles de ADN/genética , Animales , Genoma/genética , Evolución Molecular , Repeticiones de Microsatélite/genética , Genómica/métodos , Peces/genética , Peces/clasificación
7.
Proc Biol Sci ; 291(2019): 20232805, 2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38503333

RESUMEN

Cholera continues to be a global health threat. Understanding how cholera spreads between locations is fundamental to the rational, evidence-based design of intervention and control efforts. Traditionally, cholera transmission models have used cholera case-count data. More recently, whole-genome sequence data have qualitatively described cholera transmission. Integrating these data streams may provide much more accurate models of cholera spread; however, no systematic analyses have been performed so far to compare traditional case-count models to the phylodynamic models from genomic data for cholera transmission. Here, we use high-fidelity case-count and whole-genome sequencing data from the 1991 to 1998 cholera epidemic in Argentina to directly compare the epidemiological model parameters estimated from these two data sources. We find that phylodynamic methods applied to cholera genomics data provide comparable estimates that are in line with established methods. Our methodology represents a critical step in building a framework for integrating case-count and genomic data sources for cholera epidemiology and other bacterial pathogens.


Asunto(s)
Cólera , Epidemias , Humanos , Cólera/epidemiología , Cólera/microbiología , Brotes de Enfermedades , Genómica/métodos , Secuenciación Completa del Genoma
8.
Brief Bioinform ; 23(5)2022 09 20.
Artículo en Inglés | MEDLINE | ID: mdl-36049234

RESUMEN

Many biological applications are essentially pairwise comparison problems, such as evolutionary relationships on genomic sequences, contigs binning on metagenomic data, cell type identification on gene expression profiles of single-cells, etc. To make pair-wise comparison, it is necessary to adopt suitable dissimilarity metric. However, not all the metrics can be fully adapted to all possible biological applications. It is necessary to employ metric learning based on data adaptive to the application of interest. Therefore, in this study, we proposed MEtric Learning with Triplet network (MELT), which learns a nonlinear mapping from original space to the embedding space in order to keep similar data closer and dissimilar data far apart. MELT is a weakly supervised and data-driven comparison framework that offers more adaptive and accurate dissimilarity learned in the absence of the label information when the supervised methods are not applicable. We applied MELT in three typical applications of genomic data comparison, including hierarchical genomic sequences, longitudinal microbiome samples and longitudinal single-cell gene expression profiles, which have no distinctive grouping information. In the experiments, MELT demonstrated its empirical utility in comparison to many widely used dissimilarity metrics. And MELT is expected to accommodate a more extensive set of applications in large-scale genomic comparisons. MELT is available at https://github.com/Ying-Lab/MELT.


Asunto(s)
Algoritmos , Metagenómica , Aprendizaje , Metagenoma , Metagenómica/métodos
9.
Ann Hematol ; 103(1): 117-123, 2024 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-38030891

RESUMEN

Myelofibrosis (MF) is commonly diagnosed in older individuals and has not been extensively studied in young patients. Given the infrequent diagnosis in young patients, analyzing this cohort may identify factors that predict for disease development/progression. We retrospectively analyzed clinical/genomic characteristics, treatments, and outcomes of patients with MF aged 18-50 years (YOUNG) at diagnosis. Sixty-three YOUNG patients were compared to 663 patients diagnosed at 51 or older (OLDER). YOUNG patients were more likely to be female, harbor driving CALR mutations, lack splicing gene mutations, and have low-risk disease by dynamic international prognostic scoring system (DIPSS) at presentation. Thirty-six patients (60%) presented with incidental lab findings and 19 (32%) with symptomatic disease. Median time to first treatment was 9.4 months (mo). Fourteen (22%) YOUNG patients underwent allogeneic hematopoietic stem cell transplant (median 57.4 mo post-diagnosis). Five (8%) developed blast-phase disease (median 99 mo post-diagnosis). Median overall survival (OS) for YOUNG patients was not reached compared to 62.8 mo in OLDER cohort (p < 0.001). The survival advantage for YOUNG patients lost significance when compared to OLDER patients lacking splicing mutations (p = 0.11). Thirty-one (49%) had comorbidities predating MF diagnosis. Presence of a comorbidity correlated with increased disease risk as measured by serial DIPSS (p=0.02). Increased disease risk correlated with decreased OS (p = 0.05). MF is rare in young adults, has distinct clinical/molecular correlates, and a favorable prognosis. The high frequency of inflammatory comorbidities and their correlation with progression of disease risk clinically highlights the role of inflammation in MF pathogenesis.


Asunto(s)
Trasplante de Células Madre Hematopoyéticas , Mielofibrosis Primaria , Adulto Joven , Humanos , Femenino , Anciano , Masculino , Mielofibrosis Primaria/diagnóstico , Mielofibrosis Primaria/terapia , Mielofibrosis Primaria/genética , Estudios Retrospectivos , Pronóstico , Trasplante de Células Madre Hematopoyéticas/efectos adversos , Comorbilidad , Mutación
10.
Liver Int ; 44(6): 1286-1289, 2024 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-38426626

RESUMEN

Recent advancements in artificial intelligence (AI) present both opportunities and challenges within the scientific community. This study explores the capability of AI to replicate findings from genetic research, focusing on findings from prior work. Using an AI model without exposing any raw data, we created a dataset that closely mirrors the results of our original study, illustrating the ease of fabricating datasets with authenticity. This approach highlights the risks associated with AI misuse in scientific research. The study emphasizes the critical importance of maintaining the integrity of scientific inquiry in an era increasingly influenced by advanced AI technologies.


Asunto(s)
Inteligencia Artificial , Humanos , Investigación Genética , Estudios de Cohortes
11.
Stat Appl Genet Mol Biol ; 22(1)2023 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-37622330

RESUMEN

Permutation tests are widely used for statistical hypothesis testing when the sampling distribution of the test statistic under the null hypothesis is analytically intractable or unreliable due to finite sample sizes. One critical challenge in the application of permutation tests in genomic studies is that an enormous number of permutations are often needed to obtain reliable estimates of very small p-values, leading to intensive computational effort. To address this issue, we develop algorithms for the accurate and efficient estimation of small p-values in permutation tests for paired and independent two-group genomic data, and our approaches leverage a novel framework for parameterizing the permutation sample spaces of those two types of data respectively using the Bernoulli and conditional Bernoulli distributions, combined with the cross-entropy method. The performance of our proposed algorithms is demonstrated through the application to two simulated datasets and two real-world gene expression datasets generated by microarray and RNA-Seq technologies and comparisons to existing methods such as crude permutations and SAMC, and the results show that our approaches can achieve orders of magnitude of computational efficiency gains in estimating small p-values. Our approaches offer promising solutions for the improvement of computational efficiencies of existing permutation test procedures and the development of new testing methods using permutations in genomic data analysis.


Asunto(s)
Genómica , Proyectos de Investigación , Entropía , Algoritmos , Análisis de Datos
12.
Skin Res Technol ; 30(6): e13770, 2024 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-38881051

RESUMEN

BACKGROUND: Melanoma is one of the most malignant forms of skin cancer, with a high mortality rate in the advanced stages. Therefore, early and accurate detection of melanoma plays an important role in improving patients' prognosis. Biopsy is the traditional method for melanoma diagnosis, but this method lacks reliability. Therefore, it is important to apply new methods to diagnose melanoma effectively. AIM: This study presents a new approach to classify melanoma using deep neural networks (DNNs) with combined multiple modal imaging and genomic data, which could potentially provide more reliable diagnosis than current medical methods for melanoma. METHOD: We built a dataset of dermoscopic images, histopathological slides and genomic profiles. We developed a custom framework composed of two widely established types of neural networks for analysing image data Convolutional Neural Networks (CNNs) and networks that can learn graph structure for analysing genomic data-Graph Neural Networks. We trained and evaluated the proposed framework on this dataset. RESULTS: The developed multi-modal DNN achieved higher accuracy than traditional medical approaches. The mean accuracy of the proposed model was 92.5% with an area under the receiver operating characteristic curve of 0.96, suggesting that the multi-modal DNN approach can detect critical morphologic and molecular features of melanoma beyond the limitations of traditional AI and traditional machine learning approaches. The combination of cutting-edge AI may allow access to a broader range of diagnostic data, which can allow dermatologists to make more accurate decisions and refine treatment strategies. However, the application of the framework will have to be validated at a larger scale and more clinical trials need to be conducted to establish whether this novel diagnostic approach will be more effective and feasible.


Asunto(s)
Aprendizaje Profundo , Dermoscopía , Melanoma , Neoplasias Cutáneas , Humanos , Melanoma/genética , Melanoma/diagnóstico por imagen , Melanoma/diagnóstico , Melanoma/patología , Neoplasias Cutáneas/genética , Neoplasias Cutáneas/diagnóstico por imagen , Neoplasias Cutáneas/patología , Dermoscopía/métodos , Redes Neurales de la Computación , Reproducibilidad de los Resultados , Genómica/métodos , Femenino , Masculino , Persona de Mediana Edad , Adulto , Anciano
13.
BMC Med Ethics ; 25(1): 51, 2024 May 05.
Artículo en Inglés | MEDLINE | ID: mdl-38706004

RESUMEN

Data access committees (DAC) gatekeep access to secured genomic and related health datasets yet are challenged to keep pace with the rising volume and complexity of data generation. Automated decision support (ADS) systems have been shown to support consistency, compliance, and coordination of data access review decisions. However, we lack understanding of how DAC members perceive the value add of ADS, if any, on the quality and effectiveness of their reviews. In this qualitative study, we report findings from 13 semi-structured interviews with DAC members from around the world to identify relevant barriers and facilitators to implementing ADS for genomic data access management. Participants generally supported pilot studies that test ADS performance, for example in cataloging data types, verifying user credentials and tagging datasets for use terms. Concerns related to over-automation, lack of human oversight, low prioritization, and misalignment with institutional missions tempered enthusiasm for ADS among the DAC members we engaged. Tensions for change in institutional settings within which DACs operated was a powerful motivator for why DAC members considered the implementation of ADS into their access workflows, as well as perceptions of the relative advantage of ADS over the status quo. Future research is needed to build the evidence base around the comparative effectiveness and decisional outcomes of institutions that do/not use ADS into their workflows.


Asunto(s)
Conjuntos de Datos como Asunto , Técnicas de Apoyo para la Decisión , Genómica , Programas Informáticos , Automatización , Flujo de Trabajo , Entrevistas como Asunto , Sistemas de Datos , Conjuntos de Datos como Asunto/legislación & jurisprudencia , Humanos
14.
Int J Mol Sci ; 25(12)2024 Jun 11.
Artículo en Inglés | MEDLINE | ID: mdl-38928128

RESUMEN

The process of identification and management of neurological disorder conditions faces challenges, prompting the investigation of novel methods in order to improve diagnostic accuracy. In this study, we conducted a systematic literature review to identify the significance of genetics- and molecular-pathway-based machine learning (ML) models in treating neurological disorder conditions. According to the study's objectives, search strategies were developed to extract the research studies using digital libraries. We followed rigorous study selection criteria. A total of 24 studies met the inclusion criteria and were included in the review. We classified the studies based on neurological disorders. The included studies highlighted multiple methodologies and exceptional results in treating neurological disorders. The study findings underscore the potential of the existing models, presenting personalized interventions based on the individual's conditions. The findings offer better-performing approaches that handle genetics and molecular data to generate effective outcomes. Moreover, we discuss the future research directions and challenges, emphasizing the demand for generalizing existing models in real-world clinical settings. This study contributes to advancing knowledge in the field of diagnosis and management of neurological disorders.


Asunto(s)
Aprendizaje Automático , Enfermedades del Sistema Nervioso , Humanos , Enfermedades del Sistema Nervioso/diagnóstico , Enfermedades del Sistema Nervioso/genética
15.
Dev World Bioeth ; 2024 Jan 31.
Artículo en Inglés | MEDLINE | ID: mdl-38298031

RESUMEN

This article considers the practical question of how research institutions should best structure their legal relationship with the human genomic data that they generate. The analysis, based on South African law, is framed by the legal position that although a research institution that generates human genomic data is not automatically the owner thereof, it is well positioned to claim ownership of newly generated data instances. Given that the research institution exerts effort to generate the data, it can be argued that it has a moral right to claim ownership of such data. Combined with the fact that it has an interest in having comprehensive rights in such data, it appears that the prudent policy for research institutions is to claim ownership of the human genomic data instances that they generate. This policy is tested against two opposing policy positions. The first opposing policy position is that research participants should own the data that relate to them. However, in light of data protection legislation that already provides extensive protections to research participants, bestowing data ownership on research participants would offer little benefit to such individuals, while leading to significant practical problems for research institutions. The second opposing policy position is that the concept of ownership should be abandoned in favour of data custodianship. This opposing position is problematic, as avoiding reference to ownership is a denial of legal reality and hence not a useful policy. Also, avoiding reference to ownership will leave research institutions with limited legal remedies in the event of appropriation of data by third parties. Accordingly, it is concluded that the wisest policy for research institutions is indeed to explicitly claim ownership of the human genomic data instances that they generate.

16.
J Law Med ; 31(2): 258-272, 2024 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-38963246

RESUMEN

This section explores the challenges involved in translating genomic research into genomic medicine. A number of priorities have been identified in the Australian National Health Genomics Framework for addressing these challenges. Responsible collection, storage, use and management of genomic data is one of these priorities, and is the primary theme of this section. The recent release of Genomical, an Australian data-sharing platform, is used as a case study to illustrate the type of assistance that can be provided to the health care sector in addressing this priority. The section first describes the National Framework and other drivers involved in the move towards genomic medicine. The section then examines key ethical, legal and social factors at play in genomics, with particular focus on privacy and consent. Finally, the section examines how Genomical is being used to help ensure that the move towards genomic medicine is ethically, legally and socially sound and that it optimises advances in both genomic and information technology.


Asunto(s)
Genómica , Difusión de la Información , Humanos , Genómica/legislación & jurisprudencia , Genómica/ética , Australia , Difusión de la Información/legislación & jurisprudencia , Difusión de la Información/ética , Consentimiento Informado/legislación & jurisprudencia , Privacidad Genética/legislación & jurisprudencia , Confidencialidad/legislación & jurisprudencia
17.
BMC Bioinformatics ; 24(1): 25, 2023 Jan 23.
Artículo en Inglés | MEDLINE | ID: mdl-36690931

RESUMEN

In clinical trials, identification of prognostic and predictive biomarkers has became essential to precision medicine. Prognostic biomarkers can be useful for the prevention of the occurrence of the disease, and predictive biomarkers can be used to identify patients with potential benefit from the treatment. Previous researches were mainly focused on clinical characteristics, and the use of genomic data in such an area is hardly studied. A new method is required to simultaneously select prognostic and predictive biomarkers in high dimensional genomic data where biomarkers are highly correlated. We propose a novel approach called PPLasso, that integrates prognostic and predictive effects into one statistical model. PPLasso also takes into account the correlations between biomarkers that can alter the biomarker selection accuracy. Our method consists in transforming the design matrix to remove the correlations between the biomarkers before applying the generalized Lasso. In a comprehensive numerical evaluation, we show that PPLasso outperforms the traditional Lasso and other extensions on both prognostic and predictive biomarker identification in various scenarios. Finally, our method is applied to publicly available transcriptomic and proteomic data.


Asunto(s)
Biomarcadores de Tumor , Proteómica , Humanos , Pronóstico , Biomarcadores , Modelos Estadísticos , Genómica
18.
BMC Bioinformatics ; 24(1): 354, 2023 Sep 21.
Artículo en Inglés | MEDLINE | ID: mdl-37735350

RESUMEN

BACKGROUND: Plummeting DNA sequencing cost in recent years has enabled genome sequencing projects to scale up by several orders of magnitude, which is transforming genomics into a highly data-intensive field of research. This development provides the much needed statistical power required for genotype-phenotype predictions in complex diseases. METHODS: In order to efficiently leverage the wealth of information, we here assessed several genomic data science tools. The rationale to focus on on-premise installations is to cope with situations where data confidentiality and compliance regulations etc. rule out cloud based solutions. We established a comprehensive qualitative and quantitative comparison between BCFtools, SnpSift, Hail, GEMINI, and OpenCGA. The tools were compared in terms of data storage technology, query speed, scalability, annotation, data manipulation, visualization, data output representation, and availability. RESULTS: Tools that leverage sophisticated data structures are noted as the most suitable for large-scale projects in varying degrees of scalability in comparison to flat-file manipulation (e.g., BCFtools, and SnpSift). Remarkably, for small to mid-size projects, even lightweight relational database. CONCLUSION: The assessment criteria provide insights into the typical questions posed in scalable genomics and serve as guidance for the development of scalable computational infrastructure in genomics.


Asunto(s)
Ciencia de los Datos , Genómica , Mapeo Cromosómico , Bases de Datos Factuales , Análisis de Secuencia de ADN
19.
J Physiol ; 601(9): 1611-1623, 2023 05.
Artículo en Inglés | MEDLINE | ID: mdl-36762618

RESUMEN

Synthesis of DNA fragments based on gene sequences that are available in public resources has become an efficient and affordable method that has gradually replaced traditional cloning efforts such as PCR cloning from cDNA. However, database entries based on genome sequencing results are prone to errors which can lead to false sequence information and, ultimately, errors in functional characterisation of proteins such as ion channels and transporters in heterologous expression systems. We have identified five common problems that repeatedly appear in public resources: (1) Not every gene has yet been annotated; (2) not all gene annotations are necessarily correct; (3) transcripts may contain automated corrections; (4) there are mismatches between gene, mRNA and protein sequences; and (5) splicing patterns often lack experimental validation. This technical review highlights and provides a strategy to bypass these issues in order to avoid critical mistakes that could impact future studies of any gene/protein of interest in heterologous expression systems.


Asunto(s)
Proteínas , Secuencia de Bases , Secuencia de Aminoácidos , ADN Complementario/genética , ADN Complementario/metabolismo , Proteínas/genética
20.
BMC Genomics ; 24(1): 560, 2023 Sep 22.
Artículo en Inglés | MEDLINE | ID: mdl-37736708

RESUMEN

BACKGROUND: Genomic data-based machine learning tools are promising for real-time surveillance activities performing source attribution of foodborne bacteria such as Listeria monocytogenes. Given the heterogeneity of machine learning practices, our aim was to identify those influencing the source prediction performance of the usual holdout method combined with the repeated k-fold cross-validation method. METHODS: A large collection of 1 100 L. monocytogenes genomes with known sources was built according to several genomic metrics to ensure authenticity and completeness of genomic profiles. Based on these genomic profiles (i.e. 7-locus alleles, core alleles, accessory genes, core SNPs and pan kmers), we developed a versatile workflow assessing prediction performance of different combinations of training dataset splitting (i.e. 50, 60, 70, 80 and 90%), data preprocessing (i.e. with or without near-zero variance removal), and learning models (i.e. BLR, ERT, RF, SGB, SVM and XGB). The performance metrics included accuracy, Cohen's kappa, F1-score, area under the curves from receiver operating characteristic curve, precision recall curve or precision recall gain curve, and execution time. RESULTS: The testing average accuracies from accessory genes and pan kmers were significantly higher than accuracies from core alleles or SNPs. While the accuracies from 70 and 80% of training dataset splitting were not significantly different, those from 80% were significantly higher than the other tested proportions. The near-zero variance removal did not allow to produce results for 7-locus alleles, did not impact significantly the accuracy for core alleles, accessory genes and pan kmers, and decreased significantly accuracy for core SNPs. The SVM and XGB models did not present significant differences in accuracy between each other and reached significantly higher accuracies than BLR, SGB, ERT and RF, in this order of magnitude. However, the SVM model required more computing power than the XGB model, especially for high amount of descriptors such like core SNPs and pan kmers. CONCLUSIONS: In addition to recommendations about machine learning practices for L. monocytogenes source attribution based on genomic data, the present study also provides a freely available workflow to solve other balanced or unbalanced multiclass phenotypes from binary and categorical genomic profiles of other microorganisms without source code modifications.


Asunto(s)
Listeria monocytogenes , Listeria monocytogenes/genética , Genómica , Aprendizaje Automático Supervisado , Aprendizaje Automático , Alelos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA