Búsqueda | BVS Bolivia

1.

Methods for evaluating unsupervised vector representations of genomic regions.

Zheng, Guangtao; Rymuza, Julia; Gharavi, Erfaneh; LeRoy, Nathan J; Zhang, Aidong; Sheffield, Nathan C.

NAR Genom Bioinform ; 6(3): lqae086, 2024 Sep.

Artículo en Inglés | MEDLINE | ID: mdl-39131817

RESUMEN

Representation learning models have become a mainstay of modern genomics. These models are trained to yield vector representations, or embeddings, of various biological entities, such as cells, genes, individuals, or genomic regions. Recent applications of unsupervised embedding approaches have been shown to learn relationships among genomic regions that define functional elements in a genome. Unsupervised representation learning of genomic regions is free of the supervision from curated metadata and can condense rich biological knowledge from publicly available data to region embeddings. However, there exists no method for evaluating the quality of these embeddings in the absence of metadata, making it difficult to assess the reliability of analyses based on the embeddings, and to tune model training to yield optimal results. To bridge this gap, we propose four evaluation metrics: the cluster tendency score (CTS), the reconstruction score (RCS), the genome distance scaling score (GDSS), and the neighborhood preserving score (NPS). The CTS and RCS statistically quantify how well region embeddings can be clustered and how well the embeddings preserve information in training data. The GDSS and NPS exploit the biological tendency of regions close in genomic space to have similar biological functions; they measure how much such information is captured by individual region embeddings in a set. We demonstrate the utility of these statistical and biological scores for evaluating unsupervised genomic region embeddings and provide guidelines for learning reliable embeddings.

2.

Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings.

LeRoy, Nathan J; Smith, Jason P; Zheng, Guangtao; Rymuza, Julia; Gharavi, Erfaneh; Brown, Donald E; Zhang, Aidong; Sheffield, Nathan C.

NAR Genom Bioinform ; 6(3): lqae073, 2024 Sep.

Artículo en Inglés | MEDLINE | ID: mdl-38974799

RESUMEN

Data from the single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) are now widely available. One major computational challenge is dealing with high dimensionality and inherent sparsity, which is typically addressed by producing lower dimensional representations of single cells for downstream clustering tasks. Current approaches produce such individual cell embeddings directly through a one-step learning process. Here, we propose an alternative approach by building embedding models pre-trained on reference data. We argue that this provides a more flexible analysis workflow that also has computational performance advantages through transfer learning. We implemented our approach in scEmbed, an unsupervised machine-learning framework that learns low-dimensional embeddings of genomic regulatory regions to represent and analyze scATAC-seq data. scEmbed performs well in terms of clustering ability and has the key advantage of learning patterns of region co-occurrence that can be transferred to other, unseen datasets. Moreover, models pre-trained on reference data can be exploited to build fast and accurate cell-type annotation systems without the need for other data modalities. scEmbed is implemented in Python and it is available to download from GitHub. We also make our pre-trained models available on huggingface for public use. scEmbed is open source and available at https://github.com/databio/geniml. Pre-trained models from this work can be obtained on huggingface: https://huggingface.co/databio.

3.

Joint Representation Learning for Retrieval and Annotation of Genomic Interval Sets.

Gharavi, Erfaneh; LeRoy, Nathan J; Zheng, Guangtao; Zhang, Aidong; Brown, Donald E; Sheffield, Nathan C.

Bioengineering (Basel) ; 11(3)2024 Mar 08.

Artículo en Inglés | MEDLINE | ID: mdl-38534537

RESUMEN

As available genomic interval data increase in scale, we require fast systems to search them. A common approach is simple string matching to compare a search term to metadata, but this is limited by incomplete or inaccurate annotations. An alternative is to compare data directly through genomic region overlap analysis, but this approach leads to challenges like sparsity, high dimensionality, and computational expense. We require novel methods to quickly and flexibly query large, messy genomic interval databases. Here, we develop a genomic interval search system using representation learning. We train numerical embeddings for a collection of region sets simultaneously with their metadata labels, capturing similarity between region sets and their metadata in a low-dimensional space. Using these learned co-embeddings, we develop a system that solves three related information retrieval tasks using embedding distance computations: retrieving region sets related to a user query string, suggesting new labels for database region sets, and retrieving database region sets similar to a query region set. We evaluate these use cases and show that jointly learned representations of region sets and metadata are a promising approach for fast, flexible, and accurate genomic region information retrieval.

4.

Equisetum arvense L aqueous extract: a novel chemotherapeutic supplement for treatment of human colon carcinoma.

Wang, Lei; Zhang, Luojun; Zheng, Guangtao; Luo, Haiping; El-Kott, Attalla F; El-Kenawy, Ayman E.

Arch Med Sci ; 19(5): 1472-1478, 2023.

Artículo en Inglés | MEDLINE | ID: mdl-37732051

RESUMEN

Introduction: One of the plants that has long been considered by humans is Equisetum arvense L. Equisetum arvense L is now recommended for external use to heal wounds and for internal use to relieve urinary tract and prostate disorders. In the current study, the antioxidant, cytotoxicity, and anti-human lung cancer properties of Equisetum arvense were investigated in in vitro conditions. Material and methods: Total phenolic content, total flavonoid content, radical scavenging activity, and ferrous ion chelating were assessed to evaluate the antioxidant activity. 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl-2H-tetrazolium bromide (MTT) assay was chosen to investigate anticancer activity of the plant extract. Results: The plant extract scavenged 2,2-diphenyl-1-picrylhydrazyl (DPPH) as a free radical with an IC50 of 12.3 ±0.7 µg/ml better than positive controls. The plant was also rich in phenolic compounds with an amount of 396.2 ±3.2 mg GAE/g for total phenolic content. In the MTT assay, human colorectal carcinoma (HCT-8 [HRT-18], Ramos.2G6.4C10, HT-29, and HCT 116) and normal cell lines (HUVEC) were used to study the cytotoxicity and anticancer potential of Equisetum arvense L against human colorectal cancer. Conclusions: The cell viability of Equisetum arvense L was very low against human colorectal carcinoma cell lines without any cytotoxicity towards the normal (HUVEC) cell line. The best anti-human colorectal carcinoma properties of Equisetum arvense L against the above cell lines were observed in the case of the HT 29 cell line.

5.

Embeddings of genomic region sets capture rich biological associations in lower dimensions.

Gharavi, Erfaneh; Gu, Aaron; Zheng, Guangtao; Smith, Jason P; Cho, Hyun Jae; Zhang, Aidong; Brown, Donald E; Sheffield, Nathan C.

Bioinformatics ; 37(23): 4299-4306, 2021 12 07.

Artículo en Inglés | MEDLINE | ID: mdl-34156475

RESUMEN

MOTIVATION: Genomic region sets summarize functional genomics data and define locations of interest in the genome such as regulatory regions or transcription factor binding sites. The number of publicly available region sets has increased dramatically, leading to challenges in data analysis. RESULTS: We propose a new method to represent genomic region sets as vectors, or embeddings, using an adapted word2vec approach. We compared our approach to two simpler methods based on interval unions or term frequency-inverse document frequency and evaluated the methods in three ways: First, by classifying the cell line, antibody or tissue type of the region set; second, by assessing whether similarity among embeddings can reflect simulated random perturbations of genomic regions; and third, by testing robustness of the proposed representations to different signal thresholds for calling peaks. Our word2vec-based region set embeddings reduce dimensionality from more than a hundred thousand to 100 without significant loss in classification performance. The vector representation could identify cell line, antibody and tissue type with over 90% accuracy. We also found that the vectors could quantitatively summarize simulated random perturbations to region sets and are more robust to subsampling the data derived from different peak calling thresholds. Our evaluations demonstrate that the vectors retain useful biological information in relatively lower-dimensional spaces. We propose that vector representation of region sets is a promising approach for efficient analysis of genomic region data. AVAILABILITY AND IMPLEMENTATION: https://github.com/databio/regionset-embedding. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Genómica , Unión Proteica

RESUMEN

RESUMEN

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA