Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 7.482
Filtrar
2.
BMC Res Notes ; 17(1): 18, 2024 Jan 05.
Artigo em Inglês | MEDLINE | ID: mdl-38183153

RESUMO

OBJECTIVES: This article presents the process of extraction and treatment of two datasets from the General Ombudsman of the Brazilian Unified Health System (OUVSUS). The resulting datasets allow the analysis of manifestation characteristics and sociodemographic profile of the citizens that performed these manifestations. DATA DESCRIPTION: The first dataset depicts the characteristics of the manifestations registered by the General Ombudsman. Each row represents an individual manifestation and contains information such as the registration date, classification, input channel, and subject, among others. The second dataset is constituted of sociodemographic information for each citizen that performed a manifestation, and characteristics such as sexual orientation, race, age, and geographic location of the citizen are presented, among others.


Assuntos
Conjuntos de Dados como Assunto , Demografia , Humanos , Brasil
3.
Nature ; 625(7994): 321-328, 2024 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-38200296

RESUMO

Multiple sclerosis (MS) is a neuro-inflammatory and neurodegenerative disease that is most prevalent in Northern Europe. Although it is known that inherited risk for MS is located within or in close proximity to immune-related genes, it is unknown when, where and how this genetic risk originated1. Here, by using a large ancient genome dataset from the Mesolithic period to the Bronze Age2, along with new Medieval and post-Medieval genomes, we show that the genetic risk for MS rose among pastoralists from the Pontic steppe and was brought into Europe by the Yamnaya-related migration approximately 5,000 years ago. We further show that these MS-associated immunogenetic variants underwent positive selection both within the steppe population and later in Europe, probably driven by pathogenic challenges coinciding with changes in diet, lifestyle and population density. This study highlights the critical importance of the Neolithic period and Bronze Age as determinants of modern immune responses and their subsequent effect on the risk of developing MS in a changing environment.


Assuntos
Predisposição Genética para Doença , Genoma Humano , Pradaria , Esclerose Múltipla , Humanos , Conjuntos de Dados como Assunto , Dieta/etnologia , Dieta/história , Europa (Continente)/etnologia , Predisposição Genética para Doença/história , Genética Médica , História do Século XV , História Antiga , História Medieval , Migração Humana/história , Estilo de Vida/etnologia , Estilo de Vida/história , Esclerose Múltipla/genética , Esclerose Múltipla/história , Esclerose Múltipla/imunologia , Doenças Neurodegenerativas/genética , Doenças Neurodegenerativas/história , Doenças Neurodegenerativas/imunologia , Densidade Demográfica
5.
PLoS One ; 19(1): e0296929, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38277376

RESUMO

Every day thousands of news are published on the web and filtering tools can be used to extract knowledge on specific topics. The categorization of news into a predefined set of topics is a subject widely studied in the literature, however, most works are restricted to documents in English. In this work, we make two contributions. First, we introduce a Portuguese news dataset collected from WikiNews an open-source media that provide news from different sources. Since there is a lack of datasets for Portuguese, and an existing one is from a single news channel, we aim to introduce a dataset from different news channels. The availability of comprehensive datasets plays a key role in advancing research. Second, we compare different architectures for Portuguese news classification, exploring different text representations (BoW, TF-IDF, Embedding) and classification techniques (SVM, CNN, DJINN, BERT) for documents in Portuguese, covering classical methods and current technologies. We show the trade-off between accuracy and training time for this application. We aim to show the capabilities of available algorithms and the challenges faced in the area.


Assuntos
Conjuntos de Dados como Assunto , Internet , Humanos , Algoritmos , Portugal
6.
Nature ; 625(7994): 329-337, 2024 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-38200294

RESUMO

Major migration events in Holocene Eurasia have been characterized genetically at broad regional scales1-4. However, insights into the population dynamics in the contact zones are hampered by a lack of ancient genomic data sampled at high spatiotemporal resolution5-7. Here, to address this, we analysed shotgun-sequenced genomes from 100 skeletons spanning 7,300 years of the Mesolithic period, Neolithic period and Early Bronze Age in Denmark and integrated these with proxies for diet (13C and 15N content), mobility (87Sr/86Sr ratio) and vegetation cover (pollen). We observe that Danish Mesolithic individuals of the Maglemose, Kongemose and Ertebølle cultures form a distinct genetic cluster related to other Western European hunter-gatherers. Despite shifts in material culture they displayed genetic homogeneity from around 10,500 to 5,900 calibrated years before present, when Neolithic farmers with Anatolian-derived ancestry arrived. Although the Neolithic transition was delayed by more than a millennium relative to Central Europe, it was very abrupt and resulted in a population turnover with limited genetic contribution from local hunter-gatherers. The succeeding Neolithic population, associated with the Funnel Beaker culture, persisted for only about 1,000 years before immigrants with eastern Steppe-derived ancestry arrived. This second and equally rapid population replacement gave rise to the Single Grave culture with an ancestry profile more similar to present-day Danes. In our multiproxy dataset, these major demographic events are manifested as parallel shifts in genotype, phenotype, diet and land use.


Assuntos
Genoma Humano , Genômica , Migração Humana , Populações Escandinavas e Nórdicas , Humanos , Dinamarca/etnologia , Emigrantes e Imigrantes/história , Genótipo , Populações Escandinavas e Nórdicas/genética , Populações Escandinavas e Nórdicas/história , Migração Humana/história , Genoma Humano/genética , História Antiga , Pólen , Dieta/história , Caça/história , Fazendeiros/história , Cultura , Fenótipo , Conjuntos de Dados como Assunto
7.
Radiol Imaging Cancer ; 6(1): e230100, 2024 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-38240671

RESUMO

Purpose To characterize the demographic distribution of The Cancer Imaging Archive (TCIA) studies and compare them with those of the U.S. cancer population. Materials and Methods In this retrospective study, data from TCIA studies were examined for the inclusion of demographic information. Of 189 studies in TCIA up until April 2023, a total of 83 human cancer studies were found to contain supporting demographic data. The median patient age and the sex, race, and ethnicity proportions of each study were calculated and compared with those of the U.S. cancer population, provided by the Surveillance, Epidemiology, and End Results Program and the Centers for Disease Control and Prevention U.S. Cancer Statistics Data Visualizations Tool. Results The median age of TCIA patients was found to be 6.84 years lower than that of the U.S. cancer population (P = .047) and contained more female than male patients (53% vs 47%). American Indian and Alaska Native, Black or African American, and Hispanic patients were underrepresented in TCIA studies by 47.7%, 35.8%, and 14.7%, respectively, compared with the U.S. cancer population. Conclusion The results demonstrate that the patient demographics of TCIA data sets do not reflect those of the U.S. cancer population, which may decrease the generalizability of artificial intelligence radiology tools developed using these imaging data sets. Keywords: Ethics, Meta-Analysis, Health Disparities, Cancer Health Disparities, Machine Learning, Artificial Intelligence, Race, Ethnicity, Sex, Age, Bias Published under a CC BY 4.0 license.


Assuntos
Neoplasias , Feminino , Humanos , Masculino , Inteligência Artificial , Etnicidade , Neoplasias/diagnóstico por imagem , Neoplasias/epidemiologia , Estudos Retrospectivos , Grupos Raciais , Conjuntos de Dados como Assunto
8.
Nature ; 626(8000): 792-798, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-38297125

RESUMO

Crop production is a large source of atmospheric ammonia (NH3), which poses risks to air quality, human health and ecosystems1-5. However, estimating global NH3 emissions from croplands is subject to uncertainties because of data limitations, thereby limiting the accurate identification of mitigation options and efficacy4,5. Here we develop a machine learning model for generating crop-specific and spatially explicit NH3 emission factors globally (5-arcmin resolution) based on a compiled dataset of field observations. We show that global NH3 emissions from rice, wheat and maize fields in 2018 were 4.3 ± 1.0 Tg N yr-1, lower than previous estimates that did not fully consider fertilizer management practices6-9. Furthermore, spatially optimizing fertilizer management, as guided by the machine learning model, has the potential to reduce the NH3 emissions by about 38% (1.6 ± 0.4 Tg N yr-1) without altering total fertilizer nitrogen inputs. Specifically, we estimate potential NH3 emissions reductions of 47% (44-56%) for rice, 27% (24-28%) for maize and 26% (20-28%) for wheat cultivation, respectively. Under future climate change scenarios, we estimate that NH3 emissions could increase by 4.0 ± 2.7% under SSP1-2.6 and 5.5 ± 5.7% under SSP5-8.5 by 2030-2060. However, targeted fertilizer management has the potential to mitigate these increases.


Assuntos
Amônia , Produção Agrícola , Fertilizantes , Amônia/análise , Amônia/metabolismo , Produção Agrícola/métodos , Produção Agrícola/estatística & dados numéricos , Produção Agrícola/tendências , Conjuntos de Dados como Assunto , Ecossistema , Fertilizantes/efeitos adversos , Fertilizantes/análise , Fertilizantes/estatística & dados numéricos , Aprendizado de Máquina , Nitrogênio/análise , Nitrogênio/metabolismo , Oryza/metabolismo , Solo/química , Triticum/metabolismo , Zea mays/metabolismo , Mudança Climática/estatística & dados numéricos
9.
J Mol Biol ; 436(2): 168374, 2024 01 15.
Artigo em Inglês | MEDLINE | ID: mdl-38182301

RESUMO

Variant effect predictors assess if a substitution is pathogenic or benign. Most predictors, including those that are structure-based, are designed for globular proteins in aqueous environments and do not consider that the variant residue is located within the membrane. We report Missense3D-TM that provides a structure-based assessment of the impact of a missense variant located within a membrane. On a dataset of 2,078 pathogenic and 1,060 benign variants, spanning 711 proteins from 706 structures, Missense3D-TM achieved an accuracy of 66%, Mathews correlation coefficient of 0.37, sensitivity of 58% and specificity of 81%. Missense3D-TM performed similarly to mCSM-membrane: accuracy 66% vs 61% (p = 0.02) on an unbalanced test set and 70% vs 67% (p = 0.20) on a balanced test set. The Missense3D-TM website provides an analysis of the structural effects of the variant along with its predicted position within the membrane. The web server is available at http://missense3d.bc.ic.ac.uk/.


Assuntos
Proteínas de Membrana , Mutação de Sentido Incorreto , Domínios Proteicos , Imageamento Tridimensional , Conjuntos de Dados como Assunto , Proteínas de Membrana/química , Proteínas de Membrana/genética
10.
J Mol Biol ; 436(4): 168444, 2024 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-38218366

RESUMO

Many examples are known of regions of intrinsically disordered proteins that fold into α-helices upon binding to their targets. These helical binding motifs (HBMs) can be partially helical also in the unbound state, and this so-called residual structure can affect binding affinity and kinetics. To investigate the underlying mechanisms governing the formation of residual helical structure, we assembled a dataset of experimental helix contents of 65 peptides containing HBM that fold-upon-binding. The average residual helicity is 17% and increases to 60% upon target binding. The helix contents of residual and target-bound structures do not correlate, however the relative location of helix elements in both states shows a strong overlap. Compared to the general disordered regions, HBMs are enriched in amino acids with high helix preference and these residues are typically involved in target binding, explaining the overlap in helix positions. In particular, we find that leucine residues and leucine motifs in HBMs are the major contributors to helix stabilization and target-binding. For the two model peptides, we show that substitution of leucine motifs to other hydrophobic residues (valine or isoleucine) leads to reduction of residual helicity, supporting the role of leucine as helix stabilizer. From the three hydrophobic residues only leucine can efficiently stabilize residual helical structure. We suggest that the high occurrence of leucine motifs and a general preference for leucine at binding interfaces in HBMs can be explained by its unique ability to stabilize helical elements.


Assuntos
Proteínas Intrinsicamente Desordenadas , Leucina , Proteínas Intrinsicamente Desordenadas/química , Leucina/química , Peptídeos/química , Estrutura Secundária de Proteína , Motivos de Aminoácidos , Conjuntos de Dados como Assunto , Interações Hidrofóbicas e Hidrofílicas , Ligação Proteica , Modelos Químicos
11.
In Silico Biol ; 15(1-2): 11-21, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37927254

RESUMO

Single cell transcriptomics has recently seen a surge in popularity, leading to the need for data analysis pipelines that are reproducible, modular, and interoperable across different systems and institutions.To meet this demand, we introduce scAN1.0, a processing pipeline for analyzing 10X single cell RNA sequencing data. scAN1.0 is built using the Nextflow DSL2 and can be run on most computational systems. The modular design of Nextflow pipelines enables easy integration and evaluation of different blocks for specific analysis steps.We demonstrate the usefulness of scAN1.0 by showing its ability to examine the impact of the mapping step during the analysis of two datasets: (i) a 10X scRNAseq of a human pituitary gonadotroph tumor dataset and (ii) a murine 10X scRNAseq acquired on CD8 T cells during an immune response.


Assuntos
RNA-Seq , Análise da Expressão Gênica de Célula Única , Software , Conjuntos de Dados como Assunto , Humanos , Animais , Camundongos , Neoplasias Hipofisárias/genética , Linfócitos T CD8-Positivos , Perfilação da Expressão Gênica , Biologia Computacional , Fluxo de Trabalho
12.
Science ; 382(6673): eadi1910, 2023 11 24.
Artigo em Inglês | MEDLINE | ID: mdl-37995242

RESUMO

Microbial systems underpin many biotechnologies, including CRISPR, but the exponential growth of sequence databases makes it difficult to find previously unidentified systems. In this work, we develop the fast locality-sensitive hashing-based clustering (FLSHclust) algorithm, which performs deep clustering on massive datasets in linearithmic time. We incorporated FLSHclust into a CRISPR discovery pipeline and identified 188 previously unreported CRISPR-linked gene modules, revealing many additional biochemical functions coupled to adaptive immunity. We experimentally characterized three HNH nuclease-containing CRISPR systems, including the first type IV system with a specified interference mechanism, and engineered them for genome editing. We also identified and characterized a candidate type VII system, which we show acts on RNA. This work opens new avenues for harnessing CRISPR and for the broader exploration of the vast functional diversity of microbial proteins.


Assuntos
Proteínas Associadas a CRISPR , Sistemas CRISPR-Cas , Mineração de Dados , Edição de Genes , Sistemas CRISPR-Cas/genética , Humanos , Células HEK293 , Análise por Conglomerados , Algoritmos , Proteínas Associadas a CRISPR/química , Proteínas Associadas a CRISPR/classificação , Proteínas Associadas a CRISPR/genética , Clivagem do DNA , RNA Guia de Sistemas CRISPR-Cas , Conjuntos de Dados como Assunto , Mineração de Dados/métodos
13.
PLoS One ; 18(11): e0286791, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37917732

RESUMO

Colon cancer is a significant global health problem, and early detection is critical for improving survival rates. Traditional detection methods, such as colonoscopies, can be invasive and uncomfortable for patients. Machine Learning (ML) algorithms have emerged as a promising approach for non-invasive colon cancer classification using genetic data or patient demographics and medical history. One approach is to use ML to analyse genetic data, or patient demographics and medical history, to predict the likelihood of colon cancer. However, due to the challenges imposed by variable gene expression and the high dimensionality of cancer-related datasets, traditional transductive ML applications have limited accuracy and risk overfitting. In this paper, we propose a new hybrid feature selection model called HMLFSM-Hybrid Machine Learning Feature Selection Model to improve colon cancer gene classification. We developed a multifilter hybrid model including a two-phase feature selection approach, combining Information Gain (IG) and Genetic Algorithms (GA), and minimum Redundancy Maximum Relevance (mRMR) coupling with Particle Swarm Optimization (PSO). We critically tested our model on three colon cancer genetic datasets and found that the new framework outperformed other models with significant accuracy improvements (95%, ~97%, and ~94% accuracies for datasets 1, 2, and 3 respectively). The results show that our approach improves the classification accuracy of colon cancer detection by highlighting important and relevant genes, eliminating irrelevant ones, and revealing the genes that have a direct influence on the classification process. For colon cancer gene analysis, and along with our experiments and literature review, we found that selective input feature extraction prior to feature selection is essential for improving predictive performance.


Assuntos
Neoplasias do Colo , Máquina de Vetores de Suporte , Humanos , Algoritmos , Neoplasias do Colo/diagnóstico , Neoplasias do Colo/genética , Aprendizado de Máquina , Conjuntos de Dados como Assunto
14.
Nature ; 623(7989): 987-991, 2023 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-38030778

RESUMO

Theories of innovation emphasize the role of social networks and teams as facilitators of breakthrough discoveries1-4. Around the world, scientists and inventors are more plentiful and interconnected today than ever before4. However, although there are more people making discoveries, and more ideas that can be reconfigured in new ways, research suggests that new ideas are getting harder to find5,6-contradicting recombinant growth theory7,8. Here we shed light on this apparent puzzle. Analysing 20 million research articles and 4 million patent applications from across the globe over the past half-century, we begin by documenting the rise of remote collaboration across cities, underlining the growing interconnectedness of scientists and inventors globally. We further show that across all fields, periods and team sizes, researchers in these remote teams are consistently less likely to make breakthrough discoveries relative to their on-site counterparts. Creating a dataset that allows us to explore the division of labour in knowledge production within teams and across space, we find that among distributed team members, collaboration centres on late-stage, technical tasks involving more codified knowledge. Yet they are less likely to join forces in conceptual tasks-such as conceiving new ideas and designing research-when knowledge is tacit9. We conclude that despite striking improvements in digital technology in recent years, remote teams are less likely to integrate the knowledge of their members to produce new, disruptive ideas.


Assuntos
Difusão de Inovações , Cooperação Internacional , Invenções , Inventores , Patentes como Assunto , Pesquisadores , Relatório de Pesquisa , Conjuntos de Dados como Assunto , Processos Grupais , Conhecimento , Patentes como Assunto/estatística & dados numéricos , Pesquisadores/organização & administração , Pesquisadores/psicologia , Pesquisadores/tendências , Relatório de Pesquisa/tendências , Rede Social , Invenções/classificação , Invenções/estatística & dados numéricos , Inventores/organização & administração , Inventores/psicologia , Comportamento Cooperativo
15.
Sci Data ; 10(1): 837, 2023 11 28.
Artigo em Inglês | MEDLINE | ID: mdl-38017024

RESUMO

Extracellular vesicles play major roles in cell-to-cell communication and are excellent biomarker candidates. However, studying plasma extracellular vesicles is challenging due to contaminants. Here, we performed a proteomics meta-analysis of public data to refine the plasma EV composition by separating EV proteins and contaminants into different clusters. We obtained two clusters with a total of 1717 proteins that were depleted of known contaminants and enriched in EV markers with independently validated 71% true-positive. These clusters had 133 clusters of differentiation (CD) antigens and were enriched with proteins from cell-to-cell communication and signaling. We compared our data with the proteins deposited in PeptideAtlas, making our refined EV protein list a resource for mechanistic and biomarker studies. As a use case example for this resource, we validated the type 1 diabetes biomarker proplatelet basic protein in EVs and showed that it regulates apoptosis of ß cells and macrophages, two key players in the disease development. Our approach provides a refinement of the EV composition and a resource for the scientific community.


Assuntos
Vesículas Extracelulares , Proteômica , Antígenos CD/metabolismo , Biomarcadores , Vesículas Extracelulares/metabolismo , Proteínas , Transdução de Sinais , Conjuntos de Dados como Assunto , Humanos , Animais
16.
Sci Rep ; 13(1): 18897, 2023 11 02.
Artigo em Inglês | MEDLINE | ID: mdl-37919325

RESUMO

Extent of resection after surgery is one of the main prognostic factors for patients diagnosed with glioblastoma. To achieve this, accurate segmentation and classification of residual tumor from post-operative MR images is essential. The current standard method for estimating it is subject to high inter- and intra-rater variability, and an automated method for segmentation of residual tumor in early post-operative MRI could lead to a more accurate estimation of extent of resection. In this study, two state-of-the-art neural network architectures for pre-operative segmentation were trained for the task. The models were extensively validated on a multicenter dataset with nearly 1000 patients, from 12 hospitals in Europe and the United States. The best performance achieved was a 61% Dice score, and the best classification performance was about 80% balanced accuracy, with a demonstrated ability to generalize across hospitals. In addition, the segmentation performance of the best models was on par with human expert raters. The predicted segmentations can be used to accurately classify the patients into those with residual tumor, and those with gross total resection.


Assuntos
Glioblastoma , Humanos , Europa (Continente) , Glioblastoma/diagnóstico por imagem , Glioblastoma/cirurgia , Glioblastoma/patologia , Processamento de Imagem Assistida por Computador/métodos , Imageamento por Ressonância Magnética/métodos , Neoplasia Residual/diagnóstico por imagem , Redes Neurais de Computação , Estudos Multicêntricos como Assunto , Conjuntos de Dados como Assunto
17.
Nature ; 622(7982): 348-358, 2023 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-37794188

RESUMO

High-throughput proteomics platforms measuring thousands of proteins in plasma combined with genomic and phenotypic information have the power to bridge the gap between the genome and diseases. Here we performed association studies of Olink Explore 3072 data generated by the UK Biobank Pharma Proteomics Project1 on plasma samples from more than 50,000 UK Biobank participants with phenotypic and genotypic data, stratifying on British or Irish, African and South Asian ancestries. We compared the results with those of a SomaScan v4 study on plasma from 36,000 Icelandic people2, for 1,514 of whom Olink data were also available. We found modest correlation between the two platforms. Although cis protein quantitative trait loci were detected for a similar absolute number of assays on the two platforms (2,101 on Olink versus 2,120 on SomaScan), the proportion of assays with such supporting evidence for assay performance was higher on the Olink platform (72% versus 43%). A considerable number of proteins had genomic associations that differed between the platforms. We provide examples where differences between platforms may influence conclusions drawn from the integration of protein levels with the study of diseases. We demonstrate how leveraging the diverse ancestries of participants in the UK Biobank helps to detect novel associations and refine genomic location. Our results show the value of the information provided by the two most commonly used high-throughput proteomics platforms and demonstrate the differences between them that at times provides useful complementarity.


Assuntos
Proteínas Sanguíneas , Suscetibilidade a Doenças , Genômica , Genótipo , Fenótipo , Proteômica , Humanos , África/etnologia , Ásia Meridional/etnologia , Bancos de Espécimes Biológicos , Proteínas Sanguíneas/análise , Proteínas Sanguíneas/genética , Conjuntos de Dados como Assunto , Genoma Humano/genética , Islândia/etnologia , Irlanda/etnologia , Plasma/química , Proteoma/análise , Proteoma/genética , Proteômica/métodos , Locos de Características Quantitativas , Reino Unido
19.
Brief Bioinform ; 24(6)2023 09 22.
Artigo em Inglês | MEDLINE | ID: mdl-37798252

RESUMO

The emergence of massive datasets exploring the multiple levels of molecular biology has made their analysis and knowledge transfer more complex. Flexible tools to manage big biological datasets could be of great help for standardizing the usage of developed data visualizations and integration methods. Business intelligence (BI) tools have been used in many fields as exploratory tools. They have numerous connectors to link numerous data repositories with a unified graphic interface, offering an overview of data and facilitating interpretation for decision makers. BI tools could be a flexible and user-friendly way of handling molecular biological data with interactive visualizations. However, it is rather uncommon to see such tools used for the exploration of massive and complex datasets in biological fields. We believe that two main obstacles could be the reason. Firstly, we posit that the way to import data into BI tools are not compatible with biological databases. Secondly, BI tools may not be adapted to certain particularities of complex biological data, namely, the size, the variability of datasets and the availability of specialized visualizations. This paper highlights the use of five BI tools (Elastic Kibana, Siren Investigate, Microsoft Power BI, Salesforce Tableau and Apache Superset) onto which the massive data management repository engine called Elasticsearch is compatible. Four case studies will be discussed in which these BI tools were applied on biological datasets with different characteristics. We conclude that the performance of the tools depends on the complexity of the biological questions and the size of the datasets.


Assuntos
Conjuntos de Dados como Assunto , Software , Visualização de Dados
20.
J Mol Biol ; 435(20): 168260, 2023 10 15.
Artigo em Inglês | MEDLINE | ID: mdl-37678708

RESUMO

Short tandem repeats (STRs) are consecutive repetitions of one to six nucleotide motifs. They are hypervariable due to the high prevalence of repeat unit insertions or deletions primarily caused by polymerase slippage during replication. Genetic variation at STRs has been shown to influence a range of traits in humans, including gene expression, cancer risk, and autism. Until recently STRs have been poorly studied since they pose significant challenges to bioinformatics analyses. Moreover, genome-wide analysis of STR variation in population-scale cohorts requires large amounts of data and computational resources. However, the recent advent of genome-wide analysis tools has resulted in multiple large genome-wide datasets of STR variation spanning nearly two million genomic loci in thousands of individuals from diverse populations. Here we present WebSTR, a database of genetic variation and other characteristics of genome-wide STRs across human populations. WebSTR is based on reference panels of more than 1.7 million human STRs created with state of the art repeat annotation methods and can easily be extended to include additional cohorts or species. It currently contains data based on STR genotypes for individuals from the 1000 Genomes Project, H3Africa, the Genotype-Tissue Expression (GTEx) Project and colorectal cancer patients from the TCGA dataset. WebSTR is implemented as a relational database with programmatic access available through an API and a web portal for browsing data. The web portal is publicly available at https://webstr.ucsd.edu.


Assuntos
Bases de Dados Genéticas , Variação Genética , Genoma Humano , Repetições de Microssatélites , Humanos , Biologia Computacional , Genótipo , Repetições de Microssatélites/genética , Estudo de Associação Genômica Ampla , Conjuntos de Dados como Assunto , Neoplasias Colorretais/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...