Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 28
Filtrar
1.
Genome Res ; 33(10): 1734-1746, 2023 10.
Artigo em Inglês | MEDLINE | ID: mdl-37879860

RESUMO

Although it is ubiquitous in genomics, the current human reference genome (GRCh38) is incomplete: It is missing large sections of heterochromatic sequence, and as a singular, linear reference genome, it does not represent the full spectrum of human genetic diversity. To characterize gaps in GRCh38 and human genetic diversity, we developed an algorithm for sequence location approximation using nuclear families (ASLAN) to identify the region of origin of reads that do not align to GRCh38. Using unmapped reads and variant calls from whole-genome sequences (WGSs), ASLAN uses a maximum likelihood model to identify the most likely region of the genome that a subsequence belongs to given the distribution of the subsequence in the unmapped reads and phasings of families. Validating ASLAN on synthetic data and on reads from the alternative haplotypes in the decoy genome, ASLAN localizes >90% of 100-bp sequences with >92% accuracy and ∼1 Mb of resolution. We then ran ASLAN on 100-mers from unmapped reads from WGS from more than 700 families, and compared ASLAN localizations to alignment of the 100-mers to the recently released T2T-CHM13 assembly. We found that many unmapped reads in GRCh38 originate from telomeres and centromeres that are gaps in GRCh38. ASLAN localizations are in high concordance with T2T-CHM13 alignments, except in the centromeres of the acrocentric chromosomes. Comparing ASLAN localizations and T2T-CHM13 alignments, we identified sequences missing from T2T-CHM13 or sequences with high divergence from their aligned region in T2T-CHM13, highlighting new hotspots for genetic diversity.


Assuntos
Genoma Humano , Genômica , Humanos , Algoritmos , Telômero/genética , Variação Genética , Análise de Sequência de DNA
2.
Genome Res ; 33(10): 1747-1756, 2023 10.
Artigo em Inglês | MEDLINE | ID: mdl-37879861

RESUMO

Large, whole-genome sequencing (WGS) data sets containing families provide an important opportunity to identify crossovers and shared genetic material in siblings. However, the high variant calling error rates of WGS in some areas of the genome can result in spurious crossover calls, and the special inheritance status of the X Chromosome presents challenges. We have developed a hidden Markov model that addresses these issues by modeling the inheritance of variants in families in the presence of error-prone regions and inherited deletions. We call our method PhasingFamilies. We validate PhasingFamilies using the platinum genome family NA1281 (precision: 0.81; recall: 0.97), as well as simulated genomes with known crossover positions (precision: 0.93; recall: 0.92). Using 1925 quads from the Simons Simplex Collection, we found that PhasingFamilies resolves crossovers to a median resolution of 3527.5 bp. These crossovers recapitulate existing recombination rate maps, including for the X Chromosome; produce sibling pair IBD that matches expected distributions; and are validated by the haplotype estimation tool SHAPEIT. We provide an efficient, open-source implementation of PhasingFamilies that can be used to identify crossovers from family sequencing data.


Assuntos
Genoma , Padrões de Herança , Humanos , Sequenciamento Completo do Genoma , Haplótipos
3.
Sci Rep ; 13(1): 11353, 2023 07 13.
Artigo em Inglês | MEDLINE | ID: mdl-37443184

RESUMO

While healthy gut microbiomes are critical to human health, pertinent microbial processes remain largely undefined, partially due to differential bias among profiling techniques. By simultaneously integrating multiple profiling methods, multi-omic analysis can define generalizable microbial processes, and is especially useful in understanding complex conditions such as Autism. Challenges with integrating heterogeneous data produced by multiple profiling methods can be overcome using Latent Dirichlet Allocation (LDA), a promising natural language processing technique that identifies topics in heterogeneous documents. In this study, we apply LDA to multi-omic microbial data (16S rRNA amplicon, shotgun metagenomic, shotgun metatranscriptomic, and untargeted metabolomic profiling) from the stool of 81 children with and without Autism. We identify topics, or microbial processes, that summarize complex phenomena occurring within gut microbial communities. We then subset stool samples by topic distribution, and identify metabolites, specifically neurotransmitter precursors and fatty acid derivatives, that differ significantly between children with and without Autism. We identify clusters of topics, deemed "cross-omic topics", which we hypothesize are representative of generalizable microbial processes observable regardless of profiling method. Interpreting topics, we find each represents a particular diet, and we heuristically label each cross-omic topic as: healthy/general function, age-associated function, transcriptional regulation, and opportunistic pathogenesis.


Assuntos
Transtorno Autístico , Microbioma Gastrointestinal , Microbiota , Criança , Humanos , Microbioma Gastrointestinal/genética , Multiômica , RNA Ribossômico 16S/genética , Microbiota/genética
4.
Pac Symp Biocomput ; 28: 55-60, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36540964

RESUMO

The following sections are included: Introduction, Understanding and Predicting Molecular Networks, Understanding and Predicting Molecular Networks, Making Use of Family Structure, Applying Traditional Graph Algorithms to Novel Tasks, Representing Uncertainty in Networks, Conclusion, References.


Assuntos
Algoritmos , Biologia Computacional , Humanos
5.
Virol J ; 19(1): 225, 2022 12 24.
Artigo em Inglês | MEDLINE | ID: mdl-36566197

RESUMO

While hundreds of thousands of human whole genome sequences (WGS) have been collected in the effort to better understand genetic determinants of disease, these whole genome sequences have less frequently been used to study another major determinant of human health: the human virome. Using the unmapped reads from WGS of over 1000 families, we present insights into the human blood DNA virome, focusing particularly on human herpesvirus (HHV) 6A, 6B, and 7. In addition to extensively cataloguing the viruses detected in WGS of human whole blood and lymphoblastoid cell lines, we use the family structure of our dataset to show that household drives transmission of several viruses, and identify the Mendelian inheritance patterns characteristic of inherited chromsomally integrated human herpesvirus 6 (iciHHV-6). Consistent with prior studies, we find that 0.6% of our dataset's population has iciHHV, and we locate candidate integration sequences for these cases. We document genetic diversity within exogenous and integrated HHV species and within integration sites of HHV-6. Finally, in the first observation of its kind, we present evidence that suggests widespread de novo HHV-6B integration and HHV-7 integration and reactivation in lymphoblastoid cell lines. These findings show that the unmapped read space of WGS is a promising source of data for virology research.


Assuntos
Herpesvirus Humano 6 , Infecções por Roseolovirus , Humanos , Herpesvirus Humano 6/genética , Integração Viral , Análise de Sequência , Linhagem Celular
6.
Sci Rep ; 12(1): 17034, 2022 10 11.
Artigo em Inglês | MEDLINE | ID: mdl-36220843

RESUMO

Observational studies have shown that the composition of the human gut microbiome in children diagnosed with Autism Spectrum Disorder (ASD) differs significantly from that of their neurotypical (NT) counterparts. Thus far, reported ASD-specific microbiome signatures have been inconsistent. To uncover reproducible signatures, we compiled 10 publicly available raw amplicon and metagenomic sequencing datasets alongside new data generated from an internal cohort (the largest ASD cohort to date), unified them with standardized pre-processing methods, and conducted a comprehensive meta-analysis of all taxa and variables detected across multiple studies. By screening metadata to test associations between the microbiome and 52 variables in multiple patient subsets and across multiple datasets, we determined that differentially abundant taxa in ASD versus NT children were dependent upon age, sex, and bowel function, thus marking these variables as potential confounders in case-control ASD studies. Several taxa, including the strains Bacteroides stercoris t__190463 and Clostridium M bolteae t__180407, and the species Granulicatella elegans and Massilioclostridium coli, exhibited differential abundance in ASD compared to NT children only after subjects with bowel dysfunction were removed. Adjusting for age, sex and bowel function resulted in adding or removing significantly differentially abundant taxa in ASD-diagnosed individuals, emphasizing the importance of collecting and controlling for these metadata. We have performed the largest (n = 690) and most comprehensive systematic analysis of ASD gut microbiome data to date. Our study demonstrated the importance of accounting for confounding variables when designing statistical comparative analyses of ASD- and NT-associated gut bacterial profiles. Mitigating these confounders identified robust microbial signatures across cohorts, signifying the importance of accounting for these factors in comparative analyses of ASD and NT-associated gut profiles. Such studies will advance the understanding of different patient groups to deliver appropriate therapeutics by identifying microbiome traits germane to the specific ASD phenotype.


Assuntos
Transtorno do Espectro Autista , Microbioma Gastrointestinal , Microbiota , Transtorno do Espectro Autista/genética , Bactérias/genética , Criança , Microbioma Gastrointestinal/genética , Humanos , Metagenoma
7.
Sci Rep ; 12(1): 9863, 2022 06 14.
Artigo em Inglês | MEDLINE | ID: mdl-35701436

RESUMO

The unmapped readspace of whole genome sequencing data tends to be large but is often ignored. We posit that it contains valuable signals of both human infection and contamination. Using unmapped and poorly aligned reads from whole genome sequences (WGS) of over 1000 families and nearly 5000 individuals, we present insights into common viral, bacterial, and computational contamination that plague whole genome sequencing studies. We present several notable results: (1) In addition to known contaminants such as Epstein-Barr virus and phiX, sequences from whole blood and lymphocyte cell lines contain many other contaminants, likely originating from storage, prep, and sequencing pipelines. (2) Sequencing plate and biological sample source of a sample strongly influence contamination profile. And, (3) Y-chromosome fragments not on the human reference genome commonly mismap to bacterial reference genomes. Both experiment-derived and computational contamination is prominent in next-generation sequencing data. Such contamination can compromise results from WGS as well as metagenomics studies, and standard protocols for identifying and removing contamination should be developed to ensure the fidelity of sequencing-based studies.


Assuntos
Bacteriófagos , Infecções por Vírus Epstein-Barr , Biologia Computacional , Genoma Bacteriano , Genoma Humano , Genoma Viral , Herpesvirus Humano 4/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Sequenciamento Completo do Genoma
8.
Artigo em Inglês | MEDLINE | ID: mdl-35634270

RESUMO

Artificial Intelligence (A.I.) solutions are increasingly considered for telemedicine. For these methods to serve children and their families in home settings, it is crucial to ensure the privacy of the child and parent or caregiver. To address this challenge, we explore the potential for global image transformations to provide privacy while preserving the quality of behavioral annotations. Crowd workers have previously been shown to reliably annotate behavioral features in unstructured home videos, allowing machine learning classifiers to detect autism using the annotations as input. We evaluate this method with videos altered via pixelation, dense optical flow, and Gaussian blurring. On a balanced test set of 30 videos of children with autism and 30 neurotypical controls, we find that the visual privacy alterations do not drastically alter any individual behavioral annotation at the item level. The AUROC on the evaluation set was 90.0% ±7.5% for unaltered videos, 85.0% ±9.0% for pixelation, 85.0% ±9.0% for optical flow, and 83.3% ±9.3% for blurring, demonstrating that an aggregation of small changes across behavioral questions can collectively result in increased misdiagnosis rates. We also compare crowd answers against clinicians who provided the same annotations for the same videos as crowd workers, and we find that clinicians have higher sensitivity in their recognition of autism-related symptoms. We also find that there is a linear correlation (r = 0.75, p < 0.0001) between the mean Clinical Global Impression (CGI) score provided by professional clinicians and the corresponding score emitted by a previously validated autism classifier with crowd inputs, indicating that the classifier's output probability is a reliable estimate of the clinical impression of autism. A significant correlation is maintained with privacy alterations, indicating that crowd annotations can approximate clinician-provided autism impression from home videos in a privacy-preserved manner.

9.
JMIR Public Health Surveill ; 8(7): e31306, 2022 07 21.
Artigo em Inglês | MEDLINE | ID: mdl-35605128

RESUMO

BACKGROUND: Selection bias and unmeasured confounding are fundamental problems in epidemiology that threaten study internal and external validity. These phenomena are particularly dangerous in internet-based public health surveillance, where traditional mitigation and adjustment methods are inapplicable, unavailable, or out of date. Recent theoretical advances in causal modeling can mitigate these threats, but these innovations have not been widely deployed in the epidemiological community. OBJECTIVE: The purpose of our paper is to demonstrate the practical utility of causal modeling to both detect unmeasured confounding and selection bias and guide model selection to minimize bias. We implemented this approach in an applied epidemiological study of the COVID-19 cumulative infection rate in the New York City (NYC) spring 2020 epidemic. METHODS: We collected primary data from Qualtrics surveys of Amazon Mechanical Turk (MTurk) crowd workers residing in New Jersey and New York State across 2 sampling periods: April 11-14 and May 8-11, 2020. The surveys queried the subjects on household health status and demographic characteristics. We constructed a set of possible causal models of household infection and survey selection mechanisms and ranked them by compatibility with the collected survey data. The most compatible causal model was then used to estimate the cumulative infection rate in each survey period. RESULTS: There were 527 and 513 responses collected for the 2 periods, respectively. Response demographics were highly skewed toward a younger age in both survey periods. Despite the extremely strong relationship between age and COVID-19 symptoms, we recovered minimally biased estimates of the cumulative infection rate using only primary data and the most compatible causal model, with a relative bias of +3.8% and -1.9% from the reported cumulative infection rate for the first and second survey periods, respectively. CONCLUSIONS: We successfully recovered accurate estimates of the cumulative infection rate from an internet-based crowdsourced sample despite considerable selection bias and unmeasured confounding in the primary data. This implementation demonstrates how simple applications of structural causal modeling can be effectively used to determine falsifiable model conditions, detect selection bias and confounding factors, and minimize estimate bias through model selection in a novel epidemiological context. As the disease and social dynamics of COVID-19 continue to evolve, public health surveillance protocols must continue to adapt; the emergence of Omicron variants and shift to at-home testing as recent challenges. Rigorous and transparent methods to develop, deploy, and diagnosis adapted surveillance protocols will be critical to their success.


Assuntos
COVID-19 , COVID-19/epidemiologia , Fatores de Confusão Epidemiológicos , Humanos , Internet , Cidade de Nova Iorque/epidemiologia , SARS-CoV-2 , Viés de Seleção
10.
JMIR Pediatr Parent ; 5(2): e26760, 2022 Apr 08.
Artigo em Inglês | MEDLINE | ID: mdl-35394438

RESUMO

BACKGROUND: Automated emotion classification could aid those who struggle to recognize emotions, including children with developmental behavioral conditions such as autism. However, most computer vision emotion recognition models are trained on adult emotion and therefore underperform when applied to child faces. OBJECTIVE: We designed a strategy to gamify the collection and labeling of child emotion-enriched images to boost the performance of automatic child emotion recognition models to a level closer to what will be needed for digital health care approaches. METHODS: We leveraged our prototype therapeutic smartphone game, GuessWhat, which was designed in large part for children with developmental and behavioral conditions, to gamify the secure collection of video data of children expressing a variety of emotions prompted by the game. Independently, we created a secure web interface to gamify the human labeling effort, called HollywoodSquares, tailored for use by any qualified labeler. We gathered and labeled 2155 videos, 39,968 emotion frames, and 106,001 labels on all images. With this drastically expanded pediatric emotion-centric database (>30 times larger than existing public pediatric emotion data sets), we trained a convolutional neural network (CNN) computer vision classifier of happy, sad, surprised, fearful, angry, disgust, and neutral expressions evoked by children. RESULTS: The classifier achieved a 66.9% balanced accuracy and 67.4% F1-score on the entirety of the Child Affective Facial Expression (CAFE) as well as a 79.1% balanced accuracy and 78% F1-score on CAFE Subset A, a subset containing at least 60% human agreement on emotions labels. This performance is at least 10% higher than all previously developed classifiers evaluated against CAFE, the best of which reached a 56% balanced accuracy even when combining "anger" and "disgust" into a single class. CONCLUSIONS: This work validates that mobile games designed for pediatric therapies can generate high volumes of domain-relevant data sets to train state-of-the-art classifiers to perform tasks helpful to precision health efforts.

11.
J Med Internet Res ; 24(2): e31830, 2022 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-35166683

RESUMO

BACKGROUND: Autism spectrum disorder (ASD) is a widespread neurodevelopmental condition with a range of potential causes and symptoms. Standard diagnostic mechanisms for ASD, which involve lengthy parent questionnaires and clinical observation, often result in long waiting times for results. Recent advances in computer vision and mobile technology hold potential for speeding up the diagnostic process by enabling computational analysis of behavioral and social impairments from home videos. Such techniques can improve objectivity and contribute quantitatively to the diagnostic process. OBJECTIVE: In this work, we evaluate whether home videos collected from a game-based mobile app can be used to provide diagnostic insights into ASD. To the best of our knowledge, this is the first study attempting to identify potential social indicators of ASD from mobile phone videos without the use of eye-tracking hardware, manual annotations, and structured scenarios or clinical environments. METHODS: Here, we used a mobile health app to collect over 11 hours of video footage depicting 95 children engaged in gameplay in a natural home environment. We used automated data set annotations to analyze two social indicators that have previously been shown to differ between children with ASD and their neurotypical (NT) peers: (1) gaze fixation patterns, which represent regions of an individual's visual focus and (2) visual scanning methods, which refer to the ways in which individuals scan their surrounding environment. We compared the gaze fixation and visual scanning methods used by children during a 90-second gameplay video to identify statistically significant differences between the 2 cohorts; we then trained a long short-term memory (LSTM) neural network to determine if gaze indicators could be predictive of ASD. RESULTS: Our results show that gaze fixation patterns differ between the 2 cohorts; specifically, we could identify 1 statistically significant region of fixation (P<.001). In addition, we also demonstrate that there are unique visual scanning patterns that exist for individuals with ASD when compared to NT children (P<.001). A deep learning model trained on coarse gaze fixation annotations demonstrates mild predictive power in identifying ASD. CONCLUSIONS: Ultimately, our study demonstrates that heterogeneous video data sets collected from mobile devices hold potential for quantifying visual patterns and providing insights into ASD. We show the importance of automated labeling techniques in generating large-scale data sets while simultaneously preserving the privacy of participants, and we demonstrate that specific social engagement indicators associated with ASD can be identified and characterized using such data.


Assuntos
Transtorno do Espectro Autista , Aplicativos Móveis , Transtorno do Espectro Autista/diagnóstico , Criança , Computadores de Mão , Fixação Ocular , Humanos , Participação Social
12.
Pac Symp Biocomput ; 27: 313-324, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-34890159

RESUMO

As the last decade of human genomics research begins to bear the fruit of advancements in precision medicine, it is important to ensure that genomics' improvements in human health are distributed globally and equitably. An important step to ensuring health equity is to improve the human reference genome to capture global diversity by including a wide variety of alternative haplotypes, sequences that are not currently captured on the reference genome.We present a method that localizes 100 basepair (bp) long sequences extracted from short-read sequencing that can ultimately be used to identify what regions of the human genome non-reference sequences belong to.We extract reads that don't align to the reference genome, and compute the population's distribution of 100-mers found within the unmapped reads. We use genetic data from families to identify shared genetic material between siblings and match the distribution of unmapped k-mers to these inheritance patterns to determine the the most likely genomic region of a k-mer. We perform this localization with two highly interpretable methods of artificial intelligence: a computationally tractable Hidden Markov Model coupled to a Maximum Likelihood Estimator. Using a set of alternative haplotypes with known locations on the genome, we show that our algorithm is able to localize 96% of k-mers with over 90% accuracy and less than 1Mb median resolution. As the collection of sequenced human genomes grows larger and more diverse, we hope that this method can be used to improve the human reference genome, a critical step in addressing precision medicine's diversity crisis.


Assuntos
Inteligência Artificial , Genoma Humano , Biologia Computacional , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA
13.
BMC Bioinformatics ; 22(1): 509, 2021 Oct 19.
Artigo em Inglês | MEDLINE | ID: mdl-34666677

RESUMO

BACKGROUND: Sequencing partial 16S rRNA genes is a cost effective method for quantifying the microbial composition of an environment, such as the human gut. However, downstream analysis relies on binning reads into microbial groups by either considering each unique sequence as a different microbe, querying a database to get taxonomic labels from sequences, or clustering similar sequences together. However, these approaches do not fully capture evolutionary relationships between microbes, limiting the ability to identify differentially abundant groups of microbes between a diseased and control cohort. We present sequence-based biomarkers (SBBs), an aggregation method that groups and aggregates microbes using single variants and combinations of variants within their 16S sequences. We compare SBBs against other existing aggregation methods (OTU clustering and Microphenoor DiTaxa features) in several benchmarking tasks: biomarker discovery via permutation test, biomarker discovery via linear discriminant analysis, and phenotype prediction power. We demonstrate the SBBs perform on-par or better than the state-of-the-art methods in biomarker discovery and phenotype prediction. RESULTS: On two independent datasets, SBBs identify differentially abundant groups of microbes with similar or higher statistical significance than existing methods in both a permutation-test-based analysis and using linear discriminant analysis effect size. . By grouping microbes by SBB, we can identify several differentially abundant microbial groups (FDR <.1) between children with autism and neurotypical controls in a set of 115 discordant siblings. Porphyromonadaceae, Ruminococcaceae, and an unnamed species of Blastocystis were significantly enriched in autism, while Veillonellaceae was significantly depleted. Likewise, aggregating microbes by SBB on a dataset of obese and lean twins, we find several significantly differentially abundant microbial groups (FDR<.1). We observed Megasphaera andSutterellaceae highly enriched in obesity, and Phocaeicola significantly depleted. SBBs also perform on bar with or better than existing aggregation methods as features in a phenotype prediction model, predicting the autism phenotype with an ROC-AUC score of .64 and the obesity phenotype with an ROC-AUC score of .84. CONCLUSIONS: SBBs provide a powerful method for aggregating microbes to perform differential abundance analysis as well as phenotype prediction. Our source code can be freely downloaded from http://github.com/briannachrisman/16s_biomarkers .


Assuntos
Microbioma Gastrointestinal , Biomarcadores , Análise por Conglomerados , Microbioma Gastrointestinal/genética , Humanos , RNA Ribossômico 16S/genética , Software
14.
BioData Min ; 14(1): 28, 2021 May 03.
Artigo em Inglês | MEDLINE | ID: mdl-33941233

RESUMO

BACKGROUND: Machine learning approaches for predicting disease risk from high-dimensional whole genome sequence (WGS) data often result in unstable models that can be difficult to interpret, limiting the identification of putative sets of biomarkers. Here, we design and validate a graph-based methodology based on maximum flow, which leverages the presence of linkage disequilibrium (LD) to identify stable sets of variants associated with complex multigenic disorders. RESULTS: We apply our method to a previously published logistic regression model trained to identify variants in simple repeat sequences associated with autism spectrum disorder (ASD); this L1-regularized model exhibits high predictive accuracy yet demonstrates great variability in the features selected from over 230,000 possible variants. In order to improve model stability, we extract the variants assigned non-zero weights in each of 5 cross-validation folds and then assemble the five sets of features into a flow network subject to LD constraints. The maximum flow formulation allowed us to identify 55 variants, which we show to be more stable than the features identified by the original classifier. CONCLUSION: Our method allows for the creation of machine learning models that can identify predictive variants. Our results help pave the way towards biomarker-based diagnosis methods for complex genetic disorders.

15.
BioData Min ; 14(1): 27, 2021 Apr 23.
Artigo em Inglês | MEDLINE | ID: mdl-33892748

RESUMO

BACKGROUND: As next-generation sequencing technologies make their way into the clinic, knowledge of their error rates is essential if they are to be used to guide patient care. However, sequencing platforms and variant-calling pipelines are continuously evolving, making it difficult to accurately quantify error rates for the particular combination of assay and software parameters used on each sample. Family data provide a unique opportunity for estimating sequencing error rates since it allows us to observe a fraction of sequencing errors as Mendelian errors in the family, which we can then use to produce genome-wide error estimates for each sample. RESULTS: We introduce a method that uses Mendelian errors in sequencing data to make highly granular per-sample estimates of precision and recall for any set of variant calls, regardless of sequencing platform or calling methodology. We validate the accuracy of our estimates using monozygotic twins, and we use a set of monozygotic quadruplets to show that our predictions closely match the consensus method. We demonstrate our method's versatility by estimating sequencing error rates for whole genome sequencing, whole exome sequencing, and microarray datasets, and we highlight its sensitivity by quantifying performance increases between different versions of the GATK variant-calling pipeline. We then use our method to demonstrate that: 1) Sequencing error rates between samples in the same dataset can vary by over an order of magnitude. 2) Variant calling performance decreases substantially in low-complexity regions of the genome. 3) Variant calling performance in whole exome sequencing data decreases with distance from the nearest target region. 4) Variant calls from lymphoblastoid cell lines can be as accurate as those from whole blood. 5) Whole-genome sequencing can attain microarray-level precision and recall at disease-associated SNV sites. CONCLUSION: Genotype datasets from families are powerful resources that can be used to make fine-grained estimates of sequencing error for any sequencing platform and variant-calling methodology.

16.
Sci Rep ; 11(1): 7620, 2021 04 07.
Artigo em Inglês | MEDLINE | ID: mdl-33828118

RESUMO

Standard medical diagnosis of mental health conditions requires licensed experts who are increasingly outnumbered by those at risk, limiting reach. We test the hypothesis that a trustworthy crowd of non-experts can efficiently annotate behavioral features needed for accurate machine learning detection of the common childhood developmental disorder Autism Spectrum Disorder (ASD) for children under 8 years old. We implement a novel process for identifying and certifying a trustworthy distributed workforce for video feature extraction, selecting a workforce of 102 workers from a pool of 1,107. Two previously validated ASD logistic regression classifiers, evaluated against parent-reported diagnoses, were used to assess the accuracy of the trusted crowd's ratings of unstructured home videos. A representative balanced sample (N = 50 videos) of videos were evaluated with and without face box and pitch shift privacy alterations, with AUROC and AUPRC scores > 0.98. With both privacy-preserving modifications, sensitivity is preserved (96.0%) while maintaining specificity (80.0%) and accuracy (88.0%) at levels comparable to prior classification methods without alterations. We find that machine learning classification from features extracted by a certified nonexpert crowd achieves high performance for ASD detection from natural home videos of the child at risk and maintains high sensitivity when privacy-preserving mechanisms are applied. These results suggest that privacy-safeguarded crowdsourced analysis of short home videos can help enable rapid and mobile machine-learning detection of developmental delays in children.


Assuntos
Transtorno do Espectro Autista/diagnóstico , Técnicas de Observação do Comportamento/métodos , Crowdsourcing/métodos , Adulto , Algoritmos , Criança , Pré-Escolar , Confiabilidade dos Dados , Feminino , Humanos , Modelos Logísticos , Aprendizado de Máquina , Masculino , Transtornos Mentais/diagnóstico , Pessoa de Meia-Idade , Sensibilidade e Especificidade
17.
BioData Min ; 14(1): 20, 2021 Mar 20.
Artigo em Inglês | MEDLINE | ID: mdl-33743803

RESUMO

The evolutionary dynamics of SARS-CoV-2 have been carefully monitored since the COVID-19 pandemic began in December 2019. However, analysis has focused primarily on single nucleotide polymorphisms and largely ignored the role of insertions and deletions (indels) as well as recombination in SARS-CoV-2 evolution. Using sequences from the GISAID database, we catalogue over 100 insertions and deletions in the SARS-CoV-2 consensus sequences. We hypothesize that these indels are artifacts of recombination events between SARS-CoV-2 replicates whereby RNA-dependent RNA polymerase (RdRp) re-associates with a homologous template at a different loci ("imperfect homologous recombination"). We provide several independent pieces of evidence that suggest this. (1) The indels from the GISAID consensus sequences are clustered at specific regions of the genome. (2) These regions are also enriched for 5' and 3' breakpoints in the transcription regulatory site (TRS) independent transcriptome, presumably sites of RNA-dependent RNA polymerase (RdRp) template-switching. (3) Within raw reads, these indel hotspots have cases of both high intra-host heterogeneity and intra-host homogeneity, suggesting that these indels are both consequences of de novo recombination events within a host and artifacts of previous recombination. We briefly analyze the indels in the context of RNA secondary structure, noting that indels preferentially occur in "arms" and loop structures of the predicted folded RNA, suggesting that secondary structure may be a mechanism for TRS-independent template-switching in SARS-CoV-2 or other coronaviruses. These insights into the relationship between structural variation and recombination in SARS-CoV-2 can improve our reconstructions of the SARS-CoV-2 evolutionary history as well as our understanding of the process of RdRp template-switching in RNA viruses.

18.
Pac Symp Biocomput ; 26: 14-25, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33691000

RESUMO

Crowd-powered telemedicine has the potential to revolutionize healthcare, especially during times that require remote access to care. However, sharing private health data with strangers from around the world is not compatible with data privacy standards, requiring a stringent filtration process to recruit reliable and trustworthy workers who can go through the proper training and security steps. The key challenge, then, is to identify capable, trustworthy, and reliable workers through high-fidelity evaluation tasks without exposing any sensitive patient data during the evaluation process. We contribute a set of experimentally validated metrics for assessing the trustworthiness and reliability of crowd workers tasked with providing behavioral feature tags to unstructured videos of children with autism and matched neurotypical controls. The workers are blinded to diagnosis and blinded to the goal of using the features to diagnose autism. These behavioral labels are fed as input to a previously validated binary logistic regression classifier for detecting autism cases using categorical feature vectors. While the metrics do not incorporate any ground truth labels of child diagnosis, linear regression using the 3 correlative metrics as input can predict the mean probability of the correct class of each worker with a mean average error of 7.51% for performance on the same set of videos and 10.93% for performance on a distinct balanced video set with different children. These results indicate that crowd workers can be recruited for performance based largely on behavioral metrics on a crowdsourced task, enabling an affordable way to filter crowd workforces into a trustworthy and reliable diagnostic workforce.


Assuntos
Transtorno do Espectro Autista , Transtorno Autístico , Telemedicina , Transtorno do Espectro Autista/diagnóstico , Criança , Biologia Computacional , Humanos , Reprodutibilidade dos Testes
19.
ISME Commun ; 1(1): 80, 2021 Dec 18.
Artigo em Inglês | MEDLINE | ID: mdl-37938270

RESUMO

Autism Spectrum Disorder (ASD) is a complex neurodevelopmental disorder influenced by both genetic and environmental factors. Recently, gut dysbiosis has emerged as a powerful contributor to ASD symptoms. In this study, we recruited over 100 age-matched sibling pairs (between 2 and 8 years old) where one had an Autism ASD diagnosis and the other was developing typically (TD) (432 samples total). We collected stool samples over four weeks, tracked over 100 lifestyle and dietary variables, and surveyed behavior measures related to ASD symptoms. We identified 117 amplicon sequencing variants (ASVs) that were significantly different in abundance between sibling pairs across all three timepoints, 11 of which were supported by at least two contrast methods. We additionally identified dietary and lifestyle variables that differ significantly between cohorts, and further linked those variables to the ASVs they statistically relate to. Overall, dietary and lifestyle features were explanatory of ASD phenotype using logistic regression, however, global compositional microbiome features were not. Leveraging our longitudinal behavior questionnaires, we additionally identified 11 ASVs associated with changes in reported anxiety over time within and across all individuals. Lastly, we find that overall microbiome composition (beta-diversity) is associated with specific ASD-related behavioral characteristics.

20.
Cognit Comput ; 13(5): 1363-1373, 2021 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-35669554

RESUMO

Background/Introduction: Emotion detection classifiers traditionally predict discrete emotions. However, emotion expressions are often subjective, thus requiring a method to handle compound and ambiguous labels. We explore the feasibility of using crowdsourcing to acquire reliable soft-target labels and evaluate an emotion detection classifier trained with these labels. We hypothesize that training with labels that are representative of the diversity of human interpretation of an image will result in predictions that are similarly representative on a disjoint test set. We also hypothesize that crowdsourcing can generate distributions which mirror those generated in a lab setting. Methods: We center our study on the Child Affective Facial Expression (CAFE) dataset, a gold standard collection of images depicting pediatric facial expressions along with 100 human labels per image. To test the feasibility of crowdsourcing to generate these labels, we used Microworkers to acquire labels for 207 CAFE images. We evaluate both unfiltered workers as well as workers selected through a short crowd filtration process. We then train two versions of a ResNet-152 neural network on soft-target CAFE labels using the original 100 annotations provided with the dataset: (1) a classifier trained with traditional one-hot encoded labels, and (2) a classifier trained with vector labels representing the distribution of CAFE annotator responses. We compare the resulting softmax output distributions of the two classifiers with a 2-sample independent t-test of L1 distances between the classifier's output probability distribution and the distribution of human labels. Results: While agreement with CAFE is weak for unfiltered crowd workers, the filtered crowd agree with the CAFE labels 100% of the time for happy, neutral, sad and "fear + surprise", and 88.8% for "anger + disgust". While the F1-score for a one-hot encoded classifier is much higher (94.33% vs. 78.68%) with respect to the ground truth CAFE labels, the output probability vector of the crowd-trained classifier more closely resembles the distribution of human labels (t=3.2827, p=0.0014). Conclusions: For many applications of affective computing, reporting an emotion probability distribution that accounts for the subjectivity of human interpretation can be more useful than an absolute label. Crowdsourcing, including a sufficient filtering mechanism for selecting reliable crowd workers, is a feasible solution for acquiring soft-target labels.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...