Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 86
Filtrar
1.
Sci Rep ; 14(1): 8855, 2024 04 17.
Artigo em Inglês | MEDLINE | ID: mdl-38632488

RESUMO

Health and disease are fundamentally influenced by microbial communities and their genes (the microbiome). An in-depth analysis of microbiome structure that enables the classification of individuals based on their health can be crucial in enhancing diagnostics and treatment strategies to improve the overall well-being of an individual. In this paper, we present a novel semi-supervised methodology known as Randomized Feature Selection based Latent Dirichlet Allocation (RFSLDA) to study the impact of the gut microbiome on a subject's health status. Since the data in our study consists of fuzzy health labels, which are self-reported, traditional supervised learning approaches may not be suitable. As a first step, based on the similarity between documents in text analysis and gut-microbiome data, we employ Latent Dirichlet Allocation (LDA), a topic modeling approach which uses microbiome counts as features to group subjects into relatively homogeneous clusters, without invoking any knowledge of observed health status (labels) of subjects. We then leverage information from the observed health status of subjects to associate these clusters with the most similar health status making it a semi-supervised approach. Finally, a feature selection technique is incorporated into the model to improve the overall classification performance. The proposed method provides a semi-supervised topic modelling approach that can help handle the high dimensionality of the microbiome data in association studies. Our experiments reveal that our semi-supervised classification algorithm is effective and efficient in terms of high classification accuracy compared to popular supervised learning approaches like SVM and multinomial logistic model. The RFSLDA framework is attractive because it (i) enhances clustering accuracy by identifying key bacteria types as indicators of health status, (ii) identifies key bacteria types within each group based on estimates of the proportion of bacteria types within the groups, and (iii) computes a measure of within-group similarity to identify highly similar subjects in terms of their health status.


Assuntos
Microbioma Gastrointestinal , Microbiota , Humanos , Algoritmos
2.
J Phys Chem A ; 127(40): 8437-8446, 2023 Oct 12.
Artigo em Inglês | MEDLINE | ID: mdl-37773038

RESUMO

Machine learning models are widely used in science and engineering to predict the properties of materials and solve complex problems. However, training large models can take days and fine-tuning hyperparameters can take months, making it challenging to achieve optimal performance. To address this issue, we propose a Knowledge Enhancing (KE) algorithm that enhances knowledge gained from a lower capacity model to a higher capacity model, enhancing training efficiency and performance. We focus on the problem of predicting the bandgap of an unknown material and present a theoretical analysis and experimental verification of our algorithm. Our experiments show that the performance of our knowledge enhancement model is improved by at least 10.21% compared to current methods on OMDB datasets. We believe that our generic idea of knowledge enhancement will be useful for solving other problems and provide a promising direction for future research.

3.
Sci Rep ; 13(1): 3292, 2023 Feb 25.
Artigo em Inglês | MEDLINE | ID: mdl-36841850

RESUMO

Recent advances in technology have led to an explosion of data in virtually all domains of our lives. Modern biomedical devices can acquire a large number of physical readings from patients. Often, these readings are stored in the form of time series data. Such time series data can form the basis for important research to advance healthcare and well being. Due to several considerations including data size, patient privacy, etc., the original, full data may not be available to secondary parties or researchers. Instead, suppose that a subset of the data is made available. A fast and reliable record linkage algorithm enables us to accurately match patient records in the original and subset databases while maintaining privacy. The problem of record linkage when the attributes include time series has not been studied much in the literature. We introduce two main contributions in this paper. First, we propose a novel, very efficient, and scalable record linkage algorithm that is employed on time series data. This algorithm is 400× faster than the previous work. Second, we introduce a privacy preserving framework that enables health institutions to safely release their raw time series records to researchers with bare minimum amount of identifying information.

5.
J Biomed Inform ; 130: 104094, 2022 06.
Artigo em Inglês | MEDLINE | ID: mdl-35550929

RESUMO

Record linkage is an important problem studied widely in many domains including biomedical informatics. A standard version of this problem is to cluster records from several datasets, such that each cluster has records pertinent to just one individual. Typically, datasets are huge in size. Hence, existing record linkage algorithms take a very long time. It is thus essential to develop novel fast algorithms for record linkage. The incremental version of this problem is to link previously clustered records with new records added to the input datasets. A novel algorithm has been created to efficiently perform standard and incremental record linkage. This algorithm leverages a set of efficient techniques that significantly restrict the number of record pair comparisons and distance computations. Our algorithm shows an average speed-up of 2.4x (up to 4x) for the standard linkage problem as compared to the state-of-the-art, without any drop in linkage performance at all. On average, our algorithm can incrementally link records in just 33% of the time required for linking them from scratch. Our algorithms achieve comparable or superior linkage performance and outperform the state-of-the-art in terms of linking time in all cases where the number of comparison attributes is greater than two. In practice, more than two comparison attributes are quite common. The proposed algorithm is very efficient and could be used in practice for record linkage applications especially when records are being added over time and linkage output needs to be updated frequently.


Assuntos
Algoritmos , Registro Médico Coordenado , Registro Médico Coordenado/métodos
6.
Sci Total Environ ; 811: 151406, 2022 Mar 10.
Artigo em Inglês | MEDLINE | ID: mdl-34748851

RESUMO

Indoor radon concentrations are controlled by both human factors and geological factors. It is important to separate the anthropogenic and geogenic contributions. We show that there is a positive correlation between the radiometric map of uranium in the ground and the measured radon in the household in Sweden. A map of gamma radiation is used to obtain an equivalent uranium concentration (ppm eU) for each postcode area. The aggregated uranium content is compared to the yearly average indoor radon concentration for different types of houses. Interestingly, modern households show reduced radon concentrations even in postcode areas with high average uranium concentrations. This shows that modern construction is effective at reducing the correlation with background uranium concentrations and minimizing the health risk associated with radon exposure. These correlations and predictive housing parameters could assist in monitoring higher risk areas.


Assuntos
Poluentes Radioativos do Ar , Poluição do Ar em Ambientes Fechados , Monitoramento de Radiação , Radônio , Urânio , Poluentes Radioativos do Ar/análise , Poluição do Ar em Ambientes Fechados/análise , Habitação , Humanos , Radônio/análise , Suécia , Urânio/análise
8.
Hum Genomics ; 15(1): 66, 2021 11 09.
Artigo em Inglês | MEDLINE | ID: mdl-34753514

RESUMO

BACKGROUND: Nowadays we are observing an explosion of gene expression data with phenotypes. It enables us to accurately identify genes responsible for certain medical condition as well as classify them for drug target. Like any other phenotype data in medical domain, gene expression data with phenotypes also suffer from being a very underdetermined system. In a very large set of features but a very small sample size domain (e.g. DNA microarray, RNA-seq data, GWAS data, etc.), it is often reported that several contrasting feature subsets may yield near equally optimal results. This phenomenon is known as instability. Considering these facts, we have developed a robust and stable supervised gene selection algorithm to select a set of robust and stable genes having a better prediction ability from the gene expression datasets with phenotypes. Stability and robustness is ensured by class and instance level perturbations, respectively. RESULTS: We have performed rigorous experimental evaluations using 10 real gene expression microarray datasets with phenotypes. They reveal that our algorithm outperforms the state-of-the-art algorithms with respect to stability and classification accuracy. We have also performed biological enrichment analysis based on gene ontology-biological processes (GO-BP) terms, disease ontology (DO) terms, and biological pathways. CONCLUSIONS: It is indisputable from the results of the performance evaluations that our proposed method is indeed an effective and efficient supervised gene selection algorithm.


Assuntos
Algoritmos , Aprendizado de Máquina , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Fenótipo
9.
PLoS One ; 16(8): e0253383, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34437542

RESUMO

The dimensionality of the spatially distributed channels and the temporal resolution of electroencephalogram (EEG) based brain-computer interfaces (BCI) undermine emotion recognition models. Thus, prior to modeling such data, as the final stage of the learning pipeline, adequate preprocessing, transforming, and extracting temporal (i.e., time-series signals) and spatial (i.e., electrode channels) features are essential phases to recognize underlying human emotions. Conventionally, inter-subject variations are dealt with by avoiding the sources of variation (e.g., outliers) or turning the problem into a subject-deponent. We address this issue by preserving and learning from individual particularities in response to affective stimuli. This paper investigates and proposes a subject-independent emotion recognition framework that mitigates the subject-to-subject variability in such systems. Using an unsupervised feature selection algorithm, we reduce the feature space that is extracted from time-series signals. For the spatial features, we propose a subject-specific unsupervised learning algorithm that learns from inter-channel co-activation online. We tested this framework on real EEG benchmarks, namely DEAP, MAHNOB-HCI, and DREAMER. We train and test the selection outcomes using nested cross-validation and a support vector machine (SVM). We compared our results with the state-of-the-art subject-independent algorithms. Our results show an enhanced performance by accurately classifying human affection (i.e., based on valence and arousal) by 16%-27% compared to other studies. This work not only outperforms other subject-independent studies reported in the literature but also proposes an online analysis solution to affection recognition.


Assuntos
Interfaces Cérebro-Computador , Eletroencefalografia/métodos , Emoções , Algoritmos , Eletrodos , Humanos , Máquina de Vetores de Suporte
10.
J Comput Biol ; 28(11): 1104-1112, 2021 11.
Artigo em Inglês | MEDLINE | ID: mdl-34448623

RESUMO

A biological pathway is an ordered set of interactions between intracellular molecules having collective activity that impacts cellular function, for example, by controlling metabolite synthesis or by regulating the expression of sets of genes. They play a key role in advanced studies of genomics. However, existing pathway analytics methods are inadequate to extract meaningful biological structure underneath the network of pathways. They also lack automation. Given these circumstances, we have come up with a novel graph theoretic method to analyze disease-related genes through weighted network of biological pathways. The method automatically extracts biological structures, such as clusters of pathways and their relevance, significance of each pathway and gene, and so forth hidden in the complex network. We have demonstrated the effectiveness of the proposed method on a set of genes associated with coronavirus disease 2019.


Assuntos
Algoritmos , COVID-19/genética , COVID-19/metabolismo , Biologia Computacional/métodos , Redes e Vias Metabólicas/genética , Bases de Dados Genéticas , Humanos
11.
J Perinatol ; 41(5): 1100-1109, 2021 05.
Artigo em Inglês | MEDLINE | ID: mdl-33589729

RESUMO

OBJECTIVE: To investigate seasonality and temporal trends in the incidence of NEC. STUDY DESIGN: A retrospective cohort study from two tertiary NICUs in northern and central Connecticut involving 16,761 infants admitted over a 28-year period. Various perinatal and neonatal risk factors were evaluated by univariate, multivariate, and spectral density analyses. RESULTS: Incidence of NEC was unchanged over the 28 years of study. Gestational age, birth weight, and birth-months (birth in April/May) were independently associated with stage II or III NEC even after adjusting for confounding factors (p < 0.05). Yearly NEC incidence showed a multi-modal distribution with spectral density spikes approximately every 10 years. CONCLUSION(S): Temporal and seasonal factors may play a role in NEC with a peak incidence in infants born in April/May and periodicity spikes approximately every 10 years. These trends suggest non-random and possibly environmental factors influencing NEC.


Assuntos
Enterocolite Necrosante , Connecticut/epidemiologia , Feminino , Idade Gestacional , Humanos , Incidência , Recém-Nascido , Recém-Nascido de muito Baixo Peso , Gravidez , Estudos Retrospectivos , Fatores de Risco , Estações do Ano
13.
Artigo em Inglês | MEDLINE | ID: mdl-32931433

RESUMO

Discovering patterns in biological sequences is a crucial step to extract useful information from them. Motifs can be viewed as patterns that occur exactly or with minor changes across some or all of the biological sequences. Motif search has numerous applications including the identification of transcription factors and their binding sites, composite regulatory patterns, similarity among families of proteins, etc. The general problem of motif search is intractable. One of the most studied models of motif search proposed in literature is Edit-distance based Motif Search (EMS). In EMS, the goal is to find all the patterns of length l that occur with an edit-distance of at most d in each of the input sequences. EMS algorithms existing in the literature do not scale well on challenging instances and large datasets. In this paper, the current state-of-the-art EMS solver is advanced by exploiting the idea of dimension reduction. A novel idea to reduce the cardinality of the alphabet is proposed. The algorithm we propose, EMS3, is an exact algorithm. I.e., it finds all the motifs present in the input sequences. EMS3 can be also viewed as a divide and conquer algorithm. In this paper, we provide theoretical analyses to establish the efficiency of EMS3. Extensive experiments on standard benchmark datasets (synthetic and real-world) show that the proposed algorithm outperforms the existing state-of-the-art algorithm (EMS2).


Assuntos
Algoritmos , Biologia Computacional/métodos , Análise de Sequência de Proteína/métodos , Motivos de Aminoácidos/genética , Sítios de Ligação/genética , Fatores de Transcrição/química , Fatores de Transcrição/genética
14.
J Parallel Distrib Comput ; 143: 47-66, 2020 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-32699464

RESUMO

In prior works, stochastic dual coordinate ascent (SDCA) has been parallelized in a multi-core environment where the cores communicate through shared memory, or in a multi-processor distributed memory environment where the processors communicate through message passing. In this paper, we propose a hybrid SDCA framework for multi-core clusters, the most common high performance computing environment that consists of multiple nodes each having multiple cores and its own shared memory. We distribute data across nodes where each node solves a local problem in an asynchronous parallel fashion on its cores, and then the local updates are aggregated via an asynchronous across-node update scheme. The proposed double asynchronous method converges to a global solution for L-Lipschitz continuous loss functions, and at a linear convergence rate if a smooth convex loss function is used. Extensive empirical comparison has shown that our algorithm scales better than the best known shared-memory methods and runs faster than previous distributed-memory methods. Big datasets, such as one of 280 GB from the LIBSVM repository, cannot be accommodated on a single node and hence cannot be solved by a parallel algorithm. For such a dataset, our hybrid algorithm takes less than 30 seconds to achieve a duality gap of 10-5 on 16 nodes each using 12 cores, which is significantly faster than the best known distributed algorithms, such as CoCoA+, that take more than 160 seconds on 16 nodes.

15.
Health Psychol Rev ; 14(1): 145-158, 2020 03.
Artigo em Inglês | MEDLINE | ID: mdl-31941434

RESUMO

The evidence base in health psychology is vast and growing rapidly. These factors make it difficult (and sometimes practically impossible) to consider all available evidence when making decisions about the state of knowledge on a given phenomenon (e.g., associations of variables, effects of interventions on particular outcomes). Systematic reviews, meta-analyses, and other rigorous syntheses of the research mitigate this problem by providing concise, actionable summaries of knowledge in a given area of study. Yet, conducting these syntheses has grown increasingly laborious owing to the fast accumulation of new evidence; existing, manual methods for synthesis do not scale well. In this article, we discuss how semi-automation via machine learning and natural language processing methods may help researchers and practitioners to review evidence more efficiently. We outline concrete examples in health psychology, highlighting practical, open-source technologies available now. We indicate the potential of more advanced methods and discuss how to avoid the pitfalls of automated reviews.


Assuntos
Medicina do Comportamento , Aprendizado de Máquina , Processamento de Linguagem Natural , Revisões Sistemáticas como Assunto , Humanos
16.
BMC Genomics ; 20(Suppl 5): 424, 2019 Jun 06.
Artigo em Inglês | MEDLINE | ID: mdl-31167665

RESUMO

BACKGROUND: Motifs are crucial patterns that have numerous applications including the identification of transcription factors and their binding sites, composite regulatory patterns, similarity between families of proteins, etc. Several motif models have been proposed in the literature. The (l,d)-motif model is one of these that has been studied widely. However, this model will sometimes report too many spurious motifs than expected. We interpret a motif as a biologically significant entity that is evolutionarily preserved within some distance. It may be highly improbable that the motif undergoes the same number of changes in each of the species. To address this issue, in this paper, we introduce a new model which is more general than (l,d)-motif model. This model is called (l,d1,d2)-motif model (LDDMS) and is NP-hard as well. We present three elegant as well as efficient algorithms to solve the LDDMS problem, i.e., LDDMS1, LDDMS2 and LDDMS3. They are all exact algorithms. RESULTS: We did both theoretical analyses and empirical tests on these algorithms. Theoretical analyses demonstrate that our algorithms have less computational cost than the pattern driven approach. Empirical results on both simulated datasets and real datasets show that each of the three algorithms has some advantages on some (l,d1,d2) instances. CONCLUSIONS: We proposed LDDMS model which is more practically relevant. We also proposed three exact efficient algorithms to solve the problem. Besides, our algorithms can be nicely parallelized. We believe that the idea in this new model can also be extended to other motif search problems such as Edit-distance-based Motif Search (EMS) and Simple Motif Search (SMS).


Assuntos
Algoritmos , Motivos de Aminoácidos , Motivos de Nucleotídeos , Biologia Computacional , Humanos , Modelos Teóricos , Análise de Sequência de DNA/métodos , Análise de Sequência de Proteína/métodos
17.
Bioinformatics ; 35(9): e1-e7, 2019 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-31051040

RESUMO

MOTIVATION: Next-generation sequencing (NGS) technologies have revolutionized genomic research by reducing the cost of whole-genome sequencing. One of the biggest challenges posed by modern sequencing technology is economic storage of NGS data. Storing raw data is infeasible because of its enormous size and high redundancy. In this article, we address the problem of storage and transmission of large Fastq files using innovative compression techniques. RESULTS: We introduce a new lossless non-reference-based fastq compression algorithm named lossless FastQ compressor. We have compared our algorithm with other state of the art big data compression algorithms namely gzip, bzip2, fastqz, fqzcomp, G-SQZ, SCALCE, Quip, DSRC, DSRC-LZ etc. This comparison reveals that our algorithm achieves better compression ratios. The improvement obtained is up to 225%. For example, on one of the datasets (SRR065390_1), the average improvement (over all the algorithms compared) is 74.62%. AVAILABILITY AND IMPLEMENTATION: The implementations are freely available for non-commercial purposes. They can be downloaded from http://engr.uconn.edu/∼rajasek/FastqPrograms.zip.

18.
Bioinformatics ; 35(17): 2932-2940, 2019 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-30649204

RESUMO

MOTIVATION: Metagenomics is the study of genetic materials directly sampled from natural habitats. It has the potential to reveal previously hidden diversity of microscopic life largely due to the existence of highly parallel and low-cost next-generation sequencing technology. Conventional approaches align metagenomic reads onto known reference genomes to identify microbes in the sample. Since such a collection of reference genomes is very large, the approach often needs high-end computing machines with large memory which is not often available to researchers. Alternative approaches follow an alignment-free methodology where the presence of a microbe is predicted using the information about the unique k-mers present in the microbial genomes. However, such approaches suffer from high false positives due to trading off the value of k with the computational resources. In this article, we propose a highly efficient metagenomic sequence classification (MSC) algorithm that is a hybrid of both approaches. Instead of aligning reads to the full genomes, MSC aligns reads onto a set of carefully chosen, shorter and highly discriminating model sequences built from the unique k-mers of each of the reference sequences. RESULTS: Microbiome researchers are generally interested in two objectives of a taxonomic classifier: (i) to detect prevalence, i.e. the taxa present in a sample, and (ii) to estimate their relative abundances. MSC is primarily designed to detect prevalence and experimental results show that MSC is indeed a more effective and efficient algorithm compared to the other state-of-the-art algorithms in terms of accuracy, memory and runtime. Moreover, MSC outputs an approximate estimate of the abundances. AVAILABILITY AND IMPLEMENTATION: The implementations are freely available for non-commercial purposes. They can be downloaded from https://drive.google.com/open?id=1XirkAamkQ3ltWvI1W1igYQFusp9DHtVl.


Assuntos
Metagenoma , Metagenômica , Análise de Sequência de DNA , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala
19.
Artigo em Inglês | MEDLINE | ID: mdl-29993557

RESUMO

With the advances in the next generation sequencing technology, huge amounts of data have been and get generated in biology. A bottleneck in dealing with such datasets lies in developing effective algorithms for extracting useful information from them. Algorithms for finding patterns in biological data pave the way for extracting crucial information from the voluminous datasets. In this paper we focus on a fundamental pattern, namely, the closest l-mers. Given a set of m biological strings S1,S2,…,Sm and an integer l, the problem of interest is that of finding an l-mer from each string such that the distance among them is the least. I.e., we want to find m l-mers X1,X2,…,Xm such that Xi is an l-mer in Si (for 1 ≤ i ≤ m) and the Hamming distance among these m l-mers is the least (from among all such possible l-mers). This problem has many applications including motif search. Algorithms for finding the closest l-mers have been used in solving the (l,d)-motif search problem (see e.g., \cite{PeSz00,DBR07}). In this paper novel algorithms are proposed for this problem for the case of . A comprehensive experimental evaluation is performed for m=3, along with a further empirical study of m=4 and 5.

20.
Nucleic Acids Res ; 46(D1): D465-D470, 2018 01 04.
Artigo em Inglês | MEDLINE | ID: mdl-29140456

RESUMO

Minimotif Miner (MnM) is a database and web system for analyzing short functional peptide motifs, termed minimotifs. We present an update to MnM growing the database from ∼300 000 to >1 000 000 minimotif consensus sequences and instances. This growth comes largely from updating data from existing databases and annotation of articles with high-throughput approaches analyzing different types of post-translational modifications. Another update is mapping human proteins and their minimotifs to know human variants from the dbSNP, build 150. Now MnM 4 can be used to generate mechanistic hypotheses about how human genetic variation affect minimotifs and outcomes. One example of the utility of the combined minimotif/SNP tool identifies a loss of function missense SNP in a ubiquitylation minimotif encoded in the excision repair cross-complementing 2 (ERCC2) nucleotide excision repair gene. This SNP reaches genome wide significance for many types of cancer and the variant identified with MnM 4 reveals a more detailed mechanistic hypothesis concerning the role of ERCC2 in cancer. Other updates to the web system include a new architecture with migration of the web system and database to Docker containers for better performance and management. Weblinks:minimotifminer.org and mnm.engr.uconn.edu.


Assuntos
Bases de Dados de Proteínas , Peptídeos/química , Processamento de Proteína Pós-Traducional , Receptores Acoplados a Proteínas G/química , Software , Proteína Grupo D do Xeroderma Pigmentoso/química , Sequência de Aminoácidos , Sítios de Ligação , Sequência Consenso , Ontologia Genética , Genoma Humano , Humanos , Internet , Modelos Moleculares , Anotação de Sequência Molecular , Neoplasias/genética , Neoplasias/metabolismo , Neoplasias/patologia , Peptídeos/genética , Peptídeos/metabolismo , Polimorfismo de Nucleotídeo Único , Ligação Proteica , Domínios e Motivos de Interação entre Proteínas , Receptores Acoplados a Proteínas G/genética , Receptores Acoplados a Proteínas G/metabolismo , Alinhamento de Sequência , Proteína Grupo D do Xeroderma Pigmentoso/genética , Proteína Grupo D do Xeroderma Pigmentoso/metabolismo
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA