Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 17 de 17
Filtrar
1.
Bioinformatics ; 39(8)2023 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-37527019

RESUMO

MOTIVATION: Many real-world problems can be modeled as annotated graphs. Scalable graph algorithms that extract actionable information from such data are in demand since these graphs are large, varying in topology, and have diverse node/edge annotations. When these graphs change over time they create dynamic graphs, and open the possibility to find patterns across different time points. In this article, we introduce a scalable algorithm that finds unique dense regions across time points in dynamic graphs. Such algorithms have applications in many different areas, including the biological, financial, and social domains. RESULTS: There are three important contributions to this manuscript. First, we designed a scalable algorithm, USNAP, to effectively identify dense subgraphs that are unique to a time stamp given a dynamic graph. Importantly, USNAP provides a lower bound of the density measure in each step of the greedy algorithm. Second, insights and understanding obtained from validating USNAP on real data show its effectiveness. While USNAP is domain independent, we applied it to four non-small cell lung cancer gene expression datasets. Stages in non-small cell lung cancer were modeled as dynamic graphs, and input to USNAP. Pathway enrichment analyses and comprehensive interpretations from literature show that USNAP identified biologically relevant mechanisms for different stages of cancer progression. Third, USNAP is scalable, and has a time complexity of O(m+mc log nc+nc log nc), where m is the number of edges, and n is the number of vertices in the dynamic graph; mc is the number of edges, and nc is the number of vertices in the collapsed graph. AVAILABILITY AND IMPLEMENTATION: The code of USNAP is available at https://www.cs.utoronto.ca/~juris/data/USNAP22.


Assuntos
Carcinoma Pulmonar de Células não Pequenas , Neoplasias Pulmonares , Humanos , Algoritmos
2.
J Biomed Inform ; 139: 104296, 2023 03.
Artigo em Inglês | MEDLINE | ID: mdl-36736937

RESUMO

Given a cardiac-arrest patient being monitored in the ICU (intensive care unit) for brain activity, how can we predict their health outcomes as early as possible? Early decision-making is critical in many applications, e.g. monitoring patients may assist in early intervention and improved care. On the other hand, early prediction on EEG data poses several challenges: (i) earliness-accuracy trade-off; observing more data often increases accuracy but sacrifices earliness, (ii) large-scale (for training) and streaming (online decision-making) data processing, and (iii) multi-variate (due to multiple electrodes) and multi-length (due to varying length of stay of patients) time series. Motivated by this real-world application, we present BeneFitter that infuses the incurred savings from an early prediction as well as the cost from misclassification into a unified domain-specific target called benefit. Unifying these two quantities allows us to directly estimate a single target (i.e. benefit), and importantly, (a) is efficient and fast, with training time linear in the number of input sequences, and can operate in real-time for decision-making, (b) can handle multi-variate and variable-length time-series, suitable for patient data, and (c) is effective, providing up to 2× time-savings with equal or better accuracy as compared to competitors.


Assuntos
Conscientização , Unidades de Terapia Intensiva , Humanos , Fatores de Tempo , Avaliação de Resultados em Cuidados de Saúde , Eletroencefalografia
3.
Methods ; 132: 34-41, 2018 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-28684340

RESUMO

Can we use graph mining algorithms to find patterns in tumor molecular mechanisms? Can we model disease progression with multiple time-specific graph comparison algorithms? In this paper, we will focus on this area. Our main contributions are 1) we proposed the Temporal-Omics (Temp-O) workflow to model tumor progression in non-small cell lung cancer (NSCLC) using graph comparisons between multiple stage-specific graphs, and 2) we showed that temporal structures are meaningful in the tumor progression of NSCLC. Other identified temporal structures that were not highlighted in this paper may also be used to gain insights to possible novel mechanisms. Importantly, the Temp-O workflow is generic; while we applied it on NSCLC, it can be applied in other cancers and diseases. We used gene expression data from tumor samples across disease stages to model lung cancer progression, creating stage-specific tumor graphs. Validating our findings in independent datasets showed that differences in temporal network structures capture diverse mechanisms in NSCLC. Furthermore, results showed that structures are consistent and potentially biologically important as we observed that genes with similar protein names were captured in the same cliques for all cliques in all datasets. Importantly, the identified temporal structures are meaningful in the tumor progression of NSCLC as they agree with the molecular mechanism in the tumor progression or carcinogenesis of NSCLC. In particular, the identified major histocompatibility complex of class II temporal structures capture mechanisms concerning carcinogenesis; the proteasome temporal structures capture mechanisms that are in early or late stages of lung cancer; the ribosomal cliques capture the role of ribosome biosynthesis in cancer development and sustainment. Further, on a large independent dataset we validated that temporal network structures identified proteins that are prognostic for overall survival in NSCLC adenocarcinoma.


Assuntos
Carcinoma Pulmonar de Células não Pequenas/patologia , Neoplasias Pulmonares/patologia , Biomarcadores Tumorais/genética , Biomarcadores Tumorais/metabolismo , Carcinoma Pulmonar de Células não Pequenas/genética , Carcinoma Pulmonar de Células não Pequenas/metabolismo , Carcinoma Pulmonar de Células não Pequenas/mortalidade , Progressão da Doença , Redes Reguladoras de Genes , Humanos , Estimativa de Kaplan-Meier , Neoplasias Pulmonares/genética , Neoplasias Pulmonares/metabolismo , Neoplasias Pulmonares/mortalidade , Modelos Biológicos , Anotação de Sequência Molecular , Transcriptoma
4.
J Integr Neurosci ; 15(3): 381-402, 2016 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-27774837

RESUMO

We propose a nonlinear dynamic model for an invasive electroencephalogram analysis that learns the optimal parameters of the neural population model via the Levenberg-Marquardt algorithm. We introduce the crucial windows where the estimated parameters present patterns before seizure onset. The optimal parameters minimizes the error between the observed signal and the generated signal by the model. The proposed approach effectively discriminates between healthy signals and epileptic seizure signals. We evaluate the proposed method using an electroencephalogram dataset with normal and epileptic seizure sequences. The empirical results show that the patterns of parameters as a seizure approach and the method is efficient in analyzing nonlinear epilepsy electroencephalogram data. The accuracy of estimating the optimal parameters is improved by using the nonlinear dynamic model.


Assuntos
Encéfalo/diagnóstico por imagem , Eletroencefalografia/métodos , Epilepsia/diagnóstico por imagem , Dinâmica não Linear , Reconhecimento Automatizado de Padrão/métodos , Processamento de Sinais Assistido por Computador , Algoritmos , Encéfalo/fisiopatologia , Encéfalo/cirurgia , Conjuntos de Dados como Assunto , Eletrodos Implantados , Epilepsia/fisiopatologia , Epilepsia/cirurgia , Humanos , Convulsões/diagnóstico por imagem , Convulsões/fisiopatologia , Convulsões/cirurgia
5.
Artigo em Inglês | MEDLINE | ID: mdl-36201417

RESUMO

Law enforcement and domain experts can detect human trafficking (HT) in online escort websites by analyzing suspicious clusters of connected ads. How can we explain clustering results intuitively and interactively, visualizing potential evidence for experts to analyze? We present TRAFFICVIS, the first interface for cluster-level HT detection and labeling. Developed through months of participatory design with domain experts, TRAFFICVIS provides coordinated views in conjunction with carefully chosen backend algorithms to effectively show spatio-temporal and text patterns to a wide variety of anti-HT stakeholders. We build upon state-of-the-art text clustering algorithms by incorporating shared metadata as a signal of connected and possibly suspicious activity, then visualize the results. Domain experts can use TRAFFICVIS to label clusters as HT, or other, suspicious, but non-HT activity such as spam and scam, quickly creating labeled datasets to enable further HT research. Through domain expert feedback and a usage scenario, we demonstrate TRAFFICVIS's efficacy. The feedback was overwhelmingly positive, with repeated high praises for the usability and explainability of our tool, the latter being vital for indicting possible criminals.

6.
Neuroimage ; 58(2): 537-48, 2011 Sep 15.
Artigo em Inglês | MEDLINE | ID: mdl-21729758

RESUMO

The traditional approach to functional image analysis models images as matrices of raw voxel intensity values. Although such a representation is widely utilized and heavily entrenched both within neuroimaging and in the wider data mining community, the strong interactions among space, time, and categorical modes such as subject and experimental task inherent in functional imaging yield a dataset with "high-order" structure, which matrix models are incapable of exploiting. Reasoning across all of these modes of data concurrently requires a high-order model capable of representing relationships between all modes of the data in tandem. We thus propose to model functional MRI data using tensors, which are high-order generalizations of matrices equivalent to multidimensional arrays or data cubes. However, several unique challenges exist in the high-order analysis of functional medical data: naïve tensor models are incapable of exploiting spatiotemporal locality patterns, standard tensor analysis techniques exhibit poor efficiency, and mixtures of numeric and categorical modes of data are very often present in neuroimaging experiments. Formulating the problem of image clustering as a form of Latent Semantic Analysis and using the WaveCluster algorithm as a baseline, we propose a comprehensive hybrid tensor and wavelet framework for clustering, concept discovery, and compression of functional medical images which successfully addresses these challenges. Our approach reduced runtime and dataset size on a 9.3GB finger opposition motor task fMRI dataset by up to 98% while exhibiting improved spatiotemporal coherence relative to standard tensor, wavelet, and voxel-based approaches. Our clustering technique was capable of automatically differentiating between the frontal areas of the brain responsible for task-related habituation and the motor regions responsible for executing the motor task, in contrast to a widely used fMRI analysis program, SPM, which only detected the latter region. Furthermore, our approach discovered latent concepts suggestive of subject handedness nearly 100× faster than standard approaches. These results suggest that a high-order model is an integral component to accurate scalable functional neuroimaging.


Assuntos
Processamento de Imagem Assistida por Computador/métodos , Imageamento por Ressonância Magnética/métodos , Adulto , Algoritmos , Análise por Conglomerados , Interpretação Estatística de Dados , Mineração de Dados , Imagem de Tensor de Difusão , Análise Fatorial , Feminino , Lógica Fuzzy , Humanos , Processamento de Imagem Assistida por Computador/estatística & dados numéricos , Imageamento por Ressonância Magnética/estatística & dados numéricos , Masculino , Modelos Estatísticos , Análise de Componente Principal , Análise de Ondaletas
7.
Bioinformatics ; 26(12): i47-56, 2010 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-20529936

RESUMO

MOTIVATION: Microarray profiling of mRNA abundance is often ill suited for temporal-spatial analysis of gene expressions in multicellular organisms such as Drosophila. Recent progress in image-based genome-scale profiling of whole-body mRNA patterns via in situ hybridization (ISH) calls for development of accurate and automatic image analysis systems to facilitate efficient mining of complex temporal-spatial mRNA patterns, which will be essential for functional genomics and network inference in higher organisms. RESULTS: We present SPEX(2), an automatic system for embryonic ISH image processing, which can extract, transform, compare, classify and cluster spatial gene expression patterns in Drosophila embryos. Our pipeline for gene expression pattern extraction outputs the precise spatial locations and strengths of the gene expression. We performed experiments on the largest publicly available collection of Drosophila ISH images, and show that our method achieves excellent performance in automatic image annotation, and also finds clusters that are significantly enriched, both for gene ontology functional annotations, and for annotation terms from a controlled vocabulary used by human curators to describe these images. AVAILABILITY: Software will be available at http://www.sailing.cs.cmu.edu/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Drosophila/embriologia , Drosophila/genética , Expressão Gênica , Processamento de Imagem Assistida por Computador/métodos , Hibridização In Situ/métodos , RNA Mensageiro/análise , Software , Animais , Perfilação da Expressão Gênica/métodos , RNA Mensageiro/metabolismo
8.
Front Big Data ; 3: 594302, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-33997776

RESUMO

How can we detect fraudulent lockstep behavior in large-scale multi-aspect data (i.e., tensors)? Can we detect it when data are too large to fit in memory or even on a disk? Past studies have shown that dense subtensors in real-world tensors (e.g., social media, Wikipedia, TCP dumps, etc.) signal anomalous or fraudulent behavior such as retweet boosting, bot activities, and network attacks. Thus, various approaches, including tensor decomposition and search, have been proposed for detecting dense subtensors rapidly and accurately. However, existing methods suffer from low accuracy, or they assume that tensors are small enough to fit in main memory, which is unrealistic in many real-world applications such as social media and web. To overcome these limitations, we propose D-Cube, a disk-based dense-subtensor detection method, which also can run in a distributed manner across multiple machines. Compared to state-of-the-art methods, D-Cube is (1) Memory Efficient: requires up to 1,561× less memory and handles 1,000× larger data (2.6TB), (2) Fast: up to 7× faster due to its near-linear scalability, (3) Provably Accurate: gives a guarantee on the densities of the detected subtensors, and (4) Effective: spotted network attacks from TCP dumps and synchronized behavior in rating data most accurately.

9.
Bioinformatics ; 24(13): i250-8, 2008 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-18586722

RESUMO

MOTIVATION: Protein complexes integrate multiple gene products to coordinate many biological functions. Given a graph representing pairwise protein interaction data one can search for subgraphs representing protein complexes. Previous methods for performing such search relied on the assumption that complexes form a clique in that graph. While this assumption is true for some complexes, it does not hold for many others. New algorithms are required in order to recover complexes with other types of topological structure. RESULTS: We present an algorithm for inferring protein complexes from weighted interaction graphs. By using graph topological patterns and biological properties as features, we model each complex subgraph by a probabilistic Bayesian network (BN). We use a training set of known complexes to learn the parameters of this BN model. The log-likelihood ratio derived from the BN is then used to score subgraphs in the protein interaction graph and identify new complexes. We applied our method to protein interaction data in yeast. As we show our algorithm achieved a considerable improvement over clique based algorithms in terms of its ability to recover known complexes. We discuss some of the new complexes predicted by our algorithm and determine that they likely represent true complexes. AVAILABILITY: Matlab implementation is available on the supporting website: www.cs.cmu.edu/~qyj/SuperComplex.


Assuntos
Algoritmos , Análise por Conglomerados , Modelos Biológicos , Mapeamento de Interação de Proteínas/métodos , Proteoma/metabolismo , Transdução de Sinais/fisiologia , Simulação por Computador
10.
Big Data ; 4(3): 179-91, 2016 09.
Artigo em Inglês | MEDLINE | ID: mdl-27642720

RESUMO

Multiaspect data are ubiquitous in modern Big Data applications. For instance, different aspects of a social network are the different types of communication between people, the time stamp of each interaction, and the location associated to each individual. How can we jointly model all those aspects and leverage the additional information that they introduce to our analysis? Tensors, which are multidimensional extensions of matrices, are a principled and mathematically sound way of modeling such multiaspect data. In this article, our goal is to popularize tensors and tensor decompositions to Big Data practitioners by demonstrating their effectiveness, outlining challenges that pertain to their application in Big Data scenarios, and presenting our recent work that tackles those challenges. We view this work as a step toward a fully automated, unsupervised tensor mining tool that can be easily and broadly adopted by practitioners in academia and industry.


Assuntos
Mineração de Dados , Simulação por Computador , Aprendizado de Máquina
11.
PLoS One ; 11(3): e0151027, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-26974560

RESUMO

Complex networks have been shown to exhibit universal properties, with one of the most consistent patterns being the scale-free degree distribution, but are there regularities obeyed by the r-hop neighborhood in real networks? We answer this question by identifying another power-law pattern that describes the relationship between the fractions of node pairs C(r) within r hops and the hop count r. This scale-free distribution is pervasive and describes a large variety of networks, ranging from social and urban to technological and biological networks. In particular, inspired by the definition of the fractal correlation dimension D2 on a point-set, we consider the hop-count r to be the underlying distance metric between two vertices of the network, and we examine the scaling of C(r) with r. We find that this relationship follows a power-law in real networks within the range 2 ≤ r ≤ d, where d is the effective diameter of the network, that is, the 90-th percentile distance. We term this relationship as power-hop and the corresponding power-law exponent as power-hop exponent h. We provide theoretical justification for this pattern under successful existing network models, while we analyze a large set of real and synthetic network datasets and we show the pervasiveness of the power-hop.


Assuntos
Modelos Teóricos , Apoio Social , Humanos
12.
Stat Anal Data Min ; 9(4): 269-290, 2016 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-27672406

RESUMO

How can we correlate the neural activity in the human brain as it responds to typed words, with properties of these terms (like 'edible', 'fits in hand')? In short, we want to find latent variables, that jointly explain both the brain activity, as well as the behavioral responses. This is one of many settings of the Coupled Matrix-Tensor Factorization (CMTF) problem. Can we enhance any CMTF solver, so that it can operate on potentially very large datasets that may not fit in main memory? We introduce Turbo-SMT, a meta-method capable of doing exactly that: it boosts the performance of any CMTF algorithm, produces sparse and interpretable solutions, and parallelizes any CMTF algorithm, producing sparse and interpretable solutions (up to 65 fold). Additionally, we improve upon ALS, the work-horse algorithm for CMTF, with respect to efficiency and robustness to missing values. We apply Turbo-SMT to BrainQ, a dataset consisting of a (nouns, brain voxels, human subjects) tensor and a (nouns, properties) matrix, with coupling along the nouns dimension. Turbo-SMT is able to find meaningful latent variables, as well as to predict brain activity with competitive accuracy. Finally, we demonstrate the generality of Turbo-SMT, by applying it on a Facebook dataset (users, 'friends', wall-postings); there, Turbo-SMT spots spammer-like anomalies.

13.
IEEE Trans Cybern ; 44(1): 54-65, 2014 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-23757533

RESUMO

Let us consider that someone is starting a research on a topic that is unfamiliar to them. Which seminal papers have influenced the topic the most? What is the genealogy of the seminal papers in this topic? These are the questions that they can raise, which we try to answer in this paper. First, we propose an algorithm that finds a set of seminal papers on a given topic. We also address the performance and scalability issues of this sophisticated algorithm. Next, we discuss the measures to decide how much a paper is influenced by another paper. Then, we propose an algorithm that constructs a genealogy of the seminal papers by using the influence measure and citation information. Finally, through extensive experiments with a large volume of a real-world academic literature data, we show the effectiveness and efficiency of our approach.

14.
J R Soc Interface ; 11(96): 20140283, 2014 Jul 06.
Artigo em Inglês | MEDLINE | ID: mdl-24789562

RESUMO

Network robustness is an important principle in biology and engineering. Previous studies of global networks have identified both redundancy and sparseness as topological properties used by robust networks. By focusing on molecular subnetworks, or modules, we show that module topology is tightly linked to the level of environmental variability (noise) the module expects to encounter. Modules internal to the cell that are less exposed to environmental noise are more connected and less robust than external modules. A similar design principle is used by several other biological networks. We propose a simple change to the evolutionary gene duplication model which gives rise to the rich range of module topologies observed within real networks. We apply these observations to evaluate and design communication networks that are specifically optimized for noisy or malicious environments. Combined, joint analysis of biological and computational networks leads to novel algorithms and insights benefiting both fields.


Assuntos
Redes de Comunicação de Computadores , Saccharomyces cerevisiae/genética , Duplicação Gênica , Redes Reguladoras de Genes , Saccharomyces cerevisiae/metabolismo , Transdução de Sinais , Biologia de Sistemas
15.
Proc SIAM Int Conf Data Min ; 2014: 118-126, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-26473087

RESUMO

How can we correlate the neural activity in the human brain as it responds to typed words, with properties of these terms (like 'edible', 'fits in hand')? In short, we want to find latent variables, that jointly explain both the brain activity, as well as the behavioral responses. This is one of many settings of the Coupled Matrix-Tensor Factorization (CMTF) problem. Can we accelerate any CMTF solver, so that it runs within a few minutes instead of tens of hours to a day, while maintaining good accuracy? We introduce TURBO-SMT, a meta-method capable of doing exactly that: it boosts the performance of any CMTF algorithm, by up to 200×, along with an up to 65 fold increase in sparsity, with comparable accuracy to the baseline. We apply TURBO-SMT to BRAINQ, a dataset consisting of a (nouns, brain voxels, human subjects) tensor and a (nouns, properties) matrix, with coupling along the nouns dimension. TURBO-SMT is able to find meaningful latent variables, as well as to predict brain activity with competitive accuracy.

16.
Big Data ; 2(4): 216-29, 2014 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-27442756

RESUMO

Given a simple noun such as apple, and a question such as "Is it edible?," what processes take place in the human brain? More specifically, given the stimulus, what are the interactions between (groups of) neurons (also known as functional connectivity) and how can we automatically infer those interactions, given measurements of the brain activity? Furthermore, how does this connectivity differ across different human subjects? In this work, we show that this problem, even though originating from the field of neuroscience, can benefit from big data techniques; we present a simple, novel good-enough brain model, or GeBM in short, and a novel algorithm Sparse-SysId, which are able to effectively model the dynamics of the neuron interactions and infer the functional connectivity. Moreover, GeBM is able to simulate basic psychological phenomena such as habituation and priming (whose definition we provide in the main text). We evaluate GeBM by using real brain data. GeBM produces brain activity patterns that are strikingly similar to the real ones, where the inferred functional connectivity is able to provide neuroscientific insights toward a better understanding of the way that neurons interact with each other, as well as detect regularities and outliers in multisubject brain activity measurements.

17.
Comput Math Methods Med ; 2013: 545613, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23710252

RESUMO

Recently, data with complex characteristics such as epilepsy electroencephalography (EEG) time series has emerged. Epilepsy EEG data has special characteristics including nonlinearity, nonnormality, and nonperiodicity. Therefore, it is important to find a suitable forecasting method that covers these special characteristics. In this paper, we propose a coercively adjusted autoregression (CA-AR) method that forecasts future values from a multivariable epilepsy EEG time series. We use the technique of random coefficients, which forcefully adjusts the coefficients with -1 and 1. The fractal dimension is used to determine the order of the CA-AR model. We applied the CA-AR method reflecting special characteristics of data to forecast the future value of epilepsy EEG data. Experimental results show that when compared to previous methods, the proposed method can forecast faster and accurately.


Assuntos
Diagnóstico por Computador/estatística & dados numéricos , Eletroencefalografia/estatística & dados numéricos , Epilepsia/diagnóstico , Modelos Neurológicos , Biologia Computacional , Humanos , Redes Neurais de Computação , Análise de Regressão
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA