Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 17 de 17
Filtrar
Mais filtros








Base de dados
Intervalo de ano de publicação
1.
NPJ Digit Med ; 6(1): 92, 2023 May 22.
Artigo em Inglês | MEDLINE | ID: mdl-37217691

RESUMO

In machine learning (ML), association patterns in the data, paths in decision trees, and weights between layers of the neural network are often entangled due to multiple underlying causes, thus masking the pattern-to-source relation, weakening prediction, and defying explanation. This paper presents a revolutionary ML paradigm: pattern discovery and disentanglement (PDD) that disentangles associations and provides an all-in-one knowledge system capable of (a) disentangling patterns to associate with distinct primary sources; (b) discovering rare/imbalanced groups, detecting anomalies and rectifying discrepancies to improve class association, pattern and entity clustering; and (c) organizing knowledge for statistically supported interpretability for causal exploration. Results from case studies have validated such capabilities. The explainable knowledge reveals pattern-source relations on entities, and underlying factors for causal inference, and clinical study and practice; thus, addressing the major concern of interpretability, trust, and reliability when applying ML to healthcare, which is a step towards closing the AI chasm.

2.
Sci Rep ; 11(1): 5688, 2021 03 11.
Artigo em Inglês | MEDLINE | ID: mdl-33707478

RESUMO

Machine Learning has made impressive advances in many applications akin to human cognition for discernment. However, success has been limited in the areas of relational datasets, particularly for data with low volume, imbalanced groups, and mislabeled cases, with outputs that typically lack transparency and interpretability. The difficulties arise from the subtle overlapping and entanglement of functional and statistical relations at the source level. Hence, we have developed Pattern Discovery and Disentanglement System (PDD), which is able to discover explicit patterns from the data with various sizes, imbalanced groups, and screen out anomalies. We present herein four case studies on biomedical datasets to substantiate the efficacy of PDD. It improves prediction accuracy and facilitates transparent interpretation of discovered knowledge in an explicit representation framework PDD Knowledge Base that links the sources, the patterns, and individual patients. Hence, PDD promises broad and ground-breaking applications in genomic and biomedical machine learning.

3.
BMC Med Inform Decis Mak ; 21(1): 16, 2021 01 09.
Artigo em Inglês | MEDLINE | ID: mdl-33422088

RESUMO

BACKGROUND: Statistical data analysis, especially the advanced machine learning (ML) methods, have attracted considerable interest in clinical practices. We are looking for interpretability of the diagnostic/prognostic results that will bring confidence to doctors, patients and their relatives in therapeutics and clinical practice. When datasets are imbalanced in diagnostic categories, we notice that the ordinary ML methods might produce results overwhelmed by the majority classes diminishing prediction accuracy. Hence, it needs methods that could produce explicit transparent and interpretable results in decision-making, without sacrificing accuracy, even for data with imbalanced groups. METHODS: In order to interpret the clinical patterns and conduct diagnostic prediction of patients with high accuracy, we develop a novel method, Pattern Discovery and Disentanglement for Clinical Data Analysis (cPDD), which is able to discover patterns (correlated traits/indicants) and use them to classify clinical data even if the class distribution is imbalanced. In the most general setting, a relational dataset is a large table such that each column represents an attribute (trait/indicant), and each row contains a set of attribute values (AVs) of an entity (patient). Compared to the existing pattern discovery approaches, cPDD can discover a small succinct set of statistically significant high-order patterns from clinical data for interpreting and predicting the disease class of the patients even with groups small and rare. RESULTS: Experiments on synthetic and thoracic clinical dataset showed that cPDD can 1) discover a smaller set of succinct significant patterns compared to other existing pattern discovery methods; 2) allow the users to interpret succinct sets of patterns coming from uncorrelated sources, even the groups are rare/small; and 3) obtain better performance in prediction compared to other interpretable classification approaches. CONCLUSIONS: In conclusion, cPDD discovers fewer patterns with greater comprehensive coverage to improve the interpretability of patterns discovered. Experimental results on synthetic data validated that cPDD discovers all patterns implanted in the data, displays them precisely and succinctly with statistical support for interpretation and prediction, a capability which the traditional ML methods lack. The success of cPDD as a novel interpretable method in solving the imbalanced class problem shows its great potential to clinical data analysis for years to come.


Assuntos
Algoritmos , Aprendizado de Máquina , Interpretação Estatística de Dados , Humanos
4.
BMC Med Genomics ; 11(Suppl 5): 103, 2018 Nov 20.
Artigo em Inglês | MEDLINE | ID: mdl-30453949

RESUMO

BACKGROUND: A protein family has similar and diverse functions locally conserved. An aligned pattern cluster (APC) can reflect the conserved functionality. Discovering aligned residue associations (ARAs) in APCs can reveal subtle inner working characteristics of conserved regions of protein families. However, ARAs corresponding to different functionalities/subgroups/classes could be entangled because of subtle multiple entwined factors. METHODS: To discover and disentangle patterns from mixed-mode datasets, such as APCs when the residues are replaced by their fundamental biochemical properties list, this paper presents a novel method, Extended Aligned Residual Association Discovery and Disentanglement (E-ARADD). E-ARADD discretizes the numerical dataset to transform the mixed-mode dataset into an event-value dataset, constructs an ARA Frequency Matrix and then converts it into an adjusted Statistical Residual (SR) Vector Space (SRV) capturing statistical deviation from randomness. By applying Principal Component (PC) Decomposition on SRV, PCs ranked by their variance are obtained. Finally, the disentangled ARAs are discovered when the projections on a PC is re-projected to a vector space with the same basis vectors of SRV. RESULTS: Experiments on synthetic, cytochrome c and class A scavenger data have shown that E-ARADD can a) disentangle the entwined ARAs in APCs (with residues or biochemical properties), b) reveal subtle AR clusters relating to classes, subtle subgroups or specific functionalities. CONCLUSIONS: E-ARADD can discover and disentangle ARs and ARAs entangled in functionality and location of protein families to reveal functional subgroups and subgroup characteristics of biological conserved regions. Experimental results on synthetic data provides the proof-of-concept validation on the successful disentanglement that reveals class-associated ARAs with or without class labels as input. Experiments on cytochrome c data proved the efficacy of E-ARADD in handing both types of residue data. Our novel methodology is not only able to discover and disentangle ARs and ARAs in specific statistical/functional (PCs and RSRVs) spaces, but also their locations in the protein family functional domains. The success of E-ARADD shows its great potential to proteomic research, drug discovery and precision and personalized genetic medicine.


Assuntos
Biologia Computacional/métodos , Algoritmos , Análise por Conglomerados , Citocromos c/química , Citocromos c/metabolismo , Análise de Componente Principal
5.
Sci Rep ; 8(1): 14841, 2018 10 04.
Artigo em Inglês | MEDLINE | ID: mdl-30287904

RESUMO

Residue-residue close contact (R2R-C) data procured from three-dimensional protein-protein interaction (PPI) experiments is currently used for predicting residue-residue interaction (R2R-I) in PPI. However, due to complex physiochemical environments, R2R-I incidences, facilitated by multiple factors, are usually entangled in the source environment and masked in the acquired data. Here we present a novel method, P2K (Pattern to Knowledge), to disentangle R2R-I patterns and render much succinct discriminative information expressed in different specific R2R-I statistical/functional spaces. Since such knowledge is not visible in the data acquired, we refer to it as deep knowledge. Leveraging the deep knowledge discovered to construct machine learning models for sequence-based R2R-I prediction, without trial-and-error combination of the features over external knowledge of sequences, our R2R-I predictor was validated for its effectiveness under stringent leave-one-complex-out-alone cross-validation in a benchmark dataset, and was surprisingly demonstrated to perform better than an existing sequence-based R2R-I predictor by 28% (p: 1.9E-08). P2K is accessible via our web server on https://p2k.uwaterloo.ca .

6.
IEEE Trans Nanobioscience ; 17(3): 209-218, 2018 07.
Artigo em Inglês | MEDLINE | ID: mdl-29994222

RESUMO

Functional region identification is of fundamental importance for protein sequences analysis. Such knowledge provides better scientific understanding and could assist drug discovery. Up-to-date, domain annotation is one approach, but it needs to leverage existing databases. For de novo discovery, motif discovery locates and aligns locally homologous sub-sequences to obtain a position-weight matrix (PWM), which is a fixed-length representation model, whereas protein functional region size varies. It thus requires computational expensive exhaustive search to obtain a PWM with width of optimal range. This paper presents a new method known as pattern-directed aligned pattern clustering (PD-APCn) to discover and align patterns in conserved protein functional regions. It adopts aligned pattern cluster (APC) with patterns of variable length and strong support to direct the incremental APC expansion. It allows substitution and frame-shift mutations until a robust termination condition is reached. The concept of breakpoint gap is introduced to identify spots of mutations, such as substitution and frame shifts. Experiments on synthetic data sets with different sizes and noise levels showed that PD-APCn outperforms MEME with much higher recall and Fmeasure and computational speed 665 times faster that MEME. When applying to Cytochrome C and Ubiquitin families, it found all key binding sites within the APCs.


Assuntos
Biologia Computacional/métodos , Reconhecimento Automatizado de Padrão/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Algoritmos , Análise por Conglomerados , Bases de Dados de Proteínas , Humanos , Proteínas/química , Proteínas/genética
7.
Proteomes ; 6(1)2018 Feb 08.
Artigo em Inglês | MEDLINE | ID: mdl-29419792

RESUMO

A protein family has similar and diverse functions locally conserved as aligned sequence segments. Further discovering their association patterns could reveal subtle family subgroup characteristics. Since aligned residues associations (ARAs) in Aligned Pattern Clusters (APCs) are complex and intertwined due to entangled function, factors, and variance in the source environment, we have recently developed a novel method: Aligned Residue Association Discovery and Disentanglement (ARADD) to solve this problem. ARADD first obtains from an APC an ARA Frequency Matrix and converts it to an adjusted statistical residual vectorspace (SRV). It then disentangles the SRV into Principal Components (PCs) and Re-projects their vectors to a SRV to reveal succinct orthogonal AR groups. In this study, we applied ARADD to class A scavenger receptors (SR-A), a subclass of a diverse protein family binding to modified lipoproteins with diverse biological functionalities not explicitly known. Our experimental results demonstrated that ARADD can unveil subtle subgroups in sequence segments with diverse functionality and highly variable sequence lengths. We also demonstrated that the ARAs captured in a Position Weight Matrix or an APC were entangled in biological function and domain location but disentangled by ARADD to reveal different subclasses without knowing their actual occurrence positions.

8.
Artigo em Inglês | MEDLINE | ID: mdl-26336137

RESUMO

Understanding binding cores is of fundamental importance in deciphering Protein-DNA (TF-TFBS) binding and gene regulation. Limited by expensive experiments, it is promising to discover them with variations directly from sequence data. Although existing computational methods have produced satisfactory results, they are one-to-one mappings with no site-specific information on residue/nucleotide variations, where these variations in binding cores may impact binding specificity. This study presents a new representation for modeling binding cores by incorporating variations and an algorithm to discover them from only sequence data. Our algorithm takes protein and DNA sequences from TRANSFAC (a Protein-DNA Binding Database) as input; discovers from both sets of sequences conserved regions in Aligned Pattern Clusters (APCs); associates them as Protein-DNA Co-Occurring APCs; ranks the Protein-DNA Co-Occurring APCs according to their co-occurrence, and among the top ones, finds three-dimensional structures to support each binding core candidate. If successful, candidates are verified as binding cores. Otherwise, homology modeling is applied to their close matches in PDB to attain new chemically feasible binding cores. Our algorithm obtains binding cores with higher precision and much faster runtime ( ≥ 1,600x) than that of its contemporaries, discovering candidates that do not co-occur as one-to-one associated patterns in the raw data. AVAILABILITY: http://www.pami.uwaterloo.ca/~ealee/files/tcbbPnDna2015/Release.zip.


Assuntos
Análise por Conglomerados , Biologia Computacional/métodos , Proteínas de Ligação a DNA/química , DNA/química , Alinhamento de Sequência/métodos , Algoritmos , DNA/análise , DNA/genética , DNA/metabolismo , Proteínas de Ligação a DNA/genética , Proteínas de Ligação a DNA/metabolismo , Mineração de Dados , Ligação Proteica , Análise de Sequência de DNA , Análise de Sequência de Proteína
9.
Methods ; 110: 26-34, 2016 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-27476008

RESUMO

Predicting Protein-Protein Interaction (PPI) is important for making new discoveries in the molecular mechanisms inside a cell. Traditionally, new PPIs are identified through biochemical experiments but such methods are labor-intensive, expensive, time-consuming and technically ineffective due to high false positive rates. Sequence-based prediction is currently the most readily applicable and cost-effective method. It exploits known PPI Databases to construct classifiers for predicting unknown PPIs based only on sequence data without requiring any other prior knowledge. Among existing sequence-based methods, most feature-based methods use exact sequence patterns with fixed length as features - a constraint which is biologically unrealistic. SVM with Pairwise String Kernel renders better predicting performance. However it is difficult to be biologically interpretable since it is kernel-based where no concrete feature values are computed. Here we have developed a novel method WeMine-P2P to overcome these drawbacks. By assuming that the regions/sites that mediate PPI are more conserved, WeMine-P2P first discovers/locates the conserved sequence patterns in protein sequences in the form of Aligned Pattern Clusters (APCs), allowing pattern variations with variable length. It then pairs up all APCs into a set of Co-Occurring APC (cAPC) pairs, and computes a cAPC-PPI score for each cAPC pair on all PPI pairs. It further constructs a feature vector composed of all cAPC pairs with their cAPC-PPI scores for each PPI pair and uses them for constructing a PPI predictor. Through 40 independent experiments, we showed that (1) WeMine-P2P outperforms the well-known algorithm, PIPE2, which also utilizes co-occurring amino acid sequence segments but does not allow variable lengths and pattern variations; (2) WeMine-P2P achieves satisfactory PPI prediction performance, comparable to the SVM-based methods particularly among unseen protein sequences with a potential reduction of feature dimension of 1280×; (3) Unlike SVM-based methods, WeMine-P2P renders interpretable biological features from which we observed that co-occurring sequence patterns from the compositional bias regions are more discriminative. WeMine-P2P is extendable to predict other biosequence interactions such as Protein-DNA interactions.


Assuntos
Biologia Computacional/métodos , Mapeamento de Interação de Proteínas/métodos , Mapas de Interação de Proteínas/genética , Análise de Sequência de Proteína/métodos , Algoritmos , Sequência de Aminoácidos/genética
10.
Bioinformatics ; 32(16): 2427-34, 2016 08 15.
Artigo em Inglês | MEDLINE | ID: mdl-27153647

RESUMO

MOTIVATION: Evolutionarily conserved amino acids within proteins characterize functional or structural regions. Conversely, less conserved amino acids within these regions are generally areas of evolutionary divergence. A priori knowledge of biological function and species can help interpret the amino acid differences between sequences. However, this information is often erroneous or unavailable, hampering discovery with supervised algorithms. Also, most of the current unsupervised methods depend on full sequence similarity, which become inaccurate when proteins diverge (e.g. inversions, deletions, insertions). Due to these and other shortcomings, we developed a novel unsupervised algorithm which discovers highly conserved regions and uses two types of information measures: (i) data measures computed from input sequences; and (ii) class measures computed using a priori class groupings in order to reveal subgroups (i.e. classes) or functional characteristics. RESULTS: Using known and putative sequences of two proteins belonging to a relatively uncharacterized protein family we were able to group evolutionarily related sequences and identify conserved regions, which are strong homologous association patterns called Aligned Pattern Clusters, within individual proteins and across the members of this family. An initial synthetic demonstration and in silico results reveal that (i) the data measures are unbiased and (ii) our class measures can accurately rank the quality of the evolutionarily relevant groupings. Furthermore, combining our data and class measures allowed us to interpret the results by inferring regions of biological importance within the binding domain of these proteins. Compared to popular supervised methods, our algorithm has a superior runtime and comparable accuracy. AVAILABILITY AND IMPLEMENTATION: The dataset and results are available at www.pami.uwaterloo.ca/∼ealee/files/classification2015 CONTACT: akcwong@uwaterloo.ca SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Alinhamento de Sequência , Sequência de Aminoácidos , Aminoácidos , Biologia Computacional/métodos , Bases de Dados de Proteínas , Proteínas , Análise de Sequência de Proteína/métodos
11.
BMC Bioinformatics ; 15 Suppl 12: S2, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25474736

RESUMO

BACKGROUND: The large influx of biological sequences poses the importance of identifying and correlating conserved regions in homologous sequences to acquire valuable biological knowledge. These conserved regions contain statistically significant residue associations as sequence patterns. Thus, patterns from two conserved regions co-occurring frequently on the same sequences are inferred to have joint functionality. A method for finding conserved regions in protein families with frequent co-occurrence patterns is proposed. The biological significance of the discovered clusters of conserved regions with co-occurrences patterns can be validated by their three-dimensional closeness of amino acids and the biological functionality found in those regions as supported by published work. METHODS: Using existing algorithms, we discovered statistically significant amino acid associations as sequence patterns. We then aligned and clustered them into Aligned Pattern Clusters (APCs) corresponding to conserved regions with amino acid conservation and variation. When one APC frequently co-occurred with another APC, the two APCs have high co-occurrence. We then clustered APCs with high co-occurrence into what we refer to as Co-occurrence APC Clusters (Co-occurrence Clusters). RESULTS: Our results show that for Co-occurrence Clusters, the three-dimensional distance between their amino acids is closer than average amino acid distances. For the Co-occurrence Clusters of the ubiquitin and the cytochrome c families, we observed biological significance among the residing amino acids of the APCs within the same cluster. In ubiquitin, the residues are responsible for ubiquitination as well as conventional and unconventional ubiquitin-bindings. In cytochrome c, amino acids in the first co-occurrence cluster contribute to binding of other proteins in the electron transport chain, and amino acids in the second co-occurrence cluster contribute to the stability of the axial heme ligand. CONCLUSIONS: Thus, our co-occurrence clustering algorithm can efficiently find and rank conserved regions that contain patterns that frequently co-occurring on the same proteins. Co-occurring patterns are biologically significant due to their three-dimensional closeness and other evidences reported in literature. These results play an important role in drug discovery as biologists can quickly identify the target for drugs to conduct detailed preclinical studies.


Assuntos
Algoritmos , Análise de Sequência de Proteína/métodos , Homologia de Sequência de Aminoácidos , Aminoácidos/química , Análise por Conglomerados , Citocromos c/química , Conformação Proteica , Proteínas/química , Proteínas/classificação , Alinhamento de Sequência , Ubiquitina/química
12.
Artigo em Inglês | MEDLINE | ID: mdl-26356022

RESUMO

Discovering sequence patterns with variations unveils significant functions of a protein family. Existing combinatorial methods of discovering patterns with variations are computationally expensive, and probabilistic methods require more elaborate probabilistic representation of the amino acid associations. To overcome these shortcomings, this paper presents a new computationally efficient method for representing patterns with variations in a compact representation called Aligned Pattern Cluster (AP Cluster). To tackle the runtime, our method discovers a shortened list of non-redundant statistically significant sequence associations based on our previous work. To address the representation of protein functional regions, our pattern alignment and clustering step, presented in this paper captures the conservations and variations of the aligned patterns. We further refine our solution to allow more coverage of sequences via extending the AP Clusters containing only statistically significant patterns to Weak and Conserved AP Clusters. When applied to the cytochrome c, the ubiquitin, and the triosephosphate isomerase protein families, our algorithm identifies the binding segments as well as the binding residues. When compared to other methods, ours discovers all binding sites in the AP Clusters with superior entropy and coverage. The identification of patterns with variations help biologists to avoid time-consuming simulations and experimentations. (Software available upon request).


Assuntos
Biologia Computacional/métodos , Proteínas/química , Proteínas/classificação , Alinhamento de Sequência/métodos , Algoritmos , Análise por Conglomerados , Proteínas/fisiologia , Análise de Sequência de Proteína
13.
BMC Bioinformatics ; 12 Suppl 5: S5, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21989090

RESUMO

BACKGROUND: Discovering patterns from gene expression levels is regarded as a classification problem when tissue classes of the samples are given and solved as a discrete-data problem by discretizing the expression levels of each gene into intervals maximizing the interdependence between that gene and the class labels. However, when class information is unavailable, discovering gene expression patterns becomes difficult. METHODS: For a gene pool with large number of genes, we first cluster the genes into smaller groups. In each group, we use the representative gene, one with highest interdependence with others in the group, to drive the discretization of the gene expression levels of other genes. Treating intervals as discrete events, association patterns of events can be discovered. If the gene groups obtained are crisp gene clusters, significant patterns overlapping different gene clusters cannot be found. This paper presents a new method of "fuzzifying" the crisp gene clusters to overcome such problem. RESULTS: To evaluate the effectiveness of our approach, we first apply the above described procedure on a synthetic data set and then a gene expression data set with known class labels. The class labels are not being used in both analyses but used later as the ground truth in a classificatory problem for assessing the algorithm's effectiveness in fuzzy gene clustering and discretization. The results show the efficacy of the proposed method. The existence of correlation among continuous valued gene expression levels suggests that certain genes in the gene groups have high interdependence with other genes in the group. Fuzzification of a crisp gene cluster allows the cluster to take in genes from other clusters so that overlapping relationship among gene clusters could be uncovered. Hence, previously unknown hidden patterns resided in overlapping gene clusters are discovered. From the experimental results, the high order patterns discovered reveal multiple gene interaction patterns in cancerous tissues not found in normal tissues. It was also found that for the colon cancer experiment, 70% of the top patterns and most of the discriminative patterns between cancerous and normal tissues are among those spanning across different crisp gene clusters. CONCLUSIONS: We show that the proposed method for analyzing the error-prone microarray is effective even without the presence of tissue class information. A unified framework is presented, allowing fast and accurate pattern discovery for gene expression data. For a large gene set, to discover a comprehensive set of patterns, gene clustering, gene expression discretization and gene cluster fuzzification are absolutely necessary.


Assuntos
Algoritmos , Neoplasias do Colo/genética , Perfilação da Expressão Gênica/métodos , Análise por Conglomerados , Humanos , Análise de Sequência com Séries de Oligonucleotídeos/métodos
14.
J Bioinform Comput Biol ; 8(5): 789-807, 2010 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-20981888

RESUMO

Comparative genomics is concerned with the study of genome structure and function of different species. It can provide useful information for the derivation of evolutionary and functional relationships between genomes. Previous work on genome comparison focuses mainly on comparing the entire genomes for visualization without further analysis. As many interesting patterns may exist between genomes and may lead to the discovering of functional gene segments (groups of genes), we propose an algorithm called Multi-Level Genome Comparison Algorithm (MGC) that can be used to facilitate the analysis of genomes at multi-levels during the comparison process to discover sequential and regional consistency in gene segments. Different genomes may have common sub-sequences that differ from each other due to mutations, lateral gene transfers, gene rearrangements, etc., and these sub-sequences are usually not easily identified. Not all the genes can have a perfect one-to-one matching with each other. It is quite possible for one-to-many or many-to-many ambiguous relationships to exist between them. To perform the tasks effectively, MGC takes such ambiguity into consideration during genome comparison by representing genomes in a graph and then make use of a graph mining algorithm called the Multi-Level Attributed Graph Mining Algorithm (MAGMA) to build a hierarchical multi-level graph structure to facilitate genome comparison. To determine the effectiveness of these proposed algorithms, experiments were performed using intra- and inter-species of Microbial genomes. The results show that the proposed algorithms are able to discover multiple level matching patterns that show the similarities and dissimilarities among different genomes, in addition to confirming the specific role of the genes in the genomes.


Assuntos
Algoritmos , Mineração de Dados/métodos , Genômica/estatística & dados numéricos , Animais , Chlamydia muridarum/classificação , Chlamydia muridarum/genética , Chlamydiales/classificação , Chlamydiales/genética , Chlamydophila pneumoniae/classificação , Chlamydophila pneumoniae/genética , Biologia Computacional , Genoma Bacteriano , Humanos , Modelos Genéticos , Alinhamento de Sequência/estatística & dados numéricos , Especificidade da Espécie
15.
J Comput Biol ; 15(6): 625-37, 2008.
Artigo em Inglês | MEDLINE | ID: mdl-18631025

RESUMO

This paper reports the discovery of statistically significant association patterns of gene expression levels from microarray data. By association patterns, we mean certain gene expression intensity intervals having statistically significant associations among themselves and with the tissue classes, such as cancerous and normal tissues. We describe how the significance of the associations among gene expression levels can be evaluated using a statistical measure in an objective manner. If an association is found to be significant based on the measure, we say that it is statistically significant. Given a gene expression data set, we first cluster the entire gene pool comprising all the genes into groups by optimizing the correlation (or more precisely, interdependence) among the gene expression levels within gene groups. From each group, we select one or several genes that are most correlated with other genes within that group to form a smaller gene pool. This gene pool then constitutes the most representative genes from the original pool. Our pattern discovery algorithm is then used, for the first time, to discover the significant association patterns of gene expression levels among the genes from the small pool. With our method, it is more effective to discover and express the associations in terms of their intensity intervals. Hence, we discretize each gene expression levels into intervals maximizing the interdependence between the gene expression and the tissue classes. From this data set of gene expression intervals, we discover the association patterns representing statistically significant associations, some positively and some negatively, with different tissue classes. We apply our pattern discovery methodology to the colon-cancer microarray gene expression data set. It consists of 2000 genes and 62 samples taken from colon cancer or normal subjects. The statistically significant combinations of gene expression levels that repress or activate colon cancer are revealed in the colon-cancer data set. The discovered association patterns are ranked according to their statistical significance and displayed for interpretation and further analysis.


Assuntos
Neoplasias do Colo/metabolismo , Perfilação da Expressão Gênica , Reconhecimento Automatizado de Padrão , Algoritmos , Humanos , Análise de Sequência com Séries de Oligonucleotídeos/métodos
16.
IEEE Trans Syst Man Cybern B Cybern ; 37(4): 1022-9, 2007 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-17702298

RESUMO

This correspondence presents a two-stage classification learning algorithm. The first stage approximates the class-conditional distribution of a discrete space using a separate mixture model, and the second stage investigates the class posterior probabilities by training a network. The first stage explores the generative information that is inherent in each class by using the Chow-Liu (CL) method, which approximates high-dimensional probability with a tree structure, namely, a dependence tree, whereas the second stage concentrates on discriminative learning to distinguish between classes. The resulting learning algorithm integrates the advantages of both generative learning and discriminative learning. Because it uses CL dependence-tree estimation, we call our algorithm CL-Net. Empirical tests indicate that the proposed learning algorithm makes significant improvements when compared with the related classifiers that are constructed by either generative learning or discriminative learning.


Assuntos
Algoritmos , Técnicas de Apoio para a Decisão , Análise Discriminante , Modelos Teóricos , Redes Neurais de Computação , Reconhecimento Automatizado de Padrão/métodos , Simulação por Computador
17.
Artigo em Inglês | MEDLINE | ID: mdl-17044174

RESUMO

This paper presents an attribute clustering method which is able to group genes based on their interdependence so as to mine meaningful patterns from the gene expression data. It can be used for gene grouping, selection, and classification. The partitioning of a relational table into attribute subgroups allows a small number of attributes within or across the groups to be selected for analysis. By clustering attributes, the search dimension of a data mining algorithm is reduced. The reduction of search dimension is especially important to data mining in gene expression data because such data typically consist of a huge number of genes (attributes) and a small number of gene expression profiles (tuples). Most data mining algorithms are typically developed and optimized to scale to the number of tuples instead of the number of attributes. The situation becomes even worse when the number of attributes overwhelms the number of tuples, in which case, the likelihood of reporting patterns that are actually irrelevant due to chances becomes rather high. It is for the aforementioned reasons that gene grouping and selection are important preprocessing steps for many data mining algorithms to be effective when applied to gene expression data. This paper defines the problem of attribute clustering and introduces a methodology to solving it. Our proposed method groups interdependent attributes into clusters by optimizing a criterion function derived from an information measure that reflects the interdependence between attributes. By applying our algorithm to gene expression data, meaningful clusters of genes are discovered. The grouping of genes based on attribute interdependence within group helps to capture different aspects of gene association patterns in each group. Significant genes selected from each group then contain useful information for gene expression classification and identification. To evaluate the performance of the proposed approach, we applied it to two well-known gene expression data sets and compared our results with those obtained by other methods. Our experiments show that the proposed method is able to find the meaningful clusters of genes. By selecting a subset of genes which have high multiple-interdependence with others within clusters, significant classification information can be obtained. Thus, a small pool of selected genes can be used to build classifiers with very high classification rate. From the pool, gene expressions of different categories can be identified.


Assuntos
Análise por Conglomerados , Bases de Dados de Proteínas , Perfilação da Expressão Gênica/métodos , Armazenamento e Recuperação da Informação/métodos , Família Multigênica/fisiologia , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Reconhecimento Automatizado de Padrão/métodos , Algoritmos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA