Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 21
Filtrar
1.
BMC Bioinformatics ; 25(1): 225, 2024 Jun 26.
Artigo em Inglês | MEDLINE | ID: mdl-38926641

RESUMO

PURPOSE: Large Language Models (LLMs) like Generative Pre-trained Transformer (GPT) from OpenAI and LLaMA (Large Language Model Meta AI) from Meta AI are increasingly recognized for their potential in the field of cheminformatics, particularly in understanding Simplified Molecular Input Line Entry System (SMILES), a standard method for representing chemical structures. These LLMs also have the ability to decode SMILES strings into vector representations. METHOD: We investigate the performance of GPT and LLaMA compared to pre-trained models on SMILES in embedding SMILES strings on downstream tasks, focusing on two key applications: molecular property prediction and drug-drug interaction prediction. RESULTS: We find that SMILES embeddings generated using LLaMA outperform those from GPT in both molecular property and DDI prediction tasks. Notably, LLaMA-based SMILES embeddings show results comparable to pre-trained models on SMILES in molecular prediction tasks and outperform the pre-trained models for the DDI prediction tasks. CONCLUSION: The performance of LLMs in generating SMILES embeddings shows great potential for further investigation of these models for molecular embedding. We hope our study bridges the gap between LLMs and molecular embedding, motivating additional research into the potential of LLMs in the molecular representation field. GitHub: https://github.com/sshaghayeghs/LLaMA-VS-GPT .


Assuntos
Quimioinformática , Quimioinformática/métodos , Interações Medicamentosas , Estrutura Molecular
2.
Bioinformatics ; 38(5): 1369-1377, 2022 02 07.
Artigo em Inglês | MEDLINE | ID: mdl-34875000

RESUMO

MOTIVATION: Drug repurposing is a potential alternative to the traditional drug discovery process. Drug repurposing can be formulated as a recommender system that recommends novel indications for available drugs based on known drug-disease associations. This article presents a method based on non-negative matrix factorization (NMF-DR) to predict the drug-related candidate disease indications. This work proposes a recommender system-based method for drug repurposing to predict novel drug indications by integrating drug and diseases related data sources. For this purpose, this framework first integrates two types of disease similarities, the associations between drugs and diseases, and the various similarities between drugs from different views to make a heterogeneous drug-disease interaction network. Then, an improved non-negative matrix factorization-based method is proposed to complete the drug-disease adjacency matrix with predicted scores for unknown drug-disease pairs. RESULTS: The comprehensive experimental results show that NMF-DR achieves superior prediction performance when compared with several existing methods for drug-disease association prediction. AVAILABILITY AND IMPLEMENTATION: The program is available at https://github.com/sshaghayeghs/NMF-DR. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional , Reposicionamento de Medicamentos , Biologia Computacional/métodos , Reposicionamento de Medicamentos/métodos , Algoritmos , Descoberta de Drogas
3.
BMC Bioinformatics ; 23(1): 143, 2022 Apr 20.
Artigo em Inglês | MEDLINE | ID: mdl-35443626

RESUMO

'De novo' drug discovery is costly, slow, and with high risk. Repurposing known drugs for treatment of other diseases offers a fast, low-cost/risk and highly-efficient method toward development of efficacious treatments. The emergence of large-scale heterogeneous biomolecular networks, molecular, chemical and bioactivity data, and genomic and phenotypic data of pharmacological compounds is enabling the development of new area of drug repurposing called 'in silico' drug repurposing, i.e., computational drug repurposing (CDR). The aim of CDR is to discover new indications for an existing drug (drug-centric) or to identify effective drugs for a disease (disease-centric). Both drug-centric and disease-centric approaches have the common challenge of either assessing the similarity or connections between drugs and diseases. However, traditional CDR is fraught with many challenges due to the underlying complex pharmacology and biology of diseases, genes, and drugs, as well as the complexity of their associations. As such, capturing highly non-linear associations among drugs, genes, diseases by most existing CDR methods has been challenging. We propose a network-based integration approach that can best capture knowledge (and complex relationships) contained within and between drugs, genes and disease data. A network-based machine learning approach is applied thereafter by using the extracted knowledge and relationships in order to identify single and pair of approved or experimental drugs with potential therapeutic effects on different breast cancer subtypes. Indeed, further clinical analysis is needed to confirm the therapeutic effects of identified drugs on each breast cancer subtype.


Assuntos
Neoplasias da Mama , Reposicionamento de Medicamentos , Neoplasias da Mama/tratamento farmacológico , Neoplasias da Mama/genética , Biologia Computacional/métodos , Descoberta de Drogas , Reposicionamento de Medicamentos/métodos , Feminino , Humanos , Aprendizado de Máquina
4.
Brief Bioinform ; 19(2): 325-340, 2018 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-28011753

RESUMO

Driven by high-throughput sequencing techniques, modern genomic and clinical studies are in a strong need of integrative machine learning models for better use of vast volumes of heterogeneous information in the deep understanding of biological systems and the development of predictive models. How data from multiple sources (called multi-view data) are incorporated in a learning system is a key step for successful analysis. In this article, we provide a comprehensive review on omics and clinical data integration techniques, from a machine learning perspective, for various analyses such as prediction, clustering, dimension reduction and association. We shall show that Bayesian models are able to use prior information and model measurements with various distributions; tree-based methods can either build a tree with all features or collectively make a final decision based on trees learned from each view; kernel methods fuse the similarity matrices learned from individual views together for a final similarity matrix or learning model; network-based fusion methods are capable of inferring direct and indirect associations in a heterogeneous network; matrix factorization models have potential to learn interactions among features from different views; and a range of deep neural networks can be integrated in multi-modal learning for capturing the complex mechanism of biological systems.


Assuntos
Redes Reguladoras de Genes , Aprendizado de Máquina , Modelos Biológicos , Biologia de Sistemas/métodos , Animais , Humanos
5.
BMC Bioinformatics ; 19(Suppl 14): 410, 2018 Nov 20.
Artigo em Inglês | MEDLINE | ID: mdl-30453876

RESUMO

BACKGROUND: The prediction of calmodulin-binding (CaM-binding) proteins plays a very important role in the fields of biology and biochemistry, because the calmodulin protein binds and regulates a multitude of protein targets affecting different cellular processes. Computational methods that can accurately identify CaM-binding proteins and CaM-binding domains would accelerate research in calcium signaling and calmodulin function. Short-linear motifs (SLiMs), on the other hand, have been effectively used as features for analyzing protein-protein interactions, though their properties have not been utilized in the prediction of CaM-binding proteins. RESULTS: We propose a new method for the prediction of CaM-binding proteins based on both the total and average scores of known and new SLiMs in protein sequences using a new scoring method called sliding window scoring (SWS) as features for the prediction module. A dataset of 194 manually curated human CaM-binding proteins and 193 mitochondrial proteins have been obtained and used for testing the proposed model. The motif generation tool, Multiple EM for Motif Elucidation (MEME), has been used to obtain new motifs from each of the positive and negative datasets individually (the SM approach) and from the combined negative and positive datasets (the CM approach). Moreover, the wrapper criterion with random forest for feature selection (FS) has been applied followed by classification using different algorithms such as k-nearest neighbors (k-NN), support vector machines (SVM), naive Bayes (NB) and random forest (RF). CONCLUSIONS: Our proposed method shows very good prediction results and demonstrates how information contained in SLiMs is highly relevant in predicting CaM-binding proteins. Further, three new CaM-binding motifs have been computationally selected and biologically validated in this study, and which can be used for predicting CaM-binding proteins.


Assuntos
Proteínas de Ligação a Calmodulina/química , Biologia Computacional/métodos , Motivos de Aminoácidos , Sequência de Aminoácidos , Teorema de Bayes , Cálcio/metabolismo , Humanos , Probabilidade , Estrutura Quaternária de Proteína , Reprodutibilidade dos Testes , Máquina de Vetores de Suporte
6.
BMC Bioinformatics ; 16 Suppl 4: S3, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25734691

RESUMO

BACKGROUND: Cellular processes are known to be modular and are realized by groups of proteins implicated in common biological functions. Such groups of proteins are called functional modules, and many community detection methods have been devised for their discovery from protein interaction networks (PINs) data. In current agglomerative clustering approaches, vertices with just a very few neighbors are often classified as separate clusters, which does not make sense biologically. Also, a major limitation of agglomerative techniques is that their computational efficiency do not scale well to large PINs. Finally, PIN data obtained from large scale experiments generally contain many false positives, and this makes it hard for agglomerative clustering methods to find the correct clusters, since they are known to be sensitive to noisy data. RESULTS: We propose a local similarity premetric, the relative vertex clustering value, as a new criterion allowing to decide when a node can be added to a given node's cluster and which addresses the above three issues. Based on this criterion, we introduce a novel and very fast agglomerative clustering technique, FAC-PIN, for discovering functional modules and protein complexes from a PIN data. CONCLUSIONS: Our proposed FAC-PIN algorithm is applied to nine PIN data from eight different species including the yeast PIN, and the identified functional modules are validated using Gene Ontology (GO) annotations from DAVID Bioinformatics Resources. Identified protein complexes are also validated using experimentally verified complexes. Computational results show that FAC-PIN can discover functional modules or protein complexes from PINs more accurately and more efficiently than HC-PIN and CNM, the current state-of-the-art approaches for clustering PINs in an agglomerative manner.


Assuntos
Algoritmos , Biologia Computacional/métodos , Mapeamento de Interação de Proteínas/métodos , Proteínas de Saccharomyces cerevisiae/metabolismo , Saccharomyces cerevisiae/metabolismo , Análise por Conglomerados , Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/genética , Transdução de Sinais , Vocabulário Controlado
7.
Front Pharmacol ; 13: 908549, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35873597

RESUMO

Drug repurposing is the process of discovering new indications (i.e., diseases or conditions) for already approved drugs. Many computational methods have been proposed for predicting new associations between drugs and diseases. In this article, we proposed a new method, called DR-HGNN, an integrative heterogeneous graph neural network-based method for multi-labeled drug repurposing, to discover new indications for existing drugs. For this purpose, we first used the DTINet dataset to construct a heterogeneous drug-protein-disease (DPD) network, which is a graph composed of four types of nodes (drugs, proteins, diseases, and drug side effects) and eight types of edges. Second, we labeled each drug-protein edge, dp i,j = (d i , p j ), of the DPD network with a set of diseases, {δ i,j,1, … , δ i,j,k } associated with both d i and p j and then devised multi-label ranking approaches which incorporate neural network architecture that operates on the heterogeneous graph-structured data and which leverages both the interaction patterns and the features of drug and protein nodes. We used a derivative of the GraphSAGE algorithm, HinSAGE, on the heterogeneous DPD network to learn low-dimensional vector representation of features of drugs and proteins. Finally, we used the drug-protein network to learn the embeddings of the drug-protein edges and then predict the disease labels that act as bridges between drugs and proteins. The proposed method shows better results than existing methods applied to the DTINet dataset, with an AUC of 0.964.

8.
Bioinformatics ; 26(18): 2281-8, 2010 Sep 15.
Artigo em Inglês | MEDLINE | ID: mdl-20639411

RESUMO

MOTIVATION: Clustering gene expression data given in terms of time-series is a challenging problem that imposes its own particular constraints. Traditional clustering methods based on conventional similarity measures are not always suitable for clustering time-series data. A few methods have been proposed recently for clustering microarray time-series, which take the temporal dimension of the data into account. The inherent principle behind these methods is to either define a similarity measure appropriate for temporal expression data, or pre-process the data in such a way that the temporal relationships between and within the time-series are considered during the subsequent clustering phase. RESULTS: We introduce pairwise gene expression profile alignment, which vertically shifts two profiles in such a way that the area between their corresponding curves is minimal. Based on the pairwise alignment operation, we define a new distance function that is appropriate for time-series profiles. We also introduce a new clustering method that involves multiple expression profile alignment, which generalizes pairwise alignment to a set of profiles. Extensive experiments on well-known datasets yield encouraging results of at least 80% classification accuracy.


Assuntos
Algoritmos , Perfilação da Expressão Gênica/métodos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Análise por Conglomerados , Expressão Gênica
9.
Front Genet ; 10: 256, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30972106

RESUMO

Genomic profiles among different breast cancer survivors who received similar treatment may provide clues about the key biological processes involved in the cells and finding the right treatment. More specifically, such profiling may help personalize the treatment based on the patients' gene expression. In this paper, we present a hierarchical machine learning system that predicts the 5-year survivability of the patients who underwent though specific therapy; The classes are built on the combination of two parts that are the survivability information and the given therapy. For the survivability information part, it defines whether the patient survives the 5-years interval or deceased. While the therapy part denotes the therapy has been taken during that interval, which includes hormone therapy, radiotherapy, or surgery, which totally forms six classes. The Model classifies one class vs. the rest at each node, which makes the tree-based model creates five nodes. The model is trained using a set of standard classifiers based on a comprehensive study dataset that includes genomic profiles and clinical information of 347 patients. A combination of feature selection methods and a prediction method are applied on each node to identify the genes that can predict the class at that node, the identified genes for each class may serve as potential biomarkers to the class's treatment for better survivability. The results show that the model identifies the classes with high-performance measurements. An exhaustive analysis based on relevant literature shows that some of the potential biomarkers are strongly related to breast cancer survivability and cancer in general.

10.
BMC Cancer ; 8: 95, 2008 Apr 10.
Artigo em Inglês | MEDLINE | ID: mdl-18402686

RESUMO

BACKGROUND: Mutations in the mitochondrial genome (mtgenome) have been associated with many disorders, including breast cancer. Nipple aspirate fluid (NAF) from symptomatic women could potentially serve as a minimally invasive sample for breast cancer screening by detecting somatic mutations in this biofluid. This study is aimed at 1) demonstrating the feasibility of NAF recovery from symptomatic women, 2) examining the feasibility of sequencing the entire mitochondrial genome from NAF samples, 3) cross validation of the Human mitochondrial resequencing array 2.0 (MCv2), and 4) assessing the somatic mtDNA mutation rate in benign breast diseases as a potential tool for monitoring early somatic mutations associated with breast cancer. METHODS: NAF and blood were obtained from women with symptomatic benign breast conditions, and we successfully assessed the mutation load in the entire mitochondrial genome of 19 of these women. DNA extracts from NAF were sequenced using the mitochondrial resequencing array MCv2 and by capillary electrophoresis (CE) methods as a quality comparison. Sequencing was performed independently at two institutions and the results compared. The germline mtDNA sequence determined using DNA isolated from the patient's blood (control) was compared to the mutations present in cellular mtDNA recovered from patient's NAF. RESULTS: From the cohort of 28 women recruited for this study, NAF was successfully recovered from 23 participants (82%). Twenty two (96%) of the women produced fluids from both breasts. Twenty NAF samples and corresponding blood were chosen for this study. Except for one NAF sample, the whole mtgenome was successfully amplified using a single primer pair, or three pairs of overlapping primers. Comparison of MCv2 data from the two institutions demonstrates 99.200% concordance. Moreover, MCv2 data was 99.999% identical to CE sequencing, indicating that MCv2 is a reliable method to rapidly sequence the entire mtgenome. Four NAF samples contained somatic mutations. CONCLUSION: We have demonstrated that NAF is a suitable material for mtDNA sequence analysis using the rapid and reliable MCv2. Somatic mtDNA mutations present in NAF of women with benign breast diseases could potentially be used as risk factors for progression to breast cancer, but this will require a much larger study with clinical follow up.


Assuntos
Líquidos Corporais/citologia , Doenças Mamárias/genética , Análise Mutacional de DNA , DNA Mitocondrial/análise , Mitocôndrias/genética , Mamilos/patologia , Adulto , Biópsia por Agulha , Líquidos Corporais/química , Doenças Mamárias/sangue , Estudos de Viabilidade , Feminino , Genoma Mitocondrial , Humanos , Pessoa de Meia-Idade , Análise de Sequência com Séries de Oligonucleotídeos
11.
Evol Bioinform Online ; 14: 1176934318790266, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-30116102

RESUMO

Analyzing the genetic activity of breast cancer survival for a specific type of therapy provides a better understanding of the body response to the treatment and helps select the best course of action and while leading to the design of drugs based on gene activity. In this work, we use supervised and nonsupervised machine learning methods to deal with a multiclass classification problem in which we label the samples based on the combination of the 5-year survivability and treatment; we focus on hormone therapy, radiotherapy, and surgery. The proposed nonsupervised hierarchical models are created to find the highest separability between combinations of the classes. The supervised model consists of a combination of feature selection techniques and efficient classifiers used to find a potential set of biomarker genes specific to response to therapy. The results show that different models achieve different performance scores with accuracies ranging from 80.9% to 100%. We have investigated the roles of many biomarkers through the literature and found that some of the discriminative genes in the computational model such as ZC3H11A, VAX2, MAF1, and ZFP91 are related to breast cancer and other types of cancer.

12.
J Comput Biol ; 24(8): 756-766, 2017 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-28650678

RESUMO

Breast cancer is a complex disease that can be classified into at least 10 different molecular subtypes. Appropriate diagnosis of specific subtypes is critical for ensuring the best possible patient treatment and response to therapy. Current computational methods for determining the subtypes are based on identifying differentially expressed genes (i.e., biomarkers) that can best discriminate the subtypes. Such approaches, however, are known to be unreliable since they yield different biomarker sets when applied to data sets from different studies. Gathering knowledge about the functional relationship among genes will identify "network biomarkers" that will enrich the criteria for biomarker selection. Cancer network biomarkers are subnetworks of functionally related genes that "work in concert" to perform functions associated with a tumorigenic. We propose a machine learning framework that can be used to identify network biomarkers and driver genes for each specific breast cancer subtype. Our results show that the resulting network biomarkers can separate one subtype from the others with very high accuracy.


Assuntos
Biomarcadores Tumorais/genética , Neoplasias da Mama/genética , Neoplasias da Mama/patologia , Perfilação da Expressão Gênica/métodos , Redes Reguladoras de Genes , Genômica/métodos , Mapas de Interação de Proteínas , Biomarcadores Tumorais/metabolismo , Neoplasias da Mama/classificação , Neoplasias da Mama/metabolismo , Feminino , Regulação Neoplásica da Expressão Gênica , Humanos , Transcriptoma
13.
Artigo em Inglês | MEDLINE | ID: mdl-26336144

RESUMO

Accurately reconstructing gene regulatory network (GRN) from gene expression data is a challenging task in systems biology. Although some progresses have been made, the performance of GRN reconstruction still has much room for improvement. Because many regulatory events are asynchronous, learning gene interactions with multiple time delays is an effective way to improve the accuracy of GRN reconstruction. Here, we propose a new approach, called Max-Min high-order dynamic Bayesian network (MMHO-DBN) by extending the Max-Min hill-climbing Bayesian network technique originally devised for learning a Bayesian network's structure from static data. Our MMHO-DBN can explicitly model the time lags between regulators and targets in an efficient manner. It first uses constraint-based ideas to limit the space of potential structures, and then applies search-and-score ideas to search for an optimal HO-DBN structure. The performance of MMHO-DBN to GRN reconstruction was evaluated using both synthetic and real gene expression time-series data. Results show that MMHO-DBN is more accurate than current time-delayed GRN learning methods, and has an intermediate computing performance. Furthermore, it is able to learn long time-delayed relationships between genes. We applied sensitivity analysis on our model to study the performance variation along different parameter settings. The result provides hints on the setting of parameters of MMHO-DBN.


Assuntos
Redes Reguladoras de Genes/genética , Modelos Estatísticos , Biologia de Sistemas/métodos , Algoritmos , Teorema de Bayes , Bases de Dados Genéticas , Saccharomyces cerevisiae/genética
14.
F1000Res ; 5: 2124, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-28620450

RESUMO

Genomic aberrations and gene expression-defined subtypes in the large METABRIC patient cohort have been used to stratify and predict survival. The present study used normalized gene expression signatures of paclitaxel drug response to predict outcome for different survival times in METABRIC patients receiving hormone (HT) and, in some cases, chemotherapy (CT) agents. This machine learning method, which distinguishes sensitivity vs. resistance in breast cancer cell lines and validates predictions in patients; was also used to derive gene signatures of other HT  (tamoxifen) and CT agents (methotrexate, epirubicin, doxorubicin, and 5-fluorouracil) used in METABRIC. Paclitaxel gene signatures exhibited the best performance, however the other agents also predicted survival with acceptable accuracies. A support vector machine (SVM) model of paclitaxel response containing genes  ABCB1, ABCB11, ABCC1, ABCC10, BAD, BBC3, BCL2, BCL2L1, BMF, CYP2C8, CYP3A4, MAP2, MAP4, MAPT, NR1I2, SLCO1B3, TUBB1, TUBB4A, and TUBB4B was 78.6% accurate in predicting survival of 84 patients treated with both HT and CT (median survival ≥ 4.4 yr). Accuracy was lower (73.4%) in 304 untreated patients. The performance of other machine learning approaches was also evaluated at different survival thresholds. Minimum redundancy maximum relevance feature selection of a paclitaxel-based SVM classifier based on expression of genes  BCL2L1, BBC3, FGF2, FN1, and  TWIST1 was 81.1% accurate in 53 CT patients. In addition, a random forest (RF) classifier using a gene signature ( ABCB1, ABCB11, ABCC1, ABCC10, BAD, BBC3, BCL2, BCL2L1, BMF, CYP2C8, CYP3A4, MAP2, MAP4, MAPT, NR1I2,SLCO1B3, TUBB1, TUBB4A, and TUBB4B) predicted >3-year survival with 85.5% accuracy in 420 HT patients. A similar RF gene signature showed 82.7% accuracy in 504 patients treated with CT and/or HT. These results suggest that tumor gene expression signatures refined by machine learning techniques can be useful for predicting survival after drug therapies.

15.
Appl Bioinformatics ; 4(1): 1-11, 2005.
Artigo em Inglês | MEDLINE | ID: mdl-16000008

RESUMO

Following the invention of microarrays in 1994, the development and applications of this technology have grown exponentially. The numerous applications of microarray technology include clinical diagnosis and treatment, drug design and discovery, tumour detection, and environmental health research. One of the key issues in the experimental approaches utilising microarrays is to extract quantitative information from the spots, which represent genes in a given experiment. For this process, the initial stages are important and they influence future steps in the analysis. Identifying the spots and separating the background from the foreground is a fundamental problem in DNA microarray data analysis. In this review, we present an overview of state-of-the-art methods for microarray image segmentation. We discuss the foundations of the circle-shaped approach, adaptive shape segmentation, histogram-based methods and the recently introduced clustering-based techniques. We analytically show that clustering-based techniques are equivalent to the one-dimensional, standard k-means clustering algorithm that utilises the Euclidean distance.


Assuntos
Algoritmos , Perfilação da Expressão Gênica/métodos , Interpretação de Imagem Assistida por Computador/métodos , Hibridização in Situ Fluorescente/métodos , Microscopia de Fluorescência/métodos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Reconhecimento Automatizado de Padrão/métodos
16.
Source Code Biol Med ; 8(1): 10, 2013 Apr 16.
Artigo em Inglês | MEDLINE | ID: mdl-23591137

RESUMO

BACKGROUND: Non-negative matrix factorization (NMF) has been introduced as an important method for mining biological data. Though there currently exists packages implemented in R and other programming languages, they either provide only a few optimization algorithms or focus on a specific application field. There does not exist a complete NMF package for the bioinformatics community, and in order to perform various data mining tasks on biological data. RESULTS: We provide a convenient MATLAB toolbox containing both the implementations of various NMF techniques and a variety of NMF-based data mining approaches for analyzing biological data. Data mining approaches implemented within the toolbox include data clustering and bi-clustering, feature extraction and selection, sample classification, missing values imputation, data visualization, and statistical comparison. CONCLUSIONS: A series of analysis such as molecular pattern discovery, biological process identification, dimension reduction, disease prediction, visualization, and statistical comparison can be performed using this toolbox.

17.
Artigo em Inglês | MEDLINE | ID: mdl-23929868

RESUMO

Microarray data can be used to detect diseases and predict responses to therapies through classification models. However, the high dimensionality and low sample size of such data result in many computational problems such as reduced prediction accuracy and slow classification speed. In this paper, we propose a novel family of nonnegative least-squares classifiers for high-dimensional microarray gene expression and comparative genomic hybridization data. Our approaches are based on combining the advantages of using local learning, transductive learning, and ensemble learning, for better prediction performance. To study the performances of our methods, we performed computational experiments on 17 well-known data sets with diverse characteristics. We have also performed statistical comparisons with many classification techniques including the well-performing SVM approach and two related but recent methods proposed in literature. Experimental results show that our approaches are faster and achieve generally a better prediction performance over compared methods.


Assuntos
Biologia Computacional/métodos , Bases de Dados Genéticas , Análise dos Mínimos Quadrados , Algoritmos , Perfilação da Expressão Gênica , Humanos , Neoplasias/genética , Neoplasias/metabolismo , Análise de Sequência com Séries de Oligonucleotídeos , Máquina de Vetores de Suporte
18.
BMC Syst Biol ; 7 Suppl 4: S6, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-24565287

RESUMO

BACKGROUND: High-throughput genomic and proteomic data have important applications in medicine including prevention, diagnosis, treatment, and prognosis of diseases, and molecular biology, for example pathway identification. Many of such applications can be formulated to classification and dimension reduction problems in machine learning. There are computationally challenging issues with regards to accurately classifying such data, and which due to dimensionality, noise and redundancy, to name a few. The principle of sparse representation has been applied to analyzing high-dimensional biological data within the frameworks of clustering, classification, and dimension reduction approaches. However, the existing sparse representation methods are inefficient. The kernel extensions are not well addressed either. Moreover, the sparse representation techniques have not been comprehensively studied yet in bioinformatics. RESULTS: In this paper, a Bayesian treatment is presented on sparse representations. Various sparse coding and dictionary learning models are discussed. We propose fast parallel active-set optimization algorithm for each model. Kernel versions are devised based on their dimension-free property. These models are applied for classifying high-dimensional biological data. CONCLUSIONS: In our experiment, we compared our models with other methods on both accuracy and computing time. It is shown that our models can achieve satisfactory accuracy, and their performance are very efficient.


Assuntos
Biologia Computacional/métodos , Algoritmos , Inteligência Artificial , Teorema de Bayes
19.
Artigo em Inglês | MEDLINE | ID: mdl-20876933

RESUMO

This paper demonstrates the use of qualitative probabilistic networks (QPNs) to aid Dynamic Bayesian Networks (DBNs) in the process of learning the structure of gene regulatory networks from microarray gene expression data. We present a study which shows that QPNs define monotonic relations that are capable of identifying regulatory interactions in a manner that is less susceptible to the many sources of uncertainty that surround gene expression data. Moreover, we construct a model that maps the regulatory interactions of genetic networks to QPN constructs and show its capability in providing a set of candidate regulators for target genes, which is subsequently used to establish a prior structure that the DBN learning algorithm can use and which 1) distinguishes spurious correlations from true regulations, 2) enables the discovery of sets of coregulators of target genes, and 3) results in a more efficient construction of gene regulatory networks. The model is compared to the existing literature using the known gene regulatory interactions of Drosophila Melanogaster.


Assuntos
Redes Reguladoras de Genes/genética , Modelos Estatísticos , Algoritmos , Animais , Teorema de Bayes , Drosophila melanogaster/genética , Perfilação da Expressão Gênica/métodos , Regulação da Expressão Gênica
20.
Int J Data Min Bioinform ; 5(3): 266-86, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21805823

RESUMO

The appreciation of biofilm structures in digital images can be subjective to the observer, and hence it is necessary to analyse the underlying images in useful parameters by means of quantification that is, ideally, free of errors. This paper proposes a combination of techniques for segmentation of biofilm images through an optimal multi-level thresholding algorithm and a set of clustering validity indices, including the determination of the best number of thresholds. The results, which are validated through Rand Index and a quantification process performed in a laboratory, are similar to the quantification and segmentation done by an expert.


Assuntos
Algoritmos , Biofilmes , Processamento de Imagem Assistida por Computador/métodos
SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa