Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 23
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
País de afiliação
Intervalo de ano de publicação
1.
J Biomed Inform ; 132: 104135, 2022 08.
Artigo em Inglês | MEDLINE | ID: mdl-35842217

RESUMO

Certain categories in multi-category biomedical relationship extraction have linguistic similarities to some extent. Keywords related to categories and syntax structures of samples between these categories have some notable features, which are very useful in biomedical relation extraction. The pre-trained model has been widely used and has achieved great success in biomedical relationship extraction, but it is still incapable of mining this kind of information accurately. To solve the problem, we present a syntax-enhanced model based on category keywords. First, we prune syntactic dependency trees in terms of category keywords obtained by the chi-square test. It reduces noisy information caused by current syntactic parsing tools and retains useful information related to categories. Next, to encode category-related syntactic dependency trees, a syntactic transformer is presented, which enhances the ability of the pre-trained model to capture syntax structures and to distinguish multiple categories. We evaluate our method on three biomedical datasets. Compared with state-of-the-art models, our method performs better on these datasets. We conduct further analysis to verify the effectiveness of our method.


Assuntos
Linguística
2.
BMC Bioinformatics ; 22(1): 379, 2021 Jul 22.
Artigo em Inglês | MEDLINE | ID: mdl-34294047

RESUMO

BACKGROUND: Autism spectrum disorders (ASD) imply a spectrum of symptoms rather than a single phenotype. ASD could affect brain connectivity at different degree based on the severity of the symptom. Given their excellent learning capability, graph neural networks (GNN) methods have recently been used to uncover functional connectivity patterns and biological mechanisms in neuropsychiatric disorders, such as ASD. However, there remain challenges to develop an accurate GNN learning model and understand how specific decisions of these graph models are made in brain network analysis. RESULTS: In this paper, we propose a graph attention network based learning and interpreting method, namely GAT-LI, which learns to classify functional brain networks of ASD individuals versus healthy controls (HC), and interprets the learned graph model with feature importance. Specifically, GAT-LI includes a graph learning stage and an interpreting stage. First, in the graph learning stage, a new graph attention network model, namely GAT2, uses graph attention layers to learn the node representation, and a novel attention pooling layer to obtain the graph representation for functional brain network classification. We experimentally compared GAT2 model's performance on the ABIDE I database from 1035 subjects against the classification performances of other well-known models, and the results showed that the GAT2 model achieved the best classification performance. We experimentally compared the influence of different construction methods of brain networks in GAT2 model. We also used a larger synthetic graph dataset with 4000 samples to validate the utility and power of GAT2 model. Second, in the interpreting stage, we used GNNExplainer to interpret learned GAT2 model with feature importance. We experimentally compared GNNExplainer with two well-known interpretation methods including Saliency Map and DeepLIFT to interpret the learned model, and the results showed GNNExplainer achieved the best interpretation performance. We further used the interpretation method to identify the features that contributed most in classifying ASD versus HC. CONCLUSION: We propose a two-stage learning and interpreting method GAT-LI to classify functional brain networks and interpret the feature importance in the graph model. The method should also be useful in the classification and interpretation tasks for graph data from other biomedical scenarios.


Assuntos
Encéfalo , Imageamento por Ressonância Magnética , Mapeamento Encefálico , Humanos , Redes Neurais de Computação , Polímeros
3.
Biophys J ; 119(6): 1056-1064, 2020 09 15.
Artigo em Inglês | MEDLINE | ID: mdl-32891186

RESUMO

The microstructure of the extracellular matrix (ECM) plays a key role in affecting cell migration, especially nonproteolytic migration. It is difficult, however, to measure some properties of the ECM, such as stiffness and the passability for cell migration. On the basis of a network model of collagen fiber in the ECM, which has been well applied to simulate mechanical behaviors such as the stress-strain relationship, damage, and failure, we proposed a series of methods to study the microstructural properties containing pore size and pore stiffness and to search for the possible migration paths for cells. Finally, with a given criterion, we quantitatively evaluated the passability of the ECM network for cell migration. The fiber network model with a microstructure and the analysis method presented in this study further our understanding of and ability to evaluate the properties of an ECM network.


Assuntos
Matriz Extracelular , Movimento Celular
4.
BMC Bioinformatics ; 21(1): 109, 2020 Mar 18.
Artigo em Inglês | MEDLINE | ID: mdl-32183707

RESUMO

BACKGROUND: Advanced sequencing machines dramatically speed up the generation of genomic data, which makes the demand of efficient compression of sequencing data extremely urgent and significant. As the most difficult part of the standard sequencing data format FASTQ, compression of the quality score has become a conundrum in the development of FASTQ compression. Existing lossless compressors of quality scores mainly utilize specific patterns generated by specific sequencer and complex context modeling techniques to solve the problem of low compression ratio. However, the main drawbacks of these compressors are the problem of weak robustness which means unstable or even unavailable results of sequencing files and the problem of slow compression speed. Meanwhile, some compressors attempt to construct a fine-grained index structure to solve the problem of slow random access decompression speed. However, they solve the problem at the sacrifice of compression speed and at the expense of large index files, which makes them inefficient and impractical. Therefore, an efficient lossless compressor of quality scores with strong robustness, high compression ratio, fast compression and random access decompression speed is urgently needed and of great significance. RESULTS: In this paper, based on the idea of maximizing the use of hardware resources, LCQS, a lossless compression tool specialized for quality scores, was proposed. It consists of four sequential processing steps: partitioning, indexing, packing and parallelizing. Experimental results reveal that LCQS outperforms all the other state-of-the-art compressors on all criteria except for the compression speed on the dataset SRR1284073. Furthermore, LCQS presents strong robustness on all the test datasets, with its acceleration ratios of compression speed increasing by up to 29.1x, its file size reducing by up to 28.78%, and its random access decompression speed increasing by up to 2.1x. Additionally, LCQS also exhibits strong scalability. That is, the compression speed increases almost linearly as the size of input dataset increases. CONCLUSION: The ability to handle all different kinds of quality scores and superiority in compression ratio and compression speed make LCQS a high-efficient and advanced lossless quality score compressor, along with its strength of fast random access decompression. Our tool LCQS can be downloaded from https://github.com/SCUT-CCNL/LCQSand freely available for non-commercial usage.


Assuntos
Compressão de Dados/métodos , Algoritmos , Genômica , Análise de Sequência de DNA , Software
5.
BMC Med Inform Decis Mak ; 20(Suppl 3): 129, 2020 07 09.
Artigo em Inglês | MEDLINE | ID: mdl-32646413

RESUMO

BACKGROUND: With the rapid development of sequencing technologies, collecting diverse types of cancer omics data become more cost-effective. Many computational methods attempted to represent and fuse multiple omics into a comprehensive view of cancer. However, different types of omics are related and heterogeneous. Most of the existing methods do not consider the difference between omics, so the biological knowledge of individual omics may not be fully excavated. And for a given task (e.g. predicting overall survival), these methods prefer to use sample similarity or domain knowledge to learn a more reasonable representation of omics, but it's not enough. METHODS: For the purpose of learning more useful representation for individual omics and fusing them to improve the prediction ability, we proposed an autoencoder-based method named MOSAE (Multi-omics Supervised Autoencoder). In our method, a specific autoencoder were designed for each omics according to their size of dimension to generate omics-specific representations. Then, a supervised autoencoder was constructed based on specific autoencoder by using labels to enforce each specific autoencoder to learn both omics-specific and task-specific representations. Finally, representations of different omics that generate from supervised autoencoders were fused in a traditional but powerful way, and the fused representation was used for subsequent predictive tasks. RESULTS: We applied our method over TCGA Pan-Cancer dataset to predict four different clinical outcome endpoints (OS, PFI, DFI, and DSS). Compared with traditional and state-of-the-art methods, MOSAE achieved better predictive performance. We also tested the effects of each improvement, which all have a positive effect on predictive performance. CONCLUSIONS: Predicting clinical outcome endpoints are very important for precision medicine and personalized medicine. And multi-omics fusion is an effective way to solve this problem. MOSAE is a powerful multi-omics fusion method, which can generate both omics-specific and task-specific representation for given endpoint predictive tasks and improve the predictive performance.


Assuntos
Neoplasias , Humanos , Neoplasias/genética , Medicina de Precisão
6.
BMC Bioinformatics ; 20(1): 76, 2019 Feb 14.
Artigo em Inglês | MEDLINE | ID: mdl-30764760

RESUMO

BACKGROUND: The advance of next generation sequencing enables higher throughput with lower price, and as the basic of high-throughput sequencing data analysis, variant calling is widely used in disease research, clinical treatment and medicine research. However, current mainstream variant caller tools have a serious problem of computation bottlenecks, resulting in some long tail tasks when performing on large datasets. This prevents high scalability on clusters of multi-node and multi-core, and leads to long runtime and inefficient usage of computing resources. Thus, a high scalable tool which could run in distributed environment will be highly useful to accelerate variant calling on large scale genome data. RESULTS: In this paper, we present ADS-HCSpark, a scalable tool for variant calling based on Apache Spark framework. ADS-HCSpark accelerates the process of variant calling by implementing the parallelization of mainstream GATK HaplotypeCaller algorithm on multi-core and multi-node. Aiming at solving the problem of computation skew in HaplotypeCaller, a parallel strategy of adaptive data segmentation is proposed and a variant calling algorithm based on adaptive data segmentation is implemented, which achieves good scalability on both single-node and multi-node. For the requirement that adjacent data blocks should have overlapped boundaries, Hadoop-BAM library is customized to implement partitioning BAM file into overlapped blocks, further improving the accuracy of variant calling. CONCLUSIONS: ADS-HCSpark is a scalable tool to achieve variant calling based on Apache Spark framework, implementing the parallelization of GATK HaplotypeCaller algorithm. ADS-HCSpark is evaluated on our cluster and in the case of best performance that could be achieved in this experimental platform, ADS-HCSpark is 74% faster than GATK3.8 HaplotypeCaller on single-node experiments, 57% faster than GATK4.0 HaplotypeCallerSpark and 27% faster than SparkGA on multi-node experiments, with better scalability and the accuracy of over 99%. The source code of ADS-HCSpark is publicly available at https://github.com/SCUT-CCNL/ADS-HCSpark.git .


Assuntos
Algoritmos , Variação Genética , Haplótipos/genética , Software , Bases de Dados Genéticas , Genoma , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Análise de Sequência de DNA/métodos , Fatores de Tempo
7.
BMC Med Inform Decis Mak ; 19(Suppl 2): 65, 2019 04 09.
Artigo em Inglês | MEDLINE | ID: mdl-30961622

RESUMO

BACKGROUND: The Named Entity Recognition (NER) task as a key step in the extraction of health information, has encountered many challenges in Chinese Electronic Medical Records (EMRs). Firstly, the casual use of Chinese abbreviations and doctors' personal style may result in multiple expressions of the same entity, and we lack a common Chinese medical dictionary to perform accurate entity extraction. Secondly, the electronic medical record contains entities from a variety of categories of entities, and the length of those entities in different categories varies greatly, which increases the difficult in the extraction for the Chinese NER. Therefore, the entity boundary detection becomes the key to perform accurate entity extraction of Chinese EMRs, and we need to develop a model that supports multiple length entity recognition without relying on any medical dictionary. METHODS: In this study, we incorporate part-of-speech (POS) information into the deep learning model to improve the accuracy of Chinese entity boundary detection. In order to avoid the wrongly POS tagging of long entities, we proposed a method called reduced POS tagging that reserves the tags of general words but not of the seemingly medical entities. The model proposed in this paper, named SM-LSTM-CRF, consists of three layers: self-matching attention layer - calculating the relevance of each character to the entire sentence; LSTM (Long Short-Term Memory) layer - capturing the context feature of each character; CRF (Conditional Random Field) layer - labeling characters based on their features and transfer rules. RESULTS: The experimental results at a Chinese EMRs dataset show that the F1 value of SM-LSTM-CRF is increased by 2.59% compared to that of the LSTM-CRF. After adding POS feature in the model, we get an improvement of about 7.74% at F1. The reduced POS tagging reduces the false tagging on long entities, thus increases the F1 value by 2.42% and achieves an F1 score of 80.07%. CONCLUSIONS: The POS feature marked by the reduced POS tagging together with self-matching attention mechanism puts a stranglehold on entity boundaries and has a good performance in the recognition of clinical entities.


Assuntos
Aprendizado Profundo , Registros Eletrônicos de Saúde , Armazenamento e Recuperação da Informação , Processamento de Linguagem Natural , Atenção , China , Humanos , Idioma , Fala
8.
ScientificWorldJournal ; 2014: 907515, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24983011

RESUMO

Recommending news stories to users, based on their preferences, has long been a favourite domain for recommender systems research. Traditional systems strive to satisfy their user by tracing users' reading history and choosing the proper candidate news articles to recommend. However, most of news websites hardly require any user to register before reading news. Besides, the latent relations between news and microblog, the popularity of particular news, and the news organization are not addressed or solved efficiently in previous approaches. In order to solve these issues, we propose an effective personalized news recommendation method based on microblog user profile building and sub class popularity prediction, in which we propose a news organization method using hybrid classification and clustering, implement a sub class popularity prediction method, and construct user profile according to our actual situation. We had designed several experiments compared to the state-of-the-art approaches on a real world dataset, and the experimental results demonstrate that our system significantly improves the accuracy and diversity in mass text data.


Assuntos
Meios de Comunicação , Internet , Modelos Teóricos , Algoritmos , Humanos
9.
Artigo em Inglês | MEDLINE | ID: mdl-39074019

RESUMO

Deep learning methods have advanced quickly in brain imaging analysis over the past few years, but they are usually restricted by the limited labeled data. Pre-trained model on unlabeled data has presented promising improvement in feature learning in many domains, such as natural language processing. However, this technique is under-explored in brain network analysis. In this paper, we focused on pre-training methods with Transformer networks to leverage existing unlabeled data for brain functional network classification. First, we proposed a Transformer-based neural network, named as BrainNPT, for brain functional network classification. The proposed method leveraged token as a classification embedding vector for the Transformer model to effectively capture the representation of brain networks. Second, we proposed a pre-training framework for BrainNPT model to leverage unlabeled brain network data to learn the structure information of brain functional networks. The results of classification experiments demonstrated the BrainNPT model without pre-training achieved the best performance with the state-of-the-art models, and the BrainNPT model with pre-training strongly outperformed the state-of-the-art models. The pre-training BrainNPT model improved 8.75% of accuracy compared with the model without pre-training. We further compared the pre-training strategies and the data augmentation methods, analyzed the influence of the parameters of the model, and explained the trained model.


Assuntos
Algoritmos , Encéfalo , Aprendizado Profundo , Redes Neurais de Computação , Humanos , Encéfalo/fisiologia , Encéfalo/diagnóstico por imagem , Rede Nervosa/fisiologia , Rede Nervosa/diagnóstico por imagem , Imageamento por Ressonância Magnética , Processamento de Linguagem Natural
10.
Comput Biol Med ; 181: 109042, 2024 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-39180856

RESUMO

Pathological images and molecular omics are important information for predicting diagnosis and prognosis. The two kinds of heterogeneous modal data contain complementary information, and the effective fusion of the two modals can better reveal the complex mechanisms of cancer. However, due to the different representation learning methods, the expression strength of different modals in different tasks varies greatly, so that many multimodal fusions do not achieve the best results. In this paper, MBFusion is proposed, to achieve multiple tasks such as prediction of diagnosis and prognosis through multi-modal balanced fusion. The MBFusion framework uses two kinds of specially constructed graph convolutional network to extract the features of molecular omics data, and uses ResNet to extract the features of pathological image data and retain important deep features by using attention and clustering, which effectively improves both kinds of the features representation, making their expressive ability balanced and comparable. The features of these two modal data are then fused through cross-attention Transformer, and the fused features are used to learn both tasks of cancer subtype classification and survival analysis by using multi-task learning. In this paper, MBFusion and other state of the art methods are compared on two public cancer datasets, and MBFusion shows an improvement of up to 10.1% by three kinds of evaluation metrics. In the ablation experiment, MBFusion explores the contribution of each modal data and each framework module to the performance. Furthermore, the interpretability of MBFusion is explained in detail to show the value of application.


Assuntos
Neoplasias , Humanos , Neoplasias/diagnóstico , Prognóstico , Aprendizado de Máquina , Redes Neurais de Computação , Aprendizado Profundo
11.
J Affect Disord ; 364: 266-273, 2024 Nov 01.
Artigo em Inglês | MEDLINE | ID: mdl-39137835

RESUMO

BACKGROUND: Functional connectivity has been shown to fluctuate over time. The present study aimed to identifying major depressive disorders (MDD) with dynamic functional connectivity (dFC) from resting-state fMRI data, which would be helpful to produce tools of early depression diagnosis and enhance our understanding of depressive etiology. METHODS: The resting-state fMRI data of 178 subjects were collected, including 89 MDD and 89 healthy controls. We propose a spatio-temporal learning and explaining framework for dFC analysis. A yet effective spatio-temporal model is developed to classifying MDD from healthy controls with dFCs. The model is a stacking neural network model, which learns network structure information by a multi-layer perceptron based spatial encoder, and learns time-varying patterns by a Transformer based temporal encoder. We propose to explain the spatio-temporal model with a two-stage explanation method of importance feature extracting and disorder-relevant pattern exploring. The layer-wise relevance propagation (LRP) method is introduced to extract the most relevant input features in the model, and the attention mechanism with LRP is applied to extract the important time steps of dFCs. The disorder-relevant functional connections, brain regions, and brain states in the model are further explored and identified. RESULTS: We achieved the best classification performance in identifying MDD from healthy controls with dFC data. The top important functional connectivity, brain regions, and dynamic states closely related to MDD have been identified. LIMITATIONS: The data preprocessing may affect the classification performance of the model, and this study needs further validation in a larger patient population. CONCLUSIONS: The experimental results demonstrate that the proposed spatio-temporal model could effectively classify MDD, and uncover structural and temporal patterns of dFCs in depression.


Assuntos
Transtorno Depressivo Maior , Imageamento por Ressonância Magnética , Humanos , Transtorno Depressivo Maior/fisiopatologia , Transtorno Depressivo Maior/diagnóstico por imagem , Adulto , Feminino , Masculino , Encéfalo/fisiopatologia , Encéfalo/diagnóstico por imagem , Redes Neurais de Computação , Conectoma/métodos , Análise Espaço-Temporal , Adulto Jovem , Mapeamento Encefálico , Estudos de Casos e Controles
12.
Comput Biol Med ; 166: 107535, 2023 Sep 28.
Artigo em Inglês | MEDLINE | ID: mdl-37788508

RESUMO

In recent years, pre-trained language models (PLMs) have dominated natural language processing (NLP) and achieved outstanding performance in various NLP tasks, including dense retrieval based on PLMs. However, in the biomedical domain, the effectiveness of dense retrieval models based on PLMs still needs to be improved due to the diversity and ambiguity of entity expressions caused by the enrichment of biomedical entities. To alleviate the semantic gap, in this paper, we propose a method that incorporates external knowledge at the entity level into a dense retrieval model to enrich the dense representations of queries and documents. Specifically, we first add additional self-attention and information interaction modules in the Transformer layer of the BERT architecture to perform fusion and interaction between query/document text and entity embeddings from knowledge graphs. We then propose an entity similarity loss to constrain the model to better learn external knowledge from entity embeddings, and further propose a weighted entity concatenation mechanism to balance the impact of entity representations when matching queries and documents. Experiments on two publicly available biomedical retrieval datasets show that our proposed method outperforms state-of-the-art dense retrieval methods. In term of NDCG metrics, the proposed method (called ELK) improves the ranking performance of coCondenser by at least 5% on both two datasets, and also obtains further performance gain over state-of-the-art EVA methods. Though having a more sophisticated architecture, the average query latency of ELK is still within the same order of magnitude as that of other efficient methods.

13.
Artigo em Inglês | MEDLINE | ID: mdl-35044920

RESUMO

The development of omics data and biomedical images has greatly advanced the progress of precision medicine in diagnosis, treatment, and prognosis. The fusion of omics and imaging data, i.e., omics-imaging fusion, offers a new strategy for understanding complex diseases. However, due to a variety of issues such as the limited number of samples, high dimensionality of features, and heterogeneity of different data types, efficiently learning complementary or associated discriminative fusion information from omics and imaging data remains a challenge. Recently, numerous machine learning methods have been proposed to alleviate these problems. In this review, from the perspective of fusion levels and fusion methods, we first provide an overview of preprocessing and feature extraction methods for omics and imaging data, and comprehensively analyze and summarize the basic forms and variations of commonly used and newly emerging fusion methods, along with their advantages, disadvantages and the applicable scope. We then describe public datasets and compare experimental results of various fusion methods on the ADNI and TCGA datasets. Finally, we discuss future prospects and highlight remaining challenges in the field.


Assuntos
Biologia Computacional , Aprendizado de Máquina , Biologia Computacional/métodos
14.
Asian J Psychiatr ; 82: 103511, 2023 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-36791609

RESUMO

The present study aims to identify suicide risks in major depressive disorders (MDD) patients from structural MRI (sMRI) data using deep learning. In this paper, we collected the sMRI data of 288 MDD patients, including 110 patients with suicide ideation (SI), 93 patients with suicide attempts (SA), and 85 patients without suicidal ideation or attempts (NS). And we developed interpretable deep neural network models to classify patients in three tasks including SA-versus-SI, SA-versus-NS, and SI-versus-NS, respectively. Furthermore, we interpreted the models by extracting the important features that contributed most to the classification, and further discussed these features or ROI/brain regions.


Assuntos
Aprendizado Profundo , Transtorno Depressivo Maior , Humanos , Tentativa de Suicídio , Transtorno Depressivo Maior/diagnóstico por imagem , Ideação Suicida
15.
Artif Intell Med ; 126: 102260, 2022 04.
Artigo em Inglês | MEDLINE | ID: mdl-35346442

RESUMO

Morphological attributes from histopathological images and molecular profiles from genomic data are important information to drive diagnosis, prognosis, and therapy of cancers. By integrating these heterogeneous but complementary data, many multi-modal methods are proposed to study the complex mechanisms of cancers, and most of them achieve comparable or better results from previous single-modal methods. However, these multi-modal methods are restricted to a single task (e.g., survival analysis or grade classification), and thus neglect the correlation between different tasks. In this study, we present a multi-modal fusion framework based on multi-task correlation learning (MultiCoFusion) for survival analysis and cancer grade classification, which combines the power of multiple modalities and multiple tasks. Specifically, a pre-trained ResNet-152 and a sparse graph convolutional network (SGCN) are used to learn the representations of histopathological images and mRNA expression data respectively. Then these representations are fused by a fully connected neural network (FCNN), which is also a multi-task shared network. Finally, the results of survival analysis and cancer grade classification output simultaneously. The framework is trained by an alternate scheme. We systematically evaluate our framework using glioma datasets from The Cancer Genome Atlas (TCGA). Results demonstrate that MultiCoFusion learns better representations than traditional feature extraction methods. With the help of multi-task alternating learning, even simple multi-modal concatenation can achieve better performance than other deep learning and traditional methods. Multi-task learning can improve the performance of multiple tasks not just one of them, and it is effective in both single-modal and multi-modal data.


Assuntos
Glioma , Redes Neurais de Computação , Genômica , Humanos , Prognóstico
16.
IEEE J Biomed Health Inform ; 25(8): 3219-3229, 2021 08.
Artigo em Inglês | MEDLINE | ID: mdl-33449889

RESUMO

The curse of dimensionality, which is caused by high-dimensionality and low-sample-size, is a major challenge in gene expression data analysis. However, the real situation is even worse: labelling data is laborious and time-consuming, so only a small part of the limited samples will be labelled. Having such few labelled samples further increases the difficulty of training deep learning models. Interpretability is an important requirement in biomedicine. Many existing deep learning methods are trying to provide interpretability, but rarely apply to gene expression data. Recent semi-supervised graph convolution network methods try to address these problems by smoothing the label information over a graph. However, to the best of our knowledge, these methods only utilize graphs in either the feature space or sample space, which restrict their performance. We propose a transductive semi-supervised representation learning method called a hierarchical graph convolution network (HiGCN) to aggregate the information of gene expression data in both feature and sample spaces. HiGCN first utilizes external knowledge to construct a feature graph and a similarity kernel to construct a sample graph. Then, two spatial-based GCNs are used to aggregate information on these graphs. To validate the model's performance, synthetic and real datasets are provided to lend empirical support. Compared with two recent models and three traditional models, HiGCN learns better representations of gene expression data, and these representations improve the performance of downstream tasks, especially when the model is trained on a few labelled samples. Important features can be extracted from our model to provide reliable interpretability.


Assuntos
Aprendizado de Máquina Supervisionado , Expressão Gênica , Humanos
17.
J Comput Biol ; 27(9): 1350-1360, 2020 09.
Artigo em Inglês | MEDLINE | ID: mdl-31904999

RESUMO

The storage and analysis of massive genetic variation datasets in variant call format (VCF) become a great challenge with the rapid growth of genetic variation data in recent years. Traditional single process based tool kits become increasingly inefficient when analyzing massive genetic variation data. While emerging distributed storage technology such as Apache Kudu offers attractive solution, it is demanded to develop distributed storage tool kit for VCF dataset. In this article, we present Variant-Kudu, an efficient genome tool kit for storing and analyzing massive genetic variation datasets. Based on a new distributed scheme, the genetic variation data would be segmented and stored in Kudu on multinode. With this scheme, data can be randomly accessed at low latency and scanned efficiently. Aiming at reducing the queries' execution time, a strategy of distributed bitmap index is proposed and a parallel query method is designed, which expedite analyses of massive genetic variation data. Variant-Kudu is a scalable tool kit to analyze massive genetic variation datasets, and our experiments demonstrate that Variant-Kudu achieves high performance on a multinode cluster.


Assuntos
Big Data , Variação Genética/genética , Genoma/genética , Software/estatística & dados numéricos
18.
JMIR Med Inform ; 8(5): e17644, 2020 May 29.
Artigo em Inglês | MEDLINE | ID: mdl-32469325

RESUMO

BACKGROUND: The most current methods applied for intrasentence relation extraction in the biomedical literature are inadequate for document-level relation extraction, in which the relationship may cross sentence boundaries. Hence, some approaches have been proposed to extract relations by splitting the document-level datasets through heuristic rules and learning methods. However, these approaches may introduce additional noise and do not really solve the problem of intersentence relation extraction. It is challenging to avoid noise and extract cross-sentence relations. OBJECTIVE: This study aimed to avoid errors by dividing the document-level dataset, verify that a self-attention structure can extract biomedical relations in a document with long-distance dependencies and complex semantics, and discuss the relative benefits of different entity pretreatment methods for biomedical relation extraction. METHODS: This paper proposes a new data preprocessing method and attempts to apply a pretrained self-attention structure for document biomedical relation extraction with an entity replacement method to capture very long-distance dependencies and complex semantics. RESULTS: Compared with state-of-the-art approaches, our method greatly improved the precision. The results show that our approach increases the F1 value, compared with state-of-the-art methods. Through experiments of biomedical entity pretreatments, we found that a model using an entity replacement method can improve performance. CONCLUSIONS: When considering all target entity pairs as a whole in the document-level dataset, a pretrained self-attention structure is suitable to capture very long-distance dependencies and learn the textual context and complicated semantics. A replacement method for biomedical entities is conducive to biomedical relation extraction, especially to document-level relation extraction.

19.
Comput Math Methods Med ; 2020: 1394830, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32508974

RESUMO

Deep neural networks have recently been applied to the study of brain disorders such as autism spectrum disorder (ASD) with great success. However, the internal logics of these networks are difficult to interpret, especially with regard to how specific network architecture decisions are made. In this paper, we study an interpretable neural network model as a method to identify ASD participants from functional magnetic resonance imaging (fMRI) data and interpret results of the model in a precise and consistent manner. First, we propose an interpretable fully connected neural network (FCNN) to classify two groups, ASD versus healthy controls (HC), based on input data from resting-state functional connectivity (rsFC) between regions of interests (ROIs). The proposed FCNN model is a piecewise linear neural network (PLNN) which uses piecewise linear function LeakyReLU as its activation function. We experimentally compared the FCNN model against widely used classification models including support vector machine (SVM), random forest, and two new classes of deep neural network models in a large dataset containing 871 subjects from ABIDE I database. The results show the proposed FCNN model achieves the highest classification accuracy. Second, we further propose an interpreting method which could explain the trained model precisely with a precise linear formula for each input sample and decision features which contributed most to the classification of ASD versus HC participants in the model. We also discuss the implications of our proposed approach for fMRI data classification and interpretation.


Assuntos
Transtorno do Espectro Autista/diagnóstico por imagem , Aprendizado Profundo , Transtorno do Espectro Autista/classificação , Transtorno do Espectro Autista/fisiopatologia , Estudos de Casos e Controles , Biologia Computacional , Conectoma/estatística & dados numéricos , Bases de Dados Factuais , Neuroimagem Funcional/estatística & dados numéricos , Humanos , Modelos Lineares , Imageamento por Ressonância Magnética/estatística & dados numéricos , Redes Neurais de Computação , Máquina de Vetores de Suporte
20.
Genes (Basel) ; 10(11)2019 11 04.
Artigo em Inglês | MEDLINE | ID: mdl-31689965

RESUMO

(1) Background: DNA sequence alignment process is an essential step in genome analysis. BWA-MEM has been a prevalent single-node tool in genome alignment because of its high speed and accuracy. The exponentially generated genome data requiring a multi-node solution to handle large volumes of data currently remains a challenge. Spark is a ubiquitous big data platform that has been exploited to assist genome alignment in handling this challenge. Nonetheless, existing works that utilize Spark to optimize BWA-MEM suffer from higher overhead. (2) Methods: In this paper, we presented PipeMEM, a framework to accelerate BWA-MEM with lower overhead with the help of the pipe operation in Spark. We additionally proposed to use a pipeline structure and in-memory-computation to accelerate PipeMEM. (3) Results: Our experiments showed that, on paired-end alignment tasks, our framework had low overhead. In a multi-node environment, our framework, on average, was 2.27× faster compared with BWASpark (an alignment tool in Genome Analysis Toolkit (GATK)), and 2.33× faster compared with SparkBWA. (4) Conclusions: PipeMEM could accelerate BWA-MEM in the Spark environment with high performance and low overhead.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Big Data , Mapeamento Cromossômico , Genoma Humano , Humanos , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA