Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 27
Filtrar
Mais filtros

Bases de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38762789

RESUMO

Identifying drug-target interactions (DTIs) holds significant importance in drug discovery and development, playing a crucial role in various areas such as virtual screening, drug repurposing and identification of potential drug side effects. However, existing methods commonly exploit only a single type of feature from drugs and targets, suffering from miscellaneous challenges such as high sparsity and cold-start problems. We propose a novel framework called MSI-DTI (Multi-Source Information-based Drug-Target Interaction Prediction) to enhance prediction performance, which obtains feature representations from different views by integrating biometric features and knowledge graph representations from multi-source information. Our approach involves constructing a Drug-Target Knowledge Graph (DTKG), obtaining multiple feature representations from diverse information sources for SMILES sequences and amino acid sequences, incorporating network features from DTKG and performing an effective multi-source information fusion. Subsequently, we employ a multi-head self-attention mechanism coupled with residual connections to capture higher-order interaction information between sparse features while preserving lower-order information. Experimental results on DTKG and two benchmark datasets demonstrate that our MSI-DTI outperforms several state-of-the-art DTIs prediction methods, yielding more accurate and robust predictions. The source codes and datasets are publicly accessible at https://github.com/KEAML-JLU/MSI-DTI.


Assuntos
Descoberta de Drogas , Biologia Computacional/métodos , Algoritmos , Humanos
2.
Brief Bioinform ; 23(5)2022 09 20.
Artigo em Inglês | MEDLINE | ID: mdl-35596953

RESUMO

Coronavirus disease 2019 (COVID-19) has infected hundreds of millions of people and killed millions of them. As an RNA virus, COVID-19 is more susceptible to variation than other viruses. Many problems involved in this epidemic have made biosafety and biosecurity (hereafter collectively referred to as 'biosafety') a popular and timely topic globally. Biosafety research covers a broad and diverse range of topics, and it is important to quickly identify hotspots and trends in biosafety research through big data analysis. However, the data-driven literature on biosafety research discovery is quite scant. We developed a novel topic model based on latent Dirichlet allocation, affinity propagation clustering and the PageRank algorithm (LDAPR) to extract knowledge from biosafety research publications from 2011 to 2020. Then, we conducted hotspot and trend analysis with LDAPR and carried out further studies, including annual hot topic extraction, a 10-year keyword evolution trend analysis, topic map construction, hot region discovery and fine-grained correlation analysis of interdisciplinary research topic trends. These analyses revealed valuable information that can guide epidemic prevention work: (1) the research enthusiasm over a certain infectious disease not only is related to its epidemic characteristics but also is affected by the progress of research on other diseases, and (2) infectious diseases are not only strongly related to their corresponding microorganisms but also potentially related to other specific microorganisms. The detailed experimental results and our code are available at https://github.com/KEAML-JLU/Biosafety-analysis.


Assuntos
COVID-19 , Biosseguridade , COVID-19/epidemiologia , Contenção de Riscos Biológicos/métodos , Humanos , Aprendizado de Máquina , RNA
3.
Entropy (Basel) ; 23(3)2021 Mar 12.
Artigo em Inglês | MEDLINE | ID: mdl-33809188

RESUMO

Machine learning models can automatically discover biomedical research trends and promote the dissemination of information and knowledge. Text feature representation is a critical and challenging task in natural language processing. Most methods of text feature representation are based on word representation. A good representation can capture semantic and structural information. In this paper, two fusion algorithms are proposed, namely, the Tr-W2v and Ti-W2v algorithms. They are based on the classical text feature representation model and consider the importance of words. The results show that the effectiveness of the two fusion text representation models is better than the classical text representation model, and the results based on the Tr-W2v algorithm are the best. Furthermore, based on the Tr-W2v algorithm, trend analyses of cancer research are conducted, including correlation analysis, keyword trend analysis, and improved keyword trend analysis. The discovery of the research trends and the evolution of hotspots for cancers can help doctors and biological researchers collect information and provide guidance for further research.

4.
BMC Med Genet ; 21(1): 65, 2020 03 30.
Artigo em Inglês | MEDLINE | ID: mdl-32228543

RESUMO

BACKGROUND: Several obesity susceptibility loci in genes, including GNPDA2, SH2B1, TMEM18, MTCH2, CDKAL1, FAIM2, and MC4R, have been identified by genome-wide association studies. The purpose of this study was to investigate whether these loci are associated with the concurrence of obesity and type 2 diabetes in Chinese Han patients. METHODS: Using the SNaPshot technique, we genotyped seven single nucleotide polymorphisms (SNPs) in 439 Chinese patients living in Northeast China who presented at The Second Hospital of Jilin University. We analyzed the associations between these seven alleles and clinical characteristics. RESULTS: Risk alleles near TMEM18 (rs6548238) were associated with increased waist circumference, waist/hip ratio, body mass index (BMI), fasting plasma glucose, hemoglobin A1c, diastolic blood pressure, triglycerides, total cholesterol, and low-density lipoprotein-cholesterol; risk alleles of CDKAL1 (rs7754840) were associated with increased waist circumference and waist/hip ratio; and FAIM2 (rs7138803) risk alleles were linked to increased BMI, diastolic blood pressure, and triglycerides (all P < 0.05). After adjusting for sex and age, loci near TMEM18 (rs6548238) and FAIM2 (rs7138803), but not SH2B1 (rs7498665), near GNPDA2 (rs10938397), MTCH2 (rs10838738) and near MC4R (rs12970134), were associated with increased risk for type 2 diabetes in obese individuals. CONCLUSION: We found that loci near TMEM18 (rs6548238), CDKAL1 (rs7754840), and FAIM2 (rs7138803) may be associated with obesity-related indicators, and loci near TMEM18 (rs6548238) and FAIM2 (rs7138803) may increase susceptibility of concurrent type 2 diabetes associated with obesity.


Assuntos
Proteínas Reguladoras de Apoptose/genética , Diabetes Mellitus Tipo 2/genética , Loci Gênicos , Proteínas de Membrana/genética , Obesidade/genética , tRNA Metiltransferases/genética , Adolescente , Adulto , Idoso , Povo Asiático/etnologia , Povo Asiático/genética , Estudos de Casos e Controles , China/epidemiologia , Diabetes Mellitus Tipo 2/complicações , Diabetes Mellitus Tipo 2/etnologia , Feminino , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla , Humanos , Masculino , Pessoa de Meia-Idade , Obesidade/complicações , Obesidade/etnologia , Polimorfismo de Nucleotídeo Único , Adulto Jovem
5.
Int J Mol Sci ; 21(6)2020 Mar 22.
Artigo em Inglês | MEDLINE | ID: mdl-32235704

RESUMO

With recent advances in single-cell RNA sequencing, enormous transcriptome datasets have been generated. These datasets have furthered our understanding of cellular heterogeneity and its underlying mechanisms in homogeneous populations. Single-cell RNA sequencing (scRNA-seq) data clustering can group cells belonging to the same cell type based on patterns embedded in gene expression. However, scRNA-seq data are high-dimensional, noisy, and sparse, owing to the limitation of existing scRNA-seq technologies. Traditional clustering methods are not effective and efficient for high-dimensional and sparse matrix computations. Therefore, several dimension reduction methods have been introduced. To validate a reliable and standard research routine, we conducted a comprehensive review and evaluation of four classical dimension reduction methods and five clustering models. Four experiments were progressively performed on two large scRNA-seq datasets using 20 models. Results showed that the feature selection method contributed positively to high-dimensional and sparse scRNA-seq data. Moreover, feature-extraction methods were able to promote clustering performance, although this was not eternally immutable. Independent component analysis (ICA) performed well in those small compressed feature spaces, whereas principal component analysis was steadier than all the other feature-extraction methods. In addition, ICA was not ideal for fuzzy C-means clustering in scRNA-seq data analysis. K-means clustering was combined with feature-extraction methods to achieve good results.


Assuntos
Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Algoritmos , Animais , Análise por Conglomerados , Perfilação da Expressão Gênica/métodos , Camundongos , Transcriptoma
6.
J Med Internet Res ; 21(5): e12957, 2019 05 24.
Artigo em Inglês | MEDLINE | ID: mdl-31127715

RESUMO

BACKGROUND: It is of great importance for researchers to publish research results in high-quality journals. However, it is often challenging to choose the most suitable publication venue, given the exponential growth of journals and conferences. Although recommender systems have achieved success in promoting movies, music, and products, very few studies have explored recommendation of publication venues, especially for biomedical research. No recommender system exists that can specifically recommend journals in PubMed, the largest collection of biomedical literature. OBJECTIVE: We aimed to propose a publication recommender system, named Pubmender, to suggest suitable PubMed journals based on a paper's abstract. METHODS: In Pubmender, pretrained word2vec was first used to construct the start-up feature space. Subsequently, a deep convolutional neural network was constructed to achieve a high-level representation of abstracts, and a fully connected softmax model was adopted to recommend the best journals. RESULTS: We collected 880,165 papers from 1130 journals in PubMed Central and extracted abstracts from these papers as an empirical dataset. We compared different recommendation models such as Cavnar-Trenkle on the Microsoft Academic Search (MAS) engine, a collaborative filtering-based recommender system for the digital library of the Association for Computing Machinery (ACM) and CiteSeer. We found the accuracy of our system for the top 10 recommendations to be 87.0%, 22.9%, and 196.0% higher than that of MAS, ACM, and CiteSeer, respectively. In addition, we compared our system with Journal Finder and Journal Suggester, which are tools of Elsevier and Springer, respectively, that help authors find suitable journals in their series. The results revealed that the accuracy of our system was 329% higher than that of Journal Finder and 406% higher than that of Journal Suggester for the top 10 recommendations. Our web service is freely available at https://www.keaml.cn:8081/. CONCLUSIONS: Our deep learning-based recommender system can suggest an appropriate journal list to help biomedical scientists and clinicians choose suitable venues for their papers.


Assuntos
Aprendizado Profundo/tendências , Pesquisa Biomédica , Humanos , Publicações , Estudos de Validação como Assunto
7.
Front Big Data ; 7: 1346958, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38650693

RESUMO

Introduction: Acupuncture and tuina, acknowledged as ancient and highly efficacious therapeutic modalities within the domain of Traditional Chinese Medicine (TCM), have provided pragmatic treatment pathways for numerous patients. To address the problems of ambiguity in the concept of Traditional Chinese Medicine (TCM) acupuncture and tuina treatment protocols, the lack of accurate quantitative assessment of treatment protocols, and the diversity of TCM systems, we have established a map-filling technique for modern literature to achieve personalized medical recommendations. Methods: (1) Extensive acupuncture and tuina data were collected, analyzed, and processed to establish a concise TCM domain knowledge base. (2)A template-free Chinese text NER joint training method (TemplateFC) was proposed, which enhances the EntLM model with BiLSTM and CRF layers. Appropriate rules were set for ERE. (3) A comprehensive knowledge graph comprising 10,346 entities and 40,919 relationships was constructed based on modern literature. Results: A robust TCM KG with a wide range of entities and relationships was created. The template-free joint training approach significantly improved NER accuracy, especially in Chinese text, addressing issues related to entity identification and tokenization differences. The KG provided valuable insights into acupuncture and tuina, facilitating efficient information retrieval and personalized treatment recommendations. Discussion: The integration of KGs in TCM research is essential for advancing diagnostics and interventions. Challenges in NER and ERE were effectively tackled using hybrid approaches and innovative techniques. The comprehensive TCM KG our built contributes to bridging the gap in TCM knowledge and serves as a valuable resource for specialists and non-specialists alike.

8.
Front Genet ; 14: 1151962, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37205122

RESUMO

The exploration of important biomarkers associated with cancer development is crucial for diagnosing cancer, designing therapeutic interventions, and predicting prognoses. The analysis of gene co-expression provides a systemic perspective on gene networks and can be a valuable tool for mining biomarkers. The main objective of co-expression network analysis is to discover highly synergistic sets of genes, and the most widely used method is weighted gene co-expression network analysis (WGCNA). With the Pearson correlation coefficient, WGCNA measures gene correlation, and uses hierarchical clustering to identify gene modules. The Pearson correlation coefficient reflects only the linear dependence between variables, and the main drawback of hierarchical clustering is that once two objects are clustered together, the process cannot be reversed. Hence, readjusting inappropriate cluster divisions is not possible. Existing co-expression network analysis methods rely on unsupervised methods that do not utilize prior biological knowledge for module delineation. Here we present a method for identification of outstanding modules in a co-expression network using a knowledge-injected semi-supervised learning approach (KISL), which utilizes apriori biological knowledge and a semi-supervised clustering method to address the issue existing in the current GCN-based clustering methods. To measure the linear and non-linear dependence between genes, we introduce a distance correlation due to the complexity of the gene-gene relationship. Eight RNA-seq datasets of cancer samples are used to validate its effectiveness. In all eight datasets, the KISL algorithm outperformed WGCNA when comparing the silhouette coefficient, Calinski-Harabasz index and Davies-Bouldin index evaluation metrics. According to the results, KISL clusters had better cluster evaluation values and better gene module aggregation. Enrichment analysis of the recognition modules demonstrated their effectiveness in discovering modular structures in biological co-expression networks. In addition, as a general method, KISL can be applied to various co-expression network analyses based on similarity metrics. Source codes for the KISL and the related scripts are available online at https://github.com/Mowonhoo/KISL.git.

9.
Sci Rep ; 13(1): 2, 2023 01 02.
Artigo em Inglês | MEDLINE | ID: mdl-36593288

RESUMO

More and more people are under high pressure in modern society, leading to growing mental disorders, such as antenatal depression for pregnant women. Antenatal depression can affect pregnant woman's physical and psychological health and child outcomes, and cause postpartum depression. Therefore, it is essential to detect the antenatal depression of pregnant women early. This study aims to predict pregnant women's antenatal depression and identify factors that may lead to antenatal depression. First, a questionnaire was designed, based on the daily life of pregnant women. The survey was conducted on pregnant women in a hospital, where 5666 pregnant women participated. As the collected data is unbalanced and has high dimensions, we developed a one-class classifier named Stacked Auto Encoder Support Vector Data Description (SAE-SVDD) to distinguish depressed pregnant women from normal ones. To validate the method, SAE-SVDD was firstly applied on three benchmark datasets. The results showed that SAE-SVDD was effective, with its F-scores better than other popular classifiers. For the antenatal depression problem, the F-score of SAE- SVDD was higher than 0.87, demonstrating that the questionnaire is informative and the classification method is successful. Then, by an improved Term Frequency-Inverse Document Frequency (TF-IDF) analysis, the critical factors of antenatal depression were identified as work stress, marital status, husband support, passive smoking, and alcohol consumption. With its generalizability, SAE-SVDD can be applied to analyze other questionnaires.


Assuntos
Complicações na Gravidez , Gestantes , Feminino , Humanos , Gravidez , Consumo de Bebidas Alcoólicas , Estado Civil , Complicações na Gravidez/diagnóstico , Gestantes/psicologia , Inquéritos e Questionários
10.
J Cardiovasc Transl Res ; 16(4): 896-904, 2023 08.
Artigo em Inglês | MEDLINE | ID: mdl-36928587

RESUMO

The visual inspection of coronary artery stenosis is known to be significantly affected by variation, due to the presence of other tissues, camera movements, and uneven illumination. More accurate and intelligent coronary angiography diagnostic models are necessary for improving the above problems. In this study, 2980 medical images from 949 patients are collected and a novel deep learning-based coronary angiography (DLCAG) diagnose system is proposed. Firstly, we design a module of coronary classification. Then, we introduce RetinaNet to balance positive and negative samples and improve the recognition accuracy. Additionally, DLCAG adopts instance segmentation to segment the stenosis of vessels and depict the degree of the stenosis vessels. Our DLCAG is available at http://101.132.120.184:8077/ . When doctors use our system, all they need to do is login to the system, upload the coronary angiography videos. Then, a diagnose report is automatically generated.


Assuntos
Estenose Coronária , Aprendizado Profundo , Humanos , Angiografia Coronária/métodos , Constrição Patológica , Estenose Coronária/diagnóstico por imagem , Coração , Vasos Coronários/diagnóstico por imagem , Angiografia por Tomografia Computadorizada/métodos
11.
Nat Commun ; 14(1): 7554, 2023 Nov 20.
Artigo em Inglês | MEDLINE | ID: mdl-37985761

RESUMO

Lunar surface chemistry is essential for revealing petrological characteristics to understand the evolution of the Moon. Existing chemistry mapping from Apollo and Luna returned samples could only calibrate chemical features before 3.0 Gyr, missing the critical late period of the Moon. Here we present major oxides chemistry maps by adding distinctive 2.0 Gyr Chang'e-5 lunar soil samples in combination with a deep learning-based inversion model. The inferred chemical contents are more precise than the Lunar Prospector Gamma-Ray Spectrometer (GRS) maps and are closest to returned samples abundances compared to existing literature. The verification of in situ measurement data acquired by Chang'e 3 and Chang'e 4 lunar rover demonstrated that Chang'e-5 samples are indispensable ground truth in mapping lunar surface chemistry. From these maps, young mare basalt units are determined which can be potential sites in future sample return mission to constrain the late lunar magmatic and thermal history.

12.
Nat Commun ; 11(1): 6358, 2020 12 22.
Artigo em Inglês | MEDLINE | ID: mdl-33353954

RESUMO

Impact craters, which can be considered the lunar equivalent of fossils, are the most dominant lunar surface features and record the history of the Solar System. We address the problem of automatic crater detection and age estimation. From initially small numbers of recognized craters and dated craters, i.e., 7895 and 1411, respectively, we progressively identify new craters and estimate their ages with Chang'E data and stratigraphic information by transfer learning using deep neural networks. This results in the identification of 109,956 new craters, which is more than a dozen times greater than the initial number of recognized craters. The formation systems of 18,996 newly detected craters larger than 8 km are estimated. Here, a new lunar crater database for the mid- and low-latitude regions of the Moon is derived and distributed to the planetary community together with the related data analysis.

13.
Int J Biol Sci ; 15(10): 2065-2074, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31592230

RESUMO

About 29.8 million people worldwide had been diagnosed with Alzheimer's disease (AD) in 2015, and the number is projected to triple by 2050. In 2018, AD was the fifth leading cause of death in Americans with 65 years of age or older, but the progress of AD drug research is very limited. It is helpful to identify the key factors and research trends of AD for guiding further more effective studies. We proposed a framework named as LDAP, which combined the latent Dirichlet allocation model and affinity propagation algorithm to extract research topics from 95,876 AD-related papers published from 2007 to 2016. Trends and hotspots analyses were performed on LDAP results. We found that the focus points of AD research for the past 10 years include 15 diseases, 15 amino acids, peptides, and proteins, 9 enzymes and coenzymes, 7 hormones, 7 carbohydrates, 5 lipids, 2 organophosphonates, 18 chemicals, 11 compounds, 13 symptoms, and 20 phenomena. Our LDAP framework allowed us to trace the evolution of research trends and the most popular areas of interest (hotspots) on disease, protein, symptom, and phenomena. Meanwhile, 556 AD related-genes were identified, which are enriched in 12 KEGG pathways including the AD pathway and nitrogen metabolism pathway. Our results are freely available at https://www.keaml.cn/Alzheimer.


Assuntos
Doença de Alzheimer , Pesquisa Biomédica/tendências , Aprendizado de Máquina , Humanos , PubMed , Estados Unidos
14.
BMC Syst Biol ; 13(1): 13, 2019 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-30670065

RESUMO

It was highlighted that the original article [1] contained a typesetting error in the last name of Allon Canaan. This was incorrectly captured as Allon Canaann in the original article which has since been updated.

15.
J Ambient Intell Humaniz Comput ; 10(5): 2029-2040, 2019 May.
Artigo em Inglês | MEDLINE | ID: mdl-31068980

RESUMO

With the massive volume and rapid increasing of data, feature space study is of great importance. To avoid the complex training processes in deep learning models which project original feature space into low-dimensional ones, we propose a novel feature space learning (FSL) model. The main contributions in our approach are: (1) FSL can not only select useful features but also adaptively update feature values and span new feature spaces; (2) four FSL algorithms are proposed with the feature space updating procedure; (3) FSL can provide a better data understanding and learn descriptive and compact feature spaces without the tough training for deep architectures. Experimental results on benchmark data sets demonstrate that FSL-based algorithms performed better than the classical unsupervised, semi-supervised learning and even incremental semi-supervised algorithms. In addition, we show a visualization of the learned feature space results. With the carefully designed learning strategy, FSL dynamically disentangles explanatory factors, depresses the noise accumulation and semantic shift, and constructs easy-to-understand feature spaces.

16.
Neural Process Lett ; 50(1): 103-119, 2019 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-35035261

RESUMO

Automatically describing contents of an image using natural language has drawn much attention because it not only integrates computer vision and natural language processing but also has practical applications. Using an end-to-end approach, we propose a bidirectional semantic attention-based guiding of long short-term memory (Bag-LSTM) model for image captioning. The proposed model consciously refines image features from previously generated text. By fine-tuning the parameters of convolution neural networks, Bag-LSTM obtains more text-related image features via feedback propagation than other models. As opposed to existing guidance-LSTM methods which directly add image features into each unit of an LSTM block, our fine-tuned model dynamically leverages more text-conditional image features, acquired by the semantic attention mechanism, as guidance information. Moreover, we exploit bidirectional gLSTM as the caption generator, which is capable of learning long term relations between visual features and semantic information by making use of both historical and future contextual information. In addition, variations of the Bag-LSTM model are proposed in an effort to sufficiently describe high-level visual-language interactions. Experiments on the Flickr8k and MSCOCO benchmark datasets demonstrate the effectiveness of the model, as compared with the baseline algorithms, such as it is 51.2% higher than BRNN on CIDEr metric.

17.
Genes (Basel) ; 9(1)2018 Jan 05.
Artigo em Inglês | MEDLINE | ID: mdl-29303984

RESUMO

Lung cancer is the second most commonly diagnosed carcinoma and is the leading cause of cancer death. Although significant progress has been made towards its understanding and treatment, unraveling the complexities of lung cancer is still hampered by a lack of comprehensive knowledge on the mechanisms underlying the disease. High-throughput and multidimensional genomic data have shed new light on cancer biology. In this study, we developed a network-based approach integrating somatic mutations, the transcriptome, DNA methylation, and protein-DNA interactions to reveal the key regulators in lung adenocarcinoma (LUAD). By combining Bayesian network analysis with tissue-specific transcription factor (TF) and targeted gene interactions, we inferred 15 disease-related core regulatory networks in co-expression gene modules associated with LUAD. Through target gene set enrichment analysis, we identified a set of key TFs, including known cancer genes that potentially regulate the disease networks. These TFs were significantly enriched in multiple cancer-related pathways. Specifically, our results suggest that hepatitis viruses may contribute to lung carcinogenesis, highlighting the need for further investigations into the roles that viruses play in treating lung cancer. Additionally, 13 putative regulatory long non-coding RNAs (lncRNAs), including three that are known to be associated with lung cancer, and nine novel lncRNAs were revealed by our study. These lncRNAs and their target genes exhibited high interaction potentials and demonstrated significant expression correlations between normal lung and LUAD tissues. We further extended our study to include 16 solid-tissue tumor types and determined that the majority of these lncRNAs have putative regulatory roles in multiple cancers, with a few showing lung-cancer specific regulations. Our study provides a comprehensive investigation of transcription factor and lncRNA regulation in the context of LUAD regulatory networks and yields new insights into the regulatory mechanisms underlying LUAD. The novel key regulatory elements discovered by our research offer new targets for rational drug design and accompanying therapeutic strategies.

18.
Sci Rep ; 8(1): 8995, 2018 Jun 07.
Artigo em Inglês | MEDLINE | ID: mdl-29875368

RESUMO

A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has not been fixed in the paper.

19.
20.
BMC Med Genomics ; 11(Suppl 5): 106, 2018 Nov 20.
Artigo em Inglês | MEDLINE | ID: mdl-30453959

RESUMO

BACKGROUND: Non-small cell lung cancer (NSCLC) represents more than about 80% of the lung cancer. The early stages of NSCLC can be treated with complete resection with a good prognosis. However, most cases are detected at late stage of the disease. The average survival rate of the patients with invasive lung cancer is only about 4%. Adenocarcinoma in situ (AIS) is an intermediate subtype of lung adenocarcinoma that exhibits early stage growth patterns but can develop into invasion. METHODS: In this study, we used RNA-seq data from normal, AIS, and invasive lung cancer tissues to identify a gene module that represents the distinguishing characteristics of AIS as AIS-specific genes. Two differential expression analysis algorithms were employed to identify the AIS-specific genes. Then, the subset of the best performed AIS-specific genes for the early lung cancer prediction were selected by random forest. Finally, the performances of the early lung cancer prediction were assessed using random forest, support vector machine (SVM) and artificial neural networks (ANNs) on four independent early lung cancer datasets including one tumor-educated blood platelets (TEPs) dataset. RESULTS: Based on the differential expression analysis, 107 AIS-specific genes that consisted of 93 protein-coding genes and 14 long non-coding RNAs (lncRNAs) were identified. The significant functions associated with these genes include angiogenesis and ECM-receptor interaction, which are highly related to cancer development and contribute to the smoking-free lung cancers. Moreover, 12 of the AIS-specific lncRNAs are involved in lung cancer progression by potentially regulating the ECM-receptor interaction pathway. The feature selection by random forest identified 20 of the AIS-specific genes as early stage lung cancer signatures using the dataset obtained from The Cancer Genome Atlas (TCGA) lung adenocarcinoma samples. Of the 20 signatures, two were lncRNAs, BLACAT1 and CTD-2527I21.15 which have been reported to be associated with bladder cancer, colorectal cancer and breast cancer. In blind classification for three independent tissue sample datasets, these signature genes consistently yielded about 98% accuracy for distinguishing early stage lung cancer from normal cases. However, the prediction accuracy for the blood platelets samples was only 64.35% (sensitivity 78.1%, specificity 50.59%, and AUROC 0.747). CONCLUSIONS: The comparison of AIS with normal and invasive tumor revealed diseases-specific genes and offered new insights into the mechanism underlying AIS progression into an invasive tumor. These genes can also serve as the signatures for early diagnosis of lung cancer with high accuracy. The expression profile of gene signatures identified from tissue cancer samples yielded remarkable early cancer prediction for tissues samples, however, relatively lower accuracy for boold platelets samples.


Assuntos
Adenocarcinoma in Situ/patologia , Neoplasias Pulmonares/patologia , Adenocarcinoma in Situ/genética , Área Sob a Curva , Bases de Dados Genéticas , Progressão da Doença , Regulação Neoplásica da Expressão Gênica , Humanos , Neoplasias Pulmonares/genética , Aprendizado de Máquina , Estadiamento de Neoplasias , Fases de Leitura Aberta/genética , RNA Longo não Codificante/genética , Curva ROC , Transcriptoma
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA