RESUMO
Background and objective: Pelvic bone tumors represent a harmful orthopedic condition, encompassing both benign and malignant forms. Addressing the issue of limited accuracy in current machine learning algorithms for bone tumor image segmentation, we have developed an enhanced bone tumor image segmentation algorithm. This algorithm is built upon an improved full convolutional neural network, incorporating both the fully convolutional neural network (FCNN-4s) and a conditional random field (CRF) to achieve more precise segmentation. Methodology: The enhanced fully convolutional neural network (FCNN-4s) was employed to conduct initial segmentation on preprocessed images. Following each convolutional layer, batch normalization layers were introduced to expedite network training convergence and enhance the accuracy of the trained model. Subsequently, a fully connected conditional random field (CRF) was integrated to fine-tune the segmentation results, refining the boundaries of pelvic bone tumors and achieving high-quality segmentation. Results: The experimental outcomes demonstrate a significant enhancement in segmentation accuracy and stability when compared to the conventional convolutional neural network bone tumor image segmentation algorithm. The algorithm achieves an average Dice coefficient of 93.31 %, indicating superior performance in real-time operations. Conclusion: In contrast to the conventional convolutional neural network segmentation algorithm, the algorithm presented in this paper boasts a more intricate structure, proficiently addressing issues of over-segmentation and under-segmentation in pelvic bone tumor segmentation. This segmentation model exhibits superior real-time performance, robust stability, and is capable of achieving heightened segmentation accuracy.
RESUMO
BACKGROUND: In the healthcare domain today, despite the substantial adoption of electronic health information systems, a significant proportion of medical reports still exist in paper-based formats. As a result, there is a significant demand for the digitization of information from these paper-based reports. However, the digitization of paper-based laboratory reports into a structured data format can be challenging due to their non-standard layouts, which includes various data types such as text, numeric values, reference ranges, and units. Therefore, it is crucial to develop a highly scalable and lightweight technique that can effectively identify and extract information from laboratory test reports and convert them into a structured data format for downstream tasks. METHODS: We developed an end-to-end Natural Language Processing (NLP)-based pipeline for extracting information from paper-based laboratory test reports. Our pipeline consists of two main modules: an optical character recognition (OCR) module and an information extraction (IE) module. The OCR module is applied to locate and identify text from scanned laboratory test reports using state-of-the-art OCR algorithms. The IE module is then used to extract meaningful information from the OCR results to form digitalized tables of the test reports. The IE module consists of five sub-modules, which are time detection, headline position, line normalization, Named Entity Recognition (NER) with a Conditional Random Fields (CRF)-based method, and step detection for multi-column. Finally, we evaluated the performance of the proposed pipeline on 153 laboratory test reports collected from Peking University First Hospital (PKU1). RESULTS: In the OCR module, we evaluate the accuracy of text detection and recognition results at three different levels and achieved an averaged accuracy of 0.93. In the IE module, we extracted four laboratory test entities, including test item name, test result, test unit, and reference value range. The overall F1 score is 0.86 on the 153 laboratory test reports collected from PKU1. With a single CPU, the average inference time of each report is only 0.78 s. CONCLUSION: In this study, we developed a practical lightweight pipeline to digitalize and extract information from paper-based laboratory test reports in diverse types and with different layouts that can be adopted in real clinical environments with the lowest possible computing resources requirements. The high evaluation performance on the real-world hospital dataset validated the feasibility of the proposed pipeline.
Assuntos
Algoritmos , Processamento de Linguagem Natural , Humanos , Armazenamento e Recuperação da Informação , Hospitais Universitários , Registros Eletrônicos de SaúdeRESUMO
Background and objective: Bone tumor is a kind of harmful orthopedic disease, there are benign and malignant points. Aiming at the problem that the accuracy of the existing machine learning algorithm for bone tumor image segmentation is not high, a bone tumor image segmentation algorithm based on improved full convolutional neural network which consists fully convolutional neural network (FCNN-4s) and conditional random field (CRF). Methodology: The improved fully convolutional neural network (FCNN-4s) was used to perform coarse segmentation on preprocessed images. Batch normalization layers were added after each convolutional layer to accelerate the convergence speed of network training and improve the accuracy of the trained model. Then, a fully connected conditional random field (CRF) was fused to refine the bone tumor boundary in the coarse segmentation results, achieving the fine segmentation effect. Results: The experimental results show that compared with the traditional convolutional neural network bone tumor image segmentation algorithm, the algorithm has a great improvement in segmentation accuracy and stability, the average Dice can reach 91.56%, the real-time performance is better. Conclusion: Compared with the traditional convolutional neural network segmentation algorithm, the algorithm in this paper has a more refined structure, which can effectively solve the problem of over-segmentation and under-segmentation of bone tumors. The segmentation prediction has better real-time performance, strong stability, and can achieve higher segmentation accuracy.
RESUMO
To date, several POS taggers have been introduced to facilitate the success of semantic analysis for different languages. However, the task of POS tagging becomes a bit intricate in morphologically complex languages, like Amharic. In this paper, we evaluated different models such as bidirectional long short term memory, convolutional neural network in combination with bidirectional long short term memory, and conditional random field for Amharic POS tagging. Various features, both language-dependent and -independent, have been explored in a conditional random field model. Besides, word-level and character-level features are analyzed in deep neural network models. A convolutional neural network is utilized for encoding features at the word and character level. Each model's performance has evaluated on the dataset that contained 321 K tokens and manually tagged with 31 POS tags. Lastly, the best performance obtained by an end-to-end deep neural network model, convolutional neural network in combination with bidirectional long term short memory and conditional random field, is 97.23% accuracy. This is the highest accuracy for Amharic POS tagging task and is competent with contemporary taggers currently existing in different languages.
RESUMO
BACKGROUND: Ground-glass opacities (GGOs) appearing in computed tomography (CT) scans may indicate potential lung malignancy. Proper management of GGOs based on their features can prevent the development of lung cancer. Electronic health records are rich sources of information on GGO nodules and their granular features, but most of the valuable information is embedded in unstructured clinical notes. OBJECTIVE: We aimed to develop, test, and validate a deep learning-based natural language processing (NLP) tool that automatically extracts GGO features to inform the longitudinal trajectory of GGO status from large-scale radiology notes. METHODS: We developed a bidirectional long short-term memory with a conditional random field-based deep-learning NLP pipeline to extract GGO and granular features of GGO retrospectively from radiology notes of 13,216 lung cancer patients. We evaluated the pipeline with quality assessments and analyzed cohort characterization of the distribution of nodule features longitudinally to assess changes in size and solidity over time. RESULTS: Our NLP pipeline built on the GGO ontology we developed achieved between 95% and 100% precision, 89% and 100% recall, and 92% and 100% F1-scores on different GGO features. We deployed this GGO NLP model to extract and structure comprehensive characteristics of GGOs from 29,496 radiology notes of 4521 lung cancer patients. Longitudinal analysis revealed that size increased in 16.8% (240/1424) of patients, decreased in 14.6% (208/1424), and remained unchanged in 68.5% (976/1424) in their last note compared to the first note. Among 1127 patients who had longitudinal radiology notes of GGO status, 815 (72.3%) were reported to have stable status, and 259 (23%) had increased/progressed status in the subsequent notes. CONCLUSIONS: Our deep learning-based NLP pipeline can automatically extract granular GGO features at scale from electronic health records when this information is documented in radiology notes and help inform the natural history of GGO. This will open the way for a new paradigm in lung cancer prevention and early detection.
RESUMO
Reading psychology is an important basis for formulating news strategies. The purpose of this paper is to study how to analyze and study the psychological mechanism of reading and news communication strategies based on affective computing. It described the conditional random field. This paper put forward the problem of affective computing, which is based on affective computing technology. Then it expounded the concept of conditional random fields and related algorithms, and designed and analyzed cases of news communication strategies. Through the period from April 1, 2022 to June 1, 2022, the 300 short videos of news and information with the highest playback volume respectively released by Z software were used as research samples, and a multi-angle analysis was carried out. Among the 300 selected news information videos, 252 videos are original content, accounting for 84.00%, and the other short videos are from the self-media of the platform. At present, it is necessary to continuously develop intelligent products that meet the needs of users at different stages, and carry out personalized design in strict accordance with the living habits and natural conditions of different groups of people. Then continuously improving the experience of different groups of people using new media products, so that their reading interest can be more effectively stimulated.
RESUMO
Sequence-structure alignment for protein sequences is an important task for the template-based modeling of 3D structures of proteins. Building a reliable sequence-structure alignment is a challenging problem, especially for remote homologue target proteins. We built a method of sequence-structure alignment called CRFalign, which improves upon a base alignment model based on HMM-HMM comparison by employing pairwise conditional random fields in combination with nonlinear scoring functions of structural and sequence features. Nonlinear scoring part is implemented by a set of gradient boosted regression trees. In addition to sequence profile features, various position-dependent structural features are employed including secondary structures and solvent accessibilities. Training is performed on reference alignments at superfamily levels or twilight zone chosen from the SABmark benchmark set. We found that CRFalign method produces relative improvement in terms of average alignment accuracies for validation sets of SABmark benchmark. We also tested CRFalign on 51 sequence-structure pairs involving 15 FM target domains of CASP14, where we could see that CRFalign leads to an improvement in average modeling accuracies in these hard targets (TM-CRFalign ≃42.94%) compared with that of HHalign (TM-HHalign ≃39.05%) and also that of MRFalign (TM-MRFalign ≃36.93%). CRFalign was incorporated to our template search framework called CRFpred and was tested for a random target set of 300 target proteins consisting of Easy, Medium and Hard sets which showed a reasonable template search performance.
Assuntos
Algoritmos , Proteínas , Sequência de Aminoácidos , Estrutura Secundária de Proteína , Proteínas/química , Alinhamento de Sequência , SolventesRESUMO
Named entities are the main carriers of relevant medical knowledge in Electronic Medical Records (EMR). Clinical electronic medical records lead to problems such as word segmentation ambiguity and polysemy due to the specificity of Chinese language structure, so a Clinical Named Entity Recognition (CNER) model based on multi-head self-attention combined with BILSTM neural network and Conditional Random Fields is proposed. Firstly, the pre-trained language model organically combines char vectors and word vectors for the text sequences of the original dataset. The sequences are then fed into the parallel structure of the multi-head self-attention module and the BILSTM neural network module, respectively. By splicing the output of the neural network module to obtain multi-level information such as contextual information and feature association weights. Finally, entity annotation is performed by CRF. The results of the multiple comparison experiments show that the structure of the proposed model is very reasonable and robust, and it can effectively improve the Chinese CNER model. The model can extract multi-level and more comprehensive text features, compensate for the defect of long-distance dependency loss, with better applicability and recognition performance.
Assuntos
Idioma , Processamento de Linguagem Natural , Atenção , China , Registros Eletrônicos de SaúdeRESUMO
BACKGROUND: The COVID-19 pandemic has created a pressing need for integrating information from disparate sources in order to assist decision makers. Social media is important in this respect; however, to make sense of the textual information it provides and be able to automate the processing of large amounts of data, natural language processing methods are needed. Social media posts are often noisy, yet they may provide valuable insights regarding the severity and prevalence of the disease in the population. Here, we adopt a triage and diagnosis approach to analyzing social media posts using machine learning techniques for the purpose of disease detection and surveillance. We thus obtain useful prevalence and incidence statistics to identify disease symptoms and their severities, motivated by public health concerns. OBJECTIVE: This study aims to develop an end-to-end natural language processing pipeline for triage and diagnosis of COVID-19 from patient-authored social media posts in order to provide researchers and public health practitioners with additional information on the symptoms, severity, and prevalence of the disease rather than to provide an actionable decision at the individual level. METHODS: The text processing pipeline first extracted COVID-19 symptoms and related concepts, such as severity, duration, negations, and body parts, from patients' posts using conditional random fields. An unsupervised rule-based algorithm was then applied to establish relations between concepts in the next step of the pipeline. The extracted concepts and relations were subsequently used to construct 2 different vector representations of each post. These vectors were separately applied to build support vector machine learning models to triage patients into 3 categories and diagnose them for COVID-19. RESULTS: We reported macro- and microaveraged F1 scores in the range of 71%-96% and 61%-87%, respectively, for the triage and diagnosis of COVID-19 when the models were trained on human-labeled data. Our experimental results indicated that similar performance can be achieved when the models are trained using predicted labels from concept extraction and rule-based classifiers, thus yielding end-to-end machine learning. In addition, we highlighted important features uncovered by our diagnostic machine learning models and compared them with the most frequent symptoms revealed in another COVID-19 data set. In particular, we found that the most important features are not always the most frequent ones. CONCLUSIONS: Our preliminary results show that it is possible to automatically triage and diagnose patients for COVID-19 from social media natural language narratives, using a machine learning pipeline in order to provide information on the severity and prevalence of the disease for use within health surveillance systems.
Assuntos
COVID-19 , Mídias Sociais , COVID-19/diagnóstico , COVID-19/epidemiologia , Humanos , Processamento de Linguagem Natural , Pandemias , SARS-CoV-2 , TriagemRESUMO
Purpose: Multisource images are interesting in medical imaging. Indeed, multisource images enable the use of complementary information of different sources such as for T1 and T2 modalities in MRI imaging. However, such multisource data can also be subject to redundancy and correlation. The question is how to efficiently fuse the multisource information without reinforcing the redundancy. We propose a method for segmenting multisource images that are statistically correlated. Approach: The method that we propose is the continuation of a prior work in which we introduce the copula model in hidden Markov fields (HMF). To achieve the multisource segmentations, we use a functional measure of dependency called "copula." This copula is incorporated in the conditionally random fields (CRF). Contrary to HMF, where we consider a prior knowledge on the hidden states modeled by an HMF, in CRF, there is no prior information and only the distribution of the hidden states conditionally to the observations can be known. This conditional distribution depends on the data and can be modeled by an energy function composed of two terms. The first one groups the voxels having similar intensities in the same class. As for the second term, it encourages a pair of voxels to be in the same class if the difference between their intensities is not too big. Results: A comparison between HMF and CRF is performed via theory and experimentations using both simulated and real data from BRATS 2013. Moreover, our method is compared with different state-of-the-art methods, which include supervised (convolutional neural networks) and unsupervised (hierarchical MRF). Our unsupervised method gives similar results as decision trees for synthetic images and as convolutional neural networks for real images; both methods are supervised. Conclusions: We compare two statistical methods using the copula: HMF and CRF to deal with multicorrelated images. We demonstrate the interest of using copula. In both models, the copula considerably improves the results compared with individual segmentations.
RESUMO
Imaging sonar systems are widely used for monitoring fish behavior in turbid or low ambient light waters. For analyzing fish behavior in sonar images, fish segmentation is often required. In this paper, Mask R-CNN is adopted for segmenting fish in sonar images. Sonar images acquired from different shallow waters can be quite different in the contrast between fish and the background. That difference can make Mask R-CNN trained on examples collected from one fish farm ineffective to fish segmentation for the other fish farms. In this paper, a preprocessing convolutional neural network (PreCNN) is proposed to provide "standardized" feature maps for Mask R-CNN and to ease applying Mask R-CNN trained for one fish farm to the others. PreCNN aims at decoupling learning of fish instances from learning of fish-cultured environments. PreCNN is a semantic segmentation network and integrated with conditional random fields. PreCNN can utilize successive sonar images and can be trained by semi-supervised learning to make use of unlabeled information. Experimental results have shown that Mask R-CNN on the output of PreCNN is more accurate than Mask R-CNN directly on sonar images. Applying Mask R-CNN plus PreCNN trained for one fish farm to new fish farms is also more effective.
Assuntos
Processamento de Imagem Assistida por Computador , Redes Neurais de Computação , Som , Manejo de Espécimes , Aprendizado de Máquina SupervisionadoRESUMO
Semantic segmentation is one of the essential prerequisites for computer vision tasks, but edge-precise segmentation stays challenging due to the potential lack of a proper model indicating the low-level relation between pixels. We have presented Refined UNet v2, a concatenation of a network backbone and a subsequent embedded conditional random field (CRF) layer, which coarsely performs pixel-wise classification and refines edges of segmentation regions in a one-stage way. However, the CRF layer of v2 employs a gray-scale global observation (image) to construct contrast-sensitive bilateral features, which is not able to achieve the desired performance on ambiguous edges. In addition, the naïve depth-wise Gaussian filter cannot always compute efficiently, especially for a longer-range message-passing step. To address the aforementioned issues, we upgrade the bilateral message-passing kernel and the efficient implementation of Gaussian filtering in the CRF layer in this paper, referred to as Refined UNet v3, which is able to effectively capture ambiguous edges and accelerate the message-passing procedure. Specifically, the inherited UNet is employed to coarsely locate cloud and shadow regions and the embedded CRF layer refines the edges of the forthcoming segmentation proposals. The multi-channel guided Gaussian filter is applied to the bilateral message-passing step, which improves detecting ambiguous edges that are hard for the gray-scale counterpart to identify, and fast Fourier transform-based (FFT-based) Gaussian filtering facilitates an efficient and potentially range-agnostic implementation. Furthermore, Refined UNet v3 is able to be extended to segmentation on multi-spectral datasets, and the corresponding refinement examination confirms the development of shadow retrieval. Experiments and corresponding results demonstrate that the proposed update can outperform its counterpart in terms of the detection of vague edges, shadow retrieval, and isolated redundant regions, and it is practically efficient in our TensorFlow implementation. The demo source code is available at https://github.com/92xianshen/refined-unet-v3.
Assuntos
Processamento de Imagem Assistida por Computador , Redes Neurais de ComputaçãoRESUMO
Neuronal ensembles are groups of neurons with coordinated activity that could represent sensory, motor, or cognitive states. The study of how neuronal ensembles are built, recalled, and involved in the guiding of complex behaviors has been limited by the lack of experimental and analytical tools to reliably identify and manipulate neurons that have the ability to activate entire ensembles. Such pattern completion neurons have also been proposed as key elements of artificial and biological neural networks. Indeed, the relevance of pattern completion neurons is highlighted by growing evidence that targeting them can activate neuronal ensembles and trigger behavior. As a method to reliably detect pattern completion neurons, we use conditional random fields (CRFs), a type of probabilistic graphical model. We apply CRFs to identify pattern completion neurons in ensembles in experiments using in vivo two-photon calcium imaging from primary visual cortex of male mice and confirm the CRFs predictions with two-photon optogenetics. To test the broader applicability of CRFs we also analyze publicly available calcium imaging data (Allen Institute Brain Observatory dataset) and demonstrate that CRFs can reliably identify neurons that predict specific features of visual stimuli. Finally, to explore the scalability of CRFs we apply them to in silico network simulations and show that CRFs-identified pattern completion neurons have increased functional connectivity. These results demonstrate the potential of CRFs to characterize and selectively manipulate neural circuits.SIGNIFICANCE STATEMENT We describe a graph theory method to identify and optically manipulate neurons with pattern completion capability in mouse cortical circuits. Using calcium imaging and two-photon optogenetics in vivo we confirm that key neurons identified by this method can recall entire neuronal ensembles. This method could be broadly applied to manipulate neuronal ensemble activity to trigger behavior or for therapeutic applications in brain prostheses.
Assuntos
Modelos Neurológicos , Neurônios/fisiologia , Reconhecimento Visual de Modelos/fisiologia , Probabilidade , Córtex Visual/fisiologia , Animais , Masculino , Camundongos , Camundongos Endogâmicos C57BL , Microscopia de Fluorescência por Excitação Multifotônica/métodos , Neurônios/química , Optogenética/métodos , Estimulação Luminosa/métodos , Córtex Visual/química , Córtex Visual/citologiaRESUMO
Recently, automatic computer-aided detection (CAD) of COVID-19 using radiological images has received a great deal of attention from many researchers and medical practitioners, and consequently several CAD frameworks and methods have been presented in the literature to assist the radiologist physicians in performing diagnostic COVID-19 tests quickly, reliably and accurately. This paper presents an innovative framework for the automatic detection of COVID-19 from chest X-ray (CXR) images, in which a rich and effective representation of lung tissue patterns is generated from the gray level co-occurrence matrix (GLCM) based textural features. The input CXR image is first preprocessed by spatial filtering along with median filtering and contrast limited adaptive histogram equalization to improve the CXR image's poor quality and reduce image noise. Automatic thresholding by the optimized formula of Otsu's method is applied to find a proper threshold value to best segment lung regions of interest (ROIs) out from CXR images. Then, a concise set of GLCM-based texture features is extracted to accurately represent the segmented lung ROIs of each CXR image. Finally, the normalized features are fed into a trained discriminative latent-dynamic conditional random fields (LDCRFs) model for fine-grained classification to divide the cases into two categories: COVID-19 and non-COVID-19. The presented method has been experimentally tested and validated on a relatively large dataset of frontal CXR images, achieving an average accuracy, precision, recall, and F1-score of 95.88%, 96.17%, 94.45%, and 95.79%, respectively, which compare favorably with and occasionally exceed those previously reported in similar studies in the literature.
Assuntos
COVID-19 , Humanos , SARS-CoV-2RESUMO
Promoter annotation is an important task in the analysis of a genome. One of the main challenges for this task is locating the border between the promoter region and the transcribing region of the gene, the transcription start site (TSS). The TSS is the reference point to delimit the DNA sequence responsible for the assembly of the transcribing complex. As the same gene can have more than one TSS, so to delimit the promoter region, it is important to locate the closest TSS to the site of the beginning of the translation. This paper presents TSSFinder, a new software for the prediction of the TSS signal of eukaryotic genes that is significantly more accurate than other available software. We currently are the only application to offer pre-trained models for six different eukaryotic organisms: Arabidopsis thaliana, Drosophila melanogaster, Gallus gallus, Homo sapiens, Oryza sativa and Saccharomyces cerevisiae. Additionally, our software can be easily customized for specific organisms using only 125 DNA sequences with a validated TSS signal and corresponding genomic locations as a training set. TSSFinder is a valuable new tool for the annotation of genomes. TSSFinder source code and docker container can be downloaded from http://tssfinder.github.io. Alternatively, TSSFinder is also available as a web service at http://sucest-fun.org/wsapp/tssfinder/.
Assuntos
Biologia Computacional/métodos , Eucariotos/genética , Genoma , Genômica/métodos , Regiões Promotoras Genéticas , Software , Sítio de Iniciação de Transcrição , Algoritmos , Bases de Dados Genéticas , Reprodutibilidade dos Testes , Análise de Sequência de DNA , NavegadorRESUMO
TransMembrane ß-Barrel (TMBB) proteins located in the outer membranes of Gram-negative bacteria are crucial for many important biological processes and primary candidates as drug targets. Structure determination of TMBB proteins is challenging and hence computational methods devised for the analysis of TMBB proteins are important for complementing experimental approaches. Here, we present a novel web server called BetAware-Deep that is able to accurately identify the topology of TMBB proteins (i.e. the number and orientation of membrane-spanning segments along the protein sequence) and to discriminate them from other protein types. The method in BetAware-Deep defines new features by exploiting a non-canonical computation of the hydrophobic moment and by adopting sequence-profile weighting of the White&Wimley hydrophobicity scale. These features are processed using a two-step approach based on deep learning and probabilistic graphical models. BetAware-Deep has been trained on a dataset comprising 58 TMBBs and benchmarked on a novel set of 15 TMBB proteins. Results showed that BetAware-Deep outperforms two recently released state-of-the-art methods for topology prediction, predicting correct topologies of 10 out of 15 proteins. TMBB detection was also assessed on a larger dataset comprising 1009 TMBB proteins and 7571 non-TMBB proteins. Even in this benchmark, BetAware-Deep scored at the level of top-performing methods. A web server has been developed allowing users to analyze input protein sequences and providing topology prediction together with a rich set of information including a graphical representation of the residue-level annotations and prediction probabilities. BetAware-Deep is available at https://busca.biocomp.unibo.it/betaware2.
Assuntos
Algoritmos , Internet , Proteínas de Membrana/química , Bases de Dados de Proteínas , Células Procarióticas , Estrutura Secundária de ProteínaRESUMO
Increasingly, popular online museums have significantly changed the way people acquire cultural knowledge. These online museums have been generating abundant amounts of cultural relics data. In recent years, researchers have used deep learning models that can automatically extract complex features and have rich representation capabilities to implement named-entity recognition (NER). However, the lack of labeled data in the field of cultural relics makes it difficult for deep learning models that rely on labeled data to achieve excellent performance. To address this problem, this paper proposes a semi-supervised deep learning model named SCRNER (Semi-supervised model for Cultural Relics' Named Entity Recognition) that utilizes the bidirectional long short-term memory (BiLSTM) and conditional random fields (CRF) model trained by seldom labeled data and abundant unlabeled data to attain an effective performance. To satisfy the semi-supervised sample selection, we propose a repeat-labeled (relabeled) strategy to select samples of high confidence to enlarge the training set iteratively. In addition, we use embeddings from language model (ELMo) representations to dynamically acquire word representations as the input of the model to solve the problem of the blurred boundaries of cultural objects and Chinese characteristics of texts in the field of cultural relics. Experimental results demonstrate that our proposed model, trained on limited labeled data, achieves an effective performance in the task of named entity recognition of cultural relics.
RESUMO
Our goal is to summarise and aggregate information from social media regarding the symptoms of a disease, the drugs used and the treatment effects both positive and negative. To achieve this we first apply a supervised machine learning method to automatically extract medical concepts from natural language text. In an environment such as social media, where new data is continuously streamed, we need a methodology that will allow us to continuously train with the new data. To attain such incremental re-training, a semi-supervised methodology is developed, which is capable of learning new concepts from a small set of labelled data together with the much larger set of unlabelled data. The semi-supervised methodology deploys a conditional random field (CRF) as the base-line training algorithm for extracting medical concepts. The methodology iteratively augments to the training set sentences having high confidence, and adds terms to existing dictionaries to be used as features with the base-line model for further classification. Our empirical results show that the base-line CRF performs strongly across a range of different dictionary and training sizes; when the base-line is built with the full training data the F1 score reaches the range 84%-90%. Moreover, we show that the semi-supervised method produces a mild but significant improvement over the base-line. We also discuss the significance of the potential improvement of the semi-supervised methodology and found that it is significantly more accurate in most cases than the underlying base-line model.
Assuntos
Mídias Sociais , Algoritmos , Humanos , Idioma , Aprendizado de Máquina SupervisionadoRESUMO
We designed a location-context-semantics-based conditional random field (LCS-CRF) framework for the semantic classification of airborne laser scanning (ALS) point clouds. For ALS datasets of high spatial resolution but with severe noise pollutions, more contexture and semantics cues, besides location information, can be exploited to surmount the decrease of discrimination of features for classification. This paper mainly focuses on the semantic classification of ALS data using mixed location-context-semantics cues, which are integrated into a higher-order CRF framework by modeling the probabilistic potentials. The location cues modeled by the unary potentials can provide basic information for discriminating the various classes. The pairwise potentials consider the spatial contextual information by establishing the neighboring interactions between points to favor spatial smoothing. The semantics cues are explicitly encoded in the higher-order potentials. The higher-order potential operates at the clusters level with similar geometric and radiometric properties, guaranteeing the classification accuracy based on semantic rules. To demonstrate the performance of our approach, two standard benchmark datasets were utilized. Experiments show that our method achieves superior classification results with an overall accuracy of 83.1% on the Vaihingen Dataset and an overall accuracy of 94.3% on the Graphics and Media Lab (GML) Dataset A compared with other classification algorithms in the literature.
RESUMO
BACKGROUND: Schistosomiasis and infection by soil-transmitted helminths are some of the world's most prevalent neglected tropical diseases. Infection by more than one parasite (co-infection) is common and can contribute to clinical morbidity in children. Geostatistical analyses of parasite infection data are key for developing mass drug administration strategies, yet most methods ignore co-infections when estimating risk. Infection status for multiple parasites can act as a useful proxy for data-poor individual-level or environmental risk factors while avoiding regression dilution bias. Conditional random fields (CRF) is a multivariate graphical network method that opens new doors in parasite risk mapping by (i) predicting co-infections with high accuracy; (ii) isolating associations among parasites; and (iii) quantifying how these associations change across landscapes. METHODS: We built a spatial CRF to estimate infection risks for Ascaris lumbricoides, Trichuris trichiura, hookworms (Ancylostoma duodenale and Necator americanus) and Schistosoma mansoni using data from a national survey of Rwandan schoolchildren. We used an ensemble learning approach to generate spatial predictions by simulating from the CRF's posterior distribution with a multivariate boosted regression tree that captured non-linear relationships between predictors and covariance in infection risks. This CRF ensemble was compared against single parasite gradient boosted machines to assess each model's performance and prediction uncertainty. RESULTS: Parasite co-infections were common, with 19.57% of children infected with at least two parasites. The CRF ensemble achieved higher predictive power than single-parasite models by improving estimates of co-infection prevalence at the individual level and classifying schools into World Health Organization treatment categories with greater accuracy. The CRF uncovered important environmental and demographic predictors of parasite infection probabilities. Yet even after capturing demographic and environmental risk factors, the presences or absences of other parasites were strong predictors of individual-level infection risk. Spatial predictions delineated high-risk regions in need of anthelminthic treatment interventions, including areas with higher than expected co-infection prevalence. CONCLUSIONS: Monitoring studies routinely screen for multiple parasites, yet statistical models generally ignore this multivariate data when assessing risk factors and designing treatment guidelines. Multivariate approaches can be instrumental in the global effort to reduce and eventually eliminate neglected helminth infections in developing countries.