Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 19 de 19
Filtrar
1.
Am J Public Health ; 107(12): 1910-1915, 2017 12.
Artículo en Inglés | MEDLINE | ID: mdl-29048960

RESUMEN

OBJECTIVES: To deploy a methodology accurately identifying tweets marketing the illegal online sale of controlled substances. METHODS: We first collected tweets from the Twitter public application program interface stream filtered for prescription opioid keywords. We then used unsupervised machine learning (specifically, topic modeling) to identify topics associated with illegal online marketing and sales. Finally, we conducted Web forensic analyses to characterize different types of online vendors. We analyzed 619 937 tweets containing the keywords codeine, Percocet, fentanyl, Vicodin, Oxycontin, oxycodone, and hydrocodone over a 5-month period from June to November 2015. RESULTS: A total of 1778 tweets (< 1%) were identified as marketing the sale of controlled substances online; 90% had imbedded hyperlinks, but only 46 were "live" at the time of the evaluation. Seven distinct URLs linked to Web sites marketing or illegally selling controlled substances online. CONCLUSIONS: Our methodology can identify illegal online sale of prescription opioids from large volumes of tweets. Our results indicate that controlled substances are trafficked online via different strategies and vendors. Public Health Implications. Our methodology can be used to identify illegal online sellers in criminal violation of the Ryan Haight Online Pharmacy Consumer Protection Act.


Asunto(s)
Analgésicos Opioides , Crimen , Disponibilidad de Medicamentos Vía Internet , Mal Uso de Medicamentos de Venta con Receta , Medios de Comunicación Sociales/estadística & datos numéricos , Humanos , Mercadotecnía , Salud Pública , Aprendizaje Automático no Supervisado
2.
Addict Behav ; 65: 289-295, 2017 02.
Artículo en Inglés | MEDLINE | ID: mdl-27568339

RESUMEN

INTRODUCTION: Nonmedical use of prescription medications/drugs (NMUPD) is a serious public health threat, particularly in relation to the prescription opioid analgesics abuse epidemic. While attention to this problem has been growing, there remains an urgent need to develop novel strategies in the field of "digital epidemiology" to better identify, analyze and understand trends in NMUPD behavior. METHODS: We conducted surveillance of the popular microblogging site Twitter by collecting 11 million tweets filtered for three commonly abused prescription opioid analgesic drugs Percocet® (acetaminophen/oxycodone), OxyContin® (oxycodone), and Oxycodone. Unsupervised machine learning was applied on the subset of tweets for each analgesic drug to discover underlying latent themes regarding risk behavior. A two-step process of obtaining themes, and filtering out unwanted tweets was carried out in three subsequent rounds of machine learning. RESULTS: Using this methodology, 2.3M tweets were identified that contained content relevant to analgesic NMUPD. The underlying themes were identified for each drug and the most representative tweets of each theme were annotated for NMUPD behavioral risk factors. The primary themes identified evidence high levels of social media discussion about polydrug abuse on Twitter. This included specific mention of various polydrug combinations including use of other classes of prescription drugs, and illicit drug abuse. CONCLUSIONS: This study presents a methodology to filter Twitter content for NMUPD behavior, while also identifying underlying themes with minimal human intervention. Results from the study track accurately with the inclusion/exclusion criteria used to isolate NMUPD-related risk behaviors of interest and also provides insight on NMUPD behavior that has a high level of social media engagement. Results suggest that this could be a viable methodology for use in big data substance abuse surveillance, data collection, and analysis in comparison to other studies that rely upon content analysis and human coding schemes.


Asunto(s)
Trastornos Relacionados con Opioides/epidemiología , Mal Uso de Medicamentos de Venta con Receta/estadística & datos numéricos , Medios de Comunicación Sociales/estadística & datos numéricos , Aprendizaje Automático no Supervisado/estadística & datos numéricos , Humanos , Factores de Riesgo
3.
PLoS One ; 11(12): e0166694, 2016.
Artículo en Inglés | MEDLINE | ID: mdl-27992437

RESUMEN

On-line social networks publish information on a high volume of real-world events almost instantly, becoming a primary source for breaking news. Some of these real-world events can end up having a very strong impact on on-line social networks. The effect of such events can be analyzed from several perspectives, one of them being the intensity and characteristics of the collective activity that it produces in the social platform. We research 5,234 real-world news events encompassing 43 million messages discussed on the Twitter microblogging service for approximately 1 year. We show empirically that exogenous news events naturally create collective patterns of bursty behavior in combination with long periods of inactivity in the network. This type of behavior agrees with other patterns previously observed in other types of natural collective phenomena, as well as in individual human communications. In addition, we propose a methodology to classify news events according to the different levels of intensity in activity that they produce. In particular, we analyze the most highly active events and observe a consistent and strikingly different collective reaction from users when they are exposed to such events. This reaction is independent of an event's reach and scope. We further observe that extremely high-activity events have characteristics that are quite distinguishable at the beginning stages of their outbreak. This allows us to predict with high precision, the top 8% of events that will have the most impact in the social network by just using the first 5% of the information of an event's lifetime evolution. This strongly implies that high-activity events are naturally prioritized collectively by the social network, engaging users early on, way before they are brought to the mainstream audience.


Asunto(s)
Medios de Comunicación Sociales/estadística & datos numéricos , Medios de Comunicación , Humanos , Red Social
4.
Med Sci Sports Exerc ; 48(5): 951-7, 2016 May.
Artículo en Inglés | MEDLINE | ID: mdl-27089222

RESUMEN

PURPOSE: Walking for health is recommended by health agencies, partly based on epidemiological studies of self-reported behaviors. Accelerometers are now replacing survey data, but it is not clear that intensity-based cut points reflect the behaviors previously reported. New computational techniques can help classify raw accelerometer data into behaviors meaningful for public health. METHODS: Five hundred twenty days of triaxial 30-Hz accelerometer data from three studies (n = 78) were employed as training data. Study 1 included prescribed activities completed in natural settings. The other two studies included multiple days of free-living data with SenseCam-annotated ground truth. The two populations in the free-living data sets were demographically and physical different. Random forest classifiers were trained on each data set, and the classification accuracy on the training data set and that applied to the other available data sets were assessed. Accelerometer cut points were also compared with the ground truth from the three data sets. RESULTS: The random forest classified all behaviors with over 80% accuracy. Classifiers developed on the prescribed data performed with higher accuracy than the free-living data classifier, but these did not perform as well on the free-living data sets. Many of the observed behaviors occurred at different intensities compared with those identified by existing cut points. CONCLUSIONS: New machine learning classifiers developed from prescribed activities (study 1) were considerably less accurate when applied to free-living populations or to a functionally different population (studies 2 and 3). These classifiers, developed on free-living data, may have value when applied to large cohort studies with existing hip accelerometer data.


Asunto(s)
Acelerometría/métodos , Algoritmos , Ejercicio Físico , Aprendizaje Automático , Monitoreo Ambulatorio/métodos , Actividades Cotidianas/clasificación , Adulto , Anciano , Ciclismo , Femenino , Humanos , Masculino , Persona de Mediana Edad , Obesidad , Sobrepeso , Salud Pública
5.
Med Sci Sports Exerc ; 48(5): 933-40, 2016 May.
Artículo en Inglés | MEDLINE | ID: mdl-26673126

RESUMEN

PURPOSE: Accelerometers are a valuable tool for objective measurement of physical activity (PA). Wrist-worn devices may improve compliance over standard hip placement, but more research is needed to evaluate their validity for measuring PA in free-living settings. Traditional cut-point methods for accelerometers can be inaccurate and need testing in free living with wrist-worn devices. In this study, we developed and tested the performance of machine learning (ML) algorithms for classifying PA types from both hip and wrist accelerometer data. METHODS: Forty overweight or obese women (mean age = 55.2 ± 15.3 yr; BMI = 32.0 ± 3.7) wore two ActiGraph GT3X+ accelerometers (right hip, nondominant wrist; ActiGraph, Pensacola, FL) for seven free-living days. Wearable cameras captured ground truth activity labels. A classifier consisting of a random forest and hidden Markov model classified the accelerometer data into four activities (sitting, standing, walking/running, and riding in a vehicle). Free-living wrist and hip ML classifiers were compared with each other, with traditional accelerometer cut points, and with an algorithm developed in a laboratory setting. RESULTS: The ML classifier obtained average values of 89.4% and 84.6% balanced accuracy over the four activities using the hip and wrist accelerometer, respectively. In our data set with average values of 28.4 min of walking or running per day, the ML classifier predicted average values of 28.5 and 24.5 min of walking or running using the hip and wrist accelerometer, respectively. Intensity-based cut points and the laboratory algorithm significantly underestimated walking minutes. CONCLUSIONS: Our results demonstrate the superior performance of our PA-type classification algorithm, particularly in comparison with traditional cut points. Although the hip algorithm performed better, additional compliance achieved with wrist devices might justify using a slightly lower performing algorithm.


Asunto(s)
Acelerometría/instrumentación , Algoritmos , Ejercicio Físico , Actividades Cotidianas/clasificación , Adulto , Anciano , Femenino , Cadera , Humanos , Aprendizaje Automático , Cadenas de Markov , Persona de Mediana Edad , Monitoreo Ambulatorio/métodos , Obesidad , Sobrepeso , Sensibilidad y Especificidad , Muñeca
6.
IEEE Trans Pattern Anal Mach Intell ; 37(4): 697-712, 2015 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-26353288

RESUMEN

The bag-of-systems (BoS) representation is a descriptor of motion in a video, where dynamic texture (DT) codewords represent the typical motion patterns in spatio-temporal patches extracted from the video. The efficacy of the BoS descriptor depends on the richness of the codebook, which depends on the number of codewords in the codebook. However, for even modest sized codebooks, mapping videos onto the codebook results in a heavy computational load. In this paper we propose the BoS Tree, which constructs a bottom-up hierarchy of codewords that enables efficient mapping of videos to the BoS codebook. By leveraging the tree structure to efficiently index the codewords, the BoS Tree allows for fast look-ups in the codebook and enables the practical use of larger, richer codebooks. We demonstrate the effectiveness of BoS Trees on classification of four video datasets, as well as on annotation of a video dataset and a music dataset. Finally, we show that, although the fast look-ups of BoS Tree result in different descriptors than BoS for the same video, the overall distance (and kernel) matrices are highly correlated resulting in similar classification performance.

7.
PLoS One ; 9(12): e114046, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-25549335

RESUMEN

Massively parallel collaboration and emergent knowledge generation is described through a large scale survey for archaeological anomalies within ultra-high resolution earth-sensing satellite imagery. Over 10K online volunteers contributed 30K hours (3.4 years), examined 6,000 km², and generated 2.3 million feature categorizations. Motivated by the search for Genghis Khan's tomb, participants were tasked with finding an archaeological enigma that lacks any historical description of its potential visual appearance. Without a pre-existing reference for validation we turn towards consensus, defined by kernel density estimation, to pool human perception for "out of the ordinary" features across a vast landscape. This consensus served as the training mechanism within a self-evolving feedback loop between a participant and the crowd, essential driving a collective reasoning engine for anomaly detection. The resulting map led a National Geographic expedition to confirm 55 archaeological sites across a vast landscape. A increased ground-truthed accuracy was observed in those participants exposed to the peer feedback loop over those whom worked in isolation, suggesting collective reasoning can emerge within networked groups to outperform the aggregate independent ability of individuals to define the unknown.


Asunto(s)
Arqueología/métodos , Colaboración de las Masas , Imágenes Satelitales , Femenino , Humanos , Masculino
8.
Physiol Meas ; 35(11): 2191-203, 2014 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-25340969

RESUMEN

Wrist accelerometers are being used in population level surveillance of physical activity (PA) but more research is needed to evaluate their validity for correctly classifying types of PA behavior and predicting energy expenditure (EE). In this study we compare accelerometers worn on the wrist and hip, and the added value of heart rate (HR) data, for predicting PA type and EE using machine learning. Forty adults performed locomotion and household activities in a lab setting while wearing three ActiGraph GT3X+ accelerometers (left hip, right hip, non-dominant wrist) and a HR monitor (Polar RS400). Participants also wore a portable indirect calorimeter (COSMED K4b2), from which EE and metabolic equivalents (METs) were computed for each minute. We developed two predictive models: a random forest classifier to predict activity type and a random forest of regression trees to estimate METs. Predictions were evaluated using leave-one-user-out cross-validation. The hip accelerometer obtained an average accuracy of 92.3% in predicting four activity types (household, stairs, walking, running), while the wrist accelerometer obtained an average accuracy of 87.5%. Across all 8 activities combined (laundry, window washing, dusting, dishes, sweeping, stairs, walking, running), the hip and wrist accelerometers obtained average accuracies of 70.2% and 80.2% respectively. Predicting METs using the hip or wrist devices alone obtained root mean square errors (rMSE) of 1.09 and 1.00 METs per 6 min bout, respectively. Including HR data improved MET estimation, but did not significantly improve activity type classification. These results demonstrate the validity of random forest classification and regression forests for PA type and MET prediction using accelerometers. The wrist accelerometer proved more useful in predicting activities with significant arm movement, while the hip accelerometer was superior for predicting locomotion and estimating EE.


Asunto(s)
Acelerometría/instrumentación , Inteligencia Artificial , Metabolismo Energético , Cadera , Monitoreo Ambulatorio/instrumentación , Actividad Motora , Muñeca , Adulto , Calorimetría Indirecta , Femenino , Frecuencia Cardíaca , Humanos , Masculino , Actividad Motora/fisiología
9.
Front Public Health ; 2: 36, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-24795875

RESUMEN

BACKGROUND: Active travel is an important area in physical activity research, but objective measurement of active travel is still difficult. Automated methods to measure travel behaviors will improve research in this area. In this paper, we present a supervised machine learning method for transportation mode prediction from global positioning system (GPS) and accelerometer data. METHODS: We collected a dataset of about 150 h of GPS and accelerometer data from two research assistants following a protocol of prescribed trips consisting of five activities: bicycling, riding in a vehicle, walking, sitting, and standing. We extracted 49 features from 1-min windows of this data. We compared the performance of several machine learning algorithms and chose a random forest algorithm to classify the transportation mode. We used a moving average output filter to smooth the output predictions over time. RESULTS: The random forest algorithm achieved 89.8% cross-validated accuracy on this dataset. Adding the moving average filter to smooth output predictions increased the cross-validated accuracy to 91.9%. CONCLUSION: Machine learning methods are a viable approach for automating measurement of active travel, particularly for measuring travel activities that traditional accelerometer data processing methods misclassify, such as bicycling and vehicle travel.

10.
IEEE Trans Pattern Anal Mach Intell ; 36(3): 521-35, 2014 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-24457508

RESUMEN

The problem of cross-modal retrieval from multimedia repositories is considered. This problem addresses the design of retrieval systems that support queries across content modalities, for example, using an image to search for texts. A mathematical formulation is proposed, equating the design of cross-modal retrieval systems to that of isomorphic feature spaces for different content modalities. Two hypotheses are then investigated regarding the fundamental attributes of these spaces. The first is that low-level cross-modal correlations should be accounted for. The second is that the space should enable semantic abstraction. Three new solutions to the cross-modal retrieval problem are then derived from these hypotheses: correlation matching (CM), an unsupervised method which models cross-modal correlations, semantic matching (SM), a supervised technique that relies on semantic representation, and semantic correlation matching (SCM), which combines both. An extensive evaluation of retrieval performance is conducted to test the validity of the hypotheses. All approaches are shown successful for text retrieval in response to image queries and vice versa. It is concluded that both hypotheses hold, in a complementary form, although evidence in favor of the abstraction hypothesis is stronger than that for correlation.

11.
Artículo en Inglés | MEDLINE | ID: mdl-26247061

RESUMEN

Physical activity monitoring in free-living populations has many applications for public health research, weight-loss interventions, context-aware recommendation systems and assistive technologies. We present a system for physical activity recognition that is learned from a free-living dataset of 40 women who wore multiple sensors for seven days. The multi-level classification system first learns low-level codebook representations for each sensor and uses a random forest classifier to produce minute-level probabilities for each activity class. Then a higher-level HMM layer learns patterns of transitions and durations of activities over time to smooth the minute-level predictions. [Formula: see text].

12.
IEEE Trans Pattern Anal Mach Intell ; 35(7): 1606-21, 2013 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-23681990

RESUMEN

Dynamic texture (DT) is a probabilistic generative model, defined over space and time, that represents a video as the output of a linear dynamical system (LDS). The DT model has been applied to a wide variety of computer vision problems, such as motion segmentation, motion classification, and video registration. In this paper, we derive a new algorithm for clustering DT models that is based on the hierarchical EM algorithm. The proposed clustering algorithm is capable of both clustering DTs and learning novel DT cluster centers that are representative of the cluster members in a manner that is consistent with the underlying generative probabilistic model of the DT. We also derive an efficient recursive algorithm for sensitivity analysis of the discrete-time Kalman smoothing filter, which is used as the basis for computing expectations in the E-step of the HEM algorithm. Finally, we demonstrate the efficacy of the clustering algorithm on several applications in motion analysis, including hierarchical motion clustering, semantic motion annotation, and learning bag-of-systems (BoS) codebooks for dynamic texture recognition.

13.
Proc Natl Acad Sci U S A ; 109(17): 6411-6, 2012 Apr 24.
Artículo en Inglés | MEDLINE | ID: mdl-22460786

RESUMEN

Searching for relevant content in a massive amount of multimedia information is facilitated by accurately annotating each image, video, or song with a large number of relevant semantic keywords, or tags. We introduce game-powered machine learning, an integrated approach to annotating multimedia content that combines the effectiveness of human computation, through online games, with the scalability of machine learning. We investigate this framework for labeling music. First, a socially-oriented music annotation game called Herd It collects reliable music annotations based on the "wisdom of the crowds." Second, these annotated examples are used to train a supervised machine learning system. Third, the machine learning system actively directs the annotation games to collect new data that will most benefit future model iterations. Once trained, the system can automatically annotate a corpus of music much larger than what could be labeled using human computation alone. Automatically annotated songs can be retrieved based on their semantic relevance to text-based queries (e.g., "funky jazz with saxophone," "spooky electronica," etc.). Based on the results presented in this paper, we find that actively coupling annotation games with machine learning provides a reliable and scalable approach to making searchable massive amounts of multimedia data.

14.
Neural Comput ; 24(6): 1391-407, 2012 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-22364501

RESUMEN

The concave-convex procedure (CCCP) is an iterative algorithm that solves d.c. (difference of convex functions) programs as a sequence of convex programs. In machine learning, CCCP is extensively used in many learning algorithms, including sparse support vector machines (SVMs), transductive SVMs, and sparse principal component analysis. Though CCCP is widely used in many applications, its convergence behavior has not gotten a lot of specific attention. Yuille and Rangarajan analyzed its convergence in their original paper; however, we believe the analysis is not complete. The convergence of CCCP can be derived from the convergence of the d.c. algorithm (DCA), proposed in the global optimization literature to solve general d.c. programs, whose proof relies on d.c. duality. In this note, we follow a different reasoning and show how Zangwill's global convergence theory of iterative algorithms provides a natural framework to prove the convergence of CCCP. This underlines Zangwill's theory as a powerful and general framework to deal with the convergence issues of iterative algorithms, after also being used to prove the convergence of algorithms like expectation-maximization and generalized alternating minimization. In this note, we provide a rigorous analysis of the convergence of CCCP by addressing two questions: When does CCCP find a local minimum or a stationary point of the d.c. program under consideration? and when does the sequence generated by CCCP converge? We also present an open problem on the issue of local convergence of CCCP.

15.
IEEE Trans Image Process ; 20(2): 570-85, 2011 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-20736139

RESUMEN

Recently, many object localization models have shown that incorporating contextual cues can greatly improve accuracy over using appearance features alone. Therefore, many of these models have explored different types of contextual sources, but only considering one level of contextual interaction at the time. Thus, what context could truly contribute to object localization, through integrating cues from all levels, simultaneously, remains an open question. Moreover, the relative importance of the different contextual levels and appearance features across different object classes remains to be explored. Here we introduce a novel framework for multiple class object localization that incorporates different levels of contextual interactions. We study contextual interactions at the pixel, region and object level based upon three different sources of context: semantic, boundary support, and contextual neighborhoods. Our framework learns a single similarity metric from multiple kernels, combining pixel and region interactions with appearance features, and then applies a conditional random field to incorporate object level interactions. To effectively integrate different types of feature descriptions, we extend the large margin nearest neighbor to a novel algorithm that supports multiple kernels. We perform experiments on three challenging image databases: Graz-02, MSRC and PASCAL VOC 2007. Experimental results show that our model outperforms current state-of-the-art contextual frameworks and reveals individual contributions for each contextual interaction level as well as appearance features, indicating their relative importance for object localization.

16.
Genome Biol ; 9 Suppl 1: S2, 2008.
Artículo en Inglés | MEDLINE | ID: mdl-18613946

RESUMEN

BACKGROUND: Several years after sequencing the human genome and the mouse genome, much remains to be discovered about the functions of most human and mouse genes. Computational prediction of gene function promises to help focus limited experimental resources on the most likely hypotheses. Several algorithms using diverse genomic data have been applied to this task in model organisms; however, the performance of such approaches in mammals has not yet been evaluated. RESULTS: In this study, a standardized collection of mouse functional genomic data was assembled; nine bioinformatics teams used this data set to independently train classifiers and generate predictions of function, as defined by Gene Ontology (GO) terms, for 21,603 mouse genes; and the best performing submissions were combined in a single set of predictions. We identified strengths and weaknesses of current functional genomic data sets and compared the performance of function prediction algorithms. This analysis inferred functions for 76% of mouse genes, including 5,000 currently uncharacterized genes. At a recall rate of 20%, a unified set of predictions averaged 41% precision, with 26% of GO terms achieving a precision better than 90%. CONCLUSION: We performed a systematic evaluation of diverse, independently developed computational approaches for predicting gene function from heterogeneous data sources in mammals. The results show that currently available data for mammals allows predictions with both breadth and accuracy. Importantly, many highly novel predictions emerge for the 38% of mouse genes that remain uncharacterized.


Asunto(s)
Algoritmos , Ratones/genética , Proteínas/genética , Proteínas/metabolismo , Animales , Ratones/metabolismo
17.
Genome Biol ; 9 Suppl 1: S6, 2008.
Artículo en Inglés | MEDLINE | ID: mdl-18613950

RESUMEN

In predicting hierarchical protein function annotations, such as terms in the Gene Ontology (GO), the simplest approach makes predictions for each term independently. However, this approach has the unfortunate consequence that the predictor may assign to a single protein a set of terms that are inconsistent with one another; for example, the predictor may assign a specific GO term to a given protein ('purine nucleotide binding') but not assign the parent term ('nucleotide binding'). Such predictions are difficult to interpret. In this work, we focus on methods for calibrating and combining independent predictions to obtain a set of probabilistic predictions that are consistent with the topology of the ontology. We call this procedure 'reconciliation'. We begin with a baseline method for predicting GO terms from a collection of data types using an ensemble of discriminative classifiers. We apply the method to a previously described benchmark data set, and we demonstrate that the resulting predictions are frequently inconsistent with the topology of the GO. We then consider 11 distinct reconciliation methods: three heuristic methods; four variants of a Bayesian network; an extension of logistic regression to the structured case; and three novel projection methods - isotonic regression and two variants of a Kullback-Leibler projection method. We evaluate each method in three different modes - per term, per protein and joint - corresponding to three types of prediction tasks. Although the principal goal of reconciliation is interpretability, it is important to assess whether interpretability comes at a cost in terms of precision and recall. Indeed, we find that many apparently reasonable reconciliation methods yield reconciled probabilities with significantly lower precision than the original, unreconciled estimates. On the other hand, we find that isotonic regression usually performs better than the underlying, unreconciled method, and almost never performs worse; isotonic regression appears to be able to use the constraints from the GO network to its advantage. An exception to this rule is the high precision regime for joint evaluation, where Kullback-Leibler projection yields the best performance.


Asunto(s)
Algoritmos , Proteínas/genética , Proteínas/metabolismo , Animales , Teorema de Bayes , Biología Computacional , Humanos , Modelos Logísticos , Ratones , Terminología como Asunto
18.
Genome Res ; 15(5): 724-36, 2005 May.
Artículo en Inglés | MEDLINE | ID: mdl-15867433

RESUMEN

A large gene expression database has been produced that characterizes the gene expression and physiological effects of hundreds of approved and withdrawn drugs, toxicants, and biochemical standards in various organs of live rats. In order to derive useful biological knowledge from this large database, a variety of supervised classification algorithms were compared using a 597-microarray subset of the data. Our studies show that several types of linear classifiers based on Support Vector Machines (SVMs) and Logistic Regression can be used to derive readily interpretable drug signatures with high classification performance. Both methods can be tuned to produce classifiers of drug treatments in the form of short, weighted gene lists which upon analysis reveal that some of the signature genes have a positive contribution (act as "rewards" for the class-of-interest) while others have a negative contribution (act as "penalties") to the classification decision. The combination of reward and penalty genes enhances performance by keeping the number of false positive treatments low. The results of these algorithms are combined with feature selection techniques that further reduce the length of the drug signatures, an important step towards the development of useful diagnostic biomarkers and low-cost assays. Multiple signatures with no genes in common can be generated for the same classification end-point. Comparison of these gene lists identifies biological processes characteristic of a given class.


Asunto(s)
Algoritmos , Clasificación/métodos , Regulación de la Expresión Génica , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Análisis de Secuencia por Matrices de Oligonucleótidos/normas , Preparaciones Farmacéuticas/metabolismo , ARN Mensajero/aislamiento & purificación , Animales , Médula Ósea/metabolismo , Relación Dosis-Respuesta a Droga , Riñón/metabolismo , Hígado/metabolismo , Modelos Logísticos , Masculino , Miocardio/metabolismo , Análisis de Componente Principal , Ratas , Ratas Sprague-Dawley , Reproducibilidad de los Resultados
19.
Bioinformatics ; 20(16): 2626-35, 2004 Nov 01.
Artículo en Inglés | MEDLINE | ID: mdl-15130933

RESUMEN

MOTIVATION: During the past decade, the new focus on genomics has highlighted a particular challenge: to integrate the different views of the genome that are provided by various types of experimental data. RESULTS: This paper describes a computational framework for integrating and drawing inferences from a collection of genome-wide measurements. Each dataset is represented via a kernel function, which defines generalized similarity relationships between pairs of entities, such as genes or proteins. The kernel representation is both flexible and efficient, and can be applied to many different types of data. Furthermore, kernel functions derived from different types of data can be combined in a straightforward fashion. Recent advances in the theory of kernel methods have provided efficient algorithms to perform such combinations in a way that minimizes a statistical loss function. These methods exploit semidefinite programming techniques to reduce the problem of finding optimizing kernel combinations to a convex optimization problem. Computational experiments performed using yeast genome-wide datasets, including amino acid sequences, hydropathy profiles, gene expression data and known protein-protein interactions, demonstrate the utility of this approach. A statistical learning algorithm trained from all of these data to recognize particular classes of proteins--membrane proteins and ribosomal proteins--performs significantly better than the same algorithm trained on any single type of data. AVAILABILITY: Supplementary data at http://noble.gs.washington.edu/proj/sdp-svm


Asunto(s)
Algoritmos , Mapeo Cromosómico/métodos , Bases de Datos de Proteínas , Perfilación de la Expresión Génica/métodos , Modelos Genéticos , Proteínas/genética , Análisis de Secuencia de Proteína/métodos , Inteligencia Artificial , Bases de Datos Genéticas , Proteínas Fúngicas/química , Proteínas Fúngicas/genética , Genómica/métodos , Almacenamiento y Recuperación de la Información/métodos , Proteínas de la Membrana/genética , Proteínas de la Membrana/metabolismo , Modelos Estadísticos , Reconocimiento de Normas Patrones Automatizadas , Proteínas/análisis , Proteínas/química , Proteínas/clasificación , Proteínas Ribosómicas/química , Proteínas Ribosómicas/genética , Alineación de Secuencia , Homología de Secuencia de Aminoácido , Integración de Sistemas
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA