Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 9 de 9
Filtrar
Mais filtros

Base de dados
Tipo de documento
País de afiliação
Intervalo de ano de publicação
1.
Big Data ; 7(4): 286-307, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31860341

RESUMO

The outstanding performance of deep learning (DL) for computer vision and natural language processing has fueled increased interest in applying these algorithms more broadly in both research and practice. This study investigates the application of DL techniques to classification of large sparse behavioral data-which has become ubiquitous in the age of big data collection. We report on an extensive search through DL architecture variants and compare the predictive performance of DL with that of carefully regularized logistic regression (LR), which previously (and repeatedly) has been found to be the most accurate machine learning technique generally for sparse behavioral data. At a high level, we demonstrate that by following recommendations from the literature, researchers and practitioners who are not DL experts can achieve world-class performance using DL. More specifically, we report several findings. As a main result, applying DL on 39 big sparse behavioral classification tasks demonstrates a significant performance improvement compared with LR. A follow-up result suggests that if one were to choose the best shallow technique (rather than just LR), there still would often be an improvement from using DL, but that in this case the magnitude of the improvement might not justify the high cost. Investigating when DL performs better, we find that worse performance is obtained for data sets with low signal-from-noise separability-in line with prior results comparing linear and nonlinear classifiers. Exploring why the deep architectures work well, we show that using the first-layer features learned by DL yields better generalization performance for a linear model than do unsupervised feature-reduction methods (e.g., singular-value decomposition). However, to do well enough to beat well-regularized LR with the original sparse representation, more layers from the deep distributed architecture are needed. With respect to interpreting how deep models come to their decisions, we demonstrate how the neurons on the lowest layer of the deep architecture capture nuances from the raw fine-grained features and allow intuitive interpretation. Looking forward, we propose the use of instance-level counterfactual explanations to gain insight into why deep models classify individual data instances the way they do.


Assuntos
Big Data , Aprendizado Profundo , Humanos , Processamento de Linguagem Natural , Redes Neurais de Computação
2.
Big Data ; 6(3): 191-213, 2018 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-30283728

RESUMO

We develop a number of data-driven investment strategies that demonstrate how machine learning and data analytics can be used to guide investments in peer-to-peer loans. We detail the process starting with the acquisition of (real) data from a peer-to-peer lending platform all the way to the development and evaluation of investment strategies based on a variety of approaches. We focus heavily on how to apply and evaluate the data science methods, and resulting strategies, in a real-world business setting. The material presented in this article can be used by instructors who teach data science courses, at the undergraduate or graduate levels. Importantly, we go beyond just evaluating predictive performance of models, to assess how well the strategies would actually perform, using real, publicly available data. Our treatment is comprehensive and ranges from qualitative to technical, but is also modular-which gives instructors the flexibility to focus on specific parts of the case, depending on the topics they want to cover. The learning concepts include the following: data cleaning and ingestion, classification/probability estimation modeling, regression modeling, analytical engineering, calibration curves, data leakage, evaluation of model performance, basic portfolio optimization, evaluation of investment strategies, and using Python for data science.


Assuntos
Ciência de Dados/métodos , Investimentos em Saúde , Ciência de Dados/educação , Conjuntos de Dados como Assunto , Feminino , Humanos , Investimentos em Saúde/tendências , Aprendizado de Máquina , Estudos de Casos Organizacionais , Grupo Associado
3.
Big Data ; 5(3): 197-212, 2017 09.
Artigo em Inglês | MEDLINE | ID: mdl-28933942

RESUMO

Recent studies show the remarkable power of fine-grained information disclosed by users on social network sites to infer users' personal characteristics via predictive modeling. Similar fine-grained data are being used successfully in other commercial applications. In response, attention is turning increasingly to the transparency that organizations provide to users as to what inferences are drawn and why, as well as to what sort of control users can be given over inferences that are drawn about them. In this article, we focus on inferences about personal characteristics based on information disclosed by users' online actions. As a use case, we explore personal inferences that are made possible from "Likes" on Facebook. We first present a means for providing transparency into the information responsible for inferences drawn by data-driven models. We then introduce the "cloaking device"-a mechanism for users to inhibit the use of particular pieces of information in inference. Using these analytical tools we ask two main questions: (1) How much information must users cloak to significantly affect inferences about their personal traits? We find that usually users must cloak only a small portion of their actions to inhibit inference. We also find that, encouragingly, false-positive inferences are significantly easier to cloak than true-positive inferences. (2) Can firms change their modeling behavior to make cloaking more difficult? The answer is a definitive yes. We demonstrate a simple modeling change that requires users to cloak substantially more information to affect the inferences drawn. The upshot is that organizations can provide transparency and control even into complicated, predictive model-driven inferences, but they also can make control easier or harder for their users.


Assuntos
Individualidade , Adulto , Feminino , Humanos , Masculino
4.
Big Data ; 3(2): 90-102, 2015 06.
Artigo em Inglês | MEDLINE | ID: mdl-27447433

RESUMO

Online systems promise to improve advertisement targeting via the massive and detailed data available. However, there often is too few data on exactly the outcome of interest, such as purchases, for accurate campaign evaluation and optimization (due to low conversion rates, cold start periods, lack of instrumentation of offline purchases, and long purchase cycles). This paper presents a detailed treatment of proxy modeling, which is based on the identification of a suitable alternative (proxy) target variable when data on the true objective is in short supply (or even completely nonexistent). The paper has a two-fold contribution. First, the potential of proxy modeling is demonstrated clearly, based on a massive-scale experiment across 58 real online advertising campaigns. Second, we assess the value of different specific proxies for evaluating and optimizing online display advertising, showing striking results. The results include bad news and good news. The most commonly cited and used proxy is a click on an ad. The bad news is that across a large number of campaigns, clicks are not good proxies for evaluation or for optimization: clickers do not resemble buyers. The good news is that an alternative sort of proxy performs remarkably well: observed visits to the brand's website. Specifically, predictive models built based on brand site visits-which are much more common than purchases-do a remarkably good job of predicting which browsers will make a purchase. The practical bottom line: evaluating and optimizing campaigns using clicks seems wrongheaded; however, there is an easy and attractive alternative-use a well-chosen site-visit proxy instead.


Assuntos
Publicidade/estatística & dados numéricos , Sistemas On-Line , Interpretação Estatística de Dados , Humanos , Internet , Modelos Estatísticos
5.
Big Data ; 2(3): 117-28, 2014 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-27442492

RESUMO

In August 2013, we held a panel discussion at the KDD 2013 conference in Chicago on the subject of data science, data scientists, and start-ups. KDD is the premier conference on data science research and practice. The panel discussed the pros and cons for top-notch data scientists of the hot data science start-up scene. In this article, we first present background on our panelists. Our four panelists have unquestionable pedigrees in data science and substantial experience with start-ups from multiple perspectives (founders, employees, chief scientists, venture capitalists). For the casual reader, we next present a brief summary of the experts' opinions on eight of the issues the panel discussed. The rest of the article presents a lightly edited transcription of the entire panel discussion.

6.
Big Data ; 1(1): 51-9, 2013 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-27447038

RESUMO

Companies have realized they need to hire data scientists, academic institutions are scrambling to put together data-science programs, and publications are touting data science as a hot-even "sexy"-career choice. However, there is confusion about what exactly data science is, and this confusion could lead to disillusionment as the concept diffuses into meaningless buzz. In this article, we argue that there are good reasons why it has been hard to pin down exactly what is data science. One reason is that data science is intricately intertwined with other important concepts also of growing importance, such as big data and data-driven decision making. Another reason is the natural tendency to associate what a practitioner does with the definition of the practitioner's field; this can result in overlooking the fundamentals of the field. We believe that trying to define the boundaries of data science precisely is not of the utmost importance. We can debate the boundaries of the field in an academic setting, but in order for data science to serve business effectively, it is important (i) to understand its relationships to other important related concepts, and (ii) to begin to identify the fundamental principles underlying data science. Once we embrace (ii), we can much better understand and explain exactly what data science has to offer. Furthermore, only once we embrace (ii) should we be comfortable calling it data science. In this article, we present a perspective that addresses all these concepts. We close by offering, as examples, a partial list of fundamental principles underlying data science.

7.
Big Data ; 1(4): 215-26, 2013 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-27447254

RESUMO

With the increasingly widespread collection and processing of "big data," there is natural interest in using these data assets to improve decision making. One of the best understood ways to use data to improve decision making is via predictive analytics. An important, open question is: to what extent do larger data actually lead to better predictive models? In this article we empirically demonstrate that when predictive models are built from sparse, fine-grained data-such as data on low-level human behavior-we continue to see marginal increases in predictive performance even to very large scale. The empirical results are based on data drawn from nine different predictive modeling applications, from book reviews to banking transactions. This study provides a clear illustration that larger data indeed can be more valuable assets for predictive analytics. This implies that institutions with larger data assets-plus the skill to take advantage of them-potentially can obtain substantial competitive advantage over institutions without such access or skill. Moreover, the results suggest that it is worthwhile for companies with access to such fine-grained data, in the context of a key predictive task, to gather both more data instances and more possible data features. As an additional contribution, we introduce an implementation of the multivariate Bernoulli Naïve Bayes algorithm that can scale to massive, sparse data.

8.
Big Data ; 2(2): 71-2, 2014 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-27442299
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA