Your browser doesn't support javascript.
loading
Outlier detection for questionnaire data in biobanks.
Sakurai, Rieko; Ueki, Masao; Makino, Satoshi; Hozawa, Atsushi; Kuriyama, Shinichi; Takai-Igarashi, Takako; Kinoshita, Kengo; Yamamoto, Masayuki; Tamiya, Gen.
Afiliación
  • Sakurai R; Statistical Genetics Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan.
  • Ueki M; Tohoku Medical Megabank Organization, Tohoku University, Sendai, Japan.
  • Makino S; Statistical Genetics Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan.
  • Hozawa A; Tohoku Medical Megabank Organization, Tohoku University, Sendai, Japan.
  • Kuriyama S; Statistical Genetics Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan.
  • Takai-Igarashi T; Graduate School of Medicine, Tohoku University, Sendai, Japan.
  • Kinoshita K; Tohoku Medical Megabank Organization, Tohoku University, Sendai, Japan.
  • Yamamoto M; Graduate School of Medicine, Tohoku University, Sendai, Japan.
  • Tamiya G; Tohoku Medical Megabank Organization, Tohoku University, Sendai, Japan.
Int J Epidemiol ; 48(4): 1305-1315, 2019 08 01.
Article en En | MEDLINE | ID: mdl-30848787
BACKGROUND: Biobanks increasingly collect, process and store omics with more conventional epidemiologic information necessitating considerable effort in data cleaning. An efficient outlier detection method that reduces manual labour is highly desirable. METHOD: We develop an unsupervised machine-learning method for outlier detection, namely kurPCA, that uses principal component analysis combined with kurtosis to ascertain the existence of outliers. In addition, we propose a novel regression adjustment approach to improve detection, namely the regression adjustment for data by systematic missing patterns (RAMP). RESULT: Application to epidemiological record data in a large-scale biobank (Tohoku Medical Megabank Organization, Japan) shows that a combination of kurPCA and RAMP effectively detects known errors or inconsistent patterns. CONCLUSIONS: We confirm through the results of the simulation and the application that our methods showed good performance. The proposed methods are useful for many practical analysis scenarios.
Asunto(s)
Palabras clave

Texto completo: 1 Bases de datos: MEDLINE Asunto principal: Algoritmos / Encuestas y Cuestionarios / Modelos Estadísticos / Aprendizaje Automático Tipo de estudio: Diagnostic_studies / Guideline / Risk_factors_studies Límite: Humans Idioma: En Revista: Int J Epidemiol Año: 2019 Tipo del documento: Article País de afiliación: Japón

Texto completo: 1 Bases de datos: MEDLINE Asunto principal: Algoritmos / Encuestas y Cuestionarios / Modelos Estadísticos / Aprendizaje Automático Tipo de estudio: Diagnostic_studies / Guideline / Risk_factors_studies Límite: Humans Idioma: En Revista: Int J Epidemiol Año: 2019 Tipo del documento: Article País de afiliación: Japón