RESUMEN
Producing or sharing Child Sexual Exploitation Material (CSEM) is a severe crime that Law Enforcement Agencies (LEAs) fight daily. When the LEA seizes a computer from a potential producer or consumer of the CSEM, it analyzes the storage devices of the suspect looking for evidence. Manual inspection of CSEM is time-consuming given the limited time available for Spanish police to use a search warrant. Our approach to speeding up the identification of CSEM-related files is to analyze only the file names and their absolute paths rather than their content. The main challenge lies in handling short and sparse texts that are deliberately distorted by file owners using obfuscated words and user-defined naming patterns. We present two approaches to CSEM identification. The first employs two independent classifiers, one for the file name and the other for the file path, and their outputs are then combined. Conversely, the second approach uses only the file name classifier to iterate over an absolute path. Both operate at the character n-gram level, whereas novel binary and orthographic features are presented to enrich the text representation. We benchmarked six classification models based on machine learning and convolutional neural networks. The proposed classifier has an F1 score of 0.988, which can be a promising tool for LEAs.
Asunto(s)
Benchmarking , Crimen , Humanos , Niño , Familia , Aplicación de la Ley , Aprendizaje AutomáticoRESUMEN
Beef derived from grass-fed cattle is a specific quality criterion. The effect of grass silage intake on quality characteristics, i.e., fatty acids, fat-soluble vitamins, and lipid-derived volatile composition of intramuscular and perirenal fat from fattening bull weaners were studied. Visible (VIS) and near-infrared (NIR) spectra were also obtained from perirenal fat. Perirenal fat analysis was performed for feeding differentiation purposes. A total of 22 Tudanca breed 11-month-aged bulls were finished on three different diets: grass silage and a commercial concentrate ad libitum (GS-AC), grass silage ad libitum and the commercial concentrate restricted to half of the intake of the GS-AC group (GS-LC), and barley straw and concentrate ad libitum (Str-AC). Feeding had a significant effect (p < 0.05) on γ-linolenic acid and the ratio n-6/n-3 fatty acids. Furthermore, ß-carotene content was greater in beef from silage groups than in the Str-AC group. Feeding also affected the perirenal fat composition. Beef from silage-fed bulls and straw-fed bulls could be differentiated by fatty acid percentages, especially 18:0, t-18:1, and c9-18:1, ß-carotene content, b* colour value, and carotenoid colour index. However, the VIS or NIR spectra data showed poor differentiating performance, and the volatile composition did not have appreciable differentiation power.
RESUMEN
This study evaluates several feature ranking techniques together with some classifiers based on machine learning to identify relevant factors regarding the probability of contracting breast cancer and improve the performance of risk prediction models for breast cancer in a healthy population. The dataset with 919 cases and 946 controls comes from the MCC-Spain study and includes only environmental and genetic features. Breast cancer is a major public health problem. Our aim is to analyze which factors in the cancer risk prediction model are the most important for breast cancer prediction. Likewise, quantifying the stability of feature selection methods becomes essential before trying to gain insight into the data. This paper assesses several feature selection algorithms in terms of performance for a set of predictive models. Furthermore, their robustness is quantified to analyze both the similarity between the feature selection rankings and their own stability. The ranking provided by the SVM-RFE approach leads to the best performance in terms of the area under the ROC curve (AUC) metric. Top-47 ranked features obtained with this approach fed to the Logistic Regression classifier achieve an AUC = 0.616. This means an improvement of 5.8% in comparison with the full feature set. Furthermore, the SVM-RFE ranking technique turned out to be highly stable (as well as Random Forest), whereas relief and the wrapper approaches are quite unstable. This study demonstrates that the stability and performance of the model should be studied together as Random Forest and SVM-RFE turned out to be the most stable algorithms, but in terms of model performance SVM-RFE outperforms Random Forest.
Asunto(s)
Neoplasias de la Mama , Algoritmos , Área Bajo la Curva , Neoplasias de la Mama/epidemiología , Femenino , Humanos , Modelos Logísticos , Aprendizaje Automático , Máquina de Vectores de SoporteRESUMEN
Retrieving text embedded within images is a challenging task in real-world settings. Multiple problems such as low-resolution and the orientation of the text can hinder the extraction of information. These problems are common in environments such as Tor Darknet and Child Sexual Abuse images, where text extraction is crucial in the prevention of illegal activities. In this work, we evaluate eight text recognizers and, to increase the performance of text transcription, we combine these recognizers with rectification networks and super-resolution algorithms. We test our approach on four state-of-the-art and two custom datasets (TOICO-1K and Child Sexual Abuse (CSA)-text, based on text retrieved from Tor Darknet and Child Sexual Exploitation Material, respectively). We obtained a 0.3170 score of correctly recognized words in the TOICO-1K dataset when we combined Deep Convolutional Neural Networks (CNN) and rectification-based recognizers. For the CSA-text dataset, applying resolution enhancements achieved a final score of 0.6960. The highest performance increase was achieved on the ICDAR 2015 dataset, with an improvement of 4.83% when combining the MORAN recognizer and the Residual Dense resolution approach. We conclude that rectification outperforms super-resolution when applied separately, while their combination achieves the best average improvements in the chosen datasets.
RESUMEN
Face recognition is a valuable forensic tool for criminal investigators since it certainly helps in identifying individuals in scenarios of criminal activity like fugitives or child sexual abuse. It is, however, a very challenging task as it must be able to handle low-quality images of real world settings and fulfill real time requirements. Deep learning approaches for face detection have proven to be very successful but they require large computation power and processing time. In this work, we evaluate the speed-accuracy tradeoff of three popular deep-learning-based face detectors on the WIDER Face and UFDD data sets in several CPUs and GPUs. We also develop a regression model capable to estimate the performance, both in terms of processing time and accuracy. We expect this to become a very useful tool for the end user in forensic laboratories in order to estimate the performance for different face detection options. Experimental results showed that the best speed-accuracy tradeoff is achieved with images resized to 50% of the original size in GPUs and images resized to 25% of the original size in CPUs. Moreover, performance can be estimated using multiple linear regression models with a Mean Absolute Error (MAE) of 0.113, which is very promising for the forensic field.
Asunto(s)
Aprendizaje Profundo , Cara , Niño , Ciencias Forenses , HumanosRESUMEN
BACKGROUND AND OBJECTIVE: Risk prediction models aim at identifying people at higher risk of developing a target disease. Feature selection is particularly important to improve the prediction model performance avoiding overfitting and to identify the leading cancer risk (and protective) factors. Assessing the stability of feature selection/ranking algorithms becomes an important issue when the aim is to analyze the features with more prediction power. METHODS: This work is focused on colorectal cancer, assessing several feature ranking algorithms in terms of performance for a set of risk prediction models (Neural Networks, Support Vector Machines (SVM), Logistic Regression, k-Nearest Neighbors and Boosted Trees). Additionally, their robustness is evaluated following a conventional approach with scalar stability metrics and a visual approach proposed in this work to study both similarity among feature ranking techniques as well as their individual stability. A comparative analysis is carried out between the most relevant features found out in this study and features provided by the experts according to the state-of-the-art knowledge. RESULTS: The two best performance results in terms of Area Under the ROC Curve (AUC) are achieved with a SVM classifier using the top-41 features selected by the SVM wrapper approach (AUC=0.693) and Logistic Regression with the top-40 features selected by the Pearson (AUC=0.689). Experiments showed that performing feature selection contributes to classification performance with a 3.9% and 1.9% improvement in AUC for the SVM and Logistic Regression classifier, respectively, with respect to the results using the full feature set. The visual approach proposed in this work allows to see that the Neural Network-based wrapper ranking is the most unstable while the Random Forest is the most stable. CONCLUSIONS: This study demonstrates that stability and model performance should be studied jointly as Random Forest turned out to be the most stable algorithm but outperformed by others in terms of model performance while SVM wrapper and the Pearson correlation coefficient are moderately stable while achieving good model performance.
Asunto(s)
Neoplasias Colorrectales/diagnóstico por imagen , Redes Neurales de la Computación , Medición de Riesgo/métodos , Máquina de Vectores de Soporte , Área Bajo la Curva , Gráficos por Computador , Simulación por Computador , Bases de Datos Factuales , Humanos , Curva ROC , Programas Informáticos , España/epidemiologíaRESUMEN
The automated assessment of the sperm quality is an important challenge in the veterinary field. In this paper, we explore how to describe the acrosomes of boar spermatozoa using image analysis so that they can be automatically categorized as intact or damaged. Our proposal aims at characterizing the acrosomes by means of texture features. The texture is described using first order statistics and features derived from the co-occurrence matrix of the image, both computed from the original image and from the coefficients yielded by the Discrete Wavelet Transform. Texture descriptors are evaluated and compared with moments-based descriptors in terms of the classification accuracy they provide. Experimental results with a Multilayer Perceptron and the k-Nearest Neighbours classifiers show that texture descriptors outperform moment-based descriptors, reaching an accuracy of 94.93%, which makes this approach very attractive for the veterinarian community.
Asunto(s)
Acrosoma , Espermatozoides , Animales , Masculino , PorcinosRESUMEN
Decision theory shows that the optimal decision is a function of the posterior class probabilities. More specifically, in binary classification, the optimal decision is based on the comparison of the posterior probabilities with some threshold. Therefore, the most accurate estimates of the posterior probabilities are required near these decision thresholds. This paper discusses the design of objective functions that provide more accurate estimates of the probability values, taking into account the characteristics of each decision problem. We propose learning algorithms based on the stochastic gradient minimization of these loss functions. We show that the performance of the classifier is improved when these algorithms behave like sample selectors: samples near the decision boundary are the most relevant during learning.