Pesquisa | Portal de Pesquisa da BVS

Informed Down-Sampled Lexicase Selection: Identifying productive training cases for efficient problem solving.

Boldi, Ryan; Briesch, Martin; Sobania, Dominik; Lalejini, Alexander; Helmuth, Thomas; Rothlauf, Franz; Ofria, Charles; Spector, Lee.

Evol Comput ; : 1-32, 2024 Jan 26.

Artigo em Inglês | MEDLINE | ID: mdl-38271633

RESUMO

Genetic Programming (GP) often uses large training sets and requires all individuals to be evaluated on all training cases during selection. Random down-sampled lexicase selection evaluates individuals on only a random subset of the training cases allowing for more individuals to be explored with the same amount of program executions. However, sampling randomly can exclude important cases from the down-sample for a number of generations, while cases that measure the same behavior (synonymous cases) may be overused. In this work, we introduce Informed Down-Sampled Lexicase Selection. This method leverages population statistics to build down-samples that contain more distinct and therefore informative training cases. Through an empirical investigation across two different GP systems (PushGP and Grammar-Guided GP), we find that informed down-sampling significantly outperforms random down-sampling on a set of contemporary program synthesis benchmark problems. Through an analysis of the created down-samples, we find that important training cases are included in the down-sample consistently across independent evolutionary runs and systems. We hypothesize that this improvement can be attributed to the ability of Informed Down-Sampled Lexicase Selection to maintain more specialist individuals over the course of evolution, while still benefiting from reduced per-evaluation costs.

Unsupervised anomaly detection of implausible electronic health records: a real-world evaluation in cancer registries.

Röchner, Philipp; Rothlauf, Franz.

BMC Med Res Methodol ; 23(1): 125, 2023 05 24.

Artigo em Inglês | MEDLINE | ID: mdl-37226114

RESUMO

BACKGROUND: Cancer registries collect patient-specific information about cancer diseases. The collected information is verified and made available to clinical researchers, physicians, and patients. When processing information, cancer registries verify that the patient-specific records they collect are plausible. This means that the collected information about a particular patient makes medical sense. METHODS: Unsupervised machine learning approaches can detect implausible electronic health records without human guidance. Therefore, this article investigates two unsupervised anomaly detection approaches, a pattern-based approach (FindFPOF) and a compression-based approach (autoencoder), to identify implausible electronic health records in cancer registries. Unlike most existing work that analyzes synthetic anomalies, we compare the performance of both approaches and a baseline (random selection of records) on a real-world dataset. The dataset contains 21,104 electronic health records of patients with breast, colorectal, and prostate tumors. Each record consists of 16 categorical variables describing the disease, the patient, and the diagnostic procedure. The samples identified by FindFPOF, the autoencoder, and a random selection-a total of 785 different records-are evaluated in a real-world scenario by medical domain experts. RESULTS: Both anomaly detection methods are good at detecting implausible electronic health records. First, domain experts identified [Formula: see text] of 300 randomly selected records as implausible. With FindFPOF and the autoencoder, [Formula: see text] of the proposed 300 records in each sample were implausible. This corresponds to a precision of [Formula: see text] for FindFPOF and the autoencoder. Second, for 300 randomly selected records that were labeled by domain experts, the sensitivity of the autoencoder was [Formula: see text] and the sensitivity of FindFPOF was [Formula: see text]. Both anomaly detection methods had a specificity of [Formula: see text]. Third, FindFPOF and the autoencoder suggested samples with a different distribution of values than the overall dataset. For example, both anomaly detection methods suggested a higher proportion of colorectal records, the tumor localization with the highest percentage of implausible records in a randomly selected sample. CONCLUSIONS: Unsupervised anomaly detection can significantly reduce the manual effort of domain experts to find implausible electronic health records in cancer registries. In our experiments, the manual effort was reduced by a factor of approximately 3.5 compared to evaluating a random sample.

Assuntos

Neoplasias Colorretais , Médicos , Neoplasias da Próstata , Masculino , Humanos , Registros Eletrônicos de Saúde , Sistema de Registros

An Analysis of the Influence of Noneffective Instructions in Linear Genetic Programming.

Sotto, Léo Françoso Dal Piccol; Rothlauf, Franz; de Melo, Vinícius Veloso; Basgalupp, Márcio P.

Evol Comput ; 30(1): 51-74, 2022 Mar 01.

Artigo em Inglês | MEDLINE | ID: mdl-34428302

RESUMO

Linear Genetic Programming (LGP) represents programs as sequences of instructions and has a Directed Acyclic Graph (DAG) dataflow. The results of instructions are stored in registers that can be used as arguments by other instructions. Instructions that are disconnected from the main part of the program are called noneffective instructions, or structural introns. They also appear in other DAG-based GP approaches like Cartesian Genetic Programming (CGP). This article studies four hypotheses on the role of structural introns: noneffective instructions (1) serve as evolutionary memory, where evolved information is stored and later used in search, (2) preserve population diversity, (3) allow neutral search, where structural introns increase the number of neutral mutations and improve performance, and (4) serve as genetic material to enable program growth. We study different variants of LGP controlling the influence of introns for symbolic regression, classification, and digital circuits problems. We find that there is (1) evolved information in the noneffective instructions that can be reactivated and that (2) structural introns can promote programs with higher effective diversity. However, both effects have no influence on LGP search performance. On the other hand, allowing mutations to not only be applied to effective but also to noneffective instructions (3) increases the rate of neutral mutations and (4) contributes to program growth by making use of the genetic material available as structural introns. This comes along with a significant increase of LGP performance, which makes structural introns important for LGP.

Assuntos

Algoritmos , Evolução Biológica

Using machine learning to link electronic health records in cancer registries: On the tradeoff between linkage quality and manual effort.

Röchner, Philipp; Rothlauf, Franz.

Int J Med Inform ; 185: 105387, 2024 May.

Artigo em Inglês | MEDLINE | ID: mdl-38428200

RESUMO

BACKGROUND: Cancer registries link a large number of electronic health records reported by medical institutions to already registered records of the matching individual and tumor. Records are automatically linked using deterministic and probabilistic approaches; machine learning is rarely used. Records that cannot be matched automatically with sufficient accuracy are typically processed manually. For application, it is important to know how well record linkage approaches match real-world records and how much manual effort is required to achieve the desired linkage quality. We study the task of linking reported records to the matching registered tumor in cancer registries. METHODS: We compare the tradeoff between linkage quality and manual effort of five machine learning methods (logistic regression, random forest, gradient boosting, neural network, and a stacked method) to a deterministic baseline. The record linkage methods are compared in a two-class setting (no-match/ match) and a three-class setting (no-match/ undecided/ match). A cancer registry collected and linked the dataset consisting of categorical variables matching 145,755 reported records with 33,289 registered tumors. RESULTS: In the two-class setting, the gradient boosting, neural network, and stacked models have higher accuracy and F1 score (accuracy: 0.968-0.978, F1 score: 0.983-0.988) than the deterministic baseline (accuracy: 0.964, F1 score: 0.980) when the same records are manually processed (0.89% of all records). In the three-class setting, these three machine learning methods can automatically process all reported records and still have higher accuracy and F1 score than the deterministic baseline. The linkage quality of the machine learning methods studied, except for the neural network, increase as the number of manually processed records increases. CONCLUSION: Machine learning methods can significantly improve linkage quality and reduce the manual effort required by medical coders to match tumor records in cancer registries compared to a deterministic baseline. Our results help cancer registries estimate how linkage quality increases as more records are manually processed.

Assuntos

Registros Eletrônicos de Saúde , Neoplasias , Humanos , Registro Médico Coordenado/métodos , Neoplasias/epidemiologia , Sistema de Registros , Bases de Dados Factuais

On relevant features for the recurrence prediction of urothelial carcinoma of the bladder.

Schwarz, Louisa; Sobania, Dominik; Rothlauf, Franz.

Int J Med Inform ; 186: 105414, 2024 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-38531255

RESUMO

BACKGROUND: Urothelial bladder cancer (UBC) is characterized by a high recurrence rate, which is predicted by scoring systems. However, recent studies show the superiority of Machine Learning (ML) models. Nevertheless, these ML approaches are rarely used in medical practice because most of them are black-box models, that cannot adequately explain how a prediction is made. OBJECTIVE: We investigate the global feature importance of different ML models. By providing information on the most relevant features, we can facilitate the use of ML in everyday medical practice. DESIGN, SETTING, AND PARTICIPANTS: The data is provided by the cancer registry Rhineland-Palatinate gGmbH, Germany. It consists of numerical and categorical features of 1,944 patients with UBC. We retrospectively predict 2-year recurrence through ML models using Support Vector Machine, Gradient Boosting, and Artificial Neural Network. We then determine the global feature importance using performance-based Permutation Feature Importance (PFI) and variance-based Feature Importance Ranking Measure (FIRM). RESULTS: We show reliable recurrence prediction of UBC with 82.02% to 83.89% F1-Score, 83.95% to 84.49% Precision, and an overall performance of 69.20% to 70.82% AUC on testing data, depending on the model. Gradient Boosting performs best among all black-box models with an average F1-Score (83.89%), AUC (70.82%), and Precision (83.95%). Furthermore, we show consistency across PFI and FIRM by identifying the same features as relevant across the different models. These features are exclusively therapeutic measures and are consistent with findings from both medical research and clinical trials. CONCLUSIONS: We confirm the superiority of ML black-box models in predicting UBC recurrence compared to more traditional logistic regression. In addition, we present an approach that increases the explanatory power of black-box models by identifying the underlying influence of input features, thus facilitating the use of ML in clinical practice and therefore providing improved recurrence prediction through the application of black-box models.

Assuntos

Pesquisa Biomédica , Carcinoma de Células de Transição , Neoplasias da Bexiga Urinária , Humanos , Neoplasias da Bexiga Urinária/diagnóstico , Neoplasias da Bexiga Urinária/epidemiologia , Bexiga Urinária , Estudos Retrospectivos

Redundant representations in evolutionary computation.

Rothlauf, Franz; Goldberg, David E.

Evol Comput ; 11(4): 381-415, 2003.

Artigo em Inglês | MEDLINE | ID: mdl-14629864

RESUMO

This paper discusses how the use of redundant representations influences the performance of genetic and evolutionary algorithms. Representations are redundant if the number of genotypes exceeds the number of phenotypes. A distinction is made between synonymously and non-synonymously redundant representations. Representations are synonymously redundant if the genotypes that represent the same phenotype are very similar to each other. Non-synonymously redundant representations do not allow genetic operators to work properly and result in a lower performance of evolutionary search. When using synonymously redundant representations, the performance of selectorecombinative genetic algorithms (GAs) depends on the modification of the initial supply. We have developed theoretical models for synonymously redundant representations that show the necessary population size to solve a problem and the number of generations goes with O(2(kr)/r), where kr is the order of redundancy and r is the number of genotypic building blocks (BB) that represent the optimal phenotypic BB. As a result, uniformly redundant representations do not change the behavior of GAs. Only by increasing r, which means overrepresenting the optimal solution, does GA performance increase. Therefore, non-uniformly redundant representations can only be used advantageously if a-priori information exists regarding the optimal solution. The validity of the proposed theoretical concepts is illustrated for the binary trivial voting mapping and the real-valued link-biased encoding. Our empirical investigations show that the developed population sizing and time to convergence models allow an accurate prediction of the empirical results.

Assuntos

Algoritmos , Evolução Biológica , Biologia Computacional , Modelos Genéticos , Genótipo , Fenótipo , Densidade Demográfica , Fatores de Tempo

Network random keys - a tree representation scheme for genetic and evolutionary algorithms.

Rothlauf, Franz; Goldberg, David E; Heinzl, Armin.

Evol Comput ; 10(1): 75-97, 2002.

Artigo em Inglês | MEDLINE | ID: mdl-11911783

RESUMO

When using genetic and evolutionary algorithms for network design, choosing a good representation scheme for the construction of the genotype is important for algorithm performance. One of the most common representation schemes for networks is the characteristic vector representation. However, with encoding trees, and using crossover and mutation, invalid individuals occur that are either under- or over-specified. When constructing the offspring or repairing the invalid individuals that do not represent a tree, it is impossible to distinguish between the importance of the links that should be used. These problems can be overcome by transferring the concept of random keys from scheduling and ordering problems to the encoding of trees. This paper investigates the performance of a simple genetic algorithm (SGA) using network random keys (NetKeys) for the one-max tree and a real-world problem. The comparison between the network random keys and the characteristic vector encoding shows that despite the effects of stealth mutation, which favors the characteristic vector representation, selectorecombinative SGAs with NetKeys have some advantages for small and easy optimization problems. With more complex problems, SGAs with network random keys significantly outperform SGAs using characteristic vectors. This paper shows that random keys can be used for the encoding of trees, and that genetic algorithms using network random keys are able to solve complex tree problems much faster than when using the characteristic vector. Users should therefore be encouraged to use network random keys for the representation of trees.

Assuntos

Algoritmos , Evolução Biológica , Evolução Molecular , Modelos Genéticos , Distribuição Aleatória , Projetos de Pesquisa

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA