Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 13 de 13
Filtrar
1.
Entropy (Basel) ; 24(5)2022 May 13.
Artículo en Inglés | MEDLINE | ID: mdl-35626569

RESUMEN

Federated learning is a framework for multiple devices or institutions, called local clients, to collaboratively train a global model without sharing their data. For federated learning with a central server, an aggregation algorithm integrates model information sent from local clients to update the parameters for a global model. Sample mean is the simplest and most commonly used aggregation method. However, it is not robust for data with outliers or under the Byzantine problem, where Byzantine clients send malicious messages to interfere with the learning process. Some robust aggregation methods were introduced in literature including marginal median, geometric median and trimmed-mean. In this article, we propose an alternative robust aggregation method, named γ-mean, which is the minimum divergence estimation based on a robust density power divergence. This γ-mean aggregation mitigates the influence of Byzantine clients by assigning fewer weights. This weighting scheme is data-driven and controlled by the γ value. Robustness from the viewpoint of the influence function is discussed and some numerical results are presented.

2.
Biometrics ; 75(1): 245-255, 2019 03.
Artículo en Inglés | MEDLINE | ID: mdl-30052272

RESUMEN

Sufficient dimension reduction (SDR) continues to be an active field of research. When estimating the central subspace (CS), inverse regression based SDR methods involve solving a generalized eigenvalue problem, which can be problematic under the large-p-small-n situation. In recent years, new techniques have emerged in numerical linear algebra, called randomized algorithms or random sketching, for high-dimensional and large scale problems. To overcome the large-p-small-n SDR problem, we combine the idea of statistical inference with random sketching to propose a new SDR method, called integrated random-partition SDR (iRP-SDR). Our method consists of the following three steps: (i) Randomly partition the covariates into subsets to construct an envelope subspace with low dimension. (ii) Obtain a sketch of the CS by applying a conventional SDR method within the constructed envelope subspace. (iii) Repeat the above two steps many times and integrate these multiple sketches to form the final estimate of the CS. After describing the details of these steps, the asymptotic properties of iRP-SDR are established. Unlike existing methods, iRP-SDR does not involve the determination of the structural dimension until the last stage, which makes it more adaptive to a high-dimensional setting. The advantageous performance of iRP-SDR is demonstrated via simulation studies and a practical example analyzing EEG data.


Asunto(s)
Electroencefalografía/estadística & datos numéricos , Modelos Teóricos , Alcoholismo/patología , Algoritmos , Encéfalo/efectos de los fármacos , Simulación por Computador , Humanos , Aprendizaje Automático
3.
Biometrics ; 74(1): 145-154, 2018 03.
Artículo en Inglés | MEDLINE | ID: mdl-28493315

RESUMEN

Logistic regression is among the most widely used statistical methods for linear discriminant analysis. In many applications, we only observe possibly mislabeled responses. Fitting a conventional logistic regression can then lead to biased estimation. One common resolution is to fit a mislabel logistic regression model, which takes into consideration of mislabeled responses. Another common method is to adopt a robust M-estimation by down-weighting suspected instances. In this work, we propose a new robust mislabel logistic regression based on γ-divergence. Our proposal possesses two advantageous features: (1) It does not need to model the mislabel probabilities. (2) The minimum γ-divergence estimation leads to a weighted estimating equation without the need to include any bias correction term, that is, it is automatically bias-corrected. These features make the proposed γ-logistic regression more robust in model fitting and more intuitive for model interpretation through a simple weighting scheme. Our method is also easy to implement, and two types of algorithms are included. Simulation studies and the Pima data application are presented to demonstrate the performance of γ-logistic regression.


Asunto(s)
Sesgo , Modelos Logísticos , Probabilidad , Algoritmos , Clasificación , Simulación por Computador , Humanos
4.
Biometrics ; 72(1): 85-94, 2016 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-26288029

RESUMEN

Finding an efficient and computationally feasible approach to deal with the curse of high-dimensionality is a daunting challenge faced by modern biological science. The problem becomes even more severe when the interactions are the research focus. To improve the performance of statistical analyses, we propose a sparse and low-rank (SLR) screening based on the combination of a low-rank interaction model and the Lasso screening. SLR models the interaction effects using a low-rank matrix to achieve parsimonious parametrization. The low-rank model increases the efficiency of statistical inference and, hence, SLR screening is able to more accurately detect gene-gene interactions than conventional methods. Incorporation of SLR screening into the Screen-and-Clean approach (Wasserman and Roeder, 2009; Wu et al., 2010) is also discussed, which suffers less penalty from Boferroni correction, and is able to assign p-values for the identified variables in high-dimensional model. We apply the proposed screening procedure to the Warfarin dosage study and the CoLaus study. The results suggest that the new procedure can identify main and interaction effects that would have been omitted by conventional screening methods.


Asunto(s)
Algoritmos , Interpretación Estadística de Datos , Ensayos Analíticos de Alto Rendimiento/métodos , Modelos Estadísticos , Mapeo de Interacción de Proteínas/métodos , Análisis de Regresión , Simulación por Computador , Reconocimiento de Normas Patrones Automatizadas/métodos , Reproducibilidad de los Resultados , Sensibilidad y Especificidad
5.
Carbohydr Polym ; 322: 121338, 2023 Dec 15.
Artículo en Inglés | MEDLINE | ID: mdl-37839831

RESUMEN

Machine learning (ML) has been used for many clinical decision-making processes and diagnostic procedures in bioinformatics applications. We examined eight algorithms, including linear discriminant analysis (LDA), logistic regression (LR), k-nearest neighbor (KNN), random forest (RF), gradient boosting machine (GBM), support vector machine (SVM), Naïve Bayes classifier (NB), and artificial neural network (ANN) models, to evaluate their classification and prediction capabilities for four tissue types in Wolfiporia extensa using their monosaccharide composition profiles. All 8 ML-based models were assessed as exemplary models with AUC exceeding 0.8. Five models, namely LDA, KNN, RF, GBM, and ANN, performed excellently in the four-tissue-type classification (AUC > 0.9). Additionally, all eight models were evaluated as good predictive models with AUC value > 0.8 in the three-tissue-type classification. Notably, all 8 ML-based methods outperformed the single linear discriminant analysis (LDA) plotting method. For large sample sizes, the ML-based methods perform better than traditional regression techniques and could potentially increase the accuracy in identifying tissue samples of W. extensa.


Asunto(s)
Wolfiporia , Teorema de Bayes , Aprendizaje Automático , Algoritmos , Redes Neurales de la Computación
6.
BMC Genet ; 12: 48, 2011 May 19.
Artículo en Inglés | MEDLINE | ID: mdl-21592403

RESUMEN

BACKGROUND: With the completion of the international HapMap project, many studies have been conducted to investigate the association between complex diseases and haplotype variants. Such haplotype-based association studies, however, often face two difficulties; one is the large number of haplotype configurations in the chromosome region under study, and the other is the ambiguity in haplotype phase when only genotype data are observed. The latter complexity may be handled based on an EM algorithm with family data incorporated, whereas the former can be more problematic, especially when haplotypes of rare frequencies are involved. Here based on family data we propose to cluster long haplotypes of linked SNPs in a biological sense, so that the number of haplotypes can be reduced and the power of statistical tests of association can be increased. RESULTS: In this paper we employ family genotype data and combine a clustering scheme with a likelihood ratio statistic to test the association between quantitative phenotypes and haplotype variants. Haplotypes are first grouped based on their evolutionary closeness to establish a set containing core haplotypes. Then, we construct for each family the transmission and non-transmission phase in terms of these core haplotypes, taking into account simultaneously the phase ambiguity as weights. The likelihood ratio test (LRT) is next conducted with these weighted and clustered haplotypes to test for association with disease. This combination of evolution-guided haplotype clustering and weighted assignment in LRT is able, via its core-coding system, to incorporate into analysis both haplotype phase ambiguity and transmission uncertainty. Simulation studies show that this proposed procedure is more informative and powerful than three family-based association tests, FAMHAP, FBAT, and an LRT with a group consisting exclusively of rare haplotypes. CONCLUSIONS: The proposed procedure takes into account the uncertainty in phase determination and in transmission, utilizes the evolutionary information contained in haplotypes, reduces the dimension in haplotype space and the degrees of freedom in tests, and performs better in association studies. This evolution-guided clustering procedure is particularly useful for long haplotypes containing linked SNPs, and is applicable to other haplotype-based association tests. This procedure is now implemented in R and is free for download.


Asunto(s)
Algoritmos , Familia , Estudio de Asociación del Genoma Completo/métodos , Haplotipos , Análisis por Conglomerados , Humanos , Funciones de Verosimilitud
7.
Schizophr Res ; 238: 10-19, 2021 12.
Artículo en Inglés | MEDLINE | ID: mdl-34562833

RESUMEN

Nonlinear dynamical analysis has been used to quantify the complexity of brain signal at temporal scales. Power law scaling is a well-validated method in physics that has been used to describe the dynamics of a system in the frequency domain, ranging from noisy oscillation to complex fluctuations. In this research, we investigated the power-law characteristics in a large-scale resting-state fMRI data of schizophrenia and healthy participants derived from Taiwan Aging and Mental Illness cohort. We extracted the power spectral density (PSD) of resting signal by Fourier transform. Power law scaling of PSD was estimated by determining the slope of the regression line fitting to the logarithm of PSD. t-Test was used to assess the statistical difference in power law scaling between schizophrenia and healthy participants. The significant differences in power law scaling were found in six brain regions. Schizophrenia patients have significantly more positive power law scaling (i.e., more homogenous frequency components) at four brain regions: left precuneus, left medial dorsal nucleus, right inferior frontal gyrus, and right middle temporal gyrus and less positive power law scaling (i.e., more dominant at lower frequency range) in bilateral putamen compared with healthy participants. Moreover, significant correlations of power law scaling with the severity of psychosis were found. These findings suggest that schizophrenia has abnormal brain signal complexity linked to psychotic symptoms. The power law scaling represents the dynamical properties of resting-state fMRI signal may serve as a novel functional brain imaging marker for evaluating patients with mental illness.


Asunto(s)
Esquizofrenia , Encéfalo/diagnóstico por imagen , Mapeo Encefálico , Humanos , Imagen por Resonancia Magnética/métodos , Descanso , Esquizofrenia/diagnóstico por imagen
8.
Radiol Imaging Cancer ; 3(4): e210010, 2021 07.
Artículo en Inglés | MEDLINE | ID: mdl-34241550

RESUMEN

Purpose To identify distinguishing CT radiomic features of pancreatic ductal adenocarcinoma (PDAC) and to investigate whether radiomic analysis with machine learning can distinguish between patients who have PDAC and those who do not. Materials and Methods This retrospective study included contrast material-enhanced CT images in 436 patients with PDAC and 479 healthy controls from 2012 to 2018 from Taiwan that were randomly divided for training and testing. Another 100 patients with PDAC (enriched for small PDACs) and 100 controls from Taiwan were identified for testing (from 2004 to 2011). An additional 182 patients with PDAC and 82 healthy controls from the United States were randomly divided for training and testing. Images were processed into patches. An XGBoost (https://xgboost.ai/) model was trained to classify patches as cancerous or noncancerous. Patients were classified as either having or not having PDAC on the basis of the proportion of patches classified as cancerous. For both patch-based and patient-based classification, the models were characterized as either a local model (trained on Taiwanese data only) or a generalized model (trained on both Taiwanese and U.S. data). Sensitivity, specificity, and accuracy were calculated for patch- and patient-based analysis for the models. Results The median tumor size was 2.8 cm (interquartile range, 2.0-4.0 cm) in the 536 Taiwanese patients with PDAC (mean age, 65 years ± 12 [standard deviation]; 289 men). Compared with normal pancreas, PDACs had lower values for radiomic features reflecting intensity and higher values for radiomic features reflecting heterogeneity. The performance metrics for the developed generalized model when tested on the Taiwanese and U.S. test data sets, respectively, were as follows: sensitivity, 94.7% (177 of 187) and 80.6% (29 of 36); specificity, 95.4% (187 of 196) and 100% (16 of 16); accuracy, 95.0% (364 of 383) and 86.5% (45 of 52); and area under the curve, 0.98 and 0.91. Conclusion Radiomic analysis with machine learning enabled accurate detection of PDAC at CT and could identify patients with PDAC. Keywords: CT, Computer Aided Diagnosis (CAD), Pancreas, Computer Applications-Detection/Diagnosis Supplemental material is available for this article. © RSNA, 2021.


Asunto(s)
Carcinoma Ductal Pancreático , Neoplasias Pancreáticas , Anciano , Humanos , Masculino , Páncreas/diagnóstico por imagen , Neoplasias Pancreáticas/diagnóstico por imagen , Estudios Retrospectivos , Tomografía Computarizada por Rayos X
9.
BMC Bioinformatics ; 10: 44, 2009 Feb 03.
Artículo en Inglés | MEDLINE | ID: mdl-19187562

RESUMEN

BACKGROUND: Selection of influential genes with microarray data often faces the difficulties of a large number of genes and a relatively small group of subjects. In addition to the curse of dimensionality, many gene selection methods weight the contribution from each individual subject equally. This equal-contribution assumption cannot account for the possible dependence among subjects who associate similarly to the disease, and may restrict the selection of influential genes. RESULTS: A novel approach to gene selection is proposed based on kernel similarities and kernel weights. We do not assume uniformity for subject contribution. Weights are calculated via regularized least squares support vector regression (RLS-SVR) of class levels on kernel similarities and are used to weight subject contribution. The cumulative sum of weighted expression levels are next ranked to select responsible genes. These procedures also work for multiclass classification. We demonstrate this algorithm on acute leukemia, colon cancer, small, round blue cell tumors of childhood, breast cancer, and lung cancer studies, using kernel Fisher discriminant analysis and support vector machines as classifiers. Other procedures are compared as well. CONCLUSION: This approach is easy to implement and fast in computation for both binary and multiclass problems. The gene set provided by the RLS-SVR weight-based approach contains a less number of genes, and achieves a higher accuracy than other procedures.


Asunto(s)
Algoritmos , Inteligencia Artificial , Biología Computacional/métodos , Genes Relacionados con las Neoplasias , Neoplasias/genética , Análisis por Conglomerados , Perfilación de la Expresión Génica/métodos , Humanos , Análisis de los Mínimos Cuadrados , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos
10.
IEEE Trans Neural Netw ; 18(1): 1-13, 2007 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-17278457

RESUMEN

In dealing with large data sets, the reduced support vector machine (RSVM) was proposed for the practical objective to overcome some computational difficulties as well as to reduce the model complexity. In this paper, we study the RSVM from the viewpoint of sampling design, its robustness, and the spectral analysis of the reduced kernel. We consider the nonlinear separating surface as a mixture of kernels. Instead of a full model, the RSVM uses a reduced mixture with kernels sampled from certain candidate set. Our main results center on two major themes. One is the robustness of the random subset mixture model. The other is the spectral analysis of the reduced kernel. The robustness is judged by a few criteria as follows: 1) model variation measure; 2) model bias (deviation) between the reduced model and the full model; and 3) test power in distinguishing the reduced model from the full one. For the spectral analysis, we compare the eigenstructures of the full kernel matrix and the approximation kernel matrix. The approximation kernels are generated by uniform random subsets. The small discrepancies between them indicate that the approximation kernels can retain most of the relevant information for learning tasks in the full kernel. We focus on some statistical theory of the reduced set method mainly in the context of the RSVM. The use of a uniform random subset is not limited to the RSVM. This approach can act as a supplemental algorithm on top of a basic optimization algorithm, wherein the actual optimization takes place on the subset-approximated data. The statistical properties discussed in this paper are still valid.


Asunto(s)
Algoritmos , Inteligencia Artificial , Metodologías Computacionales , Modelos Estadísticos , Reconocimiento de Normas Patrones Automatizadas/métodos , Análisis por Conglomerados , Simulación por Computador
11.
Neural Comput ; 21(11): 3179-213, 2009 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-19686071

RESUMEN

This letter discusses the robustness issue of kernel principal component analysis. A class of new robust procedures is proposed based on eigenvalue decomposition of weighted covariance. The proposed procedures will place less weight on deviant patterns and thus be more resistant to data contamination and model deviation. Theoretical influence functions are derived, and numerical examples are presented as well. Both theoretical and numerical results indicate that the proposed robust method outperforms the conventional approach in the sense of being less sensitive to outliers. Our robust method and results also apply to functional principal component analysis.


Asunto(s)
Análisis de Componente Principal/métodos , Algoritmos , Humanos , Modelos Lineales , Modelos Estadísticos , Reproducibilidad de los Resultados
13.
Hum Reprod ; 22(5): 1363-72, 2007 May.
Artículo en Inglés | MEDLINE | ID: mdl-17234673

RESUMEN

BACKGROUND: The maximal number of live births (k) per donor was usually determined by cultural and social perspective. It was rarely decided on the basis of scientific evidence or discussed from mathematical or probabilistic viewpoint. METHODS AND RESULTS: To recommend a value for k, we propose three criteria to evaluate its impact on consanguinity and disease incidence due to artificial insemination by donor (AID). The first approach considers the optimization of k under the criterion of fixed tolerable number of consanguineous mating due to AID. The second approach optimizes k under fixed allowable average coefficient of inbreeding. This approach is particularly helpful when assessing the impact on the public, is of interest. The third criterion considers specific inheritance diseases. This approach is useful when evaluating the individual's risk of genetic diseases. When different diseases are considered, this criterion can be easily adopted. All these derivations are based on the assumption of shortage of gamete donors due to great demand and insufficient supply. CONCLUSIONS: Our results indicate that strong degree of assortative mating, small population size and insufficient supply in gamete donors will lead to greater risk of consanguinity. Recommendations under other settings are also tabulated for reference. A web site for calculating the limit for live births per donor is available.


Asunto(s)
Consanguinidad , Inseminación Artificial Heteróloga/estadística & datos numéricos , Fibrosis Quística/epidemiología , Trastorno Depresivo/epidemiología , Femenino , Enfermedades Genéticas Congénitas/prevención & control , Hemocromatosis/epidemiología , Humanos , Inseminación Artificial Heteróloga/legislación & jurisprudencia , Masculino , Embarazo , Índice de Embarazo , Prevalencia , Probabilidad , Esquizofrenia/epidemiología , Ataxias Espinocerebelosas/epidemiología
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA