RESUMO
To address the limitations of LiDAR dynamic target detection methods, which often require heuristic thresholding, indirect computational assistance, supplementary sensor data, or postdetection, we propose an innovative method based on multidimensional features. Using the differences between the positions and geometric structures of point cloud clusters scanned by the same target in adjacent frame point clouds, the motion states of the point cloud clusters are comprehensively evaluated. To enable the automatic precision pairing of point cloud clusters from adjacent frames of the same target, a double registration algorithm is proposed for point cloud cluster centroids. The iterative closest point (ICP) algorithm is employed for approximate interframe pose estimation during coarse registration. The random sample consensus (RANSAC) and four-parameter transformation algorithms are employed to obtain precise interframe pose relations during fine registration. These processes standardize the coordinate systems of adjacent point clouds and facilitate the association of point cloud clusters from the same target. Based on the paired point cloud cluster, a classification feature system is used to construct the XGBoost decision tree. To enhance the XGBoost training efficiency, a Spearman's rank correlation coefficient-bidirectional search for a dimensionality reduction algorithm is proposed to expedite the optimal classification feature subset construction. After preliminary outcomes are generated by XGBoost, a double Boyer-Moore voting-sliding window algorithm is proposed to refine the final LiDAR dynamic target detection accuracy. To validate the efficacy and efficiency of our method in LiDAR dynamic target detection, an experimental platform is established. Real-world data are collected and pertinent experiments are designed. The experimental results illustrate the soundness of our method. The LiDAR dynamic target correct detection rate is 92.41%, the static target error detection rate is 1.43%, and the detection efficiency is 0.0299 s. Our method exhibits notable advantages over open-source comparative methods, achieving highly efficient and precise LiDAR dynamic target detection.
RESUMO
The two-stage feature screening method for linear models applies dimension reduction at first stage to screen out nuisance features and dramatically reduce the dimension to a moderate size; at the second stage, penalized methods such as LASSO and SCAD could be applied for feature selection. A majority of subsequent works on the sure independent screening methods have focused mainly on the linear model. This motivates us to extend the independence screening method to generalized linear models, and particularly with binary response by using the point-biserial correlation. We develop a two-stage feature screening method called point-biserial sure independence screening (PB-SIS) for high-dimensional generalized linear models, aiming for high selection accuracy and low computational cost. We demonstrate that PB-SIS is a feature screening method with high efficiency. The PB-SIS method possesses the sure independence property under certain regularity conditions. A set of simulation studies are conducted and confirm the sure independence property and the accuracy and efficiency of PB-SIS. Finally we apply PB-SIS to one real data example to show its effectiveness.
RESUMO
The Fine-Gray proportional sub-distribution hazards (PSH) model is among the most popular regression model for competing risks time-to-event data. This article develops a fast safe feature elimination method, named PSH-SAFE, for fitting the penalized Fine-Gray PSH model with a Lasso (or adaptive Lasso) penalty. Our PSH-SAFE procedure is straightforward to implement, fast, and scales well to ultrahigh dimensional data. We also show that as a feature screening procedure, PSH-SAFE is safe in a sense that the eliminated features are guaranteed to be inactive features in the original Lasso (or adaptive Lasso) estimator for the penalized PSH model. We evaluate the performance of the PSH-SAFE procedure in terms of computational efficiency, screening efficiency and safety, run-time, and prediction accuracy on multiple simulated datasets and a real bladder cancer data. Our empirical results show that the PSH-SAFE procedure possesses desirable screening efficiency and safety properties and can offer substantially improved computational efficiency as well as similar or better prediction performance in comparison to their baseline competitors.
Assuntos
Neoplasias da Bexiga Urinária , Humanos , Programas de Rastreamento , Modelos de Riscos Proporcionais , Pesquisa , Neoplasias da Bexiga Urinária/diagnósticoRESUMO
To reduce the economic losses caused by bearing failures and prevent safety accidents, it is necessary to develop an effective method to predict the remaining useful life (RUL) of the rolling bearing. However, the degradation inside the bearing is difficult to monitor in real-time. Meanwhile, external uncertainties significantly impact bearing degradation. Therefore, this paper proposes a new bearing RUL prediction method based on long-short term memory (LSTM) with uncertainty quantification. First, a fusion metric related to runtime (or degradation) is proposed to reflect the latent degradation process. Then, an improved dropout method based on nonparametric kernel density is developed to improve estimation accuracy of RUL. The PHM2012 dataset is adopted to verify the proposed method, and comparison results illustrate that the proposed prediction model can accurately obtain the point estimation and probability distribution of the bearing RUL.
Assuntos
Redes Neurais de Computação , Probabilidade , IncertezaRESUMO
Network analysis has drawn great attention in recent years. It is applied to a wide range disciplines. These include but are not limited to social science, finance and genetics. It is typical that one collects abundant covariates along the response variable in practice. Since the network structure makes the responses at different nodes no longer independent, existing screening methods may not perform well for network data. We propose a network-based sure independence screening (NW-SIS) method. This approach explicitly takes the network structure into consideration. The strong screening consistency property of the NW-SIS is rigorously established. We further investigated the estimation of the network effect and establish the n -consistency of the estimator. The finite sample performance of the proposed method is assessed by simulation study and illustrated by an empirical analysis of a dataset from Chinese stock market.
RESUMO
In order to solve the current problems in medical equipment maintenance, this study proposed an intelligent fault diagnosis method for medical equipment based on long short term memory network(LSTM). Firstly, in the case of no circuit drawings and unknown circuit board signal direction, the symptom phenomenon and port electrical signal of 7 different fault categories were collected, and the feature coding, normalization, fusion and screening were preprocessed. Then, the intelligent fault diagnosis model was built based on LSTM, and the fused and screened multi-modal features were used to carry out the fault diagnosis classification and identification experiment. The results were compared with those using port electrical signal, symptom phenomenon and the fusion of the two types. In addition, the fault diagnosis algorithm was compared with BP neural network (BPNN), recurrent neural network (RNN) and convolution neural network (CNN). The results show that based on the fused and screened multi-modal features, the average classification accuracy of LSTM algorithm model reaches 0.970 9, which is higher than that of using port electrical signal alone, symptom phenomenon alone or the fusion of the two types. It also has higher accuracy than BPNN, RNN and CNN, which provides a relatively feasible new idea for intelligent fault diagnosis of similar equipment.
Assuntos
Memória de Curto Prazo , Redes Neurais de Computação , Algoritmos , EletricidadeRESUMO
BACKGROUND: Feature screening plays a critical role in handling ultrahigh dimensional data analyses when the number of features exponentially exceeds the number of observations. It is increasingly common in biomedical research to have case-control (binary) response and an extremely large-scale categorical features. However, the approach considering such data types is limited in extant literature. In this article, we propose a new feature screening approach based on the iterative trend correlation (ITC-SIS, for short) to detect important susceptibility loci that are associated with the polycystic ovary syndrome (PCOS) affection status by screening 731,442 SNP features that were collected from the genome-wide association studies. RESULTS: We prove that the trend correlation based screening approach satisfies the theoretical strong screening consistency property under a set of reasonable conditions, which provides an appealing theoretical support for its outperformance. We demonstrate that the finite sample performance of ITC-SIS is accurate and fast through various simulation designs. CONCLUSION: ITC-SIS serves as a good alternative method to detect disease susceptibility loci for clinic genomic data.
Assuntos
Predisposição Genética para Doença , Síndrome do Ovário Policístico/diagnóstico , Síndrome do Ovário Policístico/genética , Estudos de Casos e Controles , Feminino , Genoma , Estudo de Associação Genômica Ampla/métodos , HumanosRESUMO
Genome-wide association study (GWAS) has turned out to be an essential technology for exploring the genetic mechanism of complex traits. To reduce the complexity of computation, it is well accepted to remove unrelated single nucleotide polymorphisms (SNPs) before GWAS, e.g., by using iterative sure independence screening expectation-maximization Bayesian Lasso (ISIS EM-BLASSO) method. In this work, a modified version of ISIS EM-BLASSO is proposed, which reduces the number of SNPs by a screening methodology based on Pearson correlation and mutual information, then estimates the effects via EM-Bayesian Lasso (EM-BLASSO), and finally detects the true quantitative trait nucleotides (QTNs) through likelihood ratio test. We call our method a two-stage mutual information based Bayesian Lasso (MBLASSO). Under three simulation scenarios, MBLASSO improves the statistical power and retains the higher effect estimation accuracy when comparing with three other algorithms. Moreover, MBLASSO performs best on model fitting, the accuracy of detected associations is the highest, and 21 genes can only be detected by MBLASSO in Arabidopsis thaliana datasets.
RESUMO
In this study, we propose a novel model-free feature screening method for ultrahigh dimensional binary features of binary classification, called weighted mean squared deviation (WMSD). Compared to Chi-square statistic and mutual information, WMSD provides more opportunities to the binary features with probabilities near 0.5. In addition, the asymptotic properties of the proposed method are theoretically investigated under the assumption log p = o ( n ) . The number of features is practically selected by a Pearson correlation coefficient method according to the property of power-law distribution. Lastly, an empirical study of Chinese text classification illustrates that the proposed method performs well when the dimension of selected features is relatively small.
RESUMO
This project proposes an approach to identify significant single nucleotide polymorphism (SNP) effects, both additive and dominant, on the dynamic growth of poplar in diameter and height. The annual changes in yearly phenotypes based on regular observation periods are considered to represent multiple responses. In total 156,362 candidate SNPs are studied, and the phenotypes of 64 poplar trees are recorded. To address this ultrahigh dimensionality issue, this paper adopts a two-stage approach. First, the conventional genome-wide association studies (GWAS) and the distance correlation sure independence screening (DC-SIS) methods (Li et al., 2012) were combined to reduce the model dimensions at the sample size; second, a grouped penalized regression was applied to further refine the model and choose the final sparse SNPs. The multiple response issue was also carefully addressed. The SNP effects on the dynamic diameter and height growth patterns of poplar were systematically analyzed. In addition, a series of intensive simulation studies was performed to validate the proposed approach.
Assuntos
Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único/genética , Populus/genética , Modelos Genéticos , Modelos Estatísticos , Fenótipo , Populus/crescimento & desenvolvimentoRESUMO
One of main roles of omics-based association studies with high-throughput technologies is to screen out relevant molecular features, such as genetic variants, genes, and proteins, from a large pool of such candidate features based on their associations with the phenotype of interest. Typically, screened features are subject to validation studies using more established or conventional assays, where the number of evaluable features is relatively limited, so that there may exist a fixed number of features measurable by these assays. Such a limitation necessitates narrowing a feature set down to a fixed size, following an initial screening analysis via multiple testing where adjustment for multiplicity is made. We propose a two-stage screening approach to control the false discovery rate (FDR) for a feature set with fixed size that is subject to validation studies, rather than for a feature set from the initial screening analysis. Out of the feature set selected in the first stage with a relaxed FDR level, a fraction of features with most statistical significance is firstly selected. For the remaining feature set, features are selected based on biological consideration only, without regard to any statistical information, which allows evaluating the FDR level for the finally selected feature set with fixed size. Improvement of the power is discussed in the proposed two-stage screening approach. Simulation experiments based on parametric models and real microarray datasets demonstrated substantial increment in the number of screened features for biological consideration compared with the standard screening approach, allowing for more extensive and in-depth biological investigations in omics association studies.
Assuntos
Biometria/métodos , Reações Falso-Positivas , Algoritmos , Simulação por Computador , Interpretação Estatística de Dados , Detecção Precoce de Câncer , Testes Genéticos , Humanos , Análise em Microsséries , Modelos Genéticos , FenótipoRESUMO
BACKGROUND: Although the dimension of the entire genome can be extremely large, only a parsimonious set of influential SNPs are correlated with a particular complex trait and are important to the prediction of the trait. Efficiently and accurately selecting these influential SNPs from millions of candidates is in high demand, but poses challenges. We propose a backward elimination iterative distance correlation (BE-IDC) procedure to select the smallest subset of SNPs that guarantees sufficient prediction accuracy, while also solving the unclear threshold issue for traditional feature screening approaches. RESULTS: Verified through six simulations, the adaptive threshold estimated by the BE-IDC performed uniformly better than fixed threshold methods that have been used in the current literature. We also applied BE-IDC to an Arabidopsis thaliana genome-wide data. Out of 216,130 SNPs, BE-IDC selected four influential SNPs, and confirmed the same FRIGIDA gene that was reported by two other traditional methods. CONCLUSIONS: BE-IDC accommodates both the prediction accuracy and the computational speed that are highly demanded in the genomic selection.
Assuntos
Arabidopsis/genética , Modelos Genéticos , Polimorfismo de Nucleotídeo Único , Proteínas de Arabidopsis/genética , Simulação por Computador , Genoma de Planta , Estudo de Associação Genômica Ampla , Genômica , Fenótipo , Melhoramento VegetalRESUMO
In this article, we study the problem of testing the mean vectors of high dimensional data in both one-sample and two-sample cases. The proposed testing procedures employ maximum-type statistics and the parametric bootstrap techniques to compute the critical values. Different from the existing tests that heavily rely on the structural conditions on the unknown covariance matrices, the proposed tests allow general covariance structures of the data and therefore enjoy wide scope of applicability in practice. To enhance powers of the tests against sparse alternatives, we further propose two-step procedures with a preliminary feature screening step. Theoretical properties of the proposed tests are investigated. Through extensive numerical experiments on synthetic data sets and an human acute lymphoblastic leukemia gene expression data set, we illustrate the performance of the new tests and how they may provide assistance on detecting disease-associated gene-sets. The proposed methods have been implemented in an R-package HDtest and are available on CRAN.
Assuntos
Simulação por Computador , Estudos de Associação Genética , Interpretação Estatística de Dados , Expressão Gênica , Estudos de Associação Genética/estatística & dados numéricos , Humanos , Leucemia-Linfoma Linfoblástico de Células Precursoras/genéticaRESUMO
Differentiating breast cancer subtypes based on miRNA data helps doctors provide more personalized treatment plans for patients. This paper explored the interaction between miRNA pairs and developed a novel ensemble regularized polynomial logistic regression method for screening nonlinear features of breast cancer. Three different types of second-order polynomial logistic regression with elastic network penalty (SOPLR-EN) in which each type contains 10 identical models were integrated to determine the most suitable sample set for feature screening by using bootstrap sampling strategy. A single feature and 39 nonlinear features were obtained by screening features that appeared at least 15 times in 30 integrations and were involved in the classification of at least 4 subtypes. The second-order polynomial logistic regression with ridge penalty (SOPLR-R) built on screened feature set achieved 82.30% classification accuracy for distinguishing breast cancer subtypes, surpassing the performance of other six methods. Further, 11 nonlinear miRNA biomarkers were identified, and their significant relevance to breast cancer was illustrated through six types of biological analysis.
Assuntos
Biomarcadores Tumorais , Neoplasias da Mama , MicroRNAs , Humanos , Neoplasias da Mama/genética , MicroRNAs/genética , Feminino , Modelos Logísticos , Biomarcadores Tumorais/genética , Algoritmos , Regulação Neoplásica da Expressão Gênica , Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodosRESUMO
Recent advances in sequencing technologies have allowed collection of massive genome-wide information that substantially enhances the diagnosis and prognosis of head and neck cancer. Identifying predictive markers for survival time is crucial for devising prognostic systems, and learning the underlying molecular driver of the cancer course. In this paper, we introduce α -KIDS, a model-free feature screening procedure with false discovery rate (FDR) control for ultrahigh dimensional right-censored data, which is robust against unknown censoring mechanisms. Specifically, our two-stage procedure initially selects a set of important features with a dual screening mechanism using nonparametric reproducing-kernel-based ANOVA statistics, followed by identifying a refined set (of features) under directional FDR control through a unified knockoff procedure. The finite sample properties of our method, and its novelty (in light of existing alternatives) are evaluated via simulation studies. Furthermore, we illustrate our methodology via application to a motivating right-censored head and neck (HN) cancer survival data derived from The Cancer Genome Atlas, with further validation on a similar HN cancer data from the Gene Expression Omnibus database. The methodology can be implemented via the R package DSFDRC, available in GitHub.
RESUMO
In single-cell RNA sequencing (scRNA-seq) studies, cell types and their marker genes are often identified by clustering and differentially expressed gene (DEG) analysis. A common practice is to select genes using surrogate criteria such as variance and deviance, then cluster them using selected genes and detect markers by DEG analysis assuming known cell types. The surrogate criteria can miss important genes or select unimportant genes, while DEG analysis has the selection-bias problem. We present Festem, a statistical method for the direct selection of cell-type markers for downstream clustering. Festem distinguishes marker genes with heterogeneous distribution across cells that are cluster informative. Simulation and scRNA-seq applications demonstrate that Festem can sensitively select markers with high precision and enables the identification of cell types often missed by other methods. In a large intrahepatic cholangiocarcinoma dataset, we identify diverse CD8+ T cell types and potential prognostic marker genes.
Assuntos
Análise de Célula Única , Análise de Célula Única/métodos , Humanos , Análise por Conglomerados , Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , Biomarcadores Tumorais/genética , Biomarcadores Tumorais/metabolismo , Linfócitos T CD8-Positivos/metabolismo , Colangiocarcinoma/genética , Colangiocarcinoma/patologia , Marcadores Genéticos/genéticaRESUMO
Deep convolutional neural network (CNN) has made great progress in medical image classification. However, it is difficult to establish effective spatial associations, and always extracts similar low-level features, resulting in redundancy of information. To solve these limitations, we propose a stereo spatial discoupling network (TSDNets), which can leverage the multi-dimensional spatial details of medical images. Then, we use an attention mechanism to progressively extract the most discriminative features from three directions: horizontal, vertical, and depth. Moreover, a cross feature screening strategy is used to divide the original feature maps into three levels: important, secondary and redundant. Specifically, we design a cross feature screening module (CFSM) and a semantic guided decoupling module (SGDM) to model multi-dimension spatial relationships, thereby enhancing the feature representation capabilities. The extensive experiments conducted on multiple open source baseline datasets demonstrate that our TSDNets outperforms previous state-of-the-art models.
RESUMO
It is important to quantify the differences in returns to skills using the online job advertisements data, which have attracted great interest in both labor economics and statistics fields. In this paper, we study the relationship between the posted salary and the job requirements in online labor markets. There are two challenges to deal with. First, the posted salary is always presented in an interval-valued form, for example, 5k-10k yuan per month. Simply taking the mid-point or the lower bound as the alternative for salary may result in biased estimators. Second, the number of the potential skill words as predictors generated from the job advertisements by word segmentation is very large and many of them may not contribute to the salary. To this end, we propose a new feature screening method, Absolute Distribution Difference Sure Independence Screening (ADD-SIS), to select important skill words for the interval-valued response. The marginal utility for feature screening is based on the difference of estimated distribution functions via nonparametric maximum likelihood estimation, which sufficiently uses the interval information. It is model-free and robust to outliers. Numerical simulations show that the new method using the interval information is more efficient to select important predictors than the methods only based on the single points of the intervals. In the real data application, we study the text data of job advertisements for data scientists and data analysts in a major China's online job posting website, and explore the important skill words for the salary. We find that the skill words like optimization, long short-term memory (LSTM), convolutional neural networks (CNN), collaborative filtering, are positively correlated with the salary while the words like Excel, Office, data collection, may negatively contribute to the salary.
RESUMO
In the era of precision medicine, time-to-event outcomes such as time to death or progression are routinely collected, along with high-throughput covariates. These high-dimensional data defy classical survival regression models, which are either infeasible to fit or likely to incur low predictability due to over-fitting. To overcome this, recent emphasis has been placed on developing novel approaches for feature selection and survival prognostication. We will review various cutting-edge methods that handle survival outcome data with high-dimensional predictors, highlighting recent innovations in machine learning approaches for survival prediction. We will cover the statistical intuitions and principles behind these methods and conclude with extensions to more complex settings, where competing events are observed. We exemplify these methods with applications to the Boston Lung Cancer Survival Cohort study, one of the largest cancer epidemiology cohorts investigating the complex mechanisms of lung cancer.
RESUMO
Obesity is a highly heritable condition that affects increasing numbers of adults and, concerningly, of children. However, only a small fraction of its heritability has been attributed to specific genetic variants. These variants are traditionally ascertained from genome-wide association studies (GWAS), which utilize samples with tens or hundreds of thousands of individuals for whom a single summary measurement (e.g., BMI) is collected. An alternative approach is to focus on a smaller, more deeply characterized sample in conjunction with advanced statistical models that leverage longitudinal phenotypes. Novel functional data analysis (FDA) techniques are used to capitalize on longitudinal growth information from a cohort of children between birth and three years of age. In an ultra-high dimensional setting, hundreds of thousands of single nucleotide polymorphisms (SNPs) are screened, and selected SNPs are used to construct two polygenic risk scores (PRS) for childhood obesity using a weighting approach that incorporates the dynamic and joint nature of SNP effects. These scores are significantly higher in children with (vs. without) rapid infant weight gain-a predictor of obesity later in life. Using two independent cohorts, it is shown that the genetic variants identified in very young children are also informative in older children and in adults, consistent with early childhood obesity being predictive of obesity later in life. In contrast, PRSs based on SNPs identified by adult obesity GWAS are not predictive of weight gain in the cohort of young children. This provides an example of a successful application of FDA to GWAS. This application is complemented with simulations establishing that a deeply characterized sample can be just as, if not more, effective than a comparable study with a cross-sectional response. Overall, it is demonstrated that a deep, statistically sophisticated characterization of a longitudinal phenotype can provide increased statistical power to studies with relatively small sample sizes; and shows how FDA approaches can be used as an alternative to the traditional GWAS.