RESUMO
The dynamics and variability of protein conformations are directly linked to their functions. Many comparative studies of X-ray protein structures have been conducted to elucidate the relevant conformational changes, dynamics and heterogeneity. The rapid increase in the number of experimentally determined structures has made comparison an effective tool for investigating protein structures. For example, it is now possible to compare structural ensembles formed by enzyme species, variants or the type of ligands bound to them. In this study, the author developed a multilevel model for estimating two covariance matrices that represent inter- and intra-ensemble variability in the Cartesian coordinate space. Principal component analysis using the two estimated covariance matrices identified the inter-/intra-enzyme variabilities, which seemed to be important for the enzyme functions, with the illustrative examples of cytochrome P450 family 2 enzymes and class A $\beta$-lactamases. In P450, in which each enzyme has its own active site of a distinct size, an active-site motion shared universally between the enzymes was captured as the first principal mode of the intra-enzyme covariance matrix. In this case, the method was useful for understanding the conformational variability after adjusting for the differences between enzyme sizes. The developed method is advantageous in small ensemble-size problems and hence promising for use in comparative studies on experimentally determined structures where ensemble sizes are smaller than those generated, for example, by molecular dynamics simulations.
Assuntos
Simulação de Dinâmica Molecular , Proteínas , Proteínas/química , Conformação Proteica , Domínio CatalíticoRESUMO
Recent research identifies and corrects bias, such as excess dispersion, in the leading sample eigenvector of a factor-based covariance matrix estimated from a high-dimension low sample size (HL) data set. We show that eigenvector bias can have a substantial impact on variance-minimizing optimization in the HL regime, while bias in estimated eigenvalues may have little effect. We describe a data-driven eigenvector shrinkage estimator in the HL regime called "James-Stein for eigenvectors" (JSE) and its close relationship with the James-Stein (JS) estimator for a collection of averages. We show, both theoretically and with numerical experiments, that, for certain variance-minimizing problems of practical importance, efforts to correct eigenvalues have little value in comparison to the JSE correction of the leading eigenvector. When certain extra information is present, JSE is a consistent estimator of the leading eigenvector.
Assuntos
Viés , Tamanho da AmostraRESUMO
BACKGROUND: Gene interaction networks are graphs in which nodes represent genes and edges represent functional interactions between them. These interactions can be at multiple levels, for instance, gene regulation, protein-protein interaction, or metabolic pathways. To analyse gene interaction networks at a large scale, gene co-expression network analysis is often applied on high-throughput gene expression data such as RNA sequencing data. With the advance in sequencing technology, expression of genes can be measured in individual cells. Single-cell RNA sequencing (scRNAseq) provides insights of cellular development, differentiation and characteristics at the transcriptomic level. High sparsity and high-dimensional data structures pose challenges in scRNAseq data analysis. RESULTS: In this study, a sparse inverse covariance matrix estimation framework for scRNAseq data is developed to capture direct functional interactions between genes. Comparative analyses highlight high performance and fast computation of Stein-type shrinkage in high-dimensional data using simulated scRNAseq data. Data transformation approaches also show improvement in performance of shrinkage methods in non-Gaussian distributed data. Zero-inflated modelling of scRNAseq data based on a negative binomial distribution enhances shrinkage performance in zero-inflated data without interference on non zero-inflated count data. CONCLUSION: The proposed framework broadens application of graphical model in scRNAseq analysis with flexibility in sparsity of count data resulting from dropout events, high performance, and fast computational time. Implementation of the framework is in a reproducible Snakemake workflow https://github.com/calathea24/ZINBGraphicalModel and R package ZINBStein https://github.com/calathea24/ZINBStein .
Assuntos
Redes Reguladoras de Genes , Análise de Sequência de RNA , Análise de Célula Única , Análise de Célula Única/métodos , Análise de Sequência de RNA/métodos , Humanos , Perfilação da Expressão Gênica/métodosRESUMO
The computation of a similarity measure for genomic data is a standard tool in computational genetics. The principal components of such matrices are routinely used to correct for biases due to confounding by population stratification, for instance in linear regressions. However, the calculation of both a similarity matrix and its singular value decomposition (SVD) are computationally intensive. The contribution of this article is threefold. First, we demonstrate that the calculation of three matrices (called the covariance matrix, the weighted Jaccard matrix, and the genomic relationship matrix) can be reformulated in a unified way which allows for the application of a randomized SVD algorithm, which is faster than the traditional computation. The fast SVD algorithm we present is adapted from an existing randomized SVD algorithm and ensures that all computations are carried out in sparse matrix algebra. The algorithm only assumes that row-wise and column-wise subtraction and multiplication of a vector with a sparse matrix is available, an operation that is efficiently implemented in common sparse matrix packages. An exception is the so-called Jaccard matrix, which does not have a structure applicable for the fast SVD algorithm. Second, an approximate Jaccard matrix is introduced to which the fast SVD computation is applicable. Third, we establish guaranteed theoretical bounds on the accuracy (in [Formula: see text] norm and angle) between the principal components of the Jaccard matrix and the ones of our proposed approximation, thus putting the proposed Jaccard approximation on a solid mathematical foundation, and derive the theoretical runtime of our algorithm. We illustrate that the approximation error is low in practice and empirically verify the theoretical runtime scalings on both simulated data and data of the 1000 Genome Project.
Assuntos
Genoma , Genômica , Algoritmos , Modelos LinearesRESUMO
The design of protein interaction inhibitors is a promising approach to address aberrant protein interactions that cause disease. One strategy in designing inhibitors is to use peptidomimetic scaffolds that mimic the natural interaction interface. A central challenge in using peptidomimetics as protein interaction inhibitors, however, is determining how best the molecular scaffold aligns to the residues of the interface it is attempting to mimic. Here we present the Scaffold Matcher algorithm that aligns a given molecular scaffold onto hotspot residues from a protein interaction interface. To optimize the degrees of freedom of the molecular scaffold we implement the covariance matrix adaptation evolution strategy (CMA-ES), a state-of-the-art derivative-free optimization algorithm in Rosetta. To evaluate the performance of the CMA-ES, we used 26 peptides from the FlexPepDock Benchmark and compared with three other algorithms in Rosetta, specifically, Rosetta's default minimizer, a Monte Carlo protocol of small backbone perturbations, and a Genetic algorithm. We test the algorithms' performance on their ability to align a molecular scaffold to a series of hotspot residues (i.e., constraints) along native peptides. Of the 4 methods, CMA-ES was able to find the lowest energy conformation for all 26 benchmark peptides. Additionally, as a proof of concept, we apply the Scaffold Match algorithm with CMA-ES to align a peptidomimetic oligooxopiperazine scaffold to the hotspot residues of the substrate of the main protease of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Our implementation of CMA-ES into Rosetta allows for an alternative optimization method to be used on macromolecular modeling problems with rough energy landscapes. Finally, our Scaffold Matcher algorithm allows for the identification of initial conformations of interaction inhibitors that can be further designed and optimized as high-affinity reagents.
Assuntos
Peptidomiméticos , Algoritmos , Peptídeos/química , Conformação Molecular , BenchmarkingRESUMO
Genetic prediction holds immense promise for translating genetic discoveries into medical advances. As the high-dimensional covariance matrix (or the linkage disequilibrium (LD) pattern) of genetic variants often presents a block-diagonal structure, numerous methods account for the dependence among variants in predetermined local LD blocks. Moreover, due to privacy considerations and data protection concerns, genetic variant dependence in each LD block is typically estimated from external reference panels rather than the original training data set. This paper presents a unified analysis of blockwise and reference panel-based estimators in a high-dimensional prediction framework without sparsity restrictions. We find that, surprisingly, even when the covariance matrix has a block-diagonal structure with well-defined boundaries, blockwise estimation methods adjusting for local dependence can be substantially less accurate than methods controlling for the whole covariance matrix. Further, estimation methods built on the original training data set and external reference panels are likely to have varying performance in high dimensions, which may reflect the cost of having only access to summary level data from the training data set. This analysis is based on novel results in random matrix theory for block-diagonal covariance matrix. We numerically evaluate our results using extensive simulations and real data analysis in the UK Biobank.
RESUMO
Multi-modal data are prevalent in many scientific fields. In this study, we consider the parameter estimation and variable selection for a multi-response regression using block-missing multi-modal data. Our method allows the dimensions of both the responses and the predictors to be large, and the responses to be incomplete and correlated, a common practical problem in high-dimensional settings. Our proposed method uses two steps to make a prediction from a multi-response linear regression model with block-missing multi-modal predictors. In the first step, without imputing missing data, we use all available data to estimate the covariance matrix of the predictors and the cross-covariance matrix between the predictors and the responses. In the second step, we use these matrices and a penalized method to simultaneously estimate the precision matrix of the response vector, given the predictors, and the sparse regression parameter matrix. Lastly, we demonstrate the effectiveness of the proposed method using theoretical studies, simulated examples, and an analysis of a multi-modal imaging data set from the Alzheimer's Disease Neuroimaging Initiative.
RESUMO
S-parameters are widely used to detail the scattering parameters of radio frequency (RF) components and microwave circuit modules. The vector network analyzer (VNA) is the most commonly used device for measuring S-parameters. Given the multiple frequency points, complex values, and intricate uncertainty propagation involved, accurately assessing the uncertainty of S-parameter measurements is difficult. In this study, we proposed a new method for assessing S-parameter uncertainty based on the covariance matrices, tracing back to the nominal uncertainty of calibration standards. First, we analyzed the relevant theory of uncertainty assessment using covariance matrices and subsequently deduced the mechanism of Type B uncertainty propagating from calibration standards to error model coefficients and S-parameter measurements to evaluate Type B measurement uncertainty. In this study, a novel measurement system was constructed for measuring grounded coplanar waveguides by using a VNA and calibration standards with 8- and 12-error models. Initially, the model assessed the Type B uncertainty of measuring four S-parameters of a grounded coplanar waveguide. Next, the VNA calibrated with the 12-error model was used to conduct multiple repeated measurements to assess the Type A uncertainty of the grounded coplanar waveguide. Finally, the composite uncertainty was constructed, which demonstrated that the proposed method can be used for assessing the uncertainty of S-parameters.
RESUMO
In space-time adaptive processing (STAP), the coprime sampling structure can obtain better clutter suppression capabilities at a lower hardware cost than the uniform linear sampling structure. However, in practical applications, the performance of the algorithm is often limited by the number of training samples. To solve this problem, this paper proposes a fast iterative coprime STAP algorithm based on truncated kernel norm minimization (TKNM). This method establishes a virtual clutter covariance matrix (CCM), introduces truncated kernel norm regularization technology to ensure the low rank of the CCM, and transforms the non-convex problem into a convex optimization problem. Finally, a fast iterative solution method based on the alternating direction method is presented. The effectiveness and accuracy of the proposed algorithm are verified through simulation experiments.
RESUMO
This paper aims at achieving real-time optimal speed estimation for an induction motor using the Extended Kalman filter (EKF). Speed estimation is essential for fault diagnosis in Motor Current Signature Analysis (MCSA). The estimation accuracy is obtained by exploring the noise covariance matrices estimation of the EKF algorithm. The noise covariance matrices are determined using a modified subspace model identification approach. In order to reach this goal, this method compares an estimated model of a deterministic system, derived from available input-output datasets (using voltage-current sensors), with the discrete-time state-space representation used in the Kalman filter equations. This comparison leads to the determination of model uncertainties, which are subsequently represented as noise covariance matrices. Based on the fifth-order nonlinear model of the induction motor, the rotor speed is estimated with the optimized EKF algorithm, and the algorithm is tested experimentally.
RESUMO
This paper presents a novel unscented Kalman filter (UKF) implementation with adaptive covariance matrices (ACMs), to accurately estimate the longitudinal and lateral components of vehicle velocity, and thus the sideslip angle, tire slip angles, and tire slip ratios, also in extreme driving conditions, including tyre-road friction variations. The adaptation strategies are implemented on both the process noise and measurement noise covariances. The resulting UKF ACM is compared against a well-tuned baseline UKF with fixed covariances. Experimental test results in high tyre-road friction conditions show the good performance of both filters, with only a very marginal benefit of the ACM version. However, the simulated extreme tests in variable and low-friction conditions highlight the superior performance and robustness provided by the adaptation mechanism.
RESUMO
During civil aviation flights, the aircraft needs to accurately monitor the real-time navigation capability and determine whether the onboard navigation system performance meets the required navigation performance (RNP). The airborne flight management system (FMS) uses actual navigation performance (ANP) to quantitatively calculate the uncertainty of aircraft position estimation, and its evaluation accuracy is highly dependent on the position estimation covariance matrix (PECM) provided by the airborne integrated navigation system. This paper proposed an adaptive PECM estimation method based on variational Bayes (VB) to solve the problem of ANP misevaluation, which is caused by the traditional simple ANP model failing to accurately estimate PECM under unknown time-varying noise. Combined with the 3D ANP model proposed in this paper, the accuracy of ANP evaluation can be significantly improved. This enhancement contributes to ensured navigation integrity and operational safety during civil flight.
RESUMO
We consider inference problems for high-dimensional (HD) functional data with a dense number of T repeated measurements taken for a large number of p variables from a small number of n experimental units. The spatial and temporal dependence, high dimensionality, and dense number of repeated measurements pose theoretical and computational challenges. This paper has two aims; our first aim is to solve the theoretical and computational challenges in testing equivalence among covariance matrices from HD functional data. The second aim is to provide computationally efficient and tuning-free tools with guaranteed stochastic error control. The weak convergence of the stochastic process formed by the test statistics is established under the "large p, large T, and small n" setting. If the null is rejected, we further show that the locations of the change points can be estimated consistently. The estimator's rate of convergence is shown to depend on the data dimension, sample size, number of repeated measurements, and signal-to-noise ratio. We also show that our proposed computation algorithms can significantly reduce the computation time and are applicable to real-world data with a large number of HD-repeated measurements (e.g., functional magnetic resonance imaging (fMRI) data). Simulation results demonstrate both the finite sample performance and computational effectiveness of our proposed procedures. We observe that the empirical size of the test is well controlled at the nominal level, and the locations of multiple change points can be accurately identified. An application to fMRI data demonstrates that our proposed methods can identify event boundaries in the preface of the television series Sherlock. Code to implement the procedures is available in an R package named TechPhD.
Assuntos
Algoritmos , Imageamento por Ressonância Magnética , Simulação por Computador , Imageamento por Ressonância Magnética/métodos , Tamanho da AmostraRESUMO
Testing the equality of two covariance matrices is a fundamental problem in statistics, and especially challenging when the data are high-dimensional. Through a novel use of random integration, we can test the equality of high-dimensional covariance matrices without assuming parametric distributions for the two underlying populations, even if the dimension is much larger than the sample size. The asymptotic properties of our test for arbitrary number of covariates and sample size are studied in depth under a general multivariate model. The finite-sample performance of our test is evaluated through numerical studies. The empirical results demonstrate that our test is highly competitive with existing tests in a wide range of settings. In particular, our proposed test is distinctly powerful under different settings when there exist a few large or many small diagonal disturbances between the two covariance matrices.
RESUMO
A smart city is a city equipped with many sensors communicating with each other for different purposes. Cybersecurity and signal security are important in such cities, especially for airports and harbours. Any signal interference or attack on the navigation of autonomous vehicles and aircraft may lead to catastrophes and risks in people's lives. Therefore, it is of tremendous importance to develop wireless security networks for the localisation of any radio frequency interferer in smart cities. Time of arrival, angle of arrival, time-difference of arrivals, received signal strength and received signal strength difference (RSSD) are known observables used for the localisation of a signal interferer. Localisation means to estimate the coordinates of an interferer from some established monitoring stations and sensors receiving such measurements from an interferer. The main goal of this study is to optimise the geometric configuration of the monitoring stations using a desired dilution of precision and/or variance-covariance matrix (VCM) for the transmitter's location based on the RSSD. The required mathematical models are developed and applied to the Arlanda international airport of Sweden. Our numerical tests show that the same configuration is achieved based on dilution of precision and VCM criteria when the resolution of design is lower than 20 m in the presence of the same constraints. The choice of the pathloss exponent in the mathematical models of the RSSDs is not important for such low resolutions. Finally, optimisation based on the VCM is recommended because of its larger redundancy and flexibility in selecting different desired variances and covariances for the coordinates of the transmitter.
RESUMO
This paper focuses on building a non-invasive, low-cost sensor that can be fitted over tree trunks growing in a semiarid land environment. It also proposes a new definition that characterizes tree trunks' water retention capabilities mathematically. The designed sensor measures the variations in capacitance across its probes. It uses amplification and filter stages to smooth the readings, requires little power, and is operational over a 100 kHz frequency. The sensor sends data via a Long Range (LoRa) transceiver through a gateway to a processing unit. Field experiments showed that the system provides accurate readings of the moisture content. As the sensors are non-invasive, they can be fitted to branches and trunks of various sizes without altering the structure of the wood tissue. Results show that the moisture content in tree trunks increases exponentially with respect to the measured capacitance and reflects the distinct differences between different tree types. Data of known healthy trees and unhealthy trees and defective sensor readings have been collected and analysed statistically to show how anomalies in sensor reading baseds on eigenvectors and eigenvalues of the fitted curve coefficient matrix can be detected.
RESUMO
This paper proposes a fast direction of arrival (DOA) estimation method based on positive incremental modified Cholesky decomposition atomic norm minimization (PI-CANM) for augmented coprime array sensors. The approach incorporates coprime sampling on the augmented array to generate a non-uniform, discontinuous virtual array. It then utilizes interpolation to convert this into a uniform, continuous virtual array. Based on this, the problem of DOA estimation is equivalently formulated as a gridless optimization problem, which is solved via atomic norm minimization to reconstruct a Hermitian Toeplitz covariance matrix. Furthermore, by positive incremental modified Cholesky decomposition, the covariance matrix is transformed from positive semi-definite to positive definite, which simplifies the constraint of optimization problem and reduces the complexity of the solution. Finally, the Multiple Signal Classification method is utilized to carry out statistical signal processing on the reconstructed covariance matrix, yielding initial DOA angle estimates. Experimental outcomes highlight that the PI-CANM algorithm surpasses other algorithms in estimation accuracy, demonstrating stability in difficult circumstances such as low signal-to-noise ratios and limited snapshots. Additionally, it boasts an impressive computational speed. This method enhances both the accuracy and computational efficiency of DOA estimation, showing potential for broad applicability.
RESUMO
Evolution-based neural architecture search methods have shown promising results but they require high computational resources since these methods involve training each candidate architecture from scratch and then evaluating its fitness which results in long search time. Covariance Matrix Adaptation Evolution Strategy (CMA-ES) has shown promising results in tuning hyperparameters of neural networks but has not been used for neural architecture search. In this work, we propose a framework called CMANAS which applies the faster convergence property of CMA-ES to the deep neural architecture search problem. Instead of training each individual architecture seperately, we used the accuracy of a trained one shot model (OSM) on the validation data as a prediction of the fitness of the architecture resulting in reduced search time. We also used an architecture-fitness table (AF table) for keeping record of the already evaluated architecture, thus further reducing the search time. The architectures are modelled using a normal distribution, which is updated using CMA-ES based on the fitness of the sampled population. Experimentally, CMANAS achieves better results than previous evolution-based methods while reducing the search time significantly. The effectiveness of CMANAS is shown on 2 different search spaces for datasets: CIFAR-10, CIFAR-100, ImageNet and ImageNet16-120. All the results show that CMANAS is a viable alternative to previous evolution-based methods and extends the application of CMA-ES to the deep neural architecture search field.
RESUMO
The goal of this paper is to present a theoretical and practical introduction to generalized eigendecomposition (GED), which is a robust and flexible framework used for dimension reduction and source separation in multichannel signal processing. In cognitive electrophysiology, GED is used to create spatial filters that maximize a researcher-specified contrast. For example, one may wish to exploit an assumption that different sources have different frequency content, or that sources vary in magnitude across experimental conditions. GED is fast and easy to compute, performs well in simulated and real data, and is easily adaptable to a variety of specific research goals. This paper introduces GED in a way that ties together myriad individual publications and applications of GED in electrophysiology, and provides sample MATLAB and Python code that can be tested and adapted. Practical considerations and issues that often arise in applications are discussed.
Assuntos
Eletroencefalografia/métodos , Fenômenos Eletrofisiológicos , Magnetoencefalografia/métodos , Oscilometria/métodos , Humanos , Análise Multivariada , Processamento de Sinais Assistido por ComputadorRESUMO
Measures of fluctuating asymmetry (FA) have been adopted widely as an estimate of developmental instability. Arising from various sources of stress, developmental instability is associated with an organism's capacity to maintain fitness. The process of domestication has been framed as an environmental stress with human-specified parameters, suggesting that FA may manifest to a larger degree among domesticates compared to their wild relatives. This study used three-dimensional geometric morphometric landmark data to (a) quantify the amount of FA in the cranium of six domestic mammal species and their wild relatives and, (b) provide novel assessment of the commonalities and differences across domestic/wild pairs concerning the extent to which random variation arising from the developmental system assimilates into within-population variation. The majority of domestic mammals showed greater disparity for asymmetric shape, however, only two forms (Pig, Dog) showed significantly higher disparity as well as a higher degree of asymmetry compared to their wild counterparts (Wild Boar, Wolf). Contra to predictions, most domestic and wild forms did not show a statistically significant correspondence between symmetric shape variation and FA, however, a moderate correlation value was recorded for most pairs (r-partial least squares >0.5). Within pairs, domestic and wild forms showed similar correlation magnitudes for the relationship between the asymmetric and symmetric components. In domesticates, new variation may therefore retain a general, conserved pattern in the gross structuring of the cranium, whilst also being a source for response to selection on specific features.