RESUMO
Gene co-expression networks (GCNs) provide multiple benefits to molecular research including hypothesis generation and biomarker discovery. Transcriptome profiles serve as input for GCN construction and are derived from increasingly larger studies with samples across multiple experimental conditions, treatments, time points, genotypes, etc. Such experiments with larger numbers of variables confound discovery of true network edges, exclude edges and inhibit discovery of context (or condition) specific network edges. To demonstrate this problem, a 475-sample dataset is used to show that up to 97% of GCN edges can be misleading because correlations are false or incorrect. False and incorrect correlations can occur when tests are applied without ensuring assumptions are met, and pairwise gene expression may not meet test assumptions if the expression of at least one gene in the pairwise comparison is a function of multiple confounding variables. The 'one-size-fits-all' approach to GCN construction is therefore problematic for large, multivariable datasets. Recently, the Knowledge Independent Network Construction toolkit has been used in multiple studies to provide a dynamic approach to GCN construction that ensures statistical tests meet assumptions and confounding variables are addressed. Additionally, it can associate experimental context for each edge of the network resulting in context-specific GCNs (csGCNs). To help researchers recognize such challenges in GCN construction, and the creation of csGCNs, we provide a review of the workflow.
Assuntos
Redes Reguladoras de Genes , TranscriptomaRESUMO
Pivotal to the success of any computational experiment is the ability to make reliable predictions about the system under study and the time required to yield these results. Biomolecular interactions is one area of research that sits in every camp of resolution vs the time required, from the quantum mechanical level to in vivo studies. At an approximate midpoint, there is coarse-grained molecular dynamics, for which the Martini force fields have become the most widely used, fast enough to simulate the entire membrane of a mitochondrion though lacking atom-specific precision. While many force fields have been parametrized to account for a specific system under study, the Martini force field has aimed at casting a wider net with more generalized bead types that have demonstrated suitability for broad use and reuse in applications from protein-graphene oxide coassembly to polysaccharides interactions.In this Account, the progressive (Martini versions 1 through 3) and peripheral (Sour Martini, constant pH, Martini Straight, Dry Martini, etc.) developmental trajectory of the Martini force field will be analyzed in terms of self-assembling systems with a focus on short (two to three amino acids) peptide self-assembly in aqueous environments. In particular, this will focus on the effects of the Martini solvent model and compare how changes in bead definitions and mapping have effects on different systems. Considerable effort in the development of Martini has been expended to reduce the "stickiness" of amino acids to better simulate proteins in bilayers. We have included in this Account a short study of dipeptide self-assembly in water, using all mainstream Martini force fields, to examine their ability to reproduce this behavior. The three most recently released versions of Martini and variations in their solvents are used to simulate in triplicate all 400 dipeptides of the 20 gene-encoded amino acids. The ability of the force fields to model the self-assembly of the dipeptides in aqueoues environments is determined by the measurement of the aggregation propensity, and additional descriptors are used to gain further insight into the dipeptide aggregates.
Assuntos
Simulação de Dinâmica Molecular , Peptídeos , Proteínas/química , Solventes , Água/química , Aminoácidos , DipeptídeosRESUMO
BACKGROUND: Quantification of gene expression from RNA-seq data is a prerequisite for transcriptome analysis such as differential gene expression analysis and gene co-expression network construction. Individual RNA-seq experiments are larger and combining multiple experiments from sequence repositories can result in datasets with thousands of samples. Processing hundreds to thousands of RNA-seq data can result in challenges related to data management, access to sufficient computational resources, navigation of high-performance computing (HPC) systems, installation of required software dependencies, and reproducibility. Processing of larger and deeper RNA-seq experiments will become more common as sequencing technology matures. RESULTS: GEMmaker, is a nf-core compliant, Nextflow workflow, that quantifies gene expression from small to massive RNA-seq datasets. GEMmaker ensures results are highly reproducible through the use of versioned containerized software that can be executed on a single workstation, institutional compute cluster, Kubernetes platform or the cloud. GEMmaker supports popular alignment and quantification tools providing results in raw and normalized formats. GEMmaker is unique in that it can scale to process thousands of local or remote stored samples without exceeding available data storage. CONCLUSIONS: Workflows that quantify gene expression are not new, and many already address issues of portability, reusability, and scale in terms of access to CPUs. GEMmaker provides these benefits and adds the ability to scale despite low data storage infrastructure. This allows users to process hundreds to thousands of RNA-seq samples even when data storage resources are limited. GEMmaker is freely available and fully documented with step-by-step setup and execution instructions.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Software , Sequenciamento de Nucleotídeos em Larga Escala/métodos , RNA-Seq , Reprodutibilidade dos Testes , Análise de Sequência de RNA/métodosRESUMO
BACKGROUND: Prehospitalization documentation is a challenging task and prone to loss of information, as paramedics operate under disruptive environments requiring their constant attention to the patients. OBJECTIVE: The aim of this study is to develop a mobile platform for hands-free prehospitalization documentation to assist first responders in operational medical environments by aggregating all existing solutions for noise resiliency and domain adaptation. METHODS: The platform was built to extract meaningful medical information from the real-time audio streaming at the point of injury and transmit complete documentation to a field hospital prior to patient arrival. To this end, the state-of-the-art automatic speech recognition (ASR) solutions with the following modular improvements were thoroughly explored: noise-resilient ASR, multi-style training, customized lexicon, and speech enhancement. The development of the platform was strictly guided by qualitative research and simulation-based evaluation to address the relevant challenges through progressive improvements at every process step of the end-to-end solution. The primary performance metrics included medical word error rate (WER) in machine-transcribed text output and an F1 score calculated by comparing the autogenerated documentation to manual documentation by physicians. RESULTS: The total number of 15,139 individual words necessary for completing the documentation were identified from all conversations that occurred during the physician-supervised simulation drills. The baseline model presented a suboptimal performance with a WER of 69.85% and an F1 score of 0.611. The noise-resilient ASR, multi-style training, and customized lexicon improved the overall performance; the finalized platform achieved a medical WER of 33.3% and an F1 score of 0.81 when compared to manual documentation. The speech enhancement degraded performance with medical WER increased from 33.3% to 46.33% and the corresponding F1 score decreased from 0.81 to 0.78. All changes in performance were statistically significant (P<.001). CONCLUSIONS: This study presented a fully functional mobile platform for hands-free prehospitalization documentation in operational medical environments and lessons learned from its implementation.
Assuntos
Interface para o Reconhecimento da Fala , Fala , Documentação , Humanos , TecnologiaRESUMO
We introduce the Transcriptome State Perturbation Generator (TSPG) as a novel deep-learning method to identify changes in genomic expression that occur between tissue states using generative adversarial networks. TSPG learns the transcriptome perturbations from RNA-sequencing data required to shift from a source to a target class. We apply TSPG as an effective method of detecting biologically relevant alternate expression patterns between normal and tumor human tissue samples. We demonstrate that the application of TSPG to expression data obtained from a biopsy sample of a patient's kidney cancer can identify patient-specific differentially expressed genes between their individual tumor sample and a target class of healthy kidney gene expression. By utilizing TSPG in a precision medicine application in which the patient sample is not replicated (i.e., n = 1 ), we present a novel technique of determining significant transcriptional aberrations that can be used to help identify potential targeted therapies.
RESUMO
Rhodomyrtus tomentosa is a perennial shrub native to Southeast Asia and is invasive in South Florida and Hawai'i, USA. During surveys of R. tomentosa in Hong Kong from 2013-2018 for potential biological control agents, we collected larvae of the stem borer, Casmara subagronoma. Larvae were shipped in stems to a USDA-ARS quarantine facility where they were reared and subjected to biology studies and preliminary host range examinations. Casmara subagronoma is the most recent Casmara species to be described from males collected in Vietnam and Indonesia. Because the original species description was based on only two male specimens, we also provide a detailed description of the female, egg, larva, and pupa. Finally, we conducted preliminary host range trials utilizing Myrtus communis, Myrcianthes fragrans, and Camellia sinensis. Casmara subagronoma emerged from M. fragrans, a Florida-native shrub, and larvae were able to survive in non-target stems for over a year (>400 days). Based on these findings and difficulty in rearing, we do not believe C. subagronoma is a suitable insect for biological control of R. tomentosa at this time, but may warrant further study. This investigation also illustrates the importance of host surveys for conservation and taxonomic purposes.
RESUMO
Identifying local structure in molecular simulations is of utmost importance. The most common existing approach to identify local structure is to calculate some geometrical quantity referred to as an order parameter. In simple cases order parameters are physically intuitive and trivial to develop (e.g., ion-pair distance), however in most cases, order parameter development becomes a much more difficult endeavor (e.g., crystal structure identification). Using ideas from computer vision, we adapt a specific type of neural network called a PointNet to identify local structural environments in molecular simulations. A primary challenge in applying machine learning techniques to simulation is selecting the appropriate input features. This challenge is system-specific and requires significant human input and intuition. In contrast, our approach is a generic framework that requires no system-specific feature engineering and operates on the raw output of the simulations, i.e., atomic positions. We demonstrate the method on crystal structure identification in Lennard-Jones (four different phases), water (eight different phases), and mesophase (six different phases) systems. The method achieves as high as 99.5% accuracy in crystal structure identification. The method is applicable to heterogeneous nucleation and it can even predict the crystal phases of atoms near external interfaces. We demonstrate the versatility of our approach by using our method to identify surface hydrophobicity based solely upon positions and orientations of surrounding water molecules. Our results suggest the approach will be broadly applicable to many types of local structure in simulations.
RESUMO
Given the complex relationship between gene expression and phenotypic outcomes, computationally efficient approaches are needed to sift through large high-dimensional datasets in order to identify biologically relevant biomarkers. In this report, we describe a method of identifying the most salient biomarker genes in a dataset, which we call "candidate genes", by evaluating the ability of gene combinations to classify samples from a dataset, which we call "classification potential". Our algorithm, Gene Oracle, uses a neural network to test user defined gene sets for polygenic classification potential and then uses a combinatorial approach to further decompose selected gene sets into candidate and non-candidate biomarker genes. We tested this algorithm on curated gene sets from the Molecular Signatures Database (MSigDB) quantified in RNAseq gene expression matrices obtained from The Cancer Genome Atlas (TCGA) and Genotype-Tissue Expression (GTEx) data repositories. First, we identified which MSigDB Hallmark subsets have significant classification potential for both the TCGA and GTEx datasets. Then, we identified the most discriminatory candidate biomarker genes in each Hallmark gene set and provide evidence that the improved biomarker potential of these genes may be due to reduced functional complexity.
Assuntos
Biomarcadores Tumorais/genética , Estudos de Associação Genética , Predisposição Genética para Doença , Oncogenes , Algoritmos , Biologia Computacional/métodos , Bases de Dados Genéticas , Perfilação da Expressão Gênica , Ontologia Genética , HumanosRESUMO
The Old World climbing fern, Lygodium microphyllum, is a rapidly spreading environmental weed in Florida, United States. We reconstructed the complete chloroplast genome of L. microphyllum from Illumina whole-genome shotgun sequencing, and investigate the phylogenetic placement of this species within the Leptosporangiate ferns. The chloroplast genome is 158,891 bp and contains 87 protein-coding genes, four rRNA genes, and 27 tRNA genes. Thirty-three genes contained internal stop codons, a common feature in Leptosporangiate fern genomes. The L. microphyllum genome has been deposited in GenBank under accession number MG761729.
RESUMO
We hypothesized that the ongoing naturalization of frost/shade tolerant Asian bamboos in North America could cause environmental consequences involving introduced bamboos, native rodents and ultimately humans. More specifically, we asked whether the eventual masting by an abundant leptomorphic ("running") bamboo within Pacific Northwest coniferous forests could produce a temporary surfeit of food capable of driving a population irruption of a common native seed predator, the deer mouse (Peromyscus maniculatus), a hantavirus carrier. Single-choice and cafeteria-style feeding trials were conducted for deer mice with seeds of two bamboo species (Bambusa distegia and Yushania brevipaniculata), wheat, Pinus ponderosa, and native mixed diets compared to rodent laboratory feed. Adult deer mice consumed bamboo seeds as readily as they consumed native seeds. In the cafeteria-style feeding trials, Y. brevipaniculata seeds were consumed at the same rate as native seeds but more frequently than wheat seeds or rodent laboratory feed. Females produced a median litter of 4 pups on a bamboo diet. Given the ability of deer mice to reproduce frequently whenever food is abundant, we employed our feeding trial results in a modified Rosenzweig-MacArthur consumer-resource model to project the population-level response of deer mice to a suddenly available/rapidly depleted supply of bamboo seeds. The simulations predict rodent population irruptions and declines similar to reported cycles involving Asian and South American rodents but unprecedented in deer mice. Following depletion of a mast seed supply, the incidence of Sin Nombre Virus (SNV) transmission to humans could subsequently rise with dispersal of the peridomestic deer mice into nearby human settlements seeking food.
Assuntos
Bambusa/crescimento & desenvolvimento , Animais , Simulação por Computador , Dieta , Feminino , Preferências Alimentares , Espécies Introduzidas , Masculino , Peromyscus , Pinus ponderosa , Crescimento Demográfico , Quercus , Sementes , TriticumRESUMO
BACKGROUND: In genomics, highly relevant gene interaction (co-expression) networks have been constructed by finding significant pair-wise correlations between genes in expression datasets. These networks are then mined to elucidate biological function at the polygenic level. In some cases networks may be constructed from input samples that measure gene expression under a variety of different conditions, such as for different genotypes, environments, disease states and tissues. When large sets of samples are obtained from public repositories it is often unmanageable to associate samples into condition-specific groups, and combining samples from various conditions has a negative effect on network size. A fixed significance threshold is often applied also limiting the size of the final network. Therefore, we propose pre-clustering of input expression samples to approximate condition-specific grouping of samples and individual network construction of each group as a means for dynamic significance thresholding. The net effect is increase sensitivity thus maximizing the total co-expression relationships in the final co-expression network compendium. RESULTS: A total of 86 Arabidopsis thaliana co-expression networks were constructed after k-means partitioning of 7,105 publicly available ATH1 Affymetrix microarray samples. We term each pre-sorted network a Gene Interaction Layer (GIL). Random Matrix Theory (RMT), an un-supervised thresholding method, was used to threshold each of the 86 networks independently, effectively providing a dynamic (non-global) threshold for the network. The overall gene count across all GILs reached 19,588 genes (94.7% measured gene coverage) and 558,022 unique co-expression relationships. In comparison, network construction without pre-sorting of input samples yielded only 3,297 genes (15.9%) and 129,134 relationships. in the global network. CONCLUSIONS: Here we show that pre-clustering of microarray samples helps approximate condition-specific networks and allows for dynamic thresholding using un-supervised methods. Because RMT ensures only highly significant interactions are kept, the GIL compendium consists of 558,022 unique high quality A. thaliana co-expression relationships across almost all of the measurable genes on the ATH1 array. For A. thaliana, these networks represent the largest compendium to date of significant gene co-expression relationships, and are a means to explore complex pathway, polygenic, and pleiotropic relationships for this focal model plant. The networks can be explored at sysbio.genome.clemson.edu. Finally, this method is applicable to any large expression profile collection for any organism and is best suited where a knowledge-independent network construction method is desired.
Assuntos
Arabidopsis/genética , Perfilação da Expressão Gênica/métodos , Genes de Plantas/genética , Análise por Conglomerados , Redes Reguladoras de GenesRESUMO
The study of gene relationships and their effect on biological function and phenotype is a focal point in systems biology. Gene co-expression networks built using microarray expression profiles are one technique for discovering and interpreting gene relationships. A knowledge-independent thresholding technique, such as Random Matrix Theory (RMT), is useful for identifying meaningful relationships. Highly connected genes in the thresholded network are then grouped into modules that provide insight into their collective functionality. While it has been shown that co-expression networks are biologically relevant, it has not been determined to what extent any given network is functionally robust given perturbations in the input sample set. For such a test, hundreds of networks are needed and hence a tool to rapidly construct these networks. To examine functional robustness of networks with varying input, we enhanced an existing RMT implementation for improved scalability and tested functional robustness of human (Homo sapiens), rice (Oryza sativa) and budding yeast (Saccharomyces cerevisiae). We demonstrate dramatic decrease in network construction time and computational requirements and show that despite some variation in global properties between networks, functional similarity remains high. Moreover, the biological function captured by co-expression networks thresholded by RMT is highly robust.