RESUMEN
Despite the growing variety of sequencing and variant-calling tools, no workflow performs equally well across the entire human genome. Understanding context-dependent performance is critical for enabling researchers, clinicians, and developers to make informed tradeoffs when selecting sequencing hardware and software. Here we describe a set of "stratifications," which are BED files that define distinct contexts throughout the genome. We define these for GRCh37/38 as well as the new T2T-CHM13 reference, adding many new hard-to-sequence regions which are critical for understanding performance as the field progresses. Specifically, we highlight the increase in hard-to-map and GC-rich stratifications in CHM13 relative to the previous references. We then compare the benchmarking performance with each reference and show the performance penalty brought about by these additional difficult regions in CHM13. Additionally, we demonstrate how the stratifications can track context-specific improvements over different platform iterations, using Oxford Nanopore Technologies as an example. The means to generate these stratifications are available as a snakemake pipeline at https://github.com/usnistgov/giab-stratifications . We anticipate this being useful in enabling precise risk-reward calculations when building sequencing pipelines for any of the commonly-used reference genomes.
Asunto(s)
Genoma Humano , Genómica , Programas Informáticos , Humanos , Genómica/métodos , Análisis de Secuencia de ADN/métodos , Benchmarking , Secuenciación de Nucleótidos de Alto Rendimiento/métodosRESUMEN
PURPOSE: Clonal hematopoiesis (CH) is thought to be the origin of myeloid neoplasms (MN). Yet, our understanding of the mechanisms driving CH progression to MN and clinical risk prediction of MN remains limited. The human proteome reflects complex interactions between genetic and epigenetic regulation of biological systems. We hypothesized that the plasma proteome might predict MN risk and inform our understanding of the mechanisms promoting MN development. EXPERIMENTAL DESIGN: We jointly characterized CH and plasma proteomic profiles of 46,237 individuals in the UK Biobank at baseline study entry. During 500,036 person-years of follow-up, 115 individuals developed MN. Cox proportional hazard regression was used to test for an association between plasma protein levels and MN risk. RESULTS: We identified 115 proteins associated with MN risk, of which 30% (N = 34) were also associated with CH. These were enriched for known regulators of the innate and adaptive immune system. Plasma proteomics improved the prediction of MN risk (AUC = 0.85; P = 5×10-9) beyond clinical factors and CH (AUC = 0.80). In an independent group (N = 381,485), we used inherited polygenic risk scores (PRS) for plasma protein levels to validate the relevance of these proteins toMNdevelopment. PRS analyses suggest that most MN-associated proteins we identified are not directly causally linked toMN risk, but rather represent downstream markers of pathways regulating the progression of CH to MN. CONCLUSIONS: These data highlight the role of immune cell regulation in the progression of CH to MN and the promise of leveraging multi-omic characterization of CH to improveMN risk stratification. See related commentary by Bhalgat and Taylor, p. 3095.
Asunto(s)
Biomarcadores de Tumor , Proteómica , Humanos , Proteómica/métodos , Femenino , Masculino , Persona de Mediana Edad , Biomarcadores de Tumor/sangre , Biomarcadores de Tumor/genética , Anciano , Proteoma , Hematopoyesis Clonal , Factores de Riesgo , Adulto , Proteínas Sanguíneas/metabolismo , Proteínas Sanguíneas/análisis , Trastornos Mieloproliferativos/sangre , Trastornos Mieloproliferativos/genética , Trastornos Mieloproliferativos/diagnóstico , PronósticoRESUMEN
Science, technology, engineering, mathematics, and medicine (STEMM) fields change rapidly and are increasingly interdisciplinary. Commonly, STEMM practitioners use short-format training (SFT) such as workshops and short courses for upskilling and reskilling, but unaddressed challenges limit SFT's effectiveness and inclusiveness. Education researchers, students in SFT courses, and organizations have called for research and strategies that can strengthen SFT in terms of effectiveness, inclusiveness, and accessibility across multiple dimensions. This paper describes the project that resulted in a consensus set of 14 actionable recommendations to systematically strengthen SFT. A diverse international group of 30 experts in education, accessibility, and life sciences came together from 10 countries to develop recommendations that can help strengthen SFT globally. Participants, including representation from some of the largest life science training programs globally, assembled findings in the educational sciences and encompassed the experiences of several of the largest life science SFT programs. The 14 recommendations were derived through a Delphi method, where consensus was achieved in real time as the group completed a series of meetings and tasks designed to elicit specific recommendations. Recommendations cover the breadth of SFT contexts and stakeholder groups and include actions for instructors (e.g., make equity and inclusion an ethical obligation), programs (e.g., centralize infrastructure for assessment and evaluation), as well as organizations and funders (e.g., professionalize training SFT instructors; deploy SFT to counter inequity). Recommendations are aligned with a purpose-built framework-"The Bicycle Principles"-that prioritizes evidenced-based teaching, inclusiveness, and equity, as well as the ability to scale, share, and sustain SFT. We also describe how the Bicycle Principles and recommendations are consistent with educational change theories and can overcome systemic barriers to delivering consistently effective, inclusive, and career-spanning SFT.
Asunto(s)
Estudiantes , Tecnología , Humanos , Consenso , IngenieríaRESUMEN
Motivation: For a number of neurological diseases, such as Alzheimer's disease, amyotrophic lateral sclerosis, and many others, certain genes are known to be involved in the disease mechanism. A common question is whether a structural variant in any such gene may be related to drug response in clinical trials and how this relationship can contribute to the lifecycle of drug development. Results: To this end, we introduce VariantSurvival, a tool that identifies changes in survival relative to structural variants within target genes. VariantSurvival matches annotated structural variants with genes that are clinically relevant to neurological diseases. A Cox regression model determines the change in survival between the placebo and clinical trial groups with respect to the number of structural variants in the drug target genes. We demonstrate the functionality of our approach with the exemplary case of the SETX gene. VariantSurvival has a user-friendly and lightweight graphical user interface built on the shiny web application package.
RESUMEN
Serum tryptase is a biomarker used to aid in the identification of certain myeloid neoplasms, most notably systemic mastocytosis, where basal serum tryptase (BST) levels >20 ng/mL are a minor criterion for diagnosis. Although clonal myeloid neoplasms are rare, the common cause for elevated BST levels is the genetic trait hereditary α-tryptasemia (HαT) caused by increased germline TPSAB1 copy number. To date, the precise structural variation and mechanism(s) underlying elevated BST in HαT and the general clinical utility of tryptase genotyping, remain undefined. Through cloning, long-read sequencing, and assembling of the human tryptase locus from an individual with HαT, and validating our findings in vitro and in silico, we demonstrate that BST elevations arise from overexpression of replicated TPSAB1 loci encoding canonical α-tryptase protein owing to coinheritance of a linked overactive promoter element. Modeling BST levels based on TPSAB1 replication number, we generate new individualized clinical reference values for the upper limit of normal. Using this personalized laboratory medicine approach, we demonstrate the clinical utility of tryptase genotyping, finding that in the absence of HαT, BST levels >11.4 ng/mL frequently identify indolent clonal mast cell disease. Moreover, substantial BST elevations (eg, >100 ng/mL), which would ordinarily prompt bone marrow biopsy, can result from TPSAB1 replications alone and thus be within normal limits for certain individuals with HαT.
Asunto(s)
Mastocitosis , Trastornos Mieloproliferativos , Humanos , Triptasas/genética , Mastocitos , Valores de Referencia , Procedimientos Innecesarios , Mastocitosis/diagnóstico , Trastornos Mieloproliferativos/patologíaRESUMEN
iCn3D was initially developed as a web-based 3D molecular viewer. It then evolved from visualization into a full-featured interactive structural analysis software. It became a collaborative research instrument through the sharing of permanent, shortened URLs that encapsulate not only annotated visual molecular scenes, but also all underlying data and analysis scripts in a FAIR manner. More recently, with the growth of structural databases, the need to analyze large structural datasets systematically led us to use Python scripts and convert the code to be used in Node. js scripts. We showed a few examples of Python scripts at https://github.com/ncbi/icn3d/tree/master/icn3dpython to export secondary structures or PNG images from iCn3D. Users just need to replace the URL in the Python scripts to export other annotations from iCn3D. Furthermore, any interactive iCn3D feature can be converted into a Node. js script to be run in batch mode, enabling an interactive analysis performed on one or a handful of protein complexes to be scaled up to analysis features of large ensembles of structures. Currently available Node. js analysis scripts examples are available at https://github.com/ncbi/icn3d/tree/master/icn3dnode. This development will enable ensemble analyses on growing structural databases such as AlphaFold or RoseTTAFold on one hand and Electron Microscopy on the other. In this paper, we also review new features such as DelPhi electrostatic potential, 3D view of mutations, alignment of multiple chains, assembly of multiple structures by realignment, dynamic symmetry calculation, 2D cartoons at different levels, interactive contact maps, and use of iCn3D in Jupyter Notebook as described at https://pypi.org/project/icn3dpy.
RESUMEN
INTRODUCTION: Pancreatitis is a complex syndrome that results from many etiologies. Large well-characterized cohorts are needed to further understand disease risk and prognosis. METHODS: A pancreatitis cohort of more than 4,200 patients and 24,000 controls were identified in the UK BioBank (UKBB) consortium. A descriptive analysis was completed, comparing patients with acute (AP) and chronic pancreatitis (CP). The Toxic-metabolic, Idiopathic, Genetic, Autoimmune, Recurrent, and severe pancreatitis and Obstructive checklist Version 2 classification was applied to patients with AP and CP and compared with the control population. RESULTS: CP prevalence in the UKBB is 163 per 100,000. AP incidence increased from 21.4/100,000 per year from 2001 to 2005 to 48.2/100,000 per year between 2016 and 2020. Gallstones and smoking were confirmed as key risk factors for AP and CP, respectively. Both populations carry multiple risk factors and a high burden of comorbidities, including benign and malignant neoplastic disorders. DISCUSSION: The UKBB serves as a rich cohort to evaluate pancreatitis. Disease burden of AP and CP was high in this population. The association of common risk factors identified in other cohort studies was confirmed in this study. Further analysis is needed to link genomic risks and biomarkers with disease features in this population.
Asunto(s)
Bancos de Muestras Biológicas , Pancreatitis Crónica , Estudios de Cohortes , Humanos , Pancreatitis Crónica/complicaciones , Pancreatitis Crónica/epidemiología , Prevalencia , Reino Unido/epidemiologíaRESUMEN
Characterizing the gut microbiota in terms of their capacity to interfere with drug metabolism is necessary to achieve drug efficacy and safety. Although examples of drug-microbiome interactions are well-documented, little has been reported about a computational pipeline for systematically identifying and characterizing bacterial enzymes that process particular classes of drugs. The goal of our study is to develop a computational approach that compiles drugs whose metabolism may be influenced by a particular class of microbial enzymes and that quantifies the variability in the collective level of those enzymes among individuals. The present paper describes this approach, with microbial ß-glucuronidases as an example, which break down drug-glucuronide conjugates and reactivate the drugs or their metabolites. We identified 100 medications that may be metabolized by ß-glucuronidases from the gut microbiome. These medications included morphine, estrogen, ibuprofen, midazolam, and their structural analogues. The analysis of metagenomic data available through the Sequence Read Archive (SRA) showed that the level of ß-glucuronidase in the gut metagenomes was higher in males than in females, which provides a potential explanation for the sex-based differences in efficacy and toxicity for several drugs, reported in previous studies. Our analysis also showed that infant gut metagenomes at birth and 12 months of age have higher levels of ß-glucuronidase than the metagenomes of their mothers and the implication of this observed variability was discussed in the context of breastfeeding as well as infant hyperbilirubinemia. Overall, despite important limitations discussed in this paper, our analysis provided useful insights on the role of the human gut metagenome in the variability in drug response among individuals. Importantly, this approach exploits drug and metagenome data available in public databases as well as open-source cheminformatics and bioinformatics tools to predict drug-metagenome interactions.
Asunto(s)
Predicción/métodos , Microbioma Gastrointestinal/efectos de los fármacos , Metagenómica/métodos , Adulto , Bacterias/genética , Biología Computacional/métodos , Manejo de Datos , Femenino , Microbioma Gastrointestinal/genética , Glucuronidasa/genética , Glucuronidasa/metabolismo , Humanos , Recién Nacido , Masculino , Metagenoma/efectos de los fármacos , Metagenoma/genética , Microbiota/efectos de los fármacos , Microbiota/genética , MadresRESUMEN
Viruses represent important test cases for data federation due to their genome size and the rapid increase in sequence data in publicly available databases. However, some consequences of previously decentralized (unfederated) data are lack of consensus or comparisons between feature annotations. Unifying or displaying alternative annotations should be a priority both for communities with robust entry representation and for nascent communities with burgeoning data sources. To this end, during this three-day continuation of the Virus Hunting Toolkit codeathon series (VHT-2), a new integrated and federated viral index was elaborated. This Federated Index of Viral Experiments (FIVE) integrates pre-existing and novel functional and taxonomy annotations and virus-host pairings. Variability in the context of viral genomic diversity is often overlooked in virus databases. As a proof-of-concept, FIVE was the first attempt to include viral genome variation for HIV, the most well-studied human pathogen, through viral genome diversity graphs. As per the publication of this manuscript, FIVE is the first implementation of a virus-specific federated index of such scope. FIVE is coded in BigQuery for optimal access of large quantities of data and is publicly accessible. Many projects of database or index federation fail to provide easier alternatives to access or query information. To this end, a Python API query system was developed to enhance the accessibility of FIVE.
Asunto(s)
Biología Computacional , Bases de Datos Genéticas , Metagenómica/métodos , Virus/genética , Biología Computacional/métodos , Variación Genética , Genoma Viral , Interacciones Huésped-Patógeno , Humanos , Interfaz Usuario-Computador , Proteínas Virales/genética , Proteínas Virales/metabolismo , Virus/metabolismo , Navegador WebRESUMEN
The Sequence Read Archive (SRA) is a large public repository that stores raw next-generation sequencing data from thousands of diverse scientific investigations. Despite its promise, reuse and re-analysis of SRA data has been challenged by the heterogeneity and poor quality of the metadata that describe its biological samples. Recently, the MetaSRA project standardized these metadata by annotating each sample with terms from biomedical ontologies. In this work, we present a pair of Jupyter notebook-based tools that utilize the MetaSRA for building structured datasets from the SRA in order to facilitate secondary analyses of the SRA's human RNA-seq data. The first tool, called the Case-Control Finder, finds suitable case and control samples for a given disease or condition where the cases and controls are matched by tissue or cell type. The second tool, called the Series Finder, finds ordered sets of samples for the purpose of addressing biological questions pertaining to changes over a numerical property such as time. These tools were the result of a three-day-long NCBI Codeathon in March 2019 held at the University of North Carolina at Chapel Hill.
Asunto(s)
Ontologías Biológicas , Conjuntos de Datos como Asunto , Secuenciación de Nucleótidos de Alto Rendimiento , Metadatos , Programas Informáticos , Estudios de Casos y Controles , Humanos , RNA-SeqRESUMEN
PURPOSE: The modern researcher is confronted with hundreds of published methods to interpret genetic variants. There are databases of genes and variants, phenotype-genotype relationships, algorithms that score and rank genes, and in silico variant effect prediction tools. Because variant prioritization is a multifactorial problem, a welcome development in the field has been the emergence of decision support frameworks, which make it easier to integrate multiple resources in an interactive environment. Current decision support frameworks are typically limited by closed proprietary architectures, access to a restricted set of tools, lack of customizability, Web dependencies that expose protected data, or limited scalability. METHODS: We present the Open Custom Ranked Analysis of Variants Toolkit1 (OpenCRAVAT) a new open-source, scalable decision support system for variant and gene prioritization. We have designed the resource catalog to be open and modular to maximize community and developer involvement, and as a result, the catalog is being actively developed and growing every month. Resources made available via the store are well suited for analysis of cancer, as well as Mendelian and complex diseases. RESULTS: OpenCRAVAT offers both command-line utility and dynamic graphical user interface, allowing users to install with a single command, easily download tools from an extensive resource catalog, create customized pipelines, and explore results in a richly detailed viewing environment. We present several case studies to illustrate the design of custom workflows to prioritize genes and variants. CONCLUSION: OpenCRAVAT is distinguished from similar tools by its capabilities to access and integrate an unprecedented amount of diverse data resources and computational prediction methods, which span germline, somatic, common, rare, coding, and noncoding variants.
Asunto(s)
Biología Computacional/organización & administración , Bases de Datos Genéticas/normas , Mutación , Proteínas de Neoplasias/genética , Neoplasias/genética , Programas Informáticos/normas , Humanos , Neoplasias/diagnóstico , Neoplasias/tratamiento farmacológico , Interfaz Usuario-Computador , Flujo de TrabajoRESUMEN
Background: Basic and clinical scientific research at the University of South Florida (USF) have intersected to support a multi-faceted approach around a common focus on rare iron-related diseases. We proposed a modified version of the National Center for Biotechnology Information's (NCBI) Hackathon-model to take full advantage of local expertise in building "Iron Hack", a rare disease-focused hackathon. As the collaborative, problem-solving nature of hackathons tends to attract participants of highly-diverse backgrounds, organizers facilitated a symposium on rare iron-related diseases, specifically porphyrias and Friedreich's ataxia, pitched at general audiences. Methods: The hackathon was structured to begin each day with presentations by expert clinicians, genetic counselors, researchers focused on molecular and cellular biology, public health/global health, genetics/genomics, computational biology, bioinformatics, biomolecular science, bioengineering, and computer science, as well as guest speakers from the American Porphyria Foundation (APF) and Friedreich's Ataxia Research Alliance (FARA) to inform participants as to the human impact of these diseases. Results: As a result of this hackathon, we developed resources that are relevant not only to these specific disease-models, but also to other rare diseases and general bioinformatics problems. Within two and a half days, "Iron Hack" participants successfully built collaborative projects to visualize data, build databases, improve rare disease diagnosis, and study rare-disease inheritance. Conclusions: The purpose of this manuscript is to demonstrate the utility of a hackathon model to generate prototypes of generalizable tools for a given disease and train clinicians and data scientists to interact more effectively.
Asunto(s)
Ataxia de Friedreich , Porfirias , Bases de Datos Factuales , Humanos , Hierro , Enfermedades Raras , Estados UnidosRESUMEN
A wealth of viral data sits untapped in publicly available metagenomic data sets when it might be extracted to create a usable index for the virological research community. We hypothesized that work of this complexity and scale could be done in a hackathon setting. Ten teams comprised of over 40 participants from six countries, assembled to create a crowd-sourced set of analysis and processing pipelines for a complex biological data set in a three-day event on the San Diego State University campus starting 9 January 2019. Prior to the hackathon, 141,676 metagenomic data sets from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) were pre-assembled into contiguous assemblies (contigs) by NCBI staff. During the hackathon, a subset consisting of 2953 SRA data sets (approximately 55 million contigs) was selected, which were further filtered for a minimal length of 1 kb. This resulted in 4.2 million (Mio) contigs, which were aligned using BLAST against all known virus genomes, phylogenetically clustered and assigned metadata. Out of the 4.2 Mio contigs, 360,000 contigs were labeled with domains and an additional subset containing 4400 contigs was screened for virus or virus-like genes. The work yielded valuable insights into both SRA data and the cloud infrastructure required to support such efforts, revealing analysis bottlenecks and possible workarounds thereof. Mainly: (i) Conservative assemblies of SRA data improves initial analysis steps; (ii) existing bioinformatic software with weak multithreading/multicore support can be elevated by wrapper scripts to use all cores within a computing node; (iii) redesigning existing bioinformatic algorithms for a cloud infrastructure to facilitate its use for a wider audience; and (iv) a cloud infrastructure allows a diverse group of researchers to collaborate effectively. The scientific findings will be extended during a follow-up event. Here, we present the applied workflows, initial results, and lessons learned from the hackathon.
Asunto(s)
Nube Computacional/normas , Genoma Viral , Metagenoma , Metagenómica/métodos , Macrodatos , Genoma Humano , Humanos , Metagenómica/normas , Programas InformáticosRESUMEN
BACKGROUND: Anthozoa, Endocnidozoa, and Medusozoa are the 3 major clades of Cnidaria. Medusozoa is further divided into 4 clades, Hydrozoa, Staurozoa, Cubozoa, and Scyphozoa-the latter 3 lineages make up the clade Acraspeda. Acraspeda encompasses extraordinary diversity in terms of life history, numerous nuisance species, taxa with complex eyes rivaling other animals, and some of the most venomous organisms on the planet. Genomes have recently become available within Scyphozoa and Cubozoa, but there are currently no published genomes within Staurozoa and Cubozoa. FINDINGS: Here we present 3 new draft genomes of Calvadosia cruxmelitensis (Staurozoa), Alatina alata (Cubozoa), and Cassiopea xamachana (Scyphozoa) for which we provide a preliminary orthology analysis that includes an inventory of their respective venom-related genes. Additionally, we identify synteny between POU and Hox genes that had previously been reported in a hydrozoan, suggesting this linkage is highly conserved, possibly dating back to at least the last common ancestor of Medusozoa, yet likely independent of vertebrate POU-Hox linkages. CONCLUSIONS: These draft genomes provide a valuable resource for studying the evolutionary history and biology of these extraordinary animals, and for identifying genomic features underlying venom, vision, and life history traits in Acraspeda.
Asunto(s)
Cnidarios/genética , Genoma , Animales , Cnidarios/clasificación , Venenos de Cnidarios/genética , Venenos de Cnidarios/metabolismo , Filogenia , Sintenía , TranscriptomaRESUMEN
BACKGROUND: Next-generation sequencing technologies can produce tens of millions of reads, often paired-end, from transcripts or genomes. But few programs can align RNA on the genome and accurately discover introns, especially with long reads. We introduce Magic-BLAST, a new aligner based on ideas from the Magic pipeline. RESULTS: Magic-BLAST uses innovative techniques that include the optimization of a spliced alignment score and selective masking during seed selection. We evaluate the performance of Magic-BLAST to accurately map short or long sequences and its ability to discover introns on real RNA-seq data sets from PacBio, Roche and Illumina runs, and on six benchmarks, and compare it to other popular aligners. Additionally, we look at alignments of human idealized RefSeq mRNA sequences perfectly matching the genome. CONCLUSIONS: We show that Magic-BLAST is the best at intron discovery over a wide range of conditions and the best at mapping reads longer than 250 bases, from any platform. It is versatile and robust to high levels of mismatches or extreme base composition, and reasonably fast. It can align reads to a BLAST database or a FASTA file. It can accept a FASTQ file as input or automatically retrieve an accession from the SRA repository at the NCBI.
Asunto(s)
ARN/genética , Alineación de Secuencia , Análisis de Secuencia de ARN/métodos , Programas Informáticos , Algoritmos , Secuencia de Bases , Bases de Datos de Ácidos Nucleicos , Humanos , Intrones/genética , Curva ROC , Factores de TiempoRESUMEN
BACKGROUND: During the last decade, plant biotechnological laboratories have sparked a monumental revolution with the rapid development of next sequencing technologies at affordable prices. Soon, these sequencing technologies and assembling of whole genomes will extend beyond the plant computational biologists and become commonplace within the plant biology disciplines. The current availability of large-scale genomic resources for non-traditional plant model systems (the so-called 'orphan crops') is enabling the construction of high-density integrated physical and genetic linkage maps with potential applications in plant breeding. The newly available fully sequenced plant genomes represent an incredible opportunity for comparative analyses that may reveal new aspects of genome biology and evolution. The analysis of the expansion and evolution of gene families across species is a common approach to infer biological functions. To date, the extent and role of gene families in plants has only been partially addressed and many gene families remain to be investigated. Manual identification of gene families is highly time-consuming and laborious, requiring an iterative process of manual and computational analysis to identify members of a given family, typically combining numerous BLAST searches and manually cleaning data. Due to the increasing abundance of genome sequences and the agronomical interest in plant gene families, the field needs a clear, automated annotation tool. RESULTS: Here, we present the geneHummus package, an R-based pipeline for the identification and characterization of plant gene families. The impact of this pipeline comes from a reduction in hands-on annotation time combined with high specificity and sensitivity in extracting only proteins from the RefSeq database and providing the conserved domain architectures based on SPARCLE. As a case study we focused on the auxin receptor factors gene (ARF) family in Cicer arietinum (chickpea) and other legumes. CONCLUSION: We anticipate that our pipeline should be suitable for any taxonomic plant family, and likely other gene families, vastly improving the speed and ease of genomic data processing.
Asunto(s)
Fabaceae/genética , Genes de Plantas , Familia de Multigenes , Programas Informáticos , Cicer/genética , Filogenia , Proteínas de Plantas/genética , Receptores de Superficie Celular/genética , TranscriptomaRESUMEN
In March 2019, 45 scientists and software engineers from around the world converged at the University of California, Santa Cruz for the first pangenomics codeathon. The purpose of the meeting was to propose technical specifications and standards for a usable human pangenome as well as to build relevant tools for genome graph infrastructures. During the meeting, the group held several intense and productive discussions covering a diverse set of topics, including advantages of graph genomes over a linear reference representation, design of new methods that can leverage graph-based data structures, and novel visualization and annotation approaches for pangenomes. Additionally, the participants self-organized themselves into teams that worked intensely over a three-day period to build a set of pipelines and tools for specific pangenomic applications. A summary of the questions raised and the tools developed are reported in this manuscript.
RESUMEN
Ever return from a meeting feeling elated by all those exciting talks, yet unsure how all those presented glamorous and/or exciting tools can be useful in your research? Or do you have a great piece of software you want to share, yet only a handful of people visited your poster? We have all been there, and that is why we organized the Matchmaking for Computational and Experimental Biologists Session at the latest ISCB/GLBIO'2017 meeting in Chicago (May 15-17, 2017). The session exemplifies a novel approach, mimicking "matchmaking", to encouraging communication, making connections and fostering collaborations between computational and non-computational biologists. More specifically, the session facilitates face-to-face communication between researchers with similar or differing research interests, which we feel are critical for promoting productive discussions and collaborations. To accomplish this, three short scheduled talks were delivered, focusing on RNA-seq, integration of clinical and genomic data, and chromatin accessibility analyses. Next, small-table developer-led discussions, modeled after speed-dating, enabled each developer (including the speakers) to introduce a specific tool and to engage potential users or other developers around the table. Notably, we asked the audience whether any other tool developers would want to showcase their tool and we thus added four developers as moderators of these small-table discussions. Given the positive feedback from the tool developers, we feel that this type of session is an effective approach for promoting valuable scientific discussion, and is particularly helpful in the context of conferences where the number of participants and activities could hamper such interactions.
RESUMEN
Quantification of gene expression and characterization of gene transcript structures are central problems in molecular biology. RNA sequencing (RNA-Seq) and chromatin immunoprecipitation sequencing (ChIP-Seq) are important methods, but can be cumbersome and difficult for beginners to learn. To teach interested students and scientists how to analyze RNA-Seq and ChIP-Seq data, we present a start-to-finish tutorial for analyzing RNA-Seq and ChIP-Seq data: SeqAcademy ( source code: https://github.com/NCBI-Hackathons/seqacademy, webpage: http://www.seqacademy.org/). This user-friendly pipeline, fully written in Jupyter Notebook, emphasizes the use of publicly available RNA-Seq and ChIP-Seq data and strings together popular tools that bridge that gap between raw sequencing reads and biological insight. We demonstrate practical and conceptual considerations for various RNA-Seq and ChIP-Seq analysis steps with a biological use case - a previously published yeast experiment. This work complements existing sophisticated RNA-Seq and ChIP-Seq pipelines designed for advanced users by gently introducing the critical components of RNA-Seq and ChIP-Seq analysis to the novice bioinformatician. In conclusion, this well-documented pipeline will introduce state-of-the-art RNA-Seq and ChIP-Seq analysis tools to beginning bioinformaticians and help facilitate the analysis of the burgeoning amounts of public RNA-Seq and ChIP-Seq data.