RESUMO
SUMMARY: Variant Call Format (VCF), the prevailing representation for germline genotypes in population sequencing, suffers rapid size growth as larger cohorts are sequenced and more rare variants are discovered. We present Sparse Project VCF (spVCF), an evolution of VCF with judicious entropy reduction and run-length encoding, delivering >10× size reduction for modern studies with practically minimal information loss. spVCF interoperates with VCF efficiently, including tabix-based random access. We demonstrate its effectiveness with the DiscovEHR and UK Biobank whole-exome sequencing cohorts. AVAILABILITY AND IMPLEMENTATION: Apache-licensed reference implementation: github.com/mlin/spVCF. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Genômica , Software , Sequência de Bases , Genótipo , Células GerminativasRESUMO
MOTIVATION: Population-scale sequenced cohorts are foundational resources for genetic analyses, but processing raw reads into analysis-ready cohort-level variants remains challenging. RESULTS: We introduce an open-source cohort-calling method that uses the highly accurate caller DeepVariant and scalable merging tool GLnexus. Using callset quality metrics based on variant recall and precision in benchmark samples and Mendelian consistency in father-mother-child trios, we optimize the method across a range of cohort sizes, sequencing methods and sequencing depths. The resulting callsets show consistent quality improvements over those generated using existing best practices with reduced cost. We further evaluate our pipeline in the deeply sequenced 1000 Genomes Project (1KGP) samples and show superior callset quality metrics and imputation reference panel performance compared to an independently generated GATK Best Practices pipeline. AVAILABILITY AND IMPLEMENTATION: We publicly release the 1KGP individual-level variant calls and cohort callset (https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP) to foster additional development and evaluation of cohort merging methods as well as broad studies of genetic variation. Both DeepVariant (https://github.com/google/deepvariant) and GLnexus (https://github.com/dnanexus-rnd/GLnexus) are open-source, and the optimized GLnexus setup discovered in this study is also integrated into GLnexus public releases v1.2.2 and later. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
RESUMO
OBJECTIVES: To determine patient and procedural risk factors for major complications in ultrasound (US)-guided random renal core biopsy. METHODS: Random renal biopsies performed by radiologists in the US department at a single institution between 2014 and 2018 were retrospectively reviewed. The patient's age, sex, race, and estimated glomerular filtration rate (eGFR) were recorded. The biopsy approach, needle gauge, length of cores, number of throws, and presence of a color flow tract were recorded. Outcome data included minor and major complications. Associations between variables were tested with χ2 analyses and univariable/multivariable logistic regression models. RESULTS: A total of 231 biopsies (167 native and 64 allografts) were reviewed. There was no significant difference in the sex, age, race, or eGFR between native and allograft groups. The overall rate for any complication was 18.2%, with a 4.3% rate of major complications, which was significantly greater in native compared to allograft biopsies (6% versus 0%; P = .045). A risk analysis in native biopsies only showed that major complications were significantly associated with a low eGFR such that patients with stage 4 or 5 kidney disease had higher odds of complications (odds ratio [95% confidence interval]: stage 4, 9.405 [1.995-44.338]; P = .0393; stage 5, 10.749 [2.218-52.080]; P = .0203) than patients with normal function (eGFR >60 mL/min). The presence of a color flow tract portended a 10.7 times greater risk of having any complication (95% confidence interval, 4.595-24.994; P < .001). Other procedural factors were not significantly associated with complications. CONCLUSIONS: There is an increased risk of major complications in US-guided random native kidney biopsy in patients with a low eGFR (<30 mL/min) and a patent color flow tract in the immediate postbiopsy setting.
Assuntos
Biópsia Guiada por Imagem , Ultrassonografia de Intervenção , Biópsia , Biópsia com Agulha de Grande Calibre , Humanos , Rim/diagnóstico por imagem , Estudos RetrospectivosRESUMO
OBJECTIVES: Image-guided tissue sampling in the workup of suspected lymphoma can be performed by core needle biopsy (CNB) or CNB with fine-needle aspiration (FNA). We compared the yield of clinically actionable diagnoses between these methods of tissue sampling. METHODS: All ultrasound-guided percutaneous peripheral lymph node biopsies from 2010 to 2017 at a single institution were retrospectively reviewed for biopsy type (CNB versus CNB + FNA), prior diagnosis of lymphoma, size of the target lymph node, number of cores, length of core specimens, and pathologic diagnosis. Lymphoma and lymphoid tissue were included; metastatic disease and nonlymphoid tissue were excluded. An oncologist specializing in lymphoma independently determined whether an actionable diagnosis could be made with the pathologic results in the context of the patient's medical record. χ2 analyses and univariable/multivariable logistic regression models were used for statistical analyses. RESULTS: Of 578 lymph node biopsies, 306 (53%) had a prior diagnosis of lymphoma; 273 (47%) were CNB, and 305 (53%) were CNB + FNA. There was no significant difference between biopsy types (CNB versus CNB + FNA) in the number of cores (median [25th, 75th percentiles], 3 [3, 4] versus 4 [3, 4]; P = .47) or total length of tissue (4.1 [2.5, 6.1] versus 3.7 [2.3, 6] cm; P = .09). There was no difference in obtaining an actionable diagnosis between biopsy types after controlling for a known history of lymphoma (P = .271) or after controlling for the number of core specimens (P = .826). CONCLUSIONS: In cases of suspected lymphoma, CNB without FNA was sufficient to obtain an actionable diagnosis.
Assuntos
Linfonodos/diagnóstico por imagem , Linfonodos/patologia , Linfoma/diagnóstico por imagem , Linfoma/patologia , Ultrassonografia de Intervenção/métodos , Adulto , Idoso , Idoso de 80 Anos ou mais , Biópsia por Agulha Fina , Biópsia com Agulha de Grande Calibre , Feminino , Humanos , Biópsia Guiada por Imagem/métodos , Masculino , Pessoa de Meia-Idade , Estudos Retrospectivos , Adulto JovemRESUMO
Translational stop codon readthrough emerged as a major regulatory mechanism affecting hundreds of genes in animal genomes, based on recent comparative genomics and ribosomal profiling evidence, but its evolutionary properties remain unknown. Here, we leverage comparative genomic evidence across 21 Anopheles mosquitoes to systematically annotate readthrough genes in the malaria vector Anopheles gambiae, and to provide the first study of abundant readthrough evolution, by comparison with 20 Drosophila species. Using improved comparative genomics methods for detecting readthrough, we identify evolutionary signatures of conserved, functional readthrough of 353 stop codons in the malaria vector, Anopheles gambiae, and of 51 additional Drosophila melanogaster stop codons, including several cases of double and triple readthrough and of readthrough of two adjacent stop codons. We find that most differences between the readthrough repertoires of the two species arose from readthrough gain or loss in existing genes, rather than birth of new genes or gene death; that readthrough-associated RNA structures are sometimes gained or lost while readthrough persists; that readthrough is more likely to be lost at TAA and TAG stop codons; and that readthrough is under continued purifying evolutionary selection in mosquito, based on population genetic evidence. We also determine readthrough-associated gene properties that predate readthrough, and identify differences in the characteristic properties of readthrough genes between clades. We estimate more than 600 functional readthrough stop codons in mosquito and 900 in fruit fly, provide evidence of readthrough control of peroxisomal targeting, and refine the phylogenetic extent of abundant readthrough as following divergence from centipede.
Assuntos
Anopheles/genética , Anopheles/metabolismo , Códon de Terminação , Terminação Traducional da Cadeia Peptídica , Animais , Evolução Biológica , Códon , Drosophila melanogaster , Evolução Molecular , Genômica , Fases de Leitura Aberta , Filogenia , Biossíntese de Proteínas , Ribossomos/genética , Ribossomos/metabolismoRESUMO
The comparison of related genomes has emerged as a powerful lens for genome interpretation. Here we report the sequencing and comparative analysis of 29 eutherian genomes. We confirm that at least 5.5% of the human genome has undergone purifying selection, and locate constrained elements covering â¼4.2% of the genome. We use evolutionary signatures and comparisons with experimental data sets to suggest candidate functions for â¼60% of constrained bases. These elements reveal a small number of new coding exons, candidate stop codon readthrough events and over 10,000 regions of overlapping synonymous constraint within protein-coding exons. We find 220 candidate RNA structural families, and nearly a million elements overlapping potential promoter, enhancer and insulator regions. We report specific amino acid residues that have undergone positive selection, 280,000 non-coding elements exapted from mobile elements and more than 1,000 primate- and human-accelerated elements. Overlap with disease-associated variants indicates that our findings will be relevant for studies of human biology, health and disease.
Assuntos
Evolução Molecular , Genoma Humano/genética , Genoma/genética , Mamíferos/genética , Animais , Doença , Éxons/genética , Genômica , Saúde , Humanos , Anotação de Sequência Molecular , Filogenia , RNA/classificação , RNA/genética , Seleção Genética/genética , Alinhamento de Sequência , Análise de Sequência de DNARESUMO
Long noncoding RNAs (lncRNAs) comprise a diverse class of transcripts that structurally resemble mRNAs but do not encode proteins. Recent genome-wide studies in humans and the mouse have annotated lncRNAs expressed in cell lines and adult tissues, but a systematic analysis of lncRNAs expressed during vertebrate embryogenesis has been elusive. To identify lncRNAs with potential functions in vertebrate embryogenesis, we performed a time-series of RNA-seq experiments at eight stages during early zebrafish development. We reconstructed 56,535 high-confidence transcripts in 28,912 loci, recovering the vast majority of expressed RefSeq transcripts while identifying thousands of novel isoforms and expressed loci. We defined a stringent set of 1133 noncoding multi-exonic transcripts expressed during embryogenesis. These include long intergenic ncRNAs (lincRNAs), intronic overlapping lncRNAs, exonic antisense overlapping lncRNAs, and precursors for small RNAs (sRNAs). Zebrafish lncRNAs share many of the characteristics of their mammalian counterparts: relatively short length, low exon number, low expression, and conservation levels comparable to that of introns. Subsets of lncRNAs carry chromatin signatures characteristic of genes with developmental functions. The temporal expression profile of lncRNAs revealed two novel properties: lncRNAs are expressed in narrower time windows than are protein-coding genes and are specifically enriched in early-stage embryos. In addition, several lncRNAs show tissue-specific expression and distinct subcellular localization patterns. Integrative computational analyses associated individual lncRNAs with specific pathways and functions, ranging from cell cycle regulation to morphogenesis. Our study provides the first systematic identification of lncRNAs in a vertebrate embryo and forms the foundation for future genetic, genomic, and evolutionary studies.
Assuntos
Desenvolvimento Embrionário/genética , RNA não Traduzido/genética , Peixe-Zebra/embriologia , Peixe-Zebra/genética , Animais , Cromatina , Análise por Conglomerados , Biologia Computacional/métodos , Expressão Gênica , Perfilação da Expressão Gênica , Regulação da Expressão Gênica no Desenvolvimento , Genômica , Camundongos , Fases de Leitura Aberta , Especificidade de Órgãos/genética , Transcrição GênicaRESUMO
There is growing recognition that mammalian cells produce many thousands of large intergenic transcripts. However, the functional significance of these transcripts has been particularly controversial. Although there are some well-characterized examples, most (>95%) show little evidence of evolutionary conservation and have been suggested to represent transcriptional noise. Here we report a new approach to identifying large non-coding RNAs using chromatin-state maps to discover discrete transcriptional units intervening known protein-coding loci. Our approach identified approximately 1,600 large multi-exonic RNAs across four mouse cell types. In sharp contrast to previous collections, these large intervening non-coding RNAs (lincRNAs) show strong purifying selection in their genomic loci, exonic sequences and promoter regions, with greater than 95% showing clear evolutionary conservation. We also developed a functional genomics approach that assigns putative functions to each lincRNA, demonstrating a diverse range of roles for lincRNAs in processes from embryonic stem cell pluripotency to cell proliferation. We obtained independent functional validation for the predictions for over 100 lincRNAs, using cell-based assays. In particular, we demonstrate that specific lincRNAs are transcriptionally regulated by key transcription factors in these processes such as p53, NFkappaB, Sox2, Oct4 (also known as Pou5f1) and Nanog. Together, these results define a unique collection of functional lincRNAs that are highly conserved and implicated in diverse biological processes.
Assuntos
Cromatina/genética , Sequência Conservada , Mamíferos/genética , RNA/genética , Animais , Sequência de Bases , Células Cultivadas , Sequência Conservada/genética , DNA Intergênico , Éxons/genética , Camundongos , Regiões Promotoras Genéticas/genética , Reprodutibilidade dos Testes , Fatores de Transcrição/metabolismoRESUMO
Candida species are the most common cause of opportunistic fungal infection worldwide. Here we report the genome sequences of six Candida species and compare these and related pathogens and non-pathogens. There are significant expansions of cell wall, secreted and transporter gene families in pathogenic species, suggesting adaptations associated with virulence. Large genomic tracts are homozygous in three diploid species, possibly resulting from recent recombination events. Surprisingly, key components of the mating and meiosis pathways are missing from several species. These include major differences at the mating-type loci (MTL); Lodderomyces elongisporus lacks MTL, and components of the a1/2 cell identity determinant were lost in other species, raising questions about how mating and cell types are controlled. Analysis of the CUG leucine-to-serine genetic-code change reveals that 99% of ancestral CUG codons were erased and new ones arose elsewhere. Lastly, we revise the Candida albicans gene catalogue, identifying many new genes.
Assuntos
Candida/fisiologia , Candida/patogenicidade , Evolução Molecular , Genoma Fúngico/genética , Reprodução/genética , Candida/classificação , Candida/genética , Códon/genética , Sequência Conservada , Diploide , Genes Fúngicos/genética , Meiose/genética , Polimorfismo Genético , Saccharomyces/classificação , Saccharomyces/genética , Virulência/genéticaRESUMO
OBJECTIVE: To determine the effectiveness of the CT histogram method to characterize indeterminate adrenal nodules above 10 Hounsfield units (HU) on noncontrast CT. MATERIALS AND METHODS: Retrospective review of clinical CT data from January 2005 through 2008 identified 194 indeterminate adrenal nodules (>10 HU on noncontrast CT) in 175 patients. 20 nodules in 18 patients were excluded due to large standard deviation (SD > 30) of HU values. Of the remaining 174 nodules, 131 were classified as benign lipid-poor nodules based on size stability for ≥1 year (104), in- and opposed-phase MRI (17), adrenal washout CT (3), or biopsy (7). 43 were classified as malignant by size increase over a short time (30), avid FDG uptake on PET/CT (15), or biopsy (5). Histogram analysis was performed by drawing a circular region of interest on all adrenal nodules. Mean attenuation, total number of pixels, number of negative pixels, and percentage of negative pixels were recorded for each nodule. RESULTS: At the threshold value of >10% negative pixels, 59/131 benign nodules were correctly characterized, but 1/43 malignant nodules was falsely characterized as benign (sensitivity 45%, specificity 98%, positive predictive value 98%). With a slightly higher threshold value of >15% negative pixels, there were no false benign judgments. 36 nodules had more than 15% negative pixels, all of which were benign (sensitivity 27%, specificity 100%, positive predictive value 100%). In the subgroup of benign nodules measuring 11-20 HU, 80% and 54% were identified with threshold values of >10% and >15% negative pixels, respectively. CONCLUSION: The CT histogram method with a threshold value of >10% negative pixels can identify many benign adrenal nodules with attenuation values >10 HU on unenhanced CT with extremely high specificity. A threshold of >15% negative pixels can achieve 100% specificity. This method is highly robust provided very "noisy" CT examinations (SD > 30) are eliminated.
Assuntos
Neoplasias das Glândulas Suprarrenais/diagnóstico por imagem , Glândulas Suprarrenais/diagnóstico por imagem , Interpretação de Imagem Assistida por Computador/métodos , Tomografia Computadorizada por Raios X/métodos , Diagnóstico Diferencial , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Reprodutibilidade dos Testes , Estudos Retrospectivos , Sensibilidade e EspecificidadeRESUMO
The degeneracy of the genetic code allows protein-coding DNA and RNA sequences to simultaneously encode additional, overlapping functional elements. A sequence in which both protein-coding and additional overlapping functions have evolved under purifying selection should show increased evolutionary conservation compared to typical protein-coding genes--especially at synonymous sites. In this study, we use genome alignments of 29 placental mammals to systematically locate short regions within human ORFs that show conspicuously low estimated rates of synonymous substitution across these species. The 29-species alignment provides statistical power to locate more than 10,000 such regions with resolution down to nine-codon windows, which are found within more than a quarter of all human protein-coding genes and contain â¼2% of their synonymous sites. We collect numerous lines of evidence that the observed synonymous constraint in these regions reflects selection on overlapping functional elements including splicing regulatory elements, dual-coding genes, RNA secondary structures, microRNA target sites, and developmental enhancers. Our results show that overlapping functional elements are common in mammalian genes, despite the vast genomic landscape.
Assuntos
Genoma , Mamíferos/genética , Fases de Leitura Aberta/genética , Seleção Genética , Animais , Composição de Bases , Sequência de Bases , Códon , Códon de Iniciação , Biologia Computacional , Sequência Conservada , Elementos Facilitadores Genéticos , Éxons , Ordem dos Genes , Genes BRCA1 , Proteínas de Homeodomínio/genética , Humanos , MicroRNAs/metabolismo , Dados de Sequência Molecular , Taxa de Mutação , Conformação de Ácido Nucleico , Nucleossomos/metabolismo , Iniciação Traducional da Cadeia Peptídica , Splicing de RNA , Alinhamento de Sequência , Transcrição GênicaRESUMO
While translational stop codon readthrough is often used by viral genomes, it has been observed for only a handful of eukaryotic genes. We previously used comparative genomics evidence to recognize protein-coding regions in 12 species of Drosophila and showed that for 149 genes, the open reading frame following the stop codon has a protein-coding conservation signature, hinting that stop codon readthrough might be common in Drosophila. We return to this observation armed with deep RNA sequence data from the modENCODE project, an improved higher-resolution comparative genomics metric for detecting protein-coding regions, comparative sequence information from additional species, and directed experimental evidence. We report an expanded set of 283 readthrough candidates, including 16 double-readthrough candidates; these were manually curated to rule out alternatives such as A-to-I editing, alternative splicing, dicistronic translation, and selenocysteine incorporation. We report experimental evidence of translation using GFP tagging and mass spectrometry for several readthrough regions. We find that the set of readthrough candidates differs from other genes in length, composition, conservation, stop codon context, and in some cases, conserved stem-loops, providing clues about readthrough regulation and potential mechanisms. Lastly, we expand our studies beyond Drosophila and find evidence of abundant readthrough in several other insect species and one crustacean, and several readthrough candidates in nematode and human, suggesting that functionally important translational stop codon readthrough is significantly more prevalent in Metazoa than previously recognized.
Assuntos
Códon de Terminação/fisiologia , Genes de Insetos/fisiologia , Fases de Leitura Aberta/fisiologia , Biossíntese de Proteínas/fisiologia , Animais , Proteínas de Drosophila/biossíntese , Proteínas de Drosophila/genética , Drosophila melanogaster , HumanosRESUMO
Sequencing of multiple related species followed by comparative genomics analysis constitutes a powerful approach for the systematic understanding of any genome. Here, we use the genomes of 12 Drosophila species for the de novo discovery of functional elements in the fly. Each type of functional element shows characteristic patterns of change, or 'evolutionary signatures', dictated by its precise selective constraints. Such signatures enable recognition of new protein-coding genes and exons, spurious and incorrect gene annotations, and numerous unusual gene structures, including abundant stop-codon readthrough. Similarly, we predict non-protein-coding RNA genes and structures, and new microRNA (miRNA) genes. We provide evidence of miRNA processing and functionality from both hairpin arms and both DNA strands. We identify several classes of pre- and post-transcriptional regulatory motifs, and predict individual motif instances with high confidence. We also study how discovery power scales with the divergence and number of species compared, and we provide general guidelines for comparative studies.
Assuntos
Drosophila/classificação , Drosophila/genética , Evolução Molecular , Genoma de Inseto/genética , Genômica , Animais , Sequência de Bases , Sítios de Ligação , Sequência Conservada , Proteínas de Drosophila/genética , Éxons/genética , Regulação da Expressão Gênica/genética , Genes de Insetos/genética , MicroRNAs/genética , Dados de Sequência Molecular , Especificidade de Órgãos , Filogenia , Regiões não Traduzidas/genéticaRESUMO
PURPOSE: To determine which measurement of donor renal size on computed tomographic (CT) angiograms has the greatest correlation with renal function preoperatively in the donor and postoperatively in the transplant recipient. MATERIALS AND METHODS: Informed consent was waived for this retrospective HIPAA-compliant study approved by the institutional review board. Renal length, total volume, and cortical volume were measured on renal donor CT angiograms in 111 patients. Preoperative serum creatinine values for donors and postoperative creatinine values for recipients at hospital discharge and 6, 12, 24, and 36 months after transplant were collected, and estimated glomerular filtration rate (eGFR) was calculated. Correlation coefficients with 95% confidence intervals (CIs) were obtained for renal measures and donor eGFR and for renal measures adjusted to recipient body habitus and posttransplant creatinine level in the recipient. Thresholds were set for adjusted length and volumes, and the odds ratio (OR) for creatinine level less than 1.5 mg/dL at 36 months was calculated. RESULTS: Renal volumes and length were correlated with donor eGFR (r=0.58 [95% CI: 0.44, 0.69] for cortical volume, 0.56 [95% CI: 0.42, 0.68] for total volume, and 0.43 [95% CI: 0.27, 0.57] for renal length). All three measures, adjusted to recipient body habitus, were correlated with recipient renal function from discharge (r=-0.41 to -0.43) up to 36 months after transplantation (r=-0.33 to -0.41). By using a threshold of 1.5 for cortical volume to recipient weight, 2.25 for total volume to recipient weight, and 0.175 for renal length to recipient weight, the odds of creatinine level greater than 1.5 mg/dL were four times as great for smaller kidney-to-recipient weight ratios, a statistically significant pattern for cortical volume (OR, 4.07; 95% CI: 1.10, 15.09) but not total volume (OR, 4.24; 95% CI: 0.90, 20.01) or renal length (OR, 4.08; 95% CI: 0.48-34.29). CONCLUSION: Renal length and volumes correlated with recipient renal function up to 36 months after transplant. A low ratio of cortical volume to recipient weight was associated with diminished renal function at 36 months after transplant.
Assuntos
Angiografia/métodos , Rim/diagnóstico por imagem , Transplante de Fígado , Tomografia Computadorizada por Raios X/métodos , Adolescente , Adulto , Idoso , Biomarcadores/sangue , Intervalos de Confiança , Creatinina/sangue , Feminino , Taxa de Filtração Glomerular , Humanos , Testes de Função Renal , Masculino , Pessoa de Meia-Idade , Nefrectomia , Tamanho do Órgão , Interpretação de Imagem Radiográfica Assistida por Computador , Reprodutibilidade dos Testes , Estudos RetrospectivosRESUMO
MOTIVATION: As high-throughput transcriptome sequencing provides evidence for novel transcripts in many species, there is a renewed need for accurate methods to classify small genomic regions as protein coding or non-coding. We present PhyloCSF, a novel comparative genomics method that analyzes a multispecies nucleotide sequence alignment to determine whether it is likely to represent a conserved protein-coding region, based on a formal statistical comparison of phylogenetic codon models. RESULTS: We show that PhyloCSF's classification performance in 12-species Drosophila genome alignments exceeds all other methods we compared in a previous study. We anticipate that this method will be widely applicable as the transcriptomes of many additional species, tissues and subcellular compartments are sequenced, particularly in the context of ENCODE and modENCODE, and as interest grows in long non-coding RNAs, often initially recognized by their lack of protein coding potential rather than conserved RNA secondary structures. AVAILABILITY AND IMPLEMENTATION: The Objective Caml source code and executables for GNU/Linux and Mac OS X are freely available at http://compbio.mit.edu/PhyloCSF CONTACT: mlin@mit.edu; manoli@mit.edu.
Assuntos
Drosophila melanogaster/genética , Genômica/métodos , Fases de Leitura Aberta , Alinhamento de Sequência/métodos , Animais , Sequência de Bases , Drosophila/classificação , Drosophila/genética , Perfilação da Expressão Gênica , Mamíferos/genética , Schizosaccharomyces/genéticaRESUMO
Although the Human Genome Project was completed 4 years ago, the catalog of human protein-coding genes remains a matter of controversy. Current catalogs list a total of approximately 24,500 putative protein-coding genes. It is broadly suspected that a large fraction of these entries are functionally meaningless ORFs present by chance in RNA transcripts, because they show no evidence of evolutionary conservation with mouse or dog. However, there is currently no scientific justification for excluding ORFs simply because they fail to show evolutionary conservation: the alternative hypothesis is that most of these ORFs are actually valid human genes that reflect gene innovation in the primate lineage or gene loss in the other lineages. Here, we reject this hypothesis by carefully analyzing the nonconserved ORFs-specifically, their properties in other primates. We show that the vast majority of these ORFs are random occurrences. The analysis yields, as a by-product, a major revision of the current human catalogs, cutting the number of protein-coding genes to approximately 20,500. Specifically, it suggests that nonconserved ORFs should be added to the human gene catalog only if there is clear evidence of an encoded protein. It also provides a principled methodology for evaluating future proposed additions to the human gene catalog. Finally, the results indicate that there has been relatively little true innovation in mammalian protein-coding genes.
Assuntos
Código Genético , Genoma Humano/genética , Genômica , Fases de Leitura Aberta/genética , Proteínas/genética , Animais , Sequência de Bases , Elementos de DNA Transponíveis/genética , Cães , Genes/genética , Humanos , Camundongos , Dados de Sequência Molecular , Pseudogenes/genética , Análise de Sequência de DNARESUMO
BACKGROUND: Metagenomic next-generation sequencing (mNGS) has enabled the rapid, unbiased detection and identification of microbes without pathogen-specific reagents, culturing, or a priori knowledge of the microbial landscape. mNGS data analysis requires a series of computationally intensive processing steps to accurately determine the microbial composition of a sample. Existing mNGS data analysis tools typically require bioinformatics expertise and access to local server-class hardware resources. For many research laboratories, this presents an obstacle, especially in resource-limited environments. FINDINGS: We present IDseq, an open source cloud-based metagenomics pipeline and service for global pathogen detection and monitoring (https://idseq.net). The IDseq Portal accepts raw mNGS data, performs host and quality filtration steps, then executes an assembly-based alignment pipeline, which results in the assignment of reads and contigs to taxonomic categories. The taxonomic relative abundances are reported and visualized in an easy-to-use web application to facilitate data interpretation and hypothesis generation. Furthermore, IDseq supports environmental background model generation and automatic internal spike-in control recognition, providing statistics that are critical for data interpretation. IDseq was designed with the specific intent of detecting novel pathogens. Here, we benchmark novel virus detection capability using both synthetically evolved viral sequences and real-world samples, including IDseq analysis of a nasopharyngeal swab sample acquired and processed locally in Cambodia from a tourist from Wuhan, China, infected with the recently emergent SARS-CoV-2. CONCLUSION: The IDseq Portal reduces the barrier to entry for mNGS data analysis and enables bench scientists, clinicians, and bioinformaticians to gain insight from mNGS datasets for both known and novel pathogens.
Assuntos
Betacoronavirus/genética , Computação em Nuvem , Infecções por Coronavirus/virologia , Metagenoma , Metagenômica/métodos , Pneumonia Viral/virologia , Betacoronavirus/patogenicidade , COVID-19 , Infecções por Coronavirus/diagnóstico , Bases de Dados Genéticas , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Pandemias , Pneumonia Viral/diagnóstico , SARS-CoV-2 , SoftwareRESUMO
Comparative genomics of multiple related species is a powerful methodology for the discovery of functional genomic elements, and its power should increase with the number of species compared. Here, we use 12 Drosophila genomes to study the power of comparative genomics metrics to distinguish between protein-coding and non-coding regions. First, we study the relative power of different comparative metrics and their relationship to single-species metrics. We find that even relatively simple multi-species metrics robustly outperform advanced single-species metrics, especially for shorter exons (< or =240 nt), which are common in animal genomes. Moreover, the two capture largely independent features of protein-coding genes, with different sensitivity/specificity trade-offs, such that their combinations lead to even greater discriminatory power. In addition, we study how discovery power scales with the number and phylogenetic distance of the genomes compared. We find that species at a broad range of distances are comparably effective informants for pairwise comparative gene identification, but that these are surpassed by multi-species comparisons at similar evolutionary divergence. In particular, while pairwise discovery power plateaued at larger distances and never outperformed the most advanced single-species metrics, multi-species comparisons continued to benefit even from the most distant species with no apparent saturation. Last, we find that genes in functional categories typically considered fast-evolving can nonetheless be recovered at very high rates using comparative methods. Our results have implications for comparative genomics analyses in any species, including the human.
Assuntos
Mapeamento Cromossômico/métodos , Proteínas de Drosophila/genética , Drosophila/genética , Variação Genética/genética , Fases de Leitura Aberta/genética , Animais , Sequência de Bases , Análise Discriminante , Drosophila/classificação , Dados de Sequência Molecular , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , Análise de Sequência de DNA/métodos , Especificidade da EspécieRESUMO
In March 2019, 45 scientists and software engineers from around the world converged at the University of California, Santa Cruz for the first pangenomics codeathon. The purpose of the meeting was to propose technical specifications and standards for a usable human pangenome as well as to build relevant tools for genome graph infrastructures. During the meeting, the group held several intense and productive discussions covering a diverse set of topics, including advantages of graph genomes over a linear reference representation, design of new methods that can leverage graph-based data structures, and novel visualization and annotation approaches for pangenomes. Additionally, the participants self-organized themselves into teams that worked intensely over a three-day period to build a set of pipelines and tools for specific pangenomic applications. A summary of the questions raised and the tools developed are reported in this manuscript.
RESUMO
PURPOSE OF REVIEW: There is growing concern among the medical community that diagnostic radiation adds to the already increased risk of developing lymphoma that may be inherent in, or related to the treatment of, inflammatory bowel disease. This article describes recent progress in magnetic resonance enterography techniques, and examines the role of MRI in the evaluation of Crohn's disease. RECENT FINDINGS: Recent advancements in magnetic resonance technology and imaging protocol have made MRI of the small bowel feasible. With improved coils, breath-hold sequences and faster acquisition techniques, MRI capably depicts disease location, extent, and complications. Most of the current literature recognizes MRI as an excellent tool in characterizing transmural and extraluminal changes of Crohn's disease. SUMMARY: The lack of ionizing radiation is the main driving force for MRI of Crohn's disease. This advantage is magnified by the relatively young age of Crohn's disease patients. While intrinsic susceptibility to air and motion may limit its use in some patients, MRI shows promising potential as an alternative to computed tomography in monitoring disease progression or response to therapy.