RESUMO
Individuals of Pacific ancestry suffer some of the highest rates of health disparities yet remain vastly underrepresented in genomic research, including currently available linear and pangenome references. To begin addressing this, we developed the first Pacific ancestry pangenome reference using 23 individuals with diverse Pacific ancestry. We assembled 46 haploid genomes from these 23 individuals, resulting in highly accurate and contiguous genome assemblies with an average quality value of 55.0 and an average N50 of 40.7 Mb, marking the first de novo assembly of highly accurate Pacific ancestry genomes. We combined these assemblies to create a pangenome reference, which added 30.6 Mb of novel sequence missing from the Human Pangenome Reference Consortium (HPRC) reference. Mapping short reads to this pangenome reduced variant call errors and yielded more true-positive variants compared to the HPRC and T2T-CHM13 references. This Pacific ancestry pangenome reference serves as a resource to enhance genetic analyses for this underserved population.
RESUMO
Computer programming is a fundamental tool for life scientists, allowing them to carry out essential research tasks. However, despite various educational efforts, learning to write code can be a challenging endeavor for students and researchers in life-sciences disciplines. Recent advances in artificial intelligence have made it possible to translate human-language prompts to functional code, raising questions about whether these technologies can aid (or replace) life scientists' efforts to write code. Using 184 programming exercises from an introductory-bioinformatics course, we evaluated the extent to which one such tool-OpenAI's ChatGPT-could successfully complete programming tasks. ChatGPT solved 139 (75.5%) of the exercises on its first attempt. For the remaining exercises, we provided natural-language feedback to the model, prompting it to try different approaches. Within 7 or fewer attempts, ChatGPT solved 179 (97.3%) of the exercises. These findings have implications for life-sciences education and research. Instructors may need to adapt their pedagogical approaches and assessment techniques to account for these new capabilities that are available to the general public. For some programming tasks, researchers may be able to work in collaboration with machine-learning models to produce functional code.
RESUMO
Much of today's molecular science revolves around next-generation sequencing. Frequently, the first step in analyzing such data is aligning sequencing reads to a reference genome. This step is often taken for granted, but any analysis downstream of the alignment will be affected by the aligner's ability to correctly map sequences. In most cases, for research into chromatin structure and nucleosome positioning, ATAC-seq, ChIP-seq, and MNase-seq experiments use short read lengths. How well aligners manage these reads is critical. Most aligner programs will output mapped reads and unmapped reads. However, from a biological point of view, reads will fall into one of three categories: correctly mapped, incorrectly mapped, and unmapped. While increased sequencing depth can often compensate for unmapped reads, incorrectly and correctly mapped reads appear algorithmically identical but can produce biologically significant alterations in the results. For this reason, we are benchmarking various alignment programs to determine their propensity to incorrectly map short reads. As short-read alignment is an important step in ATAC-seq, ChIP-seq, and MNase-seq experiments, caution should be taken in mapping reads to ensure that the most accurate conclusions can be made from the data generated. Our analysis is intended to help investigators new to the field pick the alignment program best suited for their experimental conditions. In general, the aligners we tested performed well. BWA, Bowtie2, and Chromap were all exceptionally accurate, and we recommend using them. Furthermore, we show that longer read lengths do in fact lead to more accurate mappings.
Assuntos
Benchmarking , Cromatina , Cromatina/genética , Alinhamento de Sequência , Genoma , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , AlgoritmosRESUMO
Ramp sequences occur when the average translational efficiency of codons near the 5' end of highly expressed genes is significantly lower than the rest of the gene sequence, which counterintuitively increases translational efficiency by decreasing downstream ribosomal collisions. Here, we show that the relative codon adaptiveness within different tissues changes the existence of a ramp sequence without altering the underlying genetic code. We present the first comprehensive analysis of tissue and cell type-specific ramp sequences and report 3108 genes with ramp sequences that change between tissues and cell types, which corresponds with increased gene expression within those tissues and cells. The Ramp Atlas (https://ramps.byu.edu/) allows researchers to query precomputed ramp sequences in 18 388 genes across 62 tissues and 66 cell types and calculate tissue-specific ramp sequences from user-uploaded FASTA files through an intuitive web interface. We used The Ramp Atlas to identify seven SARS-CoV-2 genes and seven human SARS-CoV-2 entry factor genes with tissue-specific ramp sequences that may help explain viral proliferation within those tissues. We anticipate that The Ramp Atlas will facilitate personalized and creative tissue-specific ramp sequence analyses for both human and viral genes that will increase our ability to utilize this often-overlooked regulatory region.
RESUMO
Caranx ignobilis, commonly known as giant kingfish or giant trevally, is a large, reef-associated apex predator. It is a prized sportfish, targeted throughout its tropical and subtropical range in the Indian and Pacific Oceans. It also gained significant interest in aquaculture due to its unusual freshwater tolerance. Here, we present a draft assembly of the estimated 625.92 Mbp nuclear genome of a C. ignobilis individual from Hawaiian waters, which host a genetically distinct population. Our 97.4% BUSCO-complete assembly has a contig NG50 of 7.3 Mbp and a scaffold NG50 of 46.3 Mbp. Twenty-five of the 203 scaffolds contain 90% of the genome. We also present noisy, long-read DNA, Hi-C, and RNA-seq datasets, the latter containing eight distinct tissues and can help with annotations and studies of freshwater tolerance. Our genome assembly and its supporting data are valuable tools for ecological and comparative genomics studies of kingfishes and other carangoid fishes.
RESUMO
The roundjaw bonefish, Albula glossodonta, is the most widespread albulid in the Indo-Pacific and is vulnerable to extinction. We assembled the genome of a roundjaw bonefish from Hawai'i, USA, which will be instrumental for effective transboundary management and conservation when paired with population genomics datasets. The 1.05 gigabase pair (Gbp) contig-level assembly had a 4.75 megabase pair (Mbp) NG50 and a maximum contig length of 28.2 Mbp. Scaffolding yielded an LG50 of 20 and an NG50 of 14.49 Mbp, with the longest scaffold reaching 42.29 Mbp. The genome comprised 6.5% repetitive elements and was annotated with 28.3 K protein-coding genes. We then evaluated population genetic connectivity between six atolls in the Western Indian Ocean with 38,355 SNP loci across 66 A. glossodonta individuals. We discerned shallow population structure and observed genetic homogeneity between atolls in Seychelles and reduced gene flow between Seychelles and Mauritius. The South Equatorial Current might be the limiting mechanism of this reduced gene flow. The genome assembly will be useful for addressing taxonomic uncertainties of bonefishes globally.
RESUMO
INTRODUCTION: Recent clinical trials are considering inclusion of more than just apolipoprotein E (APOE) ε4 genotype as a way of reducing variability in analysis of outcomes. METHODS: Case-control data were used to compare the capacity of age, sex, and 58 Alzheimer's disease (AD)-associated single nucleotide polymorphisms (SNPs) to predict AD status using several statistical models. Model performance was assessed with Brier scores and tenfold cross-validation. Genotype and sex × age estimates from the best performing model were combined with age and intercept estimates from the general population to develop a personalized genetic risk score, termed age, and sex-adjusted GenoRisk. RESULTS: The elastic net model that included age, age x sex interaction, allelic APOE terms, and 29 additional SNPs performed the best. This model explained an additional 19% of the heritable risk compared to APOE genotype alone and achieved an area under the curve of 0.747. DISCUSSION: GenoRisk could improve the risk assessment of individuals identified for prevention studies.
RESUMO
The bluefin trevally, Caranx melampygus, also known as the bluefin kingfish or bluefin jack, is known for its remarkable, bright-blue fins. This marine teleost is a widely prized sportfish, but few resources have been devoted to the genomics and conservation of this species because it is not targeted by large-scale commercial fisheries. Population declines from recreational and artisanal overfishing have been observed in Hawai'i, USA, resulting in both an interest in aquaculture and concerns about the long-term conservation of this species. Most research to-date has been performed in Hawai'i, raising questions about the status of bluefin trevally populations across its Indo-Pacific range. Genomic resources allow for expanded research on stock status, genetic diversity, and population demography. We present a high quality, 711 Mb nuclear genome assembly of a Hawaiian bluefin trevally from noisy long-reads with a contig NG50 of 1.2 Mb and longest contig length of 8.9 Mb. As measured by single-copy orthologs, the assembly was 95% complete, and the genome is comprised of 16.9% repetitive elements. The assembly was annotated with 33.1 K protein-coding genes, 71.4% of which were assigned putative functions, using RNA-seq data from eight tissues from the same individual. This is the first whole-genome assembly published for the carangoid genus Caranx. Using this assembled genome, a multiple sequentially Markovian coalescent model was implemented to assess population demography. Estimates of effective population size suggest population expansion has occurred since the Late Pleistocene. This genome will be a valuable resource for comparative phylogenomic studies of carangoid fishes and will help elucidate demographic history and delineate stock structure for bluefin trevally populations throughout the Indo-Pacific.
Assuntos
Conservação dos Recursos Naturais , Perciformes , Animais , Pesqueiros , Peixes/genética , Genoma , Perciformes/genéticaRESUMO
Ramp sequences increase translational speed and accuracy when rare, slowly-translated codons are found at the beginnings of genes. Here, the results of the first analysis of ramp sequences in a phylogenetic construct are presented. Ramp sequences were compared from 247 vertebrates (114 Mammalian and 133 non-mammalian), where the presence and absence of ramp sequences was analyzed as a binary character in a parsimony and maximum likelihood framework. Additionally, ramp sequences were mapped to the Open Tree of Life synthetic tree to determine the number of parallelisms and reversals that occurred, and those results were compared to random permutations. Parsimony and maximum likelihood analyses of the presence and absence of ramp sequences recovered phylogenies that are highly congruent with established phylogenies. Additionally, 81% of vertebrate mammalian ramps and 81.2% of other vertebrate ramps had less parallelisms and reversals than the mean from 1000 randomly permuted trees. A chi-square analysis of completely orthologous ramp sequences resulted in a p-value < 0.001 as compared to random chance. Ramp sequences recover comparable phylogenies as other phylogenomic methods. Although not all ramp sequences appear to have a phylogenetic signal, more ramp sequences track speciation than expected by random chance. Therefore, ramp sequences may be used in conjunction with other phylogenomic approaches if many orthologs are taken into account. However, phylogenomic methods utilizing few orthologs should be cautious in incorporating ramp sequences because individual ramp sequences may provide conflicting signals.
Assuntos
Uso do Códon , Códon , Filogenia , RNA de Transferência/genética , Vertebrados/classificação , Vertebrados/genética , AnimaisRESUMO
A recent genome-wide association study (GWAS) of 59 cerebrospinal fluid (CSF) proteins with a connection to Alzheimer's disease (AD) demonstrated an association between increased levels of chemokine ligand 2 (CCL2) with an atypical chemokine receptor chemokine-binding protein 2 variant V41A (ACKR2-V41A; rs2228467). High levels of CCL2 are associated with increased risk of AD development as well as other inflammatory diseases. In this study we characterized the biological function of the ACKR2-V41A receptor compared to the wild type allele by measuring its ligand binding affinity, CCL2 scavenging efficiency, and cell activation sensitivity. We transfected Chinese hamster ovary cells with plasmids carrying wild type ACKR2 (ACKR2-WT) or the mutant ACKR2-V41A receptor. Binding affinity assays showed that ACKR2-V41A has a lower binding affinity for CCL2 and CCL4 than ACKR2-WT. CCL2 scavenging results aligned with binding affinity assays, with ACKR2-V41A cells scavenging CCL2 with a lower efficiency than ACKR2-WT. Cell activation assays also showed that ACKR2-V41A cells had significantly lower receptor upregulation (ß-Arrestin-dependent signaling pathway) upon stimulation compared to ACKR2-WT cells. These findings provide molecular and biological mechanistic insights into the GWAS association of ACKR2-V41A with increased levels of CCL2 in CSF and possibly other chemokine ligands. Increased CCL2 levels are associated with accelerated cognitive decline and increased risk of AD. Understanding how this atypical chemokine receptor allele increases serum markers of inflammation could lead to novel therapeutic solutions for AD.
Assuntos
Doença de Alzheimer/etiologia , Quimiocina CCL2/metabolismo , Inflamação/metabolismo , Proteínas Mutantes , Receptores de Quimiocinas/química , Receptores de Quimiocinas/metabolismo , Fatores de Despolimerização de Actina/metabolismo , Doença de Alzheimer/metabolismo , Doença de Alzheimer/patologia , Substituição de Aminoácidos , Animais , Células CHO , Cricetulus , Suscetibilidade a Doenças , Humanos , Interações Hidrofóbicas e Hidrofílicas , Inflamação/complicações , Inflamação/genética , Cinética , Modelos Moleculares , Fosforilação , Ligação Proteica , Conformação Proteica , Receptores de Quimiocinas/genética , Relação Estrutura-AtividadeRESUMO
Identical codon pairing and co-tRNA codon pairing increase translational efficiency within genes when two codons that encode the same amino acid are translated by the same tRNA before it diffuses from the ribosome. We examine the phylogenetic signal in both identical and co-tRNA codon pairing across 23 428 species using alignment-free and parsimony methods. We determined that conserved codon pairing typically has a smaller window size than the length of a ribosome, and codon pairing tracks phylogenies across various taxonomic groups. We report a comprehensive analysis of codon pairing, including the extent to which each codon pairs. Our parsimony method generally recovers phylogenies that are more congruent with the established phylogenies than our alignment-free method. However, four of the ten taxonomic groups did not have sufficient orthologous codon pairings and were therefore analyzed using only the alignment-free methods. Since the recovered phylogenies using only codon pairing largely match phylogenies from the Open Tree of Life and the NCBI taxonomy, and are comparable to trees recovered by other algorithms, we propose that codon pairing biases are phylogenetically conserved and should be considered in conjunction with other phylogenomic techniques.
Assuntos
Códon/genética , Sequência Conservada/genética , Filogenia , RNA de Transferência/genética , Ribossomos/genéticaRESUMO
Using parsimony, we analyzed codon usages across 12,337 species and 25,727 orthologous genes to rank specific genes and codons according to their phylogenetic signal. We examined each codon within each ortholog to determine the codon usage for each species. In total, 890,814 codons were parsimony informative. Next, we compared species that used a codon with species that did not use the codon. We assessed each codon's congruence with species relationships provided in the Open Tree of Life (OTL) and determined the statistical probability of observing these results by random chance. We determined that 25,771 codons had no parallelisms or reversals when mapped to the OTL. Codon usages from orthologous genes spanning many species were 1109× more likely to be congruent with species relationships in the OTL than would be expected by random chance. Using the OTL as a reference, we show that codon usage is phylogenetically conserved within orthologous genes in archaea, bacteria, plants, mammals, and other vertebrates. We also show how to use our provided framework to test different tree hypotheses by confirming the placement of turtles as sister taxa to archosaurs.
Assuntos
Uso do Códon/fisiologia , Códon/genética , Bases de Dados Genéticas , Especiação Genética , Filogenia , Animais , Archaea/classificação , Archaea/genética , Bactérias/classificação , Bactérias/genética , Sequência Conservada , Bases de Dados Genéticas/estatística & dados numéricos , Mamíferos/classificação , Mamíferos/genética , Plantas/classificação , Plantas/genética , Homologia de Sequência , Tartarugas/classificação , Tartarugas/genética , Vertebrados/classificação , Vertebrados/genéticaRESUMO
Carbapenem-resistant bacteria have quickly become a worldwide concern in nosocomial infections. Of the seven known carbapenemases, four have been shown to be particularly problematic: KPC, NDM, IMP, and VIM. To date, many local and species- or carbapenemase-specific epidemiological studies have been performed, which often focus on the organism itself. This report attempts to perform an inclusive (encompass both species and carbapenemase) epidemiologic study using publicly available plasmid sequences from NCBI. In this report, the gene content of these various plasmids has been characterized, replicon types of the plasmids identified, and the global spread and species promiscuity of the plasmids analyzed. Additionally, support to several groups targeting plasmid maintenance and transfer mechanisms to slow the spread of resistance plasmids is given.
Assuntos
Proteínas de Bactérias/genética , Farmacorresistência Bacteriana/genética , Plasmídeos/genética , beta-Lactamases/genética , Antibacterianos , Enterobacteriáceas Resistentes a Carbapenêmicos/genética , Carbapenêmicos , China , Bases de Dados de Ácidos Nucleicos , Plasmídeos/classificação , Replicon , Estados UnidosRESUMO
BACKGROUND: Common phylogenomic approaches for recovering phylogenies are often time-consuming and require annotations for orthologous gene relationships that are not always available. In contrast, alignment-free phylogenomic approaches typically use structure and oligomer frequencies to calculate pairwise distances between species. We have developed an approach to quickly calculate distances between species based on codon aversion. METHODS: Utilizing a novel alignment-free character state, we present CAM, an alignment-free approach to recover phylogenies by comparing differences in codon aversion motifs (i.e., the set of unused codons within each gene) across all genes within a species. Synonymous codon usage is non-random and differs between organisms, between genes, and even within a single gene, and many genes do not use all possible codons. We report a comprehensive analysis of codon aversion within 229,742,339 genes from 23,428 species across all kingdoms of life, and we provide an alignment-free framework for its use in a phylogenetic construct. For each species, we first construct a set of codon aversion motifs spanning all genes within that species. We define the pairwise distance between two species, A and B, as one minus the number of shared codon aversion motifs divided by the total codon aversion motifs of the species, A or B, containing the fewest motifs. This approach allows us to calculate pairwise distances even when substantial differences in the number of genes or a high rate of divergence between species exists. Finally, we use neighbor-joining to recover phylogenies. RESULTS: Using the Open Tree of Life and NCBI Taxonomy Database as expected phylogenies, our approach compares well, recovering phylogenies that largely match expected trees and are comparable to trees recovered using maximum likelihood and other alignment-free approaches. Our technique is much faster than maximum likelihood and similar in accuracy to other alignment-free approaches. Therefore, we propose that codon aversion be considered a phylogenetically conserved character that may be used in future phylogenomic studies. AVAILABILITY: CAM, documentation, and test files are freely available on GitHub at https://github.com/ridgelab/cam.
RESUMO
BACKGROUND: Plant chloroplasts and mitochondria utilize nuclear encoded proteins to replicate their DNA. These proteins are purposely built for replication in the organelle environment and are distinct from those involved in replication of the nuclear genome. These organelle-localized proteins have ancestral roots in bacterial and bacteriophage genes, supporting the endosymbiotic theory of their origin. We examined the interactions between three of these proteins from Arabidopsis thaliana: a DNA helicase-primase similar to bacteriophage T7 gp4 protein and animal mitochondrial Twinkle, and two DNA polymerases, Pol1A and Pol1B. We used a three-pronged approach to analyze the interactions, including Yeast-two-hybrid analysis, Direct Coupling Analysis (DCA), and thermophoresis. RESULTS: Yeast-two-hybrid analysis reveals residues 120-295 of Twinkle as the minimal region that can still interact with Pol1A or Pol1B. This region is a part of the primase domain of the protein and slightly overlaps the zinc-finger and RNA polymerase subdomains located within. Additionally, we observed that Arabidopsis Twinkle interacts much more strongly with Pol1A versus Pol1B. Thermophoresis also confirms that the primase domain of Twinkle has higher binding affinity than any other region of the protein. Direct-Coupling-Analysis identified specific residues in Twinkle and the DNA polymerases critical to positive interaction between the two proteins. CONCLUSIONS: The interaction of Twinkle with Pol1A or Pol1B mimics the minimal DNA replisomes of T7 phage and those present in mammalian mitochondria. However, while T7 and mammals absolutely require their homolog of Twinkle DNA helicase-primase, Arabidopsis Twinkle mutants are seemingly unaffected by this loss. This implies that while Arabidopsis mitochondria mimic minimal replisomes from T7 and mammalian mitochondria, there is an extra level of redundancy specific to loss of Twinkle function.
Assuntos
Proteínas de Arabidopsis/genética , Arabidopsis/genética , Bacteriófago T7/genética , DNA Polimerase Dirigida por DNA/genética , Complexos Multienzimáticos/genética , Enzimas Multifuncionais/genética , Arabidopsis/metabolismo , Proteínas de Arabidopsis/metabolismo , DNA Polimerase Dirigida por DNA/metabolismo , Mitocôndrias/metabolismo , Enzimas Multifuncionais/metabolismoRESUMO
BACKGROUND: The human genome contains "dark" gene regions that cannot be adequately assembled or aligned using standard short-read sequencing technologies, preventing researchers from identifying mutations within these gene regions that may be relevant to human disease. Here, we identify regions with few mappable reads that we call dark by depth, and others that have ambiguous alignment, called camouflaged. We assess how well long-read or linked-read technologies resolve these regions. RESULTS: Based on standard whole-genome Illumina sequencing data, we identify 36,794 dark regions in 6054 gene bodies from pathways important to human health, development, and reproduction. Of these gene bodies, 8.7% are completely dark and 35.2% are ≥ 5% dark. We identify dark regions that are present in protein-coding exons across 748 genes. Linked-read or long-read sequencing technologies from 10x Genomics, PacBio, and Oxford Nanopore Technologies reduce dark protein-coding regions to approximately 50.5%, 35.6%, and 9.6%, respectively. We present an algorithm to resolve most camouflaged regions and apply it to the Alzheimer's Disease Sequencing Project. We rescue a rare ten-nucleotide frameshift deletion in CR1, a top Alzheimer's disease gene, found in disease cases but not in controls. CONCLUSIONS: While we could not formally assess the association of the CR1 frameshift mutation with Alzheimer's disease due to insufficient sample-size, we believe it merits investigating in a larger cohort. There remain thousands of potentially important genomic regions overlooked by short-read sequencing that are largely resolved by long-read technologies.
Assuntos
Predisposição Genética para Doença , Genoma Humano , Humanos , MutaçãoRESUMO
Alzheimer's disease (AD) is mainly a late-onset neurodegenerative disorder. Substantial efforts have been made to solve the complex genetic architecture of AD as a means to identify therapeutic targets. Unfortunately, to date, no disease-altering therapeutics have been developed. As therapeutics are likely to be most effective in the early stages of disease (ie, before the onset of symptoms), a recent focus of AD research has been the identification of protective factors that prevent disease. One example is the discovery of a rare variant in the 3'-UTR of RAB10 that is protective for AD. Here, we review the possible genetic, molecular, and functional role of RAB10 in AD and potential therapeutic approaches to target RAB10.
Assuntos
Doença de Alzheimer/genética , Doença de Alzheimer/metabolismo , Proteínas rab de Ligação ao GTP/genética , Proteínas rab de Ligação ao GTP/metabolismo , Regiões 3' não Traduzidas , Doença de Alzheimer/tratamento farmacológico , Animais , Descoberta de Drogas , Humanos , Fatores de ProteçãoRESUMO
Different species, genes, and locations within genes use different codons to fine-tune gene expression. Within genes, the ramp sequence assists in ribosome spacing and decreases downstream collisions by incorporating slowly-translated codons at the beginning of a gene. Although previously reported as occurring in some species, no previous attempt at extracting the ramp sequence from specific genes has been published. We present ExtRamp, a software package that quickly extracts ramp sequences from any species using the tRNA adaptation index or relative codon adaptiveness. Different filters facilitate the analysis of codon efficiency and enable identification of genes with a ramp sequence. We validate the existence of a ramp sequence in most species by running ExtRamp on 229 742 339 genes across 23 428 species. We evaluate differences in reported ramp sequences when we use different parameters. Using the strictest ramp sequence cut-off, we show that across most taxonomic groups, ramp sequences are approximately 20-40 codons long and occur in about 10% of gene sequences. We also show that in Drosophila melanogaster as gene expression increases, a higher proportion of genes have ramp sequences. We provide a framework for performing this analysis on other species. ExtRamp is freely available at https://github.com/ridgelab/ExtRamp.
Assuntos
Algoritmos , Códon , Análise de Sequência de DNA/métodos , Animais , RNA de Transferência , Análise de Sequência de RNA/métodos , SoftwareRESUMO
MOTIVATION: Orthologous gene identification is fundamental to all aspects of biology. For example, ortholog identification between species can provide functional insights for genes of unknown function and is a necessary step in phylogenetic inference. Currently, most ortholog identification algorithms require all-versus-all BLAST comparisons, which are time-consuming and memory intensive. RESULTS: In contrast to existing approaches, JustOrthologs exploits the conservation of gene structure by using the lengths of coding sequence regions and dinucleotide percentages to identify orthologs. In comparison to OrthoMCL, OMA and OrthoFinder, JustOrthologs decreases ortholog identification runtime by more than 96% and achieves comparable precision and recall scores. The computational speedup allowed us to conduct pairwise comparisons of 1197 complete genomes (780 eukaryotes and 417 archaea). We confirmed gene annotations for 384 120 genes, grouped 1 675 415 genes in previously unreported ortholog groups, and identified 51 429 potentially mislabeled genes across 622 843 ortholog groups. AVAILABILITY AND IMPLEMENTATION: JustOrthologs is an open source collaborative software package available in the GitHub repository: https://github.com/ridgelab/JustOrthologs/. All test FASTA files used for comparisons are freely available at https://github.com/ridgelab/JustOrthologs/comparisonFastaFiles/. Reference genomes used in this work are available for download from the NCBI repository: ftp://ftp.ncbi.nih.gov/genomes/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Algoritmos , Genômica , Software , Biologia Computacional , Anotação de Sequência Molecular , FilogeniaRESUMO
Polygenic scores (or genetic risk scores) quantify the aggregate of small effects from many common genetic loci that have been associated with a trait through genome-wide association. Polygenic scores were first used successfully in schizophrenia and have since been applied to multiple phenotypes including multiple sclerosis, rheumatoid arthritis, and height. Because human height is an easily-measured and complex polygenic trait, polygenic height scores provide exciting insights into the predictability of aggregate common variant effect on the phenotype. Shawn Bradley is an extremely tall former professional basketball player from Brigham Young University and the National Basketball Association (NBA), measuring 2.29 meters (7'6â³, 99.99999th percentile for height) tall, with no known medical conditions. Here, we present a case where a rare combination of common SNPs in one individual results in an extremely high polygenic height score that is correlated with an extreme phenotype. While polygenic scores are not clinically significant in the average case, our findings suggest that for extreme phenotypes, polygenic scores may be more successful for the prediction of individuals.