RESUMO
The growing use of next-generation sequencing technologies on genetic diagnosis has produced an exponential increase in the number of variants of uncertain significance (VUS). In this manuscript, we compare three machine learning methods to classify VUS as Pathogenic or No pathogenic, implementing a Random Forest (RF), a Support Vector Machine (SVM), and a Multilayer Perceptron. To train the models, we extracted high-quality variants from ClinVar that were previously classified as VUS. For each variant, we retrieved nine conservation scores, the loss-of-function tool, and allele frequencies. For the RF and SVM models, hyperparameters were tuned using cross-validation with a grid search. The three models were tested on a nonoverlapping set of variants that had been classified as VUS over the last 3 years, but had been reclassified in August 2020. The three models yielded superior accuracy on this set compared to the benchmarked tools. The RF-based model yielded the best performance across different variant types and was used to create VusPrize, an open-source software tool for prioritization of VUS. We believe that our model can improve the process of genetic diagnosis in research and clinical settings.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Aprendizado de Máquina , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Redes Neurais de Computação , Software , Máquina de Vetores de SuporteRESUMO
Investigation of large structural variants (SVs) is a challenging yet important task in understanding trait differences in highly repetitive genomes. Combining different bioinformatic approaches for SV detection, we analyzed whole-genome sequencing data from 3000 rice genomes and identified 63 million individual SV calls that grouped into 1.5 million allelic variants. We found enrichment of long SVs in promoters and an excess of shorter variants in 5' UTRs. Across the rice genomes, we identified regions of high SV frequency enriched in stress response genes. We demonstrated how SVs may help in finding causative variants in genome-wide association analysis. These new insights into rice genome biology are valuable for understanding the effects SVs have on gene function, with the prospect of identifying novel agronomically important alleles that can be utilized to improve cultivated rice.
Assuntos
Variação Genética , Genoma de Planta , Variação Estrutural do Genoma , Genômica/métodos , Oryza/genética , Alelos , Mapeamento Cromossômico , Elementos de DNA Transponíveis , Estudo de Associação Genômica Ampla/métodos , Fenótipo , Análise de Sequência de DNA/métodos , Estresse Fisiológico/genéticaRESUMO
Fruit development has been central in the evolution and domestication of flowering plants. In common bean (Phaseolus vulgaris), the principal global grain legume staple, two main production categories are distinguished by fibre deposition in pods: dry beans, with fibrous, stringy pods; and stringless snap/green beans, with reduced fibre deposition, which frequently revert to the ancestral stringy state. Here, we identify genetic and developmental patterns associated with pod fibre deposition. Transcriptional, anatomical, epigenetic and genetic regulation of pod strings were explored through RNA-seq, RT-qPCR, fluorescence microscopy, bisulfite sequencing and whole-genome sequencing. Overexpression of the INDEHISCENT ('PvIND') orthologue was observed in stringless types compared with isogenic stringy lines, associated with overspecification of weak dehiscence-zone cells throughout the pod vascular sheath. No differences in DNA methylation were correlated with this phenotype. Nonstringy varieties showed a tandemly direct duplicated PvIND and a Ty1-copia retrotransposon inserted between the two repeats. These sequence features are lost during pod reversion and are predictive of pod phenotype in diverse materials, supporting their role in PvIND overexpression and reversible string phenotype. Our results give insight into reversible gain-of-function mutations and possible genetic solutions to the reversion problem, of considerable economic value for green bean production.
Assuntos
Phaseolus , Domesticação , Duplicação Gênica , Phaseolus/genética , Fenótipo , Retroelementos/genéticaRESUMO
BACKGROUND: Quantitative analysis of ventricular cerebrospinal fluid (vCSF) proteins following acute brain injury (ABI) may help identify pathophysiological pathways and potential biomarkers that can predict unfavorable outcome. METHODS: In this prospective proteomic analysis study, consecutive patients with severe ABI expected to require intraventricular catheterization for intracranial pressure (ICP) monitoring for at least 5 days and patients without ABI admitted for elective clipping of an unruptured cerebral aneurysm were included. vCSF samples were collected within the first 24 h after ABI and ventriculostomy insertion and then every 24 h for 5 days. In patients without ABI, a single vCSF sample was collected at the time of elective clipping. Data-independent acquisition and sequential window acquisition of all theoretical spectra (SWATH) mass spectrometry were used to compare differences in protein expression in patients with ABI and patients without ABI and in patients with traumatic and nontraumatic ABI. Differences in protein expression according to different ICP values, intensive care unit outcome, subarachnoid hemorrhage (SAH) versus traumatic brain injury (TBI), and good versus poor 3-month functional status (assessed by using the Glasgow Outcome Scale) were also evaluated. vCSF proteins with significant differences between groups were compared by using linear models and selected for gene ontology analysis using R Language and the Panther database. RESULTS: We included 50 patients with ABI (SAH n = 23, TBI n = 15, intracranial hemorrhage n = 6, ischemic stroke n = 3, others n = 3) and 12 patients without ABI. There were significant differences in the expression of 255 proteins between patients with and without ABI (p < 0.01). There were intraday and interday differences in expression of seven proteins related to increased inflammation, apoptosis, oxidative stress, and cellular response to hypoxia and injury. Among these, glial fibrillary acidic protein expression was higher in patients with ABI with severe intracranial hypertension (ICH) (ICP ≥ 30 mm Hg) or death compared to those without (log 2 fold change: + 2.4; p < 0.001), suggesting extensive primary astroglial injury or death. There were differences in the expression of 96 proteins between patients with traumatic and nontraumatic ABI (p < 0.05); intraday and interday differences were observed for six proteins related to structural damage, complement activation, and cholesterol metabolism. Thirty-nine vCSF proteins were associated with an increased risk of severe ICH (ICP ≥ 30 mm Hg) in patients with traumatic compared with nontraumatic ABI (p < 0.05). No significant differences were found in protein expression between patients with SAH versus TBI or between those with good versus poor 3-month Glasgow Outcome Scale score. CONCLUSIONS: Dysregulated vCSF protein expression after ABI may be associated with an increased risk of severe ICH and death.
Assuntos
Lesões Encefálicas Traumáticas , Lesões Encefálicas , Hipertensão Intracraniana , Hemorragia Subaracnóidea , Biomarcadores , Colesterol , Proteína Glial Fibrilar Ácida , Humanos , Hipertensão Intracraniana/etiologia , Pressão Intracraniana/fisiologia , Estudos Prospectivos , Proteômica , Hemorragia Subaracnóidea/complicaçõesRESUMO
BACKGROUND: Common bean is an important staple crop in the tropics of Africa, Asia and the Americas. Particularly smallholder farmers rely on bean as a source for calories, protein and micronutrients. Drought is a major production constraint for common bean, a situation that will be aggravated with current climate change scenarios. In this context, new tools designed to understand the genetic basis governing the phenotypic responses to abiotic stress are required to improve transfer of desirable traits into cultivated beans. RESULTS: A multiparent advanced generation intercross (MAGIC) population of common bean was generated from eight Mesoamerican breeding lines representing the phenotypic and genotypic diversity of the CIAT Mesoamerican breeding program. This population was assessed under drought conditions in two field trials for yield, 100 seed weight, iron and zinc accumulation, phenology and pod harvest index. Transgressive segregation was observed for most of these traits. Yield was positively correlated with yield components and pod harvest index (PHI), and negative correlations were found with phenology traits and micromineral contents. Founder haplotypes in the population were identified using Genotyping by Sequencing (GBS). No major population structure was observed in the population. Whole Genome Sequencing (WGS) data from the founder lines was used to impute genotyping data for GWAS. Genetic mapping was carried out with two methods, using association mapping with GWAS, and linkage mapping with haplotype-based interval screening. Thirteen high confidence QTL were identified using both methods and several QTL hotspots were found controlling multiple traits. A major QTL hotspot located on chromosome Pv01 for phenology traits and yield was identified. Further hotspots affecting several traits were observed on chromosomes Pv03 and Pv08. A major QTL for seed Fe content was contributed by MIB778, the founder line with highest micromineral accumulation. Based on imputed WGS data, candidate genes are reported for the identified major QTL, and sequence changes were identified that could cause the phenotypic variation. CONCLUSIONS: This work demonstrates the importance of this common bean MAGIC population for genetic mapping of agronomic traits, to identify trait associations for molecular breeding tool design and as a new genetic resource for the bean research community.
Assuntos
Phaseolus , África , Ásia , Mapeamento Cromossômico , Secas , Phaseolus/genética , Fenótipo , Melhoramento Vegetal , Locos de Características QuantitativasRESUMO
MOTIVATION: Accurate detection, genotyping and downstream analysis of genomic variants from high-throughput sequencing data are fundamental features in modern production pipelines for genetic-based diagnosis in medicine or genomic selection in plant and animal breeding. Our research group maintains the Next-Generation Sequencing Experience Platform (NGSEP) as a precise, efficient and easy-to-use software solution for these features. RESULTS: Understanding that incorrect alignments around short tandem repeats are an important source of genotyping errors, we implemented in NGSEP new algorithms for realignment and haplotype clustering of reads spanning indels and short tandem repeats. We performed extensive benchmark experiments comparing NGSEP to state-of-the-art software using real data from three sequencing protocols and four species with different distributions of repetitive elements. NGSEP consistently shows comparative accuracy and better efficiency compared to the existing solutions. We expect that this work will contribute to the continuous improvement of quality in variant calling needed for modern applications in medicine and agriculture. AVAILABILITY AND IMPLEMENTATION: NGSEP is available as open source software at http://ngsep.sf.net. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Software , Algoritmos , Genômica , Mutação INDEL , Análise de Sequência de DNARESUMO
Palm oil is the most consumed vegetable oil globally, and Colombia is the largest palm oil producer in South America and fourth worldwide. However, oil palm plantations in Colombia are affected by bud rot disease caused by the oomycete Phytophthora palmivora, leading to significant economic losses. Infection processes by plant pathogens involve the secretion of effector molecules, which alter the functioning or structure of host cells. Current long-read sequencing technologies provide the information needed to produce high-quality genome assemblies, enabling a comprehensive annotation of effectors. Here, we describe the development of genomic resources for P. palmivora, including a high-quality genome assembly based on long and short-read sequencing data, intraspecies variability for 12 isolates from different oil palm cultivation regions in Colombia, and a catalog of over 1,000 candidate effector proteins. A total of 45,416 genes were annotated from the new genome assembled in 2,322 contigs adding to 165.5 Mbp, which represents an improvement of two times more gene models, 33 times better contiguity, and 11 times less fragmentation compared with currently available genomic resources for the species. Analysis of nucleotide evolution in paralogs suggests a recent whole-genome duplication event. Genetic differences were identified among isolates showing variable virulence levels. We expect that these novel genomic resources contribute to the characterization of the species and the understanding of the interaction of P. palmivora with oil palm and could be further exploited as tools for the development of effective strategies for disease control.
Assuntos
Phytophthora , Colômbia , Genômica , Doenças das Plantas , América do SulRESUMO
BACKGROUND: The apomictic reproductive mode of Brachiaria (syn. Urochloa) forage species allows breeders to faithfully propagate heterozygous genotypes through seed over multiple generations. In Brachiaria, reproductive mode segregates as single dominant locus, the apospory-specific genomic region (ASGR). The AGSR has been mapped to an area of reduced recombination on Brachiaria decumbens chromosome 5. A primer pair designed within ASGR-BABY BOOM-like (BBML), the candidate gene for the parthenogenesis component of apomixis in Pennisetum squamulatum, was diagnostic for reproductive mode in the closely related species B. ruziziensis, B. brizantha, and B. decumbens. In this study, we used a mapping population of the distantly related commercial species B. humidicola to map the ASGR and test for conservation of ASGR-BBML sequences across Brachiaria species. RESULTS: Dense genetic maps were constructed for the maternal and paternal genomes of a hexaploid (2n = 6x = 36) B. humidicola F1 mapping population (n = 102) using genotyping-by-sequencing, simple sequence repeat, amplified fragment length polymorphism, and transcriptome derived single nucleotide polymorphism markers. Comparative genomics with Setaria italica provided confirmation for x = 6 as the base chromosome number of B. humidicola. High resolution molecular karyotyping indicated that the six homologous chromosomes of the sexual female parent paired at random, whereas preferential pairing of subgenomes was observed in the apomictic male parent. Furthermore, evidence for compensated aneuploidy was found in the apomictic parent, with only five homologous linkage groups identified for chromosome 5 and seven homologous linkage groups of chromosome 6. The ASGR mapped to B. humidicola chromosome 1, a region syntenic with chromosomes 1 and 7 of S. italica. The ASGR-BBML specific PCR product cosegregated with the ASGR in the F1 mapping population, despite its location on a different carrier chromosome than B. decumbens. CONCLUSIONS: The first dense molecular maps of B. humidicola provide strong support for cytogenetic evidence indicating a base chromosome number of six in this species. Furthermore, these results show conservation of the ASGR across the Paniceae in different chromosomal backgrounds and support postulation of the ASGR-BBML as candidate genes for the parthenogenesis component of apomixis.
Assuntos
Apomixia , Brachiaria/genética , Mapeamento Cromossômico , Partenogênese/genética , Cromossomos de Plantas , Genômica , Cariotipagem , Translocação GenéticaRESUMO
Identification of genomic regions associated with a phenotype of interest is a fundamental step toward solving questions in biology and improving industrial research. Bulk segregant analysis (BSA) combined with high-throughput sequencing is a technique to efficiently identify these genomic regions associated with a trait of interest. However, distinguishing true from spuriously linked genomic regions and accurately delineating the genomic positions of these truly linked regions requires the use of complex statistical models currently implemented in software tools that are generally difficult to operate for non-expert users. To facilitate the exploration and analysis of data generated by bulked segregant analysis, we present EXPLoRA-web, a web service wrapped around our previously published algorithm EXPLoRA, which exploits linkage disequilibrium to increase the power and accuracy of quantitative trait loci identification in BSA analysis. EXPLoRA-web provides a user friendly interface that enables easy data upload and parallel processing of different parameter configurations. Results are provided graphically and as BED file and/or text file and the input is expected in widely used formats, enabling straightforward BSA data analysis. The web server is available at http://bioinformatics.intec.ugent.be/explora-web/.
Assuntos
Algoritmos , Desequilíbrio de Ligação , Locos de Características Quantitativas , Característica Quantitativa Herdável , Software , Alelos , Animais , Bactérias/genética , Bactérias/metabolismo , Gráficos por Computador , Frequência do Gene , Sequenciamento de Nucleotídeos em Larga Escala , Armazenamento e Recuperação da Informação , Internet , FenótipoRESUMO
BACKGROUND: Therecent development and availability of different genotype by sequencing (GBS) protocols provided a cost-effective approach to perform high-resolution genomic analysis of entire populations in different species. The central component of all these protocols is the digestion of the initial DNA with known restriction enzymes, to generate sequencing fragments at predictable and reproducible sites. This allows to genotype thousands of genetic markers on populations with hundreds of individuals. Because GBS protocols achieve parallel genotyping through high throughput sequencing (HTS), every GBS protocol must include a bioinformatics pipeline for analysis of HTS data. Our bioinformatics group recently developed the Next Generation Sequencing Eclipse Plugin (NGSEP) for accurate, efficient, and user-friendly analysis of HTS data. RESULTS: Here we present the latest functionalities implemented in NGSEP in the context of the analysis of GBS data. We implemented a one step wizard to perform parallel read alignment, variants identification and genotyping from HTS reads sequenced from entire populations. We added different filters for variants, samples and genotype calls as well as calculation of summary statistics overall and per sample, and diversity statistics per site. NGSEP includes a module to translate genotype calls to some of the most widely used input formats for integration with several tools to perform downstream analyses such as population structure analysis, construction of genetic maps, genetic mapping of complex traits and phenotype prediction for genomic selection. We assessed the accuracy of NGSEP on two highly heterozygous F1 cassava populations and on an inbred common bean population, and we showed that NGSEP provides similar or better accuracy compared to other widely used software packages for variants detection such as GATK, Samtools and Tassel. CONCLUSIONS: NGSEP is a powerful, accurate and efficient bioinformatics software tool for analysis of HTS data, and also one of the best bioinformatic packages to facilitate the analysis and to maximize the genomic variability information that can be obtained from GBS experiments for population genomics.
Assuntos
Genes de Plantas , Técnicas de Genotipagem , Sequenciamento de Nucleotídeos em Larga Escala , Biologia Computacional , Genótipo , Manihot/genética , Phaseolus/genética , Análise de Sequência de DNARESUMO
Tandem repeats are short DNA sequences that are repeated head-to-tail with a propensity to be variable. They constitute a significant proportion of the human genome, also occurring within coding and regulatory regions. Variation in these repeats can alter the function and/or expression of genes allowing organisms to swiftly adapt to novel environments. Importantly, some repeat expansions have also been linked to certain neurodegenerative diseases. Therefore, accurate sequencing of tandem repeats could contribute to our understanding of common phenotypic variability and might uncover missing genetic factors in idiopathic clinical conditions. However, despite long-standing evidence for the functional role of repeats, they are largely ignored because of technical limitations in sequencing, mapping and typing. Here, we report on a novel capture technique and data filtering protocol that allowed simultaneous sequencing of thousands of tandem repeats in the human genomes of a three generation family using GS-FLX-plus Titanium technology. Our results demonstrated that up to 7.6% of tandem repeats in this family (4% in coding sequences) differ from the reference sequence, and identified a de novo variation in the family tree. The method opens new routes to look at this underappreciated type of genetic variability, including the identification of novel disease-related repeats.
Assuntos
Genoma Humano , Polimorfismo Genético , Sequências de Repetição em Tandem , Sequência de Bases , Feminino , Componentes do Gene , Humanos , Masculino , Dados de Sequência Molecular , Linhagem , Análise de Sequência de DNARESUMO
Recent advances in high-throughput sequencing (HTS) technologies and computing capacity have produced unprecedented amounts of genomic data that have unraveled the genetics of phenotypic variability in several species. However, operating and integrating current software tools for data analysis still require important investments in highly skilled personnel. Developing accurate, efficient and user-friendly software packages for HTS data analysis will lead to a more rapid discovery of genomic elements relevant to medical, agricultural and industrial applications. We therefore developed Next-Generation Sequencing Eclipse Plug-in (NGSEP), a new software tool for integrated, efficient and user-friendly detection of single nucleotide variants (SNVs), indels and copy number variants (CNVs). NGSEP includes modules for read alignment, sorting, merging, functional annotation of variants, filtering and quality statistics. Analysis of sequencing experiments in yeast, rice and human samples shows that NGSEP has superior accuracy and efficiency, compared with currently available packages for variants detection. We also show that only a comprehensive and accurate identification of repeat regions and CNVs allows researchers to properly separate SNVs from differences between copies of repeat elements. We expect that NGSEP will become a strong support tool to empower the analysis of sequencing data in a wide range of research projects on different species.
Assuntos
Variação Genética , Técnicas de Genotipagem , Sequenciamento de Nucleotídeos em Larga Escala , Software , Algoritmos , Variações do Número de Cópias de DNA , Genômica/métodos , Humanos , Mutação INDEL , Oryza/genéticaRESUMO
The yeast Saccharomyces cerevisiae is able to accumulate ≥17% ethanol (v/v) by fermentation in the absence of cell proliferation. The genetic basis of this unique capacity is unknown. Up to now, all research has focused on tolerance of yeast cell proliferation to high ethanol levels. Comparison of maximal ethanol accumulation capacity and ethanol tolerance of cell proliferation in 68 yeast strains showed a poor correlation, but higher ethanol tolerance of cell proliferation clearly increased the likelihood of superior maximal ethanol accumulation capacity. We have applied pooled-segregant whole-genome sequence analysis to identify the polygenic basis of these two complex traits using segregants from a cross of a haploid derivative of the sake strain CBS1585 and the lab strain BY. From a total of 301 segregants, 22 superior segregants accumulating ≥17% ethanol in small-scale fermentations and 32 superior segregants growing in the presence of 18% ethanol, were separately pooled and sequenced. Plotting SNP variant frequency against chromosomal position revealed eleven and eight Quantitative Trait Loci (QTLs) for the two traits, respectively, and showed that the genetic basis of the two traits is partially different. Fine-mapping and Reciprocal Hemizygosity Analysis identified ADE1, URA3, and KIN3, encoding a protein kinase involved in DNA damage repair, as specific causative genes for maximal ethanol accumulation capacity. These genes, as well as the previously identified MKT1 gene, were not linked in this genetic background to tolerance of cell proliferation to high ethanol levels. The superior KIN3 allele contained two SNPs, which are absent in all yeast strains sequenced up to now. This work provides the first insight in the genetic basis of maximal ethanol accumulation capacity in yeast and reveals for the first time the importance of DNA damage repair in yeast ethanol tolerance.
Assuntos
Proliferação de Células , Etanol/metabolismo , Locos de Características Quantitativas/genética , Saccharomyces cerevisiae/genética , Bebidas Alcoólicas/microbiologia , Alelos , Mapeamento Cromossômico , Dano ao DNA/efeitos dos fármacos , Dano ao DNA/genética , Reparo do DNA/efeitos dos fármacos , Reparo do DNA/genética , Tolerância a Medicamentos/genética , Etanol/farmacologia , Genoma , Polimorfismo de Nucleotídeo Único , Proteínas Serina-Treonina Quinases/genética , Saccharomyces cerevisiae/metabolismo , Proteínas de Saccharomyces cerevisiae/genéticaRESUMO
BACKGROUND: Bulk segregant analysis (BSA) coupled to high throughput sequencing is a powerful method to map genomic regions related with phenotypes of interest. It relies on crossing two parents, one inferior and one superior for a trait of interest. Segregants displaying the trait of the superior parent are pooled, the DNA extracted and sequenced. Genomic regions linked to the trait of interest are identified by searching the pool for overrepresented alleles that normally originate from the superior parent. BSA data analysis is non-trivial due to sequencing, alignment and screening errors. RESULTS: To increase the power of the BSA technology and obtain a better distinction between spuriously and truly linked regions, we developed EXPLoRA (EXtraction of over-rePresented aLleles in BSA), an algorithm for BSA data analysis that explicitly models the dependency between neighboring marker sites by exploiting the properties of linkage disequilibrium through a Hidden Markov Model (HMM). Reanalyzing a BSA dataset for high ethanol tolerance in yeast allowed reliably identifying QTLs linked to this phenotype that could not be identified with statistical significance in the original study. Experimental validation of one of the least pronounced linked regions, by identifying its causative gene VPS70, confirmed the potential of our method. CONCLUSIONS: EXPLoRA has a performance at least as good as the state-of-the-art and it is robust even at low signal to noise ratio's i.e. when the true linkage signal is diluted by sampling, screening errors or when few segregants are available.
Assuntos
Algoritmos , Etanol/farmacologia , Locos de Características Quantitativas , Saccharomyces cerevisiae/efeitos dos fármacos , Mapeamento Cromossômico , Ligação Genética , Desequilíbrio de Ligação , Cadeias de Markov , Fenótipo , Saccharomyces cerevisiae/genéticaRESUMO
Independent determination of both haplotype sequences of an individual genome is essential to relate genetic variation to genome function, phenotype, and disease. To address the importance of phase, we have generated the most complete haplotype-resolved genome to date, "Max Planck One" (MP1), by fosmid pool-based next generation sequencing. Virtually all SNPs (>99%) and 80,000 indels were phased into haploid sequences of up to 6.3 Mb (N50 ~1 Mb). The completeness of phasing allowed determination of the concrete molecular haplotype pairs for the vast majority of genes (81%) including potential regulatory sequences, of which >90% were found to be constituted by two different molecular forms. A subset of 159 genes with potentially severe mutations in either cis or trans configurations exemplified in particular the role of phase for gene function, disease, and clinical interpretation of personal genomes (e.g., BRCA1). Extended genomic regions harboring manifold combinations of physically and/or functionally related genes and regulatory elements were resolved into their underlying "haploid landscapes," which may define the functional genome. Moreover, the majority of genes and functional sequences were found to contain individual or rare SNPs, which cannot be phased from population data alone, emphasizing the importance of molecular phasing for characterizing a genome in its molecular individuality. Our work provides the foundation to understand that the distinction of molecular haplotypes is essential to resolve the (inherently individual) biology of genes, genomes, and disease, establishing a reference point for "phase-sensitive" personal genomics. MP1's annotated haploid genomes are available as a public resource.
Assuntos
Genoma Humano , Haplótipos , Feminino , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Mutação INDEL , Masculino , Pessoa de Meia-Idade , Polimorfismo de Nucleotídeo Único , Análise de Sequência de DNARESUMO
Gene duplication stimulates evolutionary innovation as the resulting paralogs acquire mutations that lead to sub- or neofunctionalization. A comprehensive in silico analysis of paralogs in Saccharomyces cerevisiae reveals that duplicates of cell-surface and subtelomeric genes also undergo ectopic recombination, which leads to new chimaeric alleles. Mimicking such intergenic recombination events in the FLO (flocculation) family of cell-surface genes shows that chimaeric FLO alleles confer different adhesion phenotypes than the parental genes. Our results indicate that intergenic recombination between paralogs can generate a large set of new alleles, thereby providing the raw material for evolutionary adaptation and innovation.
Assuntos
Duplicação Gênica/genética , Lectinas de Ligação a Manose/genética , Recombinação Genética , Proteínas de Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/genética , Alelos , Adesão Celular/genética , Evolução Molecular , Regulação Fúngica da Expressão Gênica , Variação Genética , Mutação , Fenótipo , Homologia de Sequência de AminoácidosRESUMO
Determining the underlying haplotypes of individual human genomes is an essential, but currently difficult, step toward a complete understanding of genome function. Fosmid pool-based next-generation sequencing allows genome-wide generation of 40-kb haploid DNA segments, which can be phased into contiguous molecular haplotypes computationally by Single Individual Haplotyping (SIH). Many SIH algorithms have been proposed, but the accuracy of such methods has been difficult to assess due to the lack of real benchmark data. To address this problem, we generated whole genome fosmid sequence data from a HapMap trio child, NA12878, for which reliable haplotypes have already been produced. We assembled haplotypes using eight algorithms for SIH and carried out direct comparisons of their accuracy, completeness and efficiency. Our comparisons indicate that fosmid-based haplotyping can deliver highly accurate results even at low coverage and that our SIH algorithm, ReFHap, is able to efficiently produce high-quality haplotypes. We expanded the haplotypes for NA12878 by combining the current haplotypes with our fosmid-based haplotypes, producing near-to-complete new gold-standard haplotypes containing almost 98% of heterozygous SNPs. This improvement includes notable fractions of disease-related and GWA SNPs. Integrated with other molecular biological data sets, this phase information will advance the emerging field of diploid genomics.
Assuntos
Genoma Humano , Projeto HapMap , Haplótipos , Análise de Sequência de DNA , Algoritmos , Genômica/normas , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Polimorfismo de Nucleotídeo Único , Análise de Sequência de DNA/normasRESUMO
BACKGROUND: Structural variants (SVs) are genomic polymorphisms defined by their length (>50 bp). The usual types of SVs are deletions, insertions, translocations, inversions, and copy number variants. SV detection and genotyping is fundamental given the role of SVs in phenomena such as phenotypic variation and evolutionary events. Thus, methods to identify SVs using long-read sequencing data have been recently developed. FINDINGS: We present an accurate and efficient algorithm to predict germline SVs from long-read sequencing data. The algorithm starts collecting evidence (signatures) of SVs from read alignments. Then, signatures are clustered based on a Euclidean graph with coordinates calculated from lengths and genomic positions. Clustering is performed by the DBSCAN algorithm, which provides the advantage of delimiting clusters with high resolution. Clusters are transformed into SVs and a Bayesian model allows to precisely genotype SVs based on their supporting evidence. This algorithm is integrated into the single sample variants detector of the Next Generation Sequencing Experience Platform, which facilitates the integration with other functionalities for genomics analysis. We performed multiple benchmark experiments, including simulation and real data, representing different genome profiles, sequencing technologies (PacBio HiFi, ONT), and read depths. CONCLUSION: The results show that our approach outperformed state-of-the-art tools on germline SV calling and genotyping, especially at low depths, and in error-prone repetitive regions. We believe this work significantly contributes to the development of bioinformatic strategies to maximize the use of long-read sequencing technologies.
Assuntos
Algoritmos , Benchmarking , Teorema de Bayes , Genótipo , Análise por ConglomeradosRESUMO
The Terminal Fusarium Clade (TFC) is a group in the Nectriaceae family with agricultural and clinical relevance. In recent years, various phylogenies have been presented in the literature, showing disagreement in the topologies, but only a few studies have conducted analyses on the divergence time scale of the group. Therefore, the evolutionary history of this group is still being determined. This study aimed to understand the evolutionary history of the TFC from a phylogenomic perspective. To achieve this objective, we performed a phylogenomic analysis using the available genomes in GenBank and ran eight different pipelines. We presented a new robust topology of the TFC that differs at some nodes from previous studies. These new relationships allowed us to formulate new hypotheses about the evolutionary history of the TFC. We also inferred new divergence time estimates, which differ from those of previous studies due to topology discordances and taxon sampling. The results suggested an important diversification process in the Neogene period, likely associated with the diversification and predominance of terrestrial ecosystems by angiosperms. In conclusion, we presented a robust time-scale phylogeny that allowed us to formulate new hypotheses regarding the evolutionary history of the TFC.
RESUMO
Pod quality and yield traits in snap bean (Phaseolus vulgaris L.) influence consumer preferences, crop adoption by farmers, and the ability of the product to be commercially competitive locally and globally. The objective of the study was to identify the quantitative trait loci (QTL) for pod quality and yield traits in a snap × dry bean recombinant inbred line (RIL) population. A total of 184 F6 RILs derived from a cross between Vanilla (snap bean) and MCM5001 (dry bean) were grown in three field sites in Kenya and one greenhouse environment in Davis, CA, USA. They were genotyped at 5,951 single nucleotide polymorphisms (SNPs), and composite interval mapping was conducted to identify QTL for 16 pod quality and yield traits, including pod wall fiber, pod string, pod size, and harvest metrics. A combined total of 44 QTL were identified in field and greenhouse trials. The QTL for pod quality were identified on chromosomes Pv01, Pv02, Pv03, Pv04, Pv06, and Pv07, and for pod yield were identified on Pv08. Co-localization of QTL was observed for pod quality and yield traits. Some identified QTL overlapped with previously mapped QTL for pod quality and yield traits, with several others identified as novel. The identified QTL can be used in future marker-assisted selection in snap bean.