Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 17 de 17
Filtrar
1.
BMC Bioinformatics ; 20(1): 148, 2019 Mar 20.
Artigo em Inglês | MEDLINE | ID: mdl-30894135

RESUMO

BACKGROUND: Genetic studies in tetraploids are lagging behind in comparison with studies of diploids as the complex genetics of tetraploids require much more elaborated computational methodologies. Recent advancements in development of molecular techniques and computational tools facilitate new methods for automated, high-throughput genotype calling in tetraploid species. We report on the upgrade of the widely-used fitTetra software aiming to improve its accuracy, which to date is hampered by technical artefacts in the data. RESULTS: Our upgrade of the fitTetra package is designed for a more accurate modelling of complex collections of samples. The package fits a mixture model where some parameters of the model are estimated separately for each sub-collection. When a full-sib family is analyzed, we use parental genotypes to predict the expected segregation in terms of allele dosages in the offspring. More accurate modelling and use of parental data increases the accuracy of dosage calling. We tested the package on data obtained with an Affymetrix Axiom 60 k array and compared its performance with the original version and the recently published ClusterCall tool, showing that at least 20% more SNPs could be called with our updated. CONCLUSION: Our updated software package shows clearly improved performance in genotype calling accuracy. Estimation of mixing proportions of the underlying dosage distributions is separated for full-sib families (where mixture proportions can be estimated from the parental dosages and inheritance model) and unstructured populations (where they are based on the assumption of Hardy-Weinberg equilibrium). Additionally, as the distributions of signal ratios of the dosage classes can be assumed to be the same for all populations, including parental data for some subpopulations helps to improve fitting other populations as well. The R package fitTetra 2.0 is freely available under the GNU Public License as Additional file with this article.


Assuntos
Algoritmos , Genética Populacional , Polimorfismo de Nucleotídeo Único , Software , Tetraploidia , Alelos , Genótipo , Humanos , Análise de Sequência com Séries de Oligonucleotídeos
2.
Mol Ecol ; 28(21): 4737-4754, 2019 11.
Artigo em Inglês | MEDLINE | ID: mdl-31550391

RESUMO

For half a century population genetics studies have put type II restriction endonucleases to work. Now, coupled with massively-parallel, short-read sequencing, the family of RAD protocols that wields these enzymes has generated vast genetic knowledge from the natural world. Here, we describe the first software natively capable of using paired-end sequencing to derive short contigs from de novo RAD data. Stacks version 2 employs a de Bruijn graph assembler to build and connect contigs from forward and reverse reads for each de novo RAD locus, which it then uses as a reference for read alignments. The new architecture allows all the individuals in a metapopulation to be considered at the same time as each RAD locus is processed. This enables a Bayesian genotype caller to provide precise SNPs, and a robust algorithm to phase those SNPs into long haplotypes, generating RAD loci that are 400-800 bp in length. To prove its recall and precision, we tested the software with simulated data and compared reference-aligned and de novo analyses of three empirical data sets. Our study shows that the latest version of Stacks is highly accurate and outperforms other software in assembling and genotyping paired-end de novo data sets.


Assuntos
Genética Populacional/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Teorema de Bayes , Genótipo , Humanos , Metagenômica/métodos , Fenótipo , Polimorfismo de Nucleotídeo Único/genética , Software
3.
Methods ; 79-80: 41-6, 2015 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-25644447

RESUMO

Next-generation sequencing (NGS) technologies, which can provide base-pair resolution genetic information for all types of genetic variations, are increasingly used in genetics research. However, due to the complex nature of NGS technologies and analytics and their relatively high cost, investigators face practical challenges for both design and analysis. These challenges are further complicated by recent methodological developments that make it possible to use haplotype information in sequencing reads. In light of these developments, we conducted comprehensive simulations to evaluate the effects of sequencing coverage, insert size of paired-end reads, and sample size on genotype calling and haplotype phasing in NGS studies. In contrast to previous studies that typically use idealized scenarios to tease out the effects of individual design and analytic decisions, we used a complete analytical pipeline from read mapping and variant detection to genotype calling and haplotype phasing so that we can assess the joint effects of multiple decisions and thus make more realistic recommendations to investigators. Consistent with previous studies, we found that the use of haplotype information in reads can improve the accuracy of genotype calling and haplotype phasing, and we also found that a mixture of short and long insert sizes of paired-end reads may offer even greater accuracy. However, this benefit is only clear in high coverage sequencing where variant detection is close to perfect. Finally, we observed that LD-based refinement methods do not always outperform single site based methods for genotype calling. Therefore, we should choose analytical methods that are appropriate to the sequencing coverage and sample size in order to use haplotype information in sequencing reads.


Assuntos
Técnicas de Genotipagem , Análise de Sequência de DNA , Algoritmos , Simulação por Computador , Variação Genética , Genótipo , Haplótipos
4.
Anim Genet ; 46(1): 82-6, 2015 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-25515399

RESUMO

The number of polymorphisms identified with next-generation sequencing approaches depends directly on the sequencing depth and therefore on the experimental cost. Although higher levels of depth ensure more sensitive and more specific SNP calls, economic constraints limit the increase of depth for whole-genome resequencing (WGS). For this reason, capture resequencing is used for studies focusing on only some specific regions of the genome. However, several biases in capture resequencing are known to have a negative impact on the sensitivity of SNP detection. Within this framework, the aim of this study was to compare the accuracy of WGS and capture resequencing on SNP detection and genotype calling, which differ in terms of both sequencing depth and biases. Indeed, we have evaluated the SNP calling and genotyping accuracy in a WGS dataset (13X) and in a capture resequencing dataset (87X) performed on 11 individuals. The percentage of SNPs not identified due to a sevenfold sequencing depth decrease was estimated at 7.8% using a down-sampling procedure on the capture sequencing dataset. A comparison of the 87X capture sequencing dataset with the WGS dataset revealed that capture-related biases were leading with the loss of 5.2% of SNPs detected with WGS. Nevertheless, when considering the SNPs detected by both approaches, capture sequencing appears to achieve far better SNP genotyping, with about 4.4% of the WGS genotypes that can be considered as erroneous and even 10% focusing on heterozygous genotypes. In conclusion, WGS and capture deep sequencing can be considered equivalent strategies for SNP detection, as the rate of SNPs not identified because of a low sequencing depth in the former is quite similar to SNPs missed because of method biases of the latter. On the other hand, capture deep sequencing clearly appears more adapted for studies requiring great accuracy in genotyping.


Assuntos
Técnicas de Genotipagem/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Polimorfismo de Nucleotídeo Único , Análise de Sequência de DNA/métodos , Animais , Galinhas/genética , Genoma , Genótipo
5.
Mathematics (Basel) ; 11(11)2023 Jun 01.
Artigo em Inglês | MEDLINE | ID: mdl-38721066

RESUMO

Association testing has been widely used to study the relationship between genetic variants and phenotypes. Most association testing methods are genotype-based, i.e. first estimate genotype and then regress phenotype on estimated genotype and other variables. Directly testing methods based on next generation sequencing (NGS) data without genotype calling have been proposed and shown advantage over genotype-based methods in the scenarios when genotype calling is not accurate. NGS data-based single-variant testing have been proposed including our previously proposed single-variant testing method, i.e. UNC combo method [1]. NGS data-based group testing methods for continuous phenotype have also been proposed by us using a linear model framework which can handle continuous responses [2]. In this paper, we extend our linear model-based framework to a generalized linear model-based framework so that the methods can handle other types of responses especially binary responses which is commonly-faced in association studies. We have conducted extensive simulation studies to evaluate the performance of different estimators and compare our estimators with their corresponding genotype-based methods. We found that all methods have Type I errors controlled, and our NGS data-based testing methods have better performance than their corresponding genotype-based methods in the literature for other types of responses including binary responses (logistic regression) and count responses (Poisson regression especially when sequencing depth is low. In conclusion, we have extended our previous linear model (LM) framework to a generalized linear model (GLM) framework and derived NGS data-based testing methods for a group of genetic variants. Compared with our previously proposed LM-based methods [2], the new GLM-based methods can handle more complex responses (for example, binary responses and count responses) in addition to continuous responses. Our methods have filled the literature gap and shown advantage over their corresponding genotype-based methods in the literature.

6.
Appl Plant Sci ; 10(6): e11499, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36518944

RESUMO

Premise: Although several software packages are available for genotyping insertion/deletion (indel) polymorphisms in genomes using next-generation sequencing data, simultaneously calling indel genotypes across many individuals for use in genetic mapping remains challenging. Methods and Results: We present an integrated pipeline, InDelGT, for the extraction of indel genotypes from a segregating population such as backcross or F2 lines, or from an F1 cross between outbred species. The InDelGT algorithm is implemented in three steps: generating an indel catalog, calling indel genotypes, and analyzing indel segregation. We demonstrated the use of the pipeline with an example data set from an F1 hybrid population of Populus and successfully constructed the two parental genetic linkage maps. Conclusions: InDelGT is a practical tool that can quickly genotype a large number of indel markers within a population following Mendelian segregation. The InDelGT pipeline is freely available on GitHub (https://github.com/tongchf/InDelGT).

7.
Comput Struct Biotechnol J ; 20: 3729-3733, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35891781

RESUMO

RNA sequence data are commonly summarized as read counts. By contrast, so far there is no alternative to genotype calling for investigating the relationship between genetic variants determined by next-generation sequencing (NGS) and a phenotype of interest. Here we propose and evaluate the direct analysis of allele counts for genetic association tests. Specifically, we assess the potential advantage of the ratio of alternative allele counts to the total number of reads aligned at a specific position of the genome (coverage) over called genotypes. We simulated association studies based on NGS data from HapMap individuals. Genotype quality scores and allele counts were simulated using NGS data from the Personal Genome Project. Real data from the 1000 Genomes Project was also used to compare the two competing approaches. The average proportions of probability values lower or equal to 0.05 amounted to 0.0496 for called genotypes and 0.0485 for the ratio of alternative allele counts to coverage in the null scenario, and to 0.69 for called genotypes and 0.75 for the ratio of alternative allele counts to coverage in the alternative scenario (9% power increase). The advantage in statistical power of the novel approach increased with decreasing coverage, with decreasing genotype quality and with decreasing allele frequency - 124% power increase for variants with a minor allele frequency lower than 0.05. We provide computer code in R to implement the novel approach, which does not preclude the use of complementary data quality filters before or after identification of the most promising association signals. Author summary: Genetic association tests usually rely on called genotypes. We postulate here that the direct analysis of allele counts from sequence data improves the quality of statistical inference. To evaluate this hypothesis, we investigate simulated and real data using distinct statistical approaches. We demonstrate that association tests based on allele counts rather than called genotypes achieve higher statistical power with controlled type I error rates.

8.
Comput Biol Chem ; 94: 107417, 2021 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-33810991

RESUMO

Genotype plays a significant role in determining characteristics in an organism and genotype calling has been greatly accelerated by sequencing technologies. Furthermore, most parametric statistical models are unable to effectively call genotype, which is influenced by the size of structural variations and the coverage fluctuations of sequencing data. In this study, we propose a new method for calling deletions' genotypes from the next-generation data, called Cnngeno. Cnngeno can convert sequencing data into images and classifies the genotypes from these images using the convolutional neural network(CNN). Moreover, Cnngeno adopted the convolutional bootstrapping strategy to improve the anti-noisy label's ability. The results show that Cnngeno performs better in terms of precision for calling genotype when compared with other existing methods. The Cnngeno is an open-source method, available at https://github.com/BRF123/Cnngeno.


Assuntos
Aprendizado Profundo , Sequenciamento de Nucleotídeos em Larga Escala , Redes Neurais de Computação , Genótipo , Humanos
9.
Mol Ecol Resour ; 21(4): 1085-1097, 2021 May.
Artigo em Inglês | MEDLINE | ID: mdl-33434329

RESUMO

Genotyping-by-sequencing methods such as RADseq are popular for generating genomic and population-scale data sets from a diverse range of organisms. These often lack a usable reference genome, restricting users to RADseq specific software for processing. However, these come with limitations compared to generic next generation sequencing (NGS) toolkits. Here, we describe and test a simple pipeline for reference-free RADseq data processing that blends de novo elements from STACKS with the full suite of state-of-the art NGS tools. Specifically, we use the de novo RADseq assembly employed by STACKS to create a catalogue of RAD loci that serves as a reference for read mapping, variant calling and site filters. Using RADseq data from 28 zebra sequenced to ~8x depth-of-coverage we evaluate our approach by comparing the site frequency spectra (SFS) to those from alternative pipelines. Most pipelines yielded similar SFS at 8x depth, but only a genotype likelihood based pipeline performed similarly at low sequencing depth (2-4x). We compared the RADseq SFS with medium-depth (~13x) shotgun sequencing of eight overlapping samples, revealing that the RADseq SFS was persistently slightly skewed towards rare and invariant alleles. Using simulations and human data we confirm that this is expected when there is allelic dropout (AD) in the RADseq data. AD in the RADseq data caused a heterozygosity deficit of ~16%, which dropped to ~5% after filtering AD. Hence, AD was the most important source of bias in our RADseq data.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA , Software , Animais , Equidae/genética , Genômica , Humanos , Funções Verossimilhança , Perda de Heterozigosidade , Polimorfismo de Nucleotídeo Único
10.
Front Genet ; 12: 655707, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34262593

RESUMO

In addition to their common usages to study gene expression, RNA-seq data accumulated over the last 10 years are a yet-unexploited resource of SNPs in numerous individuals from different populations. SNP detection by RNA-seq is particularly interesting for livestock species since whole genome sequencing is expensive and exome sequencing tools are unavailable. These SNPs detected in expressed regions can be used to characterize variants affecting protein functions, and to study cis-regulated genes by analyzing allele-specific expression (ASE) in the tissue of interest. However, gene expression can be highly variable, and filters for SNP detection using the popular GATK toolkit are not yet standardized, making SNP detection and genotype calling by RNA-seq a challenging endeavor. We compared SNP calling results using GATK suggested filters, on two chicken populations for which both RNA-seq and DNA-seq data were available for the same samples of the same tissue. We showed, in expressed regions, a RNA-seq precision of 91% (SNPs detected by RNA-seq and shared by DNA-seq) and we characterized the remaining 9% of SNPs. We then studied the genotype (GT) obtained by RNA-seq and the impact of two factors (GT call-rate and read number per GT) on the concordance of GT with DNA-seq; we proposed thresholds for them leading to a 95% concordance. Applying these thresholds to 767 multi-tissue RNA-seq of 382 birds of 11 chicken populations, we found 9.5 M SNPs in total, of which ∼550,000 SNPs per tissue and population with a reliable GT (call rate ≥ 50%) and among them, ∼340,000 with a MAF ≥ 10%. We showed that such RNA-seq data from one tissue can be used to (i) detect SNPs with a strong predicted impact on proteins, despite their scarcity in each population (16,307 SIFT deleterious missenses and 590 stop-gained), (ii) study, on a large scale, cis-regulations of gene expression, with ∼81% of protein-coding and 68% of long non-coding genes (TPM ≥ 1) that can be analyzed for ASE, and with ∼29% of them that were cis-regulated, and (iii) analyze population genetic using such SNPs located in expressed regions. This work shows that RNA-seq data can be used with good confidence to detect SNPs and associated GT within various populations and used them for different analyses as GTEx studies.

11.
Mol Ecol Resour ; 20(1): 114-124, 2020 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-31483931

RESUMO

Minimally invasive sampling (MIS) is widespread in wildlife studies; however, its utility for massively parallel DNA sequencing (MPS) is limited. Poor sample quality and contamination by exogenous DNA can make MIS challenging to use with modern genotyping-by-sequencing approaches, which have been traditionally developed for high-quality DNA sources. Given that MIS is often more appropriate in many contexts, there is a need to make such samples practical for harnessing MPS. Here, we test the ability for Genotyping-in-Thousands by sequencing (GT-seq), a multiplex amplicon sequencing approach, to effectively genotype minimally invasive cloacal DNA samples collected from the Western Rattlesnake (Crotalus oreganus), a threatened species in British Columbia, Canada. As there was no previous genetic information for this species, an optimized panel of 362 SNPs was selected for use with GT-seq from a de novo restriction site-associated DNA sequencing (RADseq) assembly. Comparisons of genotypes generated within and among RADseq and GT-seq for the same individuals found low rates of genotyping error (GT-seq: 0.50%; RADseq: 0.80%) and discordance (2.57%), the latter likely due to the different genotype calling models employed. GT-seq mean genotype discordance between blood and cloacal swab samples collected from the same individuals was also minimal (1.37%). Estimates of population diversity parameters were similar across GT-seq and RADseq data sets, as were inferred patterns of population structure. Overall, GT-seq can be effectively applied to low-quality DNA samples, minimizing the inefficiencies presented by exogenous DNA typically found in minimally invasive samples and continuing the expansion of molecular ecology and conservation genetics in the genomics era.


Assuntos
Crotalus/genética , DNA/genética , Técnicas de Genotipagem/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Animais , Colúmbia Britânica , Espécies em Perigo de Extinção , Genômica , Genótipo , Polimorfismo de Nucleotídeo Único
12.
G3 (Bethesda) ; 9(3): 663-673, 2019 03 07.
Artigo em Inglês | MEDLINE | ID: mdl-30655271

RESUMO

Low or uneven read depth is a common limitation of genotyping-by-sequencing (GBS) and restriction site-associated DNA sequencing (RAD-seq), resulting in high missing data rates, heterozygotes miscalled as homozygotes, and uncertainty of allele copy number in heterozygous polyploids. Bayesian genotype calling can mitigate these issues, but previously has only been implemented in software that requires a reference genome or uses priors that may be inappropriate for the population. Here we present several novel Bayesian algorithms that estimate genotype posterior probabilities, all of which are implemented in a new R package, polyRAD. Appropriate priors can be specified for mapping populations, populations in Hardy-Weinberg equilibrium, or structured populations, and in each case can be informed by genotypes at linked markers. The polyRAD software imports read depth from several existing pipelines, and outputs continuous or discrete numerical genotypes suitable for analyses such as genome-wide association and genomic prediction.


Assuntos
Diploide , Polimorfismo de Nucleotídeo Único , Poliploidia , Análise de Sequência de DNA/métodos , Software , Algoritmos , Teorema de Bayes , Técnicas de Genotipagem , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Poaceae/genética , Incerteza
13.
Front Plant Sci ; 9: 104, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29467780

RESUMO

Polypoid species play significant roles in agriculture and food production. Many crop species are polyploid, such as potato, wheat, strawberry, and sugarcane. Genotyping has been a daunting task for genetic studies of polyploid crops, which lags far behind the diploid crop species. Single nucleotide polymorphism (SNP) array is considered to be one of, high-throughput, relatively cost-efficient and automated genotyping approaches. However, there are significant challenges for SNP identification in complex, polyploid genomes, which has seriously slowed SNP discovery and array development in polyploid species. Ploidy is a significant factor impacting SNP qualities and validation rates of SNP markers in SNP arrays, which has been proven to be a very important tool for genetic studies and molecular breeding. In this review, we (1) discussed the pros and cons of SNP array in general for high throughput genotyping, (2) presented the challenges of and solutions to SNP calling in polyploid species, (3) summarized the SNP selection criteria and considerations of SNP array design for polyploid species, (4) illustrated SNP array applications in several different polyploid crop species, then (5) discussed challenges, available software, and their accuracy comparisons for genotype calling based on SNP array data in polyploids, and finally (6) provided a series of SNP array design and genotype calling recommendations. This review presents a complete overview of SNP array development and applications in polypoid crops, which will benefit the research in molecular breeding and genetics of crops with complex genomes.

14.
J Appl Genet ; 57(1): 71-9, 2016 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-26055432

RESUMO

Application of the massive parallel sequencing technology has become one of the most important issues in life sciences. Therefore, it was crucial to develop bioinformatics tools for next-generation sequencing (NGS) data processing. Currently, two of the most significant tasks include alignment to a reference genome and detection of single nucleotide polymorphisms (SNPs). In many types of genomic analyses, great numbers of reads need to be mapped to the reference genome; therefore, selection of the aligner is an essential step in NGS pipelines. Two main algorithms-suffix tries and hash tables-have been introduced for this purpose. Suffix array-based aligners are memory-efficient and work faster than hash-based aligners, but they are less accurate. In contrast, hash table algorithms tend to be slower, but more sensitive. SNP and genotype callers may also be divided into two main different approaches: heuristic and probabilistic methods. A variety of software has been subsequently developed over the past several years. In this paper, we briefly review the current development of NGS data processing algorithms and present the available software.


Assuntos
Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Polimorfismo de Nucleotídeo Único , Análise de Sequência de DNA/métodos , Software , Biologia Computacional , Alinhamento de Sequência
15.
J Appl Stat ; 40(6): 1372-1381, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23667285

RESUMO

We study the genotype calling algorithms for the high-throughput single-nucleotide polymorphism (SNP) arrays. Building upon the novel SNP-RMA preprocessing approach and the state-of-the-art CRLMM approach for genotype calling, we propose a simple modification to better model and combine the information across multiple SNPs with empirical Bayes modeling, which could often significantly improve the genotype calling of CRLMM. Through applications to the HapMap Trio data set and a non-HapMap test set of high quality SNP chips, we illustrate the competitive performance of the proposed method.

16.
Comput Biol Med ; 43(9): 1171-6, 2013 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-23930810

RESUMO

We present NGSPE, a pipeline for variation discovery and genotyping of pair-ended Illumina next generation sequencing (NGS) data (http://ngspeanalysis.sourceforge.net/). This pipeline not only describes a set of sequential analytical steps, such as short reads alignment, genotype calling and functional variation annotation that can be conducted using open-source software tools, but also provides users a set of scripts to install the dependent software and resources and implement the pipeline on their data. A sample summary report including the concordance rate between data generated by this pipeline and different resources as well as the comparison between replication samples of two commercial platforms from Illumina and Complete Genomics is also provided. Furthermore, some of the mutations identified by the pipeline were verified using Sanger sequencing.


Assuntos
Genoma Humano/genética , Genômica/métodos , Análise de Sequência de DNA/métodos , Software , Humanos
17.
Stat Biosci ; 5(1): 3-25, 2013 May.
Artigo em Inglês | MEDLINE | ID: mdl-24489615

RESUMO

Massively parallel sequencing (MPS), since its debut in 2005, has transformed the field of genomic studies. These new sequencing technologies have resulted in the successful identification of causal variants for several rare Mendelian disorders. They have also begun to deliver on their promise to explain some of the missing heritability from genome-wide association studies (GWAS) of complex traits. We anticipate a rapidly growing number of MPS-based studies for a diverse range of applications in the near future. One crucial and nearly inevitable step is to detect SNPs and call genotypes at the detected polymorphic sites from the sequencing data. Here, we review statistical methods that have been proposed in the past five years for this purpose. In addition, we discuss emerging issues and future directions related to SNP detection and genotype calling from MPS data.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA