Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 109
Filtrar
1.
Genes (Basel) ; 12(11)2021 11 18.
Artículo en Inglés | MEDLINE | ID: mdl-34828415

RESUMEN

Multiple sequence alignment (MSA) is the basis for almost all sequence comparison and molecular phylogenetic inferences. Large-scale genomic analyses are typically associated with automated progressive MSA without subsequent manual adjustment, which itself is often error-prone because of the lack of a consistent and explicit criterion. Here, I outlined several commonly encountered alignment errors that cannot be avoided by progressive MSA for nucleotide, amino acid, and codon sequences. Methods that could be automated to fix such alignment errors were then presented. I emphasized the utility of position weight matrix as a new tool for MSA refinement and illustrated its usage by refining the MSA of nucleotide and amino acid sequences. The main advantages of the position weight matrix approach include (1) its use of information from all sequences, in contrast to other commonly used methods based on pairwise alignment scores and inconsistency measures, and (2) its speedy computation, making it suitable for a large number of long viral genomic sequences.


Asunto(s)
Automatización de Laboratorios/métodos , Genómica/métodos , Alineación de Secuencia/métodos , Algoritmos , Animales , Automatización de Laboratorios/normas , Genómica/normas , Humanos , Filogenia , Sensibilidad y Especificidad , Alineación de Secuencia/normas , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ADN/normas , Análisis de Secuencia de Proteína/métodos , Análisis de Secuencia de Proteína/normas
2.
Nature ; 587(7833): 246-251, 2020 11.
Artículo en Inglés | MEDLINE | ID: mdl-33177663

RESUMEN

New genome assemblies have been arriving at a rapidly increasing pace, thanks to decreases in sequencing costs and improvements in third-generation sequencing technologies1-3. For example, the number of vertebrate genome assemblies currently in the NCBI (National Center for Biotechnology Information) database4 increased by more than 50% to 1,485 assemblies in the year from July 2018 to July 2019. In addition to this influx of assemblies from different species, new human de novo assemblies5 are being produced, which enable the analysis of not only small polymorphisms, but also complex, large-scale structural differences between human individuals and haplotypes. This coming era and its unprecedented amount of data offer the opportunity to uncover many insights into genome evolution but also present challenges in how to adapt current analysis methods to meet the increased scale. Cactus6, a reference-free multiple genome alignment program, has been shown to be highly accurate, but the existing implementation scales poorly with increasing numbers of genomes, and struggles in regions of highly duplicated sequences. Here we describe progressive extensions to Cactus to create Progressive Cactus, which enables the reference-free alignment of tens to thousands of large vertebrate genomes while maintaining high alignment quality. We describe results from an alignment of more than 600 amniote genomes, which is to our knowledge the largest multiple vertebrate genome alignment created so far.


Asunto(s)
Genoma/genética , Genómica/métodos , Alineación de Secuencia/métodos , Programas Informáticos , Vertebrados/genética , Amnios , Animales , Simulación por Computador , Genómica/normas , Haplotipos , Humanos , Control de Calidad , Alineación de Secuencia/normas , Programas Informáticos/normas
3.
Gigascience ; 9(2)2020 02 01.
Artículo en Inglés | MEDLINE | ID: mdl-32025702

RESUMEN

BACKGROUND: Accurately identifying single-nucleotide polymorphisms (SNPs) from bacterial sequencing data is an essential requirement for using genomics to track transmission and predict important phenotypes such as antimicrobial resistance. However, most previous performance evaluations of SNP calling have been restricted to eukaryotic (human) data. Additionally, bacterial SNP calling requires choosing an appropriate reference genome to align reads to, which, together with the bioinformatic pipeline, affects the accuracy and completeness of a set of SNP calls obtained. This study evaluates the performance of 209 SNP-calling pipelines using a combination of simulated data from 254 strains of 10 clinically common bacteria and real data from environmentally sourced and genomically diverse isolates within the genera Citrobacter, Enterobacter, Escherichia, and Klebsiella. RESULTS: We evaluated the performance of 209 SNP-calling pipelines, aligning reads to genomes of the same or a divergent strain. Irrespective of pipeline, a principal determinant of reliable SNP calling was reference genome selection. Across multiple taxa, there was a strong inverse relationship between pipeline sensitivity and precision, and the Mash distance (a proxy for average nucleotide divergence) between reads and reference genome. The effect was especially pronounced for diverse, recombinogenic bacteria such as Escherichia coli but less dominant for clonal species such as Mycobacterium tuberculosis. CONCLUSIONS: The accuracy of SNP calling for a given species is compromised by increasing intra-species diversity. When reads were aligned to the same genome from which they were sequenced, among the highest-performing pipelines was Novoalign/GATK. By contrast, when reads were aligned to particularly divergent genomes, the highest-performing pipelines often used the aligners NextGenMap or SMALT, and/or the variant callers LoFreq, mpileup, or Strelka.


Asunto(s)
Genoma Bacteriano , Genómica/normas , Polimorfismo de Nucleótido Simple , Programas Informáticos/normas , Escherichia coli/genética , Genómica/métodos , Técnicas de Genotipaje/métodos , Técnicas de Genotipaje/normas , Mycobacterium tuberculosis/genética , Recombinación Genética , Alineación de Secuencia/métodos , Alineación de Secuencia/normas
4.
Gigascience ; 8(7)2019 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-31289836

RESUMEN

BACKGROUND: Mammalian X and Y chromosomes share a common evolutionary origin and retain regions of high sequence similarity. Similar sequence content can confound the mapping of short next-generation sequencing reads to a reference genome. It is therefore possible that the presence of both sex chromosomes in a reference genome can cause technical artifacts in genomic data and affect downstream analyses and applications. Understanding this problem is critical for medical genomics and population genomic inference. RESULTS: Here, we characterize how sequence homology can affect analyses on the sex chromosomes and present XYalign, a new tool that (1) facilitates the inference of sex chromosome complement from next-generation sequencing data; (2) corrects erroneous read mapping on the sex chromosomes; and (3) tabulates and visualizes important metrics for quality control such as mapping quality, sequencing depth, and allele balance. We find that sequence homology affects read mapping on the sex chromosomes and this has downstream effects on variant calling. However, we show that XYalign can correct mismapping, resulting in more accurate variant calling. We also show how metrics output by XYalign can be used to identify XX and XY individuals across diverse sequencing experiments, including low- and high-coverage whole-genome sequencing, and exome sequencing. Finally, we discuss how the flexibility of the XYalign framework can be leveraged for other uses including the identification of aneuploidy on the autosomes. XYalign is available open source under the GNU General Public License (version 3). CONCLUSIONS: Sex chromsome sequence homology causes the mismapping of short reads, which in turn affects downstream analyses. XYalign provides a reproducible framework to correct mismapping and improve variant calling on the sex chromsomes.


Asunto(s)
Cromosomas Humanos X/genética , Cromosomas Humanos Y/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , Homología de Secuencia de Ácido Nucleico , Artefactos , Mapeo Contig/métodos , Mapeo Contig/normas , Femenino , Secuenciación de Nucleótidos de Alto Rendimiento/normas , Humanos , Masculino , Alineación de Secuencia/métodos , Alineación de Secuencia/normas , Análisis de Secuencia de ADN/normas
5.
Gigascience ; 8(7)2019 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-31251324

RESUMEN

Biclustering is a technique of discovering local similarities within data. For many years the complexity of the methods and parallelization issues limited its application to big data problems. With the development of novel scalable methods, biclustering has finally started to close this gap. In this paper we discuss the caveats of biclustering and present its current challenges and guidelines for practitioners. We also try to explain why biclustering may soon become one of the standards for big data analytics.


Asunto(s)
Macrodatos , Genómica/métodos , Análisis de Secuencia de ADN/métodos , Análisis por Conglomerados , Minería de Datos/métodos , Genoma Humano , Genómica/normas , Humanos , Alineación de Secuencia/métodos , Alineación de Secuencia/normas , Análisis de Secuencia de ADN/normas , Programas Informáticos
6.
BMC Vet Res ; 15(1): 135, 2019 May 08.
Artículo en Inglés | MEDLINE | ID: mdl-31068211

RESUMEN

BACKGROUND: Porcine reproductive and respiratory syndrome (PRRS) is a major threat to the swine industry. It is caused by the PRRS virus (PRRSV). Determination and comparison of the nucleotide sequences of PRRSV strains provides useful information in support of control initiatives or epidemiological studies on transmission patterns. The alignment of sequences is the first step in analyzing sequence data, with multiple algorithms being available, but little is known on the impact of this methodological choice. Here, a study was conducted to evaluate the impact of different alignment algorithms on the resulting aligned sequence dataset and on practical issues when applied to a large field database of PRRSV open reading frame (ORF) 5 sequences collected in Quebec, Canada, from 2010 to 2014. Five multiple sequence alignment programs were compared: Clustal W, Clustal Omega, Muscle, T-Coffee and MAFFT. RESULTS: The resulting alignments showed very similar results in terms of average pairwise genetic similarity, proportion of pairwise comparisons having ≥97.5% genetic similarity and sum of pairs (SP) score, except for T-Coffee where increased length of aligned datasets as well as limitation to handle large datasets were observed. CONCLUSIONS: Based on efficiency at minimizing the number of gaps in different dataset sizes with default open gap values as well as the capability to handle a large number of sequences in a timely manner, the use of Clustal Omega might be recommended for the management of PRRSV extensive database for both research and surveillance purposes.


Asunto(s)
Algoritmos , Variación Genética , Virus del Síndrome Respiratorio y Reproductivo Porcino/genética , Alineación de Secuencia/métodos , Alineación de Secuencia/normas
7.
Genes (Basel) ; 10(2)2019 01 22.
Artículo en Inglés | MEDLINE | ID: mdl-30678245

RESUMEN

Phylogenetic tree is essential to understand evolution and it is usually constructed through multiple sequence alignment, which suffers from heavy computational burdens and requires sophisticated parameter tuning. Recently, alignment free methods based on k-mer profiles or common substrings provide alternative ways to construct phylogenetic trees. However, most of these methods ignore the global similarities between sequences or some specific valuable features, e.g., frequent patterns overall datasets. To make further improvement, we propose an alignment free algorithm based on sequential pattern mining, where each sequence is converted into a binary representation of sequential patterns among sequences. The phylogenetic tree is further constructed via clustering distance matrix which is calculated from pattern vectors. To increase accuracy for highly divergent sequences, we consider pattern weight and filtering redundancy sub-patterns. Both simulated and real data demonstrates our method outperform other alignment free methods, especially for large sequence set with low similarity.


Asunto(s)
Filogenia , Alineación de Secuencia/métodos , Programas Informáticos , Alineación de Secuencia/normas
8.
Syst Biol ; 68(3): 396-411, 2019 05 01.
Artículo en Inglés | MEDLINE | ID: mdl-30329135

RESUMEN

The estimation of multiple sequence alignments of protein sequences is a basic step in many bioinformatics pipelines, including protein structure prediction, protein family identification, and phylogeny estimation. Statistical coestimation of alignments and trees under stochastic models of sequence evolution has long been considered the most rigorous technique for estimating alignments and trees, but little is known about the accuracy of such methods on biological benchmarks. We report the results of an extensive study evaluating the most popular protein alignment methods as well as the statistical coestimation method BAli-Phy on 1192 protein data sets from established benchmarks as well as on 120 simulated data sets. Our study (which used more than 230 CPU years for the BAli-Phy analyses alone) shows that BAli-Phy has better precision and recall (with respect to the true alignments) than the other alignment methods on the simulated data sets but has consistently lower recall on the biological benchmarks (with respect to the reference alignments) than many of the other methods. In other words, we find that BAli-Phy systematically underaligns when operating on biological sequence data but shows no sign of this on simulated data. There are several potential causes for this change in performance, including model misspecification, errors in the reference alignments, and conflicts between structural alignment and evolutionary alignments, and future research is needed to determine the most likely explanation. We conclude with a discussion of the potential ramifications for each of these possibilities. [BAli-Phy; homology; multiple sequence alignment; protein sequences; structural alignment.].


Asunto(s)
Clasificación/métodos , Bases de Datos de Proteínas , Modelos Estadísticos , Alineación de Secuencia/normas , Simulación por Computador , Conjuntos de Datos como Asunto
9.
PLoS Comput Biol ; 14(11): e1006547, 2018 11.
Artículo en Inglés | MEDLINE | ID: mdl-30383764

RESUMEN

Protein or DNA motifs are sequence regions which possess biological importance. These regions are often highly conserved among homologous sequences. The generation of multiple sequence alignments (MSAs) with a correct alignment of the conserved sequence motifs is still difficult to achieve, due to the fact that the contribution of these typically short fragments is overshadowed by the rest of the sequence. Here we extended the PRALINE multiple sequence alignment program with a novel motif-aware MSA algorithm in order to address this shortcoming. This method can incorporate explicit information about the presence of externally provided sequence motifs, which is then used in the dynamic programming step by boosting the amino acid substitution matrix towards the motif. The strength of the boost is controlled by a parameter, α. Using a benchmark set of alignments we confirm that a good compromise can be found that improves the matching of motif regions while not significantly reducing the overall alignment quality. By estimating α on an unrelated set of reference alignments we find there is indeed a strong conservation signal for motifs. A number of typical but difficult MSA use cases are explored to exemplify the problems in correctly aligning functional sequence motifs and how the motif-aware alignment method can be employed to alleviate these problems.


Asunto(s)
Secuencias de Aminoácidos , ADN/química , Proteínas/química , Alineación de Secuencia/normas , Algoritmos , Secuencia de Aminoácidos , Secuencia Conservada , VIH-1/química , Homología de Secuencia de Aminoácido , Productos del Gen env del Virus de la Inmunodeficiencia Humana/química
10.
J Comput Biol ; 25(8): 841-849, 2018 08.
Artículo en Inglés | MEDLINE | ID: mdl-30084692

RESUMEN

The comparison and assessment of similarity across metagenomes are still an open problem. Uncultivated samples suffer from high variability, thus making it difficult for heuristic sequence comparison methods to find precise matches in reference databases. Finer methods are required to provide higher accuracy and certainty, although these come at the expense of larger computation times. Therefore, in this work, we present our software for the highly parallel, fine-grained pairwise alignment of metagenomes. First, an analysis of the computational limitations of performing coarse-grained global alignments in parallel manner is described, and a solution is discussed and employed by our proposal. Second, we show that our development is competitive with state-of-the-art software in terms of speed and consumption of resources, while achieving more accurate results. In addition, the parallel scheme adopted is tested, depicting a performance of up to 98% efficiency while using up to 64 cores. Sequential optimizations are also tested and show a speedup of 9× over our previous proposal.


Asunto(s)
Biología Computacional/métodos , Metagenoma , Metagenómica/métodos , Metagenómica/normas , Alineación de Secuencia/normas , Programas Informáticos , Algoritmos , Humanos
11.
J Comput Biol ; 25(10): 1106-1119, 2018 10.
Artículo en Inglés | MEDLINE | ID: mdl-29993269

RESUMEN

The Smith-Waterman (SW) algorithm explores all the possible alignments between two or more sequences and as a result it returns the optimal local alignment. However, the computational cost of this algorithm is very high, and the exponential growth of computation makes SW unrealistic for searching similarities in large sets of sequences. Fortunately, the dynamic programming kernel of the SW algorithm involves mathematical operations over affine control loops whose iteration space can be represented by the polyhedral model. This allows us to apply polyhedral compilation techniques to optimize the studied SW dense array code. In this article, we present an approach to generate efficient SW implementations for two and three sequences by using the transitive closure of a dependence graph and loop skewing. Generated programs are represented with parallel tiled loop nests, which expose significantly higher performance than that of programs obtained with closely related compilers. The approach is able to tile all loops of original loop nests as opposed to well-known affine transformation techniques. Furthermore, it allows for code optimization of three-sequence alignment. Such a code cannot be generated by means of state-of-the-art automatic optimizing compilers. We demonstrate that an under-approximation of transitive closure (instead of exact transitive closure) can be used to generate valid parallel tiled code. This considerably reduces the computational complexity of the approach. Generated codes were run on cores of a modern Intel multiprocessor and they expose high speedup and good scalability on this platform.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Alineación de Secuencia/métodos , Alineación de Secuencia/normas , Humanos , Programas Informáticos
12.
Genes Genomics ; 40(2): 189-197, 2018.
Artículo en Inglés | MEDLINE | ID: mdl-29568413

RESUMEN

In addition to the rapid advancement in Next-Generation Sequencing (NGS) technology, clinical panel sequencing is being used increasingly in clinical studies and tests. However, tools that are used in NGS data analysis have not been comparatively evaluated in performance for panel sequencing. This study aimed to evaluate the tools used in the alignment process, the first procedure in bioinformatics analysis, by comparing tools that have been widely used with ones that have been introduced recently. With the accumulated panel sequencing data, detected variant lists were cataloged and inserted into simulated reads produced from the reference genome (h19). The amount of unmapped reads and misaligned reads, mapping quality distribution, and runtime were measured as standards for comparison. As the most widely used tools, Bowtie2 and BWA-MEM each showed explicit performance with AUC of 0.9984 and 0.9970 respectively. Kart, maintaining superior runtime and less number of misaligned read, also similarly possessed high level of AUC (0.9723). Such selection and optimization method of tools appropriate for panel sequencing can be utilized for fields requiring error minimization, such as clinical application and liquid biopsy studies.


Asunto(s)
Simulación por Computador , Genoma Humano , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Alineación de Secuencia/métodos , Programas Informáticos , Genómica/métodos , Genómica/normas , Secuenciación de Nucleótidos de Alto Rendimiento/normas , Humanos , Alineación de Secuencia/normas , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ADN/normas
13.
Mitochondrial DNA A DNA Mapp Seq Anal ; 29(7): 1128-1138, 2018 10.
Artículo en Inglés | MEDLINE | ID: mdl-29338473

RESUMEN

Phylogenetics and population genetics are central disciplines in evolutionary biology. Both are based on the comparison of single DNA sequences, or a concatenation of a number of these. However, with the advent of next-generation DNA sequencing technologies, the approaches that consider large genomic data sets are of growing importance for the elucidation of evolutionary relationships among species. Among these approaches, the assembly and alignment-free methods which allow an efficient distance computation and phylogeny reconstruction are of great importance. However, it is not yet clear under what quality conditions and abundance of genomic data such methods are able to infer phylogenies accurately. In the present study we assess the method originally proposed by Fan et al. for whole genome data, in the elucidation of Tomatoes' chloroplast phylogenetics using short read sequences. We find that this assembly and alignment-free method is capable of reproducing previous results under conditions of high coverage, given that low frequency k-mers (i.e. error prone data) are effectively filtered out. Finally, we present a complete chloroplast phylogeny for the best data quality candidates of the recently published 360 tomato genomes.


Asunto(s)
Código de Barras del ADN Taxonómico/métodos , ADN de Cloroplastos/genética , Filogenia , Alineación de Secuencia/métodos , Solanum lycopersicum/genética , Código de Barras del ADN Taxonómico/normas , Solanum lycopersicum/clasificación , Alineación de Secuencia/normas
14.
Genomics ; 110(5): 263-273, 2018 09.
Artículo en Inglés | MEDLINE | ID: mdl-29180261

RESUMEN

Several proteins and genes are members of families that share a public evolutionary. In order to outline the evolutionary relationships and to recognize conserved patterns, sequence comparison becomes an emerging process. The current work investigates critically the k-mer role in composition vector method for comparing genome sequences. Generally, composition vector methods using k-mer are applied under choice of different value of k to compare genome sequences. For some values of k, results are satisfactory, but for other values of k, results are unsatisfactory. Standard composition vector method is carried out in the proposed work using 3-mer string length. In addition, special type of information based similarity index is used as a distance measure. It establishes that use of 3-mer and information based similarity index provide satisfactory results especially for comparison of whole genome sequences in all cases. These selections provide a sort of unified approach towards comparison of genome sequences.


Asunto(s)
Algoritmos , Genómica/métodos , Alineación de Secuencia/métodos , Animales , Humanos , Alineación de Secuencia/normas
15.
Arch Med Sadowej Kryminol ; 68(4): 242-258, 2018.
Artículo en Inglés | MEDLINE | ID: mdl-31025842

RESUMEN

Although mitochondrial DNA (mtDNA) testing has been used in forensic genetics only since the mid-1990s, forensic DNA laboratories have been recently increasing the range of mtDNA sequencing, employing new analytical approaches and methods of data analysis. Therefore, it seems fitting to gather and systematize existing recommendations in the field of mtDNA analysis for forensic purposes, and formulate a set of interpretative guidelines which are especially relevant in view of recent developments in the forensic casework. The starting point is the recommendations of the International Society for Forensic Genetics (ISFG) which, in the opinion of the Polish Speaking Working Group of the ISFG (ISFG- PL), should be followed by all Polish laboratories conducting forensic testing.


Asunto(s)
Dermatoglifia del ADN/normas , ADN Mitocondrial/genética , Genética Forense/normas , Análisis de Secuencia de ADN/normas , Genética Forense/métodos , Humanos , Polonia , Alineación de Secuencia/normas , Sociedades Científicas
16.
Gigascience ; 6(11): 1-6, 2017 11 01.
Artículo en Inglés | MEDLINE | ID: mdl-29048539

RESUMEN

The BAM and CRAM formats provide a supplementary linear index that facilitates rapid access to sequence alignments in arbitrary genomic regions. Comparing consecutive entries in a BAM or CRAM index allows one to infer the number of alignment records per genomic region for use as an effective proxy of sequence depth in each genomic region. Based on these properties, we have developed indexcov, an efficient estimator of whole-genome sequencing coverage to rapidly identify samples with aberrant coverage profiles, reveal large-scale chromosomal anomalies, recognize potential batch effects, and infer the sex of a sample. Indexcov is available at https://github.com/brentp/goleft under the MIT license.


Asunto(s)
Alineación de Secuencia/normas , Programas Informáticos/normas , Secuenciación Completa del Genoma/normas , Genoma Humano , Humanos , Control de Calidad , Reproducibilidad de los Resultados , Alineación de Secuencia/métodos , Secuenciación Completa del Genoma/métodos
17.
Sci Rep ; 7(1): 10963, 2017 09 08.
Artículo en Inglés | MEDLINE | ID: mdl-28887485

RESUMEN

Complementary to reference-based variant detection, recent studies revealed that many novel variants could be detected with de novo assembled genomes. To evaluate the effect of reads coverage and the accuracy of assembly-based variant calling, we simulated short reads containing more than 3 million of single nucleotide variants (SNVs) from the whole human genome and compared the efficiency of SNV calling between the assembly-based and alignment-based calling approaches. We assessed the quality of the assembled contig and found that a minimum of 30X coverage of short reads was needed to ensure reliable SNV calling and to generate assembled contigs with a good coverage of genome and genes. In addition, we observed that the assembly-based approach had a much lower recall rate and precision comparing to the alignment-based approach that would recover 99% of imputed SNVs. We observed similar results with experimental reads for NA24385, an individual whose germline variants were well characterized. Although there are additional values for SNVs detection, the assembly-based approach would have great risk of false discovery of novel SNVs. Further improvement of de novo assembly algorithms are needed in order to warrant a good completeness of genome with haplotype resolved and high fidelity of assembled sequences.


Asunto(s)
Mapeo Contig/métodos , Estudio de Asociación del Genoma Completo/métodos , Polimorfismo de Nucleótido Simple , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Algoritmos , Mapeo Contig/normas , Estudio de Asociación del Genoma Completo/normas , Humanos , Alineación de Secuencia/normas , Análisis de Secuencia de ADN/normas
18.
Gigascience ; 6(7): 1-8, 2017 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-28531267

RESUMEN

The 1000 Genomes Project produced more than 100 trillion basepairs of short read sequence from more than 2600 samples in 26 populations over a period of five years. In its final phase, the project released over 85 million genotyped and phased variants on human reference genome assembly GRCh37. An updated reference assembly, GRCh38, was released in late 2013, but there was insufficient time for the final phase of the project analysis to change to the new assembly. Although it is possible to lift the coordinates of the 1000 Genomes Project variants to the new assembly, this is a potentially error-prone process as coordinate remapping is most appropriate only for non-repetitive regions of the genome and those that did not see significant change between the two assemblies. It will also miss variants in any region that was newly added to GRCh38. Thus, to produce the highest quality variants and genotypes on GRCh38, the best strategy is to realign the reads and recall the variants based on the new alignment. As the first step of variant calling for the 1000 Genomes Project data, we have finished remapping all of the 1000 Genomes sequence reads to GRCh38 with alternative scaffold-aware BWA-MEM. The resulting alignments are available as CRAM, a reference-based sequence compression format. The data have been released on our FTP site and are also available from European Nucleotide Archive to facilitate researchers discovering variants on the primary sequences and alternative contigs of GRCh38.


Asunto(s)
Mapeo Contig/métodos , Proyecto Genoma Humano , Alineación de Secuencia/métodos , Secuenciación Completa del Genoma/métodos , Algoritmos , Mapeo Contig/normas , Humanos , Estándares de Referencia , Alineación de Secuencia/normas , Secuenciación Completa del Genoma/normas
19.
Genet Mol Res ; 16(2)2017 Apr 20.
Artículo en Inglés | MEDLINE | ID: mdl-28437554

RESUMEN

Molecular identification is very useful in cases where morphology-based species identification is not possible. Examples for its application in cetaceans include the identification of carcasses of stranded animals in advanced state of decomposition and body parts that are illegally traded. One DNA region that is often used for molecular identification is the Folmer region of the mitochondrial gene cytochrome c oxidase subunit I (COI) (locus 48 to 705 bp). This locus has been used for the identification of several animal species, including whales and dolphins. The goal of the present study was to evaluate the usefulness of another region of COI, the E3-I5 (locus 685 to locus 1179; 495 bp) as a marker for identification of cetaceans from northeastern Canada and northeastern Brazil. The identification markers were successfully obtained for seven cetacean species after performing percent identity and Basic Local Alignment Search Tool analyses. The obtained markers are now publicly available and are useful for the identification of the endangered blue whale (Balaenoptera musculus), common minke whale (B. acutorostrata), vulnerable sperm whale (Physeter macrocephalus), harbor porpoise (Phocoena phocoena), common bottlenose dolphin (Tursiops truncatus), Guiana dolphin (Sotalia guianensis), and melon-headed whale (Peponocephala electra).


Asunto(s)
Cetáceos/genética , Código de Barras del ADN Taxonómico/normas , Complejo IV de Transporte de Electrones/genética , Alineación de Secuencia/normas , Animales , Cetáceos/clasificación , Código de Barras del ADN Taxonómico/métodos , Especies en Peligro de Extinción , Marcadores Genéticos , Estándares de Referencia , Alineación de Secuencia/métodos
20.
G3 (Bethesda) ; 7(5): 1405-1416, 2017 05 05.
Artículo en Inglés | MEDLINE | ID: mdl-28235826

RESUMEN

Comparing genomes of closely related genotypes from populations with distinct demographic histories can help reveal the impact of effective population size on genome evolution. For this purpose, we present a high quality genome assembly of Daphnia pulex (PA42), and compare this with the first sequenced genome of this species (TCO), which was derived from an isolate from a population with >90% reduction in nucleotide diversity. PA42 has numerous similarities to TCO at the gene level, with an average amino acid sequence identity of 98.8 and >60% of orthologous proteins identical. Nonetheless, there is a highly elevated number of genes in the TCO genome annotation, with ∼7000 excess genes appearing to be false positives. This view is supported by the high GC content, lack of introns, and short length of these suspicious gene annotations. Consistent with the view that reduced effective population size can facilitate the accumulation of slightly deleterious genomic features, we observe more proliferation of transposable elements (TEs) and a higher frequency of gained introns in the TCO genome.


Asunto(s)
Daphnia/genética , Secuenciación Completa del Genoma/métodos , Animales , Elementos Transponibles de ADN , Intrones , Anotación de Secuencia Molecular/métodos , Anotación de Secuencia Molecular/normas , Estándares de Referencia , Sensibilidad y Especificidad , Alineación de Secuencia/métodos , Alineación de Secuencia/normas , Secuenciación Completa del Genoma/normas
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...