Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 85
Filtrar
Mais filtros

Bases de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Mol Biol Evol ; 40(12)2023 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-38085182

RESUMO

DNA that controls gene expression (e.g. enhancers, promoters) has seemed almost never to be conserved between distantly related animals, like vertebrates and arthropods. This is mysterious, because development of such animals is partly organized by homologous genes with similar complex expression patterns, termed "deep homology." Here, we report 25 regulatory DNA segments conserved across bilaterian animals, of which 7 are also conserved in cnidaria (coral and sea anemone). They control developmental genes (e.g. Nr2f, Ptch, Rfx1/3, Sall, Smad6, Sp5, Tbx2/3), including six homeobox genes: Gsx, Hmx, Meis, Msx, Six1/2, and Zfhx3/4. The segments contain perfectly or near-perfectly conserved CCAAT boxes, E-boxes, and other sequences recognized by regulatory proteins. More such DNA conservation will surely be found soon, as more genomes are published and sequence comparison is optimized. This reveals a control system for animal development conserved since the Precambrian.


Assuntos
Antozoários , Genes Homeobox , Animais , DNA , Fatores de Transcrição/genética , Antozoários/genética , Desenvolvimento Embrionário/genética , Sequência Conservada/genética
2.
Bioinformatics ; 39(2)2023 02 03.
Artigo em Inglês | MEDLINE | ID: mdl-36702468

RESUMO

MOTIVATION: We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlapping words, were developed by heuristic intuition, and are not optimal. RESULTS: We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence comparison methods, for randomly evolving sequences. It is likely near-optimal for a wide range of alignment-based and alignment-free analyses. For real biological DNA, it increases specificity by avoiding simple repeats. Our approach generalizes universal hitting sets (which guarantee to sample a sequence at least once) and polar sets (which guarantee to sample a sequence at most once). This helps us understand how to do rapid sequence analysis as accurately as possible. AVAILABILITY AND IMPLEMENTATION: Source code is freely available at https://gitlab.com/mcfrith/noverlap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , Análise de Sequência de DNA/métodos
3.
Mol Biol Evol ; 39(4)2022 04 11.
Artigo em Inglês | MEDLINE | ID: mdl-35348724

RESUMO

Genomes hold a treasure trove of protein fossils: Fragments of formerly protein-coding DNA, which mainly come from transposable elements (TEs) or host genes. These fossils reveal ancient evolution of TEs and genomes, and many fossils have been exapted to perform diverse functions important for the host's fitness. However, old and highly degraded fossils are hard to identify, standard methods (e.g. BLAST) are not optimized for this task, and few Paleozoic protein fossils have been found. Here, a recently optimized method is used to find protein fossils in vertebrate genomes. It finds Paleozoic fossils predating the amphibian/amniote divergence from most major TE categories, including virus-related Polinton and Gypsy elements. It finds 10 fossils in the human genome (eight from TEs and two from host genes) that predate the last common ancestor of all jawed vertebrates, probably from the Ordovician period. It also finds types of transposon and retrotransposon not found in human before. These fossils have extreme sequence conservation, indicating exaptation: some have evidence of gene-regulatory function, and they tend to lie nearest to developmental genes. Some ancient fossils suggest "genome tectonics," where two fragments of one TE have drifted apart by up to megabases, possibly explaining gene deserts and large introns. This paints a picture of great TE diversity in our aquatic ancestors, with patchy TE inheritance by later vertebrates, producing new genes and regulatory elements on the way. Host-gene fossils too have contributed anciently conserved DNA segments. This paves the way to further studies of ancient protein fossils.


Assuntos
Elementos de DNA Transponíveis , Fósseis , Animais , Elementos de DNA Transponíveis/genética , Evolução Molecular , Humanos , Sequências Reguladoras de Ácido Nucleico , Retroelementos , Vertebrados/genética
4.
Hum Mol Genet ; 30(7): 552-563, 2021 05 12.
Artigo em Inglês | MEDLINE | ID: mdl-33693705

RESUMO

Facioscapulohumeral muscular dystrophy (FSHD) is an inherited muscle disease caused by misexpression of the DUX4 gene in skeletal muscle. DUX4 is a transcription factor, which is normally expressed in the cleavage-stage embryo and regulates gene expression involved in early embryonic development. Recent studies revealed that DUX4 also activates the transcription of repetitive elements such as endogenous retroviruses (ERVs), mammalian apparent long terminal repeat (LTR)-retrotransposons and pericentromeric satellite repeats (Human Satellite II). DUX4-bound ERV sequences also create alternative promoters for genes or long non-coding RNAs, producing fusion transcripts. To further understand transcriptional regulation by DUX4, we performed nanopore long-read direct RNA sequencing (dRNA-seq) of human muscle cells induced by DUX4, because long reads show whole isoforms with greater confidence. We successfully detected differential expression of known DUX4-induced genes and discovered 61 differentially expressed repeat loci, which are near DUX4-ChIP peaks. We also identified 247 gene-ERV fusion transcripts, of which 216 were not reported previously. In addition, long-read dRNA-seq clearly shows that RNA splicing is a common event in DUX4-activated ERV transcripts. Long-read analysis showed non-LTR transposons including Alu elements are also transcribed from LTRs. Our findings revealed further complexity of DUX4-induced ERV transcripts. This catalogue of DUX4-activated repetitive elements may provide useful information to elucidate the pathology of FSHD. Also, our results indicate that nanopore dRNA-seq has complementary strengths to conventional short-read complementary DNA sequencing.


Assuntos
Proteínas de Homeodomínio/genética , Músculo Esquelético/metabolismo , Distrofia Muscular Facioescapuloumeral/genética , Nanoporos , Sequências Repetitivas de Ácido Nucleico/genética , Análise de Sequência de RNA/métodos , Linhagem Celular Tumoral , Perfilação da Expressão Gênica , Regulação da Expressão Gênica , Humanos , Células Musculares/metabolismo , Distrofia Muscular Facioescapuloumeral/patologia , Isoformas de Proteínas/genética , Isoformas de RNA/genética , Reação em Cadeia da Polimerase Via Transcriptase Reversa , Análise de Sequência de RNA/estatística & dados numéricos
5.
Nucleic Acids Res ; 49(6): 3139-3155, 2021 04 06.
Artigo em Inglês | MEDLINE | ID: mdl-33693858

RESUMO

Minimal absent words (MAWs) are minimal-length oligomers absent from a genome or proteome. Although some artificially synthesized MAWs have deleterious effects, there is still a lack of a strategy for the classification of non-occurring sequences as potentially malicious or benign. In this work, by using Markovian models with multiple-testing correction, we reveal significant absent oligomers, which are statistically expected to exist. This suggests that their absence is due to negative selection. We survey genomes and proteomes covering the diversity of life and find thousands of significant absent sequences. Common significant MAWs are often mono- or dinucleotide tracts, or palindromic. Significant viral MAWs are often restriction sites and may indicate unknown restriction motifs. Surprisingly, significant mammal genome MAWs are often present, but rare, in other mammals, suggesting that they are suppressed but not completely forbidden. Significant human MAWs are frequently present in prokaryotes, suggesting immune function, but rarely present in human viruses, indicating viral mimicry of the host. More than one-fourth of human proteins are one substitution away from containing a significant MAW, with the majority of replacements being predicted harmful. We provide a web-based, interactive database of significant MAWs across genomes and proteomes.


Assuntos
Bases de Dados Genéticas , Genômica/métodos , Proteômica/métodos , Animais , Genoma , Humanos , Cadeias de Markov , Mutação , Peptídeos/química , Proteoma , Software , Vírus/genética
6.
Bioinformatics ; 36(22-23): 5344-5350, 2021 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-33346833

RESUMO

MOTIVATION: Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via 'seeds': simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence. RESULTS: Here, we study a simple sparse-seeding method: using seeds at positions of certain 'words' (e.g. ac, at, gc or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed 'minimizer' sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it. AVAILABILITY AND IMPLEMENTATION: Software to design and test minimally overlapping words is freely available at https://gitlab.com/mcfrith/noverlap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

7.
Bioinformatics ; 36(2): 408-415, 2020 01 15.
Artigo em Inglês | MEDLINE | ID: mdl-31329241

RESUMO

MOTIVATION: Sequence alignment remains fundamental in bioinformatics. Pair-wise alignment is traditionally based on ad hoc scores for substitutions, insertions and deletions, but can also be based on probability models (pair hidden Markov models: PHMMs). PHMMs enable us to: fit the parameters to each kind of data, calculate the reliability of alignment parts and measure sequence similarity integrated over possible alignments. RESULTS: This study shows how multiple models correspond to one set of scores. Scores can be converted to probabilities by partition functions with a 'temperature' parameter: for any temperature, this corresponds to some PHMM. There is a special class of models with balanced length probability, i.e. no bias toward either longer or shorter alignments. The best way to score alignments and assess their significance depends on the aim: judging whether whole sequences are related versus finding related parts. This clarifies the statistical basis of sequence alignment. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Modelos Estatísticos , Cadeias de Markov , Probabilidade , Reprodutibilidade dos Testes , Alinhamento de Sequência
8.
J Hum Genet ; 66(7): 697-705, 2021 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-33510365

RESUMO

Whole-exome sequencing (WES) can detect not only single-nucleotide variants in causal genes, but also pathogenic copy-number variations using several methods. However, there may be overlooked pathogenic variations in the out of target genome regions of WES analysis (e.g., promoters), leaving many patients undiagnosed. Whole-genome sequencing (WGS) can potentially analyze such regions. We applied long-read nanopore WGS and our recently developed analysis pipeline "dnarrange" to a patient who was undiagnosed by trio-based WES analysis, and identified a heterozygous 97-kb deletion partially involving 5'-untranslated exons of MBD5, which was outside the WES target regions. The phenotype of the patient, a 32-year-old male, was consistent with haploinsufficiency of MBD5. The transcript level of MBD5 in the patient's lymphoblastoid cells was reduced. We therefore concluded that the partial MBD5 deletion is the culprit for this patient. Furthermore, we found other rare structural variations (SVs) in this patient, i.e., a large inversion and a retrotransposon insertion, which were not seen in 33 controls. Although we considered that they are benign SVs, this finding suggests that our pipeline using long-read WGS is useful for investigating various types of potentially pathogenic SVs. In conclusion, we identified a 97-kb deletion, which causes haploinsufficiency of MBD5 in a patient with neurodevelopmental disorder, demonstrating that long-read WGS is a powerful technique to discover pathogenic SVs.


Assuntos
Proteínas de Ligação a DNA/genética , Predisposição Genética para Doença , Transtornos do Neurodesenvolvimento/genética , Adulto , Exoma/genética , Haploinsuficiência/genética , Humanos , Masculino , Mutagênese Insercional/genética , Transtornos do Neurodesenvolvimento/patologia , Retroelementos/genética , Sequenciamento Completo do Genoma
9.
J Hum Genet ; 65(8): 667-674, 2020 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-32296131

RESUMO

Chromothripsis is a type of chaotic complex genomic rearrangement caused by a single event of chromosomal shattering and repair processes. Chromothripsis is known to cause rare congenital diseases when it occurs in germline cells, however, current genome analysis technologies have difficulty in detecting and deciphering chromothripsis. It is possible that this type of complex rearrangement may be overlooked in rare-disease patients whose genetic diagnosis is unsolved. We applied long read nanopore sequencing and our recently developed analysis pipeline dnarrange to a patient who has a reciprocal chromosomal translocation t(8;18)(q22;q21) as a result of chromothripsis between the two chromosomes, and fully characterize the complex rearrangements at the translocation site. The patient genome was evidently shattered into 19 fragments, and rejoined into derivative chromosomes in a random order and orientation. The reconstructed patient genome indicates loss of five genomic regions, which all overlap with microarray-detected copy number losses. We found that two disease-related genes RAD21 and EXT1 were lost by chromothripsis. These two genes could fully explain the disease phenotype with facial dysmorphisms and bone abnormality, which is likely a contiguous gene syndrome, Cornelia de Lange syndrome type IV (CdLs-4) and atypical Langer-Giedion syndrome (LGS), also known as trichorhinophalangeal syndrome type II (TRPSII). This provides evidence that our approach based on long read sequencing can fully characterize chromothripsis in a patient's genome, which is important for understanding the phenotype of disease caused by complex genomic rearrangement.


Assuntos
Proteínas de Ciclo Celular/genética , Cromotripsia , Proteínas de Ligação a DNA/genética , Síndrome de Cornélia de Lange/genética , Síndrome de Langer-Giedion/genética , N-Acetilglucosaminiltransferases/genética , Criança , Deleção Cromossômica , Síndrome de Cornélia de Lange/diagnóstico , Síndrome de Cornélia de Lange/fisiopatologia , Genoma , Humanos , Síndrome de Langer-Giedion/diagnóstico , Síndrome de Langer-Giedion/fisiopatologia , Masculino , Sequenciamento por Nanoporos , Fenótipo , Análise de Sequência de DNA , Translocação Genética
10.
J Hum Genet ; 65(5): 475-480, 2020 May.
Artigo em Inglês | MEDLINE | ID: mdl-32066831

RESUMO

Recently, a recessively inherited intronic repeat expansion in replication factor C1 (RFC1) was identified in cerebellar ataxia with neuropathy and bilateral vestibular areflexia syndrome (CANVAS). Here, we describe a Japanese case of genetically confirmed CANVAS with autonomic failure and auditory hallucination. The case showed impaired uptake of iodine-123-metaiodobenzylguanidine and 123I-ioflupane in the cardiac sympathetic nerve and dopaminergic neurons, respectively, by single-photon emission computed tomography. Long-read sequencing identified biallelic pathogenic (AAGGG)n nucleotide repeat expansion in RFC1 and heterozygous benign (TAAAA)n and (TAGAA)n expansions in brain expressed, associated with NEDD4 (BEAN1). Enrichment of the repeat regions in RFC1 and BEAN1 using a Cas9-mediated system clearly distinguished between pathogenic and benign repeat expansions. The haplotype around RFC1 indicated that the (AAGGG)n expansion in our case was on the same ancestral allele as that of European cases. Thus, long-read sequencing facilitates precise genetic diagnosis of diseases with complex repeat structures and various expansions.


Assuntos
Vestibulopatia Bilateral/genética , Ataxia Cerebelar/genética , Expansão das Repetições de DNA , Proteína de Replicação C/genética , Análise de Sequência de DNA , Idoso de 80 Anos ou mais , Povo Asiático , Vestibulopatia Bilateral/diagnóstico , Ataxia Cerebelar/diagnóstico , Feminino , Humanos , Japão , Ubiquitina-Proteína Ligases Nedd4/genética
11.
Nucleic Acids Res ; 46(4): 1661-1673, 2018 02 28.
Artigo em Inglês | MEDLINE | ID: mdl-29272440

RESUMO

Genomes mutate and evolve in ways simple (substitution or deletion of bases) and complex (e.g. chromosome shattering). We do not fully understand what types of complex mutation occur, and we cannot routinely characterize arbitrarily-complex mutations in a high-throughput, genome-wide manner. Long-read DNA sequencing methods (e.g. PacBio, nanopore) are promising for this task, because one read may encompass a whole complex mutation. We describe an analysis pipeline to characterize arbitrarily-complex 'local' mutations, i.e. intrachromosomal mutations encompassed by one DNA read. We apply it to nanopore and PacBio reads from one human cell line (NA12878), and survey sequence rearrangements, both real and artifactual. Almost all the real rearrangements belong to recurring patterns or motifs: the most common is tandem multiplication (e.g. heptuplication), but there are also complex patterns such as localized shattering, which resembles DNA damage by radiation. Gene conversions are identified, including one between hemoglobin gamma genes. This study demonstrates a way to find intricate rearrangements with any number of duplications, deletions, and repositionings. It demonstrates a probability-based method to resolve ambiguous rearrangements involving highly similar sequences, as occurs in gene conversion. We present a catalog of local rearrangements in one human cell line, and show which rearrangement patterns occur.


Assuntos
DNA/química , Mutação , Linhagem Celular , Conversão Gênica , Humanos , Sequências Repetidas Invertidas , Alinhamento de Sequência , Análise de Sequência de DNA , Deleção de Sequência , Inversão de Sequência
12.
Nucleic Acids Res ; 46(3): e18, 2018 02 16.
Artigo em Inglês | MEDLINE | ID: mdl-29182778

RESUMO

Performing sequence alignment to identify structural variants, such as large deletions, from genome sequencing data is a fundamental task, but current methods are far from perfect. The current practice is to independently align each DNA read to a reference genome. We show that the propensity of genomic rearrangements to accumulate in repeat-rich regions imposes severe ambiguities in these alignments, and consequently on the variant calls-with current read lengths, this affects more than one third of known large deletions in the C. Venter genome. We present a method to jointly align reads to a genome, whereby alignment ambiguity of one read can be disambiguated by other reads. We show this leads to a significant improvement in the accuracy of identifying large deletions (≥20 bases), while imposing minimal computational overhead and maintaining an overall running time that is at par with current tools. A software implementation is available as an open-source Python program called JRA at https://bitbucket.org/jointreadalignment/jra-src.


Assuntos
Algoritmos , Sequência de Bases , DNA/genética , Genoma Humano , Deleção de Sequência , Linhagem Celular , Conjuntos de Dados como Assunto , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Internet , Masculino , Pessoa de Meia-Idade , Ploidias , Cultura Primária de Células , Alinhamento de Sequência , Análise de Sequência de DNA , Software
13.
Ann Neurol ; 84(6): 843-853, 2018 12.
Artigo em Inglês | MEDLINE | ID: mdl-30412317

RESUMO

OBJECTIVE: Approximately 5% of cerebral small vessel diseases are hereditary, which include COL4A1/COL4A2-related disorders. COL4A1/COL4A2 encode type IV collagen α1/2 chains in the basement membranes of cerebral vessels. COL4A1/COL4A2 mutations impair the secretion of collagen to the extracellular matrix, thereby resulting in vessel fragility. The diagnostic yield for COL4A1/COL4A2 variants is around 20 to 30%, suggesting other mutated genes might be associated with this disease. This study aimed to identify novel genes that cause COL4A1/COL4A2-related disorders. METHODS: Whole exome sequencing was performed in 2 families with suspected COL4A1/COL4A2-related disorders. We validated the role of COLGALT1 variants by constructing a 3-dimensional structural model, evaluating collagen ß (1-O) galactosyltransferase 1 (ColGalT1) protein expression and ColGalT activity by Western blotting and collagen galactosyltransferase assays, and performing in vitro RNA interference and rescue experiments. RESULTS: Exome sequencing demonstrated biallelic variants in COLGALT1 encoding ColGalT1, which was involved in the post-translational modification of type IV collagen in 2 unrelated patients: c.452 T > G (p.Leu151Arg) and c.1096delG (p.Glu366Argfs*15) in Patient 1, and c.460G > C (p.Ala154Pro) and c.1129G > C (p.Gly377Arg) in Patient 2. Three-dimensional model analysis suggested that p.Leu151Arg and p.Ala154Pro destabilized protein folding, which impaired enzymatic activity. ColGalT1 protein expression and ColGalT activity in Patient 1 were undetectable. RNA interference studies demonstrated that reduced ColGalT1 altered COL4A1 secretion, and rescue experiments showed that mutant COLGALT1 insufficiently restored COL4A1 production in cells compared with wild type. INTERPRETATION: Biallelic COLGALT1 variants cause cerebral small vessel abnormalities through a common molecular pathogenesis with COL4A1/COL4A2-related disorders. Ann Neurol 2018;84:843-853.


Assuntos
Doenças de Pequenos Vasos Cerebrais/genética , Colágeno Tipo IV/genética , Predisposição Genética para Doença/genética , Mutação/genética , Linhagem Celular Transformada , Doenças de Pequenos Vasos Cerebrais/diagnóstico por imagem , Criança , Análise Mutacional de DNA , Glucosiltransferases/metabolismo , Humanos , Imageamento por Ressonância Magnética , Masculino , Modelos Moleculares , Mutagênese , RNA Mensageiro/metabolismo , Transfecção
14.
Bioinformatics ; 33(6): 926-928, 2017 03 15.
Artigo em Inglês | MEDLINE | ID: mdl-28039163

RESUMO

Summary: LAST-TRAIN improves sequence alignment accuracy by inferring substitution and gap scores that fit the frequencies of substitutions, insertions, and deletions in a given dataset. We have applied it to mapping DNA reads from IonTorrent and PacBio RS, and we show that it reduces reference bias for Oxford Nanopore reads. Availability and Implementation: the source code is freely available at http://last.cbrc.jp/. Contact: mhamada@waseda.jp or mcfrith@edu.k.u-tokyo.ac.jp. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Genoma Humano , Polimorfismo Genético , Análise de Sequência de DNA/métodos , Software , Humanos
15.
Bioinformatics ; 32(2): 304-5, 2016 Jan 15.
Artigo em Inglês | MEDLINE | ID: mdl-26428291

RESUMO

MOTIVATION: Pairwise local alignment is an indispensable tool for molecular biologists. In real time (i.e. in about 1 s), ALP (Ascending Ladder Program) calculates the E-values for protein-protein or DNA-DNA local alignments of random sequences, for arbitrary substitution score matrix, gap costs and letter abundances; and FALP (Frameshift Ascending Ladder Program) performs a similar task, although more slowly, for frameshifting DNA-protein alignments. AVAILABILITY AND IMPLEMENTATION: To permit other C++ programmers to implement the computational efficiencies in ALP and FALP directly within their own programs, C++ source codes are available in the public domain at http://go.usa.gov/3GTSW under 'ALP' and 'FALP', along with the standalone programs ALP and FALP. CONTACT: spouge@nih.gov SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional/métodos , DNA/química , Proteínas/química , Análise de Sequência de DNA/métodos , Análise de Sequência de Proteína/métodos , Software , DNA/metabolismo , Bases de Dados Factuais , Humanos , Proteínas/metabolismo , Alinhamento de Sequência
16.
J Struct Funct Genomics ; 17(4): 147-154, 2016 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-28083762

RESUMO

Protein database search for public databases is a fundamental step in the target selection of proteins in structural and functional genomics and also for inferring protein structure, function, and evolution. Most database search methods employ amino acid substitution matrices to score amino acid pairs. The choice of substitution matrix strongly affects homology detection performance. We earlier proposed a substitution matrix named MIQS that was optimized for distant protein homology search. Herein we further evaluate MIQS in combination with LAST, a heuristic and fast database search tool with a tunable sensitivity parameter m, where larger m denotes higher sensitivity. Results show that MIQS substantially improves the homology detection and alignment quality performance of LAST across diverse m parameters. Against a protein database consisting of approximately 15 million sequences, LAST with m = 105 achieves better homology detection performance than BLASTP, and completes the search 20 times faster. Compared to the most sensitive existing methods being used today, CS-BLAST and SSEARCH, LAST with MIQS and m = 106 shows comparable homology detection performance at 2.0 and 3.9 times greater speed, respectively. Results demonstrate that MIQS-powered LAST is a time-efficient method for sensitive and accurate homology search.


Assuntos
Heurística Computacional , Bases de Dados de Proteínas , Análise de Sequência de Proteína , Algoritmos , Biologia Computacional , Modelos Moleculares , Proteínas/química , Alinhamento de Sequência
17.
Brief Bioinform ; 15(2): 138-54, 2014 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-24413184

RESUMO

The suffix array and its variants are text-indexing data structures that have become indispensable in the field of bioinformatics. With the uninitiated in mind, we provide an accessible exposition of the SA-IS algorithm, which is the state of the art in suffix array construction. We also describe DisLex, a technique that allows standard suffix array construction algorithms to create modified suffix arrays designed to enable a simple form of inexact matching needed to support 'spaced seeds' and 'subset seeds' used in many biological applications.


Assuntos
Algoritmos , Biologia Computacional/métodos , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Humanos , Reconhecimento Automatizado de Padrão/estatística & dados numéricos , Software
18.
Nucleic Acids Res ; 42(8): 4823-32, 2014 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-24682821

RESUMO

Proximal promoters are fundamental genomic elements for gene expression. They vary in terms of GC percentage, CpG abundance, presence of TATA signal, evolutionary conservation, chromosomal spread of transcription start sites and breadth of expression across cell types. These properties are correlated, and it has been suggested that there are two classes of promoters: one class with high CpG, widely spread transcription start sites and broad expression, and another with TATA signals, narrow spread and restricted expression. However, it has been unclear why these properties are correlated in this way. We reexamined these features using the deep FANTOM5 CAGE data from hundreds of cell types. First, we point out subtle but important biases in previous definitions of promoters and of expression breadth. Second, we show that most promoters are rather nonspecifically expressed across many cell types. Third, promoters' expression breadth is independent of maximum expression level, and therefore correlates with average expression level. Fourth, the data show a more complex picture than two classes, with a network of direct and indirect correlations among promoter properties. By tentatively distinguishing the direct from the indirect correlations, we reveal simple explanations for them.


Assuntos
Regiões Promotoras Genéticas , Animais , Ilhas de CpG , Interpretação Estatística de Dados , Humanos , Camundongos , TATA Box , Sítio de Iniciação de Transcrição , Transcrição Gênica
19.
Nucleic Acids Res ; 42(7): e59, 2014 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-24493737

RESUMO

Sequence similarity search is a fundamental way of analyzing nucleotide sequences. Despite decades of research, this is not a solved problem because there exist many similarities that are not found by current methods. Search methods are typically based on a seed-and-extend approach, which has many variants (e.g. spaced seeds, transition seeds), and it remains unclear how to optimize this approach. This study designs and tests seeding methods for inter-mammal and inter-insect genome comparison. By considering substitution patterns of real genomes, we design sets of multiple complementary transition seeds, which have better performance (sensitivity per run time) than previous seeding strategies. Often the best seed patterns have more transition positions than those used previously. We also point out that recent computer memory sizes (e.g. 60 GB) make it feasible to use multiple (e.g. eight) seeds for whole mammal genomes. Interestingly, the most sensitive settings achieve diminishing returns for human-dog and melanogaster-pseudoobscura comparisons, but not for human-mouse, which suggests that we still miss many human-mouse alignments. Our optimized heuristics find ∼20,000 new human-mouse alignments that are missing from the standard UCSC alignments. We tabulate seed patterns and parameters that work well so they can be used in future research.


Assuntos
Genoma Humano , Genômica/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Animais , Cães , Genoma , Humanos , Camundongos
20.
Bioinformatics ; 30(24): 3575-82, 2014 Dec 15.
Artigo em Inglês | MEDLINE | ID: mdl-25172925

RESUMO

MOTIVATION: The alignment of DNA sequences to proteins, allowing for frameshifts, is a classic method in sequence analysis. It can help identify pseudogenes (which accumulate mutations), analyze raw DNA and RNA sequence data (which may have frameshift sequencing errors), investigate ribosomal frameshifts, etc. Often, however, only ad hoc approximations or simulations are available to provide the statistical significance of a frameshift alignment score. RESULTS: We describe a method to estimate statistical significance of frameshift alignments, similar to classic BLAST statistics. (BLAST presently does not permit its alignments to include frameshifts.) We also illustrate the continuing usefulness of frameshift alignment with two 'post-genomic' applications: (i) when finding pseudogenes within the human genome, frameshift alignments show that most anciently conserved non-coding human elements are recent pseudogenes with conserved ancestral genes; and (ii) when analyzing metagenomic DNA reads from polluted soil, frameshift alignments show that most alignable metagenomic reads contain frameshifts, suggesting that metagenomic analysis needs to use frameshift alignment to derive accurate results.


Assuntos
Mutação da Fase de Leitura , Alinhamento de Sequência/métodos , Algoritmos , Interpretação Estatística de Dados , Genoma Humano , Genômica , Humanos , Metagenômica , Pseudogenes , Análise de Sequência de DNA , Análise de Sequência de Proteína , Análise de Sequência de RNA , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA