Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 93
Filtrar
1.
Cell ; 185(16): 3025-3040.e6, 2022 08 04.
Artigo em Inglês | MEDLINE | ID: mdl-35882231

RESUMO

Non-allelic recombination between homologous repetitive elements contributes to evolution and human genetic disorders. Here, we combine short- and long-DNA read sequencing of repeat elements with a new bioinformatics pipeline to show that somatic recombination of Alu and L1 elements is widespread in the human genome. Our analysis uncovers tissue-specific non-allelic homologous recombination hallmarks; moreover, we find that centromeres and cancer-associated genes are enriched for retroelements that may act as recombination hotspots. We compare recombination profiles in human-induced pluripotent stem cells and differentiated neurons and find that the neuron-specific recombination of repeat elements accompanies chromatin changes during cell-fate determination. Finally, we report that somatic recombination profiles are altered in Parkinson's and Alzheimer's disease, suggesting a link between retroelement recombination and genomic instability in neurodegeneration. This work highlights a significant contribution of the somatic recombination of repeat elements to genomic diversity in health and disease.


Assuntos
Genoma Humano , Retroelementos , Elementos Alu/genética , Recombinação Homóloga , Humanos , Elementos Nucleotídeos Longos e Dispersos , Sequências Repetitivas de Ácido Nucleico
2.
Mol Biol Evol ; 40(12)2023 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-38085182

RESUMO

DNA that controls gene expression (e.g. enhancers, promoters) has seemed almost never to be conserved between distantly related animals, like vertebrates and arthropods. This is mysterious, because development of such animals is partly organized by homologous genes with similar complex expression patterns, termed "deep homology." Here, we report 25 regulatory DNA segments conserved across bilaterian animals, of which 7 are also conserved in cnidaria (coral and sea anemone). They control developmental genes (e.g. Nr2f, Ptch, Rfx1/3, Sall, Smad6, Sp5, Tbx2/3), including six homeobox genes: Gsx, Hmx, Meis, Msx, Six1/2, and Zfhx3/4. The segments contain perfectly or near-perfectly conserved CCAAT boxes, E-boxes, and other sequences recognized by regulatory proteins. More such DNA conservation will surely be found soon, as more genomes are published and sequence comparison is optimized. This reveals a control system for animal development conserved since the Precambrian.


Assuntos
Antozoários , Genes Homeobox , Animais , DNA , Fatores de Transcrição/genética , Antozoários/genética , Desenvolvimento Embrionário/genética , Sequência Conservada/genética
3.
Bioinformatics ; 39(2)2023 02 03.
Artigo em Inglês | MEDLINE | ID: mdl-36702468

RESUMO

MOTIVATION: We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlapping words, were developed by heuristic intuition, and are not optimal. RESULTS: We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence comparison methods, for randomly evolving sequences. It is likely near-optimal for a wide range of alignment-based and alignment-free analyses. For real biological DNA, it increases specificity by avoiding simple repeats. Our approach generalizes universal hitting sets (which guarantee to sample a sequence at least once) and polar sets (which guarantee to sample a sequence at most once). This helps us understand how to do rapid sequence analysis as accurately as possible. AVAILABILITY AND IMPLEMENTATION: Source code is freely available at https://gitlab.com/mcfrith/noverlap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , Análise de Sequência de DNA/métodos
4.
Mol Biol Evol ; 39(4)2022 04 11.
Artigo em Inglês | MEDLINE | ID: mdl-35348724

RESUMO

Genomes hold a treasure trove of protein fossils: Fragments of formerly protein-coding DNA, which mainly come from transposable elements (TEs) or host genes. These fossils reveal ancient evolution of TEs and genomes, and many fossils have been exapted to perform diverse functions important for the host's fitness. However, old and highly degraded fossils are hard to identify, standard methods (e.g. BLAST) are not optimized for this task, and few Paleozoic protein fossils have been found. Here, a recently optimized method is used to find protein fossils in vertebrate genomes. It finds Paleozoic fossils predating the amphibian/amniote divergence from most major TE categories, including virus-related Polinton and Gypsy elements. It finds 10 fossils in the human genome (eight from TEs and two from host genes) that predate the last common ancestor of all jawed vertebrates, probably from the Ordovician period. It also finds types of transposon and retrotransposon not found in human before. These fossils have extreme sequence conservation, indicating exaptation: some have evidence of gene-regulatory function, and they tend to lie nearest to developmental genes. Some ancient fossils suggest "genome tectonics," where two fragments of one TE have drifted apart by up to megabases, possibly explaining gene deserts and large introns. This paints a picture of great TE diversity in our aquatic ancestors, with patchy TE inheritance by later vertebrates, producing new genes and regulatory elements on the way. Host-gene fossils too have contributed anciently conserved DNA segments. This paves the way to further studies of ancient protein fossils.


Assuntos
Elementos de DNA Transponíveis , Fósseis , Animais , Elementos de DNA Transponíveis/genética , Evolução Molecular , Humanos , Sequências Reguladoras de Ácido Nucleico , Retroelementos , Vertebrados/genética
5.
Hum Mol Genet ; 30(7): 552-563, 2021 05 12.
Artigo em Inglês | MEDLINE | ID: mdl-33693705

RESUMO

Facioscapulohumeral muscular dystrophy (FSHD) is an inherited muscle disease caused by misexpression of the DUX4 gene in skeletal muscle. DUX4 is a transcription factor, which is normally expressed in the cleavage-stage embryo and regulates gene expression involved in early embryonic development. Recent studies revealed that DUX4 also activates the transcription of repetitive elements such as endogenous retroviruses (ERVs), mammalian apparent long terminal repeat (LTR)-retrotransposons and pericentromeric satellite repeats (Human Satellite II). DUX4-bound ERV sequences also create alternative promoters for genes or long non-coding RNAs, producing fusion transcripts. To further understand transcriptional regulation by DUX4, we performed nanopore long-read direct RNA sequencing (dRNA-seq) of human muscle cells induced by DUX4, because long reads show whole isoforms with greater confidence. We successfully detected differential expression of known DUX4-induced genes and discovered 61 differentially expressed repeat loci, which are near DUX4-ChIP peaks. We also identified 247 gene-ERV fusion transcripts, of which 216 were not reported previously. In addition, long-read dRNA-seq clearly shows that RNA splicing is a common event in DUX4-activated ERV transcripts. Long-read analysis showed non-LTR transposons including Alu elements are also transcribed from LTRs. Our findings revealed further complexity of DUX4-induced ERV transcripts. This catalogue of DUX4-activated repetitive elements may provide useful information to elucidate the pathology of FSHD. Also, our results indicate that nanopore dRNA-seq has complementary strengths to conventional short-read complementary DNA sequencing.


Assuntos
Proteínas de Homeodomínio/genética , Músculo Esquelético/metabolismo , Distrofia Muscular Facioescapuloumeral/genética , Nanoporos , Sequências Repetitivas de Ácido Nucleico/genética , Análise de Sequência de RNA/métodos , Linhagem Celular Tumoral , Perfilação da Expressão Gênica , Regulação da Expressão Gênica , Humanos , Células Musculares/metabolismo , Distrofia Muscular Facioescapuloumeral/patologia , Isoformas de Proteínas/genética , Isoformas de RNA/genética , Reação em Cadeia da Polimerase Via Transcriptase Reversa , Análise de Sequência de RNA/estatística & dados numéricos
6.
Nucleic Acids Res ; 49(6): 3139-3155, 2021 04 06.
Artigo em Inglês | MEDLINE | ID: mdl-33693858

RESUMO

Minimal absent words (MAWs) are minimal-length oligomers absent from a genome or proteome. Although some artificially synthesized MAWs have deleterious effects, there is still a lack of a strategy for the classification of non-occurring sequences as potentially malicious or benign. In this work, by using Markovian models with multiple-testing correction, we reveal significant absent oligomers, which are statistically expected to exist. This suggests that their absence is due to negative selection. We survey genomes and proteomes covering the diversity of life and find thousands of significant absent sequences. Common significant MAWs are often mono- or dinucleotide tracts, or palindromic. Significant viral MAWs are often restriction sites and may indicate unknown restriction motifs. Surprisingly, significant mammal genome MAWs are often present, but rare, in other mammals, suggesting that they are suppressed but not completely forbidden. Significant human MAWs are frequently present in prokaryotes, suggesting immune function, but rarely present in human viruses, indicating viral mimicry of the host. More than one-fourth of human proteins are one substitution away from containing a significant MAW, with the majority of replacements being predicted harmful. We provide a web-based, interactive database of significant MAWs across genomes and proteomes.


Assuntos
Bases de Dados Genéticas , Genômica/métodos , Proteômica/métodos , Animais , Genoma , Humanos , Cadeias de Markov , Mutação , Peptídeos/química , Proteoma , Software , Vírus/genética
7.
Bioinformatics ; 36(22-23): 5344-5350, 2021 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-33346833

RESUMO

MOTIVATION: Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via 'seeds': simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence. RESULTS: Here, we study a simple sparse-seeding method: using seeds at positions of certain 'words' (e.g. ac, at, gc or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed 'minimizer' sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it. AVAILABILITY AND IMPLEMENTATION: Software to design and test minimally overlapping words is freely available at https://gitlab.com/mcfrith/noverlap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

8.
Bioinformatics ; 36(2): 408-415, 2020 01 15.
Artigo em Inglês | MEDLINE | ID: mdl-31329241

RESUMO

MOTIVATION: Sequence alignment remains fundamental in bioinformatics. Pair-wise alignment is traditionally based on ad hoc scores for substitutions, insertions and deletions, but can also be based on probability models (pair hidden Markov models: PHMMs). PHMMs enable us to: fit the parameters to each kind of data, calculate the reliability of alignment parts and measure sequence similarity integrated over possible alignments. RESULTS: This study shows how multiple models correspond to one set of scores. Scores can be converted to probabilities by partition functions with a 'temperature' parameter: for any temperature, this corresponds to some PHMM. There is a special class of models with balanced length probability, i.e. no bias toward either longer or shorter alignments. The best way to score alignments and assess their significance depends on the aim: judging whether whole sequences are related versus finding related parts. This clarifies the statistical basis of sequence alignment. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Modelos Estatísticos , Cadeias de Markov , Probabilidade , Reprodutibilidade dos Testes , Alinhamento de Sequência
9.
J Hum Genet ; 66(7): 697-705, 2021 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-33510365

RESUMO

Whole-exome sequencing (WES) can detect not only single-nucleotide variants in causal genes, but also pathogenic copy-number variations using several methods. However, there may be overlooked pathogenic variations in the out of target genome regions of WES analysis (e.g., promoters), leaving many patients undiagnosed. Whole-genome sequencing (WGS) can potentially analyze such regions. We applied long-read nanopore WGS and our recently developed analysis pipeline "dnarrange" to a patient who was undiagnosed by trio-based WES analysis, and identified a heterozygous 97-kb deletion partially involving 5'-untranslated exons of MBD5, which was outside the WES target regions. The phenotype of the patient, a 32-year-old male, was consistent with haploinsufficiency of MBD5. The transcript level of MBD5 in the patient's lymphoblastoid cells was reduced. We therefore concluded that the partial MBD5 deletion is the culprit for this patient. Furthermore, we found other rare structural variations (SVs) in this patient, i.e., a large inversion and a retrotransposon insertion, which were not seen in 33 controls. Although we considered that they are benign SVs, this finding suggests that our pipeline using long-read WGS is useful for investigating various types of potentially pathogenic SVs. In conclusion, we identified a 97-kb deletion, which causes haploinsufficiency of MBD5 in a patient with neurodevelopmental disorder, demonstrating that long-read WGS is a powerful technique to discover pathogenic SVs.


Assuntos
Proteínas de Ligação a DNA/genética , Predisposição Genética para Doença , Transtornos do Neurodesenvolvimento/genética , Adulto , Exoma/genética , Haploinsuficiência/genética , Humanos , Masculino , Mutagênese Insercional/genética , Transtornos do Neurodesenvolvimento/patologia , Retroelementos/genética , Sequenciamento Completo do Genoma
10.
J Hum Genet ; 65(8): 667-674, 2020 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-32296131

RESUMO

Chromothripsis is a type of chaotic complex genomic rearrangement caused by a single event of chromosomal shattering and repair processes. Chromothripsis is known to cause rare congenital diseases when it occurs in germline cells, however, current genome analysis technologies have difficulty in detecting and deciphering chromothripsis. It is possible that this type of complex rearrangement may be overlooked in rare-disease patients whose genetic diagnosis is unsolved. We applied long read nanopore sequencing and our recently developed analysis pipeline dnarrange to a patient who has a reciprocal chromosomal translocation t(8;18)(q22;q21) as a result of chromothripsis between the two chromosomes, and fully characterize the complex rearrangements at the translocation site. The patient genome was evidently shattered into 19 fragments, and rejoined into derivative chromosomes in a random order and orientation. The reconstructed patient genome indicates loss of five genomic regions, which all overlap with microarray-detected copy number losses. We found that two disease-related genes RAD21 and EXT1 were lost by chromothripsis. These two genes could fully explain the disease phenotype with facial dysmorphisms and bone abnormality, which is likely a contiguous gene syndrome, Cornelia de Lange syndrome type IV (CdLs-4) and atypical Langer-Giedion syndrome (LGS), also known as trichorhinophalangeal syndrome type II (TRPSII). This provides evidence that our approach based on long read sequencing can fully characterize chromothripsis in a patient's genome, which is important for understanding the phenotype of disease caused by complex genomic rearrangement.


Assuntos
Proteínas de Ciclo Celular/genética , Cromotripsia , Proteínas de Ligação a DNA/genética , Síndrome de Cornélia de Lange/genética , Síndrome de Langer-Giedion/genética , N-Acetilglucosaminiltransferases/genética , Criança , Deleção Cromossômica , Síndrome de Cornélia de Lange/diagnóstico , Síndrome de Cornélia de Lange/fisiopatologia , Genoma , Humanos , Síndrome de Langer-Giedion/diagnóstico , Síndrome de Langer-Giedion/fisiopatologia , Masculino , Sequenciamento por Nanoporos , Fenótipo , Análise de Sequência de DNA , Translocação Genética
11.
J Hum Genet ; 65(5): 475-480, 2020 May.
Artigo em Inglês | MEDLINE | ID: mdl-32066831

RESUMO

Recently, a recessively inherited intronic repeat expansion in replication factor C1 (RFC1) was identified in cerebellar ataxia with neuropathy and bilateral vestibular areflexia syndrome (CANVAS). Here, we describe a Japanese case of genetically confirmed CANVAS with autonomic failure and auditory hallucination. The case showed impaired uptake of iodine-123-metaiodobenzylguanidine and 123I-ioflupane in the cardiac sympathetic nerve and dopaminergic neurons, respectively, by single-photon emission computed tomography. Long-read sequencing identified biallelic pathogenic (AAGGG)n nucleotide repeat expansion in RFC1 and heterozygous benign (TAAAA)n and (TAGAA)n expansions in brain expressed, associated with NEDD4 (BEAN1). Enrichment of the repeat regions in RFC1 and BEAN1 using a Cas9-mediated system clearly distinguished between pathogenic and benign repeat expansions. The haplotype around RFC1 indicated that the (AAGGG)n expansion in our case was on the same ancestral allele as that of European cases. Thus, long-read sequencing facilitates precise genetic diagnosis of diseases with complex repeat structures and various expansions.


Assuntos
Vestibulopatia Bilateral/genética , Ataxia Cerebelar/genética , Expansão das Repetições de DNA , Proteína de Replicação C/genética , Análise de Sequência de DNA , Idoso de 80 Anos ou mais , Povo Asiático , Vestibulopatia Bilateral/diagnóstico , Ataxia Cerebelar/diagnóstico , Feminino , Humanos , Japão , Ubiquitina-Proteína Ligases Nedd4/genética
12.
Nucleic Acids Res ; 46(4): 1661-1673, 2018 02 28.
Artigo em Inglês | MEDLINE | ID: mdl-29272440

RESUMO

Genomes mutate and evolve in ways simple (substitution or deletion of bases) and complex (e.g. chromosome shattering). We do not fully understand what types of complex mutation occur, and we cannot routinely characterize arbitrarily-complex mutations in a high-throughput, genome-wide manner. Long-read DNA sequencing methods (e.g. PacBio, nanopore) are promising for this task, because one read may encompass a whole complex mutation. We describe an analysis pipeline to characterize arbitrarily-complex 'local' mutations, i.e. intrachromosomal mutations encompassed by one DNA read. We apply it to nanopore and PacBio reads from one human cell line (NA12878), and survey sequence rearrangements, both real and artifactual. Almost all the real rearrangements belong to recurring patterns or motifs: the most common is tandem multiplication (e.g. heptuplication), but there are also complex patterns such as localized shattering, which resembles DNA damage by radiation. Gene conversions are identified, including one between hemoglobin gamma genes. This study demonstrates a way to find intricate rearrangements with any number of duplications, deletions, and repositionings. It demonstrates a probability-based method to resolve ambiguous rearrangements involving highly similar sequences, as occurs in gene conversion. We present a catalog of local rearrangements in one human cell line, and show which rearrangement patterns occur.


Assuntos
DNA/química , Mutação , Linhagem Celular , Conversão Gênica , Humanos , Sequências Repetidas Invertidas , Alinhamento de Sequência , Análise de Sequência de DNA , Deleção de Sequência , Inversão de Sequência
13.
Nucleic Acids Res ; 46(3): e18, 2018 02 16.
Artigo em Inglês | MEDLINE | ID: mdl-29182778

RESUMO

Performing sequence alignment to identify structural variants, such as large deletions, from genome sequencing data is a fundamental task, but current methods are far from perfect. The current practice is to independently align each DNA read to a reference genome. We show that the propensity of genomic rearrangements to accumulate in repeat-rich regions imposes severe ambiguities in these alignments, and consequently on the variant calls-with current read lengths, this affects more than one third of known large deletions in the C. Venter genome. We present a method to jointly align reads to a genome, whereby alignment ambiguity of one read can be disambiguated by other reads. We show this leads to a significant improvement in the accuracy of identifying large deletions (≥20 bases), while imposing minimal computational overhead and maintaining an overall running time that is at par with current tools. A software implementation is available as an open-source Python program called JRA at https://bitbucket.org/jointreadalignment/jra-src.


Assuntos
Algoritmos , Sequência de Bases , DNA/genética , Genoma Humano , Deleção de Sequência , Linhagem Celular , Conjuntos de Dados como Assunto , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Internet , Masculino , Pessoa de Meia-Idade , Ploidias , Cultura Primária de Células , Alinhamento de Sequência , Análise de Sequência de DNA , Software
14.
Ann Neurol ; 84(6): 843-853, 2018 12.
Artigo em Inglês | MEDLINE | ID: mdl-30412317

RESUMO

OBJECTIVE: Approximately 5% of cerebral small vessel diseases are hereditary, which include COL4A1/COL4A2-related disorders. COL4A1/COL4A2 encode type IV collagen α1/2 chains in the basement membranes of cerebral vessels. COL4A1/COL4A2 mutations impair the secretion of collagen to the extracellular matrix, thereby resulting in vessel fragility. The diagnostic yield for COL4A1/COL4A2 variants is around 20 to 30%, suggesting other mutated genes might be associated with this disease. This study aimed to identify novel genes that cause COL4A1/COL4A2-related disorders. METHODS: Whole exome sequencing was performed in 2 families with suspected COL4A1/COL4A2-related disorders. We validated the role of COLGALT1 variants by constructing a 3-dimensional structural model, evaluating collagen ß (1-O) galactosyltransferase 1 (ColGalT1) protein expression and ColGalT activity by Western blotting and collagen galactosyltransferase assays, and performing in vitro RNA interference and rescue experiments. RESULTS: Exome sequencing demonstrated biallelic variants in COLGALT1 encoding ColGalT1, which was involved in the post-translational modification of type IV collagen in 2 unrelated patients: c.452 T > G (p.Leu151Arg) and c.1096delG (p.Glu366Argfs*15) in Patient 1, and c.460G > C (p.Ala154Pro) and c.1129G > C (p.Gly377Arg) in Patient 2. Three-dimensional model analysis suggested that p.Leu151Arg and p.Ala154Pro destabilized protein folding, which impaired enzymatic activity. ColGalT1 protein expression and ColGalT activity in Patient 1 were undetectable. RNA interference studies demonstrated that reduced ColGalT1 altered COL4A1 secretion, and rescue experiments showed that mutant COLGALT1 insufficiently restored COL4A1 production in cells compared with wild type. INTERPRETATION: Biallelic COLGALT1 variants cause cerebral small vessel abnormalities through a common molecular pathogenesis with COL4A1/COL4A2-related disorders. Ann Neurol 2018;84:843-853.


Assuntos
Doenças de Pequenos Vasos Cerebrais/genética , Colágeno Tipo IV/genética , Predisposição Genética para Doença/genética , Mutação/genética , Linhagem Celular Transformada , Doenças de Pequenos Vasos Cerebrais/diagnóstico por imagem , Criança , Análise Mutacional de DNA , Glucosiltransferases/metabolismo , Humanos , Imageamento por Ressonância Magnética , Masculino , Modelos Moleculares , Mutagênese , RNA Mensageiro/metabolismo , Transfecção
15.
Bioinformatics ; 33(6): 926-928, 2017 03 15.
Artigo em Inglês | MEDLINE | ID: mdl-28039163

RESUMO

Summary: LAST-TRAIN improves sequence alignment accuracy by inferring substitution and gap scores that fit the frequencies of substitutions, insertions, and deletions in a given dataset. We have applied it to mapping DNA reads from IonTorrent and PacBio RS, and we show that it reduces reference bias for Oxford Nanopore reads. Availability and Implementation: the source code is freely available at http://last.cbrc.jp/. Contact: mhamada@waseda.jp or mcfrith@edu.k.u-tokyo.ac.jp. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Genoma Humano , Polimorfismo Genético , Análise de Sequência de DNA/métodos , Software , Humanos
16.
BMC Bioinformatics ; 18(1): 299, 2017 Jun 12.
Artigo em Inglês | MEDLINE | ID: mdl-28606054

RESUMO

BACKGROUND: Genome sequencing provides a powerful tool for pathogen detection and can help resolve outbreaks that pose public safety and health risks. Mapping of DNA reads to genomes plays a fundamental role in this approach, where accurate alignment and classification of sequencing data is crucial. Standard mapping methods crudely treat bases as independent from their neighbors. Accuracy might be improved by using higher order paired hidden Markov models (HMMs), which model neighbor effects, but introduce design and implementation issues that have typically made them impractical for read mapping applications. We present a variable-order paired HMM that we term VarHMM, which addresses central issues involved with higher order modeling for sequence alignment. RESULTS: Compared with existing alignment methods, VarHMM is able to model higher order distributions and quantify alignment probabilities with greater detail and accuracy. In a series of comparison tests, in which Ion Torrent sequenced DNA was mapped to similar bacterial strains, VarHMM consistently provided better strain discrimination than any of the other alignment methods that we compared with. CONCLUSIONS: Our results demonstrate the advantages of higher ordered probability distribution modeling and also suggest that further development of such models would benefit read mapping in a range of other applications as well.


Assuntos
DNA Bacteriano , Genoma Bacteriano/genética , Genômica/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , DNA Bacteriano/análise , DNA Bacteriano/classificação , DNA Bacteriano/genética , Cadeias de Markov
17.
Bioinformatics ; 32(2): 304-5, 2016 Jan 15.
Artigo em Inglês | MEDLINE | ID: mdl-26428291

RESUMO

MOTIVATION: Pairwise local alignment is an indispensable tool for molecular biologists. In real time (i.e. in about 1 s), ALP (Ascending Ladder Program) calculates the E-values for protein-protein or DNA-DNA local alignments of random sequences, for arbitrary substitution score matrix, gap costs and letter abundances; and FALP (Frameshift Ascending Ladder Program) performs a similar task, although more slowly, for frameshifting DNA-protein alignments. AVAILABILITY AND IMPLEMENTATION: To permit other C++ programmers to implement the computational efficiencies in ALP and FALP directly within their own programs, C++ source codes are available in the public domain at http://go.usa.gov/3GTSW under 'ALP' and 'FALP', along with the standalone programs ALP and FALP. CONTACT: spouge@nih.gov SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional/métodos , DNA/química , Proteínas/química , Análise de Sequência de DNA/métodos , Análise de Sequência de Proteína/métodos , Software , DNA/metabolismo , Bases de Dados Factuais , Humanos , Proteínas/metabolismo , Alinhamento de Sequência
18.
J Struct Funct Genomics ; 17(4): 147-154, 2016 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-28083762

RESUMO

Protein database search for public databases is a fundamental step in the target selection of proteins in structural and functional genomics and also for inferring protein structure, function, and evolution. Most database search methods employ amino acid substitution matrices to score amino acid pairs. The choice of substitution matrix strongly affects homology detection performance. We earlier proposed a substitution matrix named MIQS that was optimized for distant protein homology search. Herein we further evaluate MIQS in combination with LAST, a heuristic and fast database search tool with a tunable sensitivity parameter m, where larger m denotes higher sensitivity. Results show that MIQS substantially improves the homology detection and alignment quality performance of LAST across diverse m parameters. Against a protein database consisting of approximately 15 million sequences, LAST with m = 105 achieves better homology detection performance than BLASTP, and completes the search 20 times faster. Compared to the most sensitive existing methods being used today, CS-BLAST and SSEARCH, LAST with MIQS and m = 106 shows comparable homology detection performance at 2.0 and 3.9 times greater speed, respectively. Results demonstrate that MIQS-powered LAST is a time-efficient method for sensitive and accurate homology search.


Assuntos
Heurística Computacional , Bases de Dados de Proteínas , Análise de Sequência de Proteína , Algoritmos , Biologia Computacional , Modelos Moleculares , Proteínas/química , Alinhamento de Sequência
19.
Brief Bioinform ; 15(2): 138-54, 2014 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-24413184

RESUMO

The suffix array and its variants are text-indexing data structures that have become indispensable in the field of bioinformatics. With the uninitiated in mind, we provide an accessible exposition of the SA-IS algorithm, which is the state of the art in suffix array construction. We also describe DisLex, a technique that allows standard suffix array construction algorithms to create modified suffix arrays designed to enable a simple form of inexact matching needed to support 'spaced seeds' and 'subset seeds' used in many biological applications.


Assuntos
Algoritmos , Biologia Computacional/métodos , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Humanos , Reconhecimento Automatizado de Padrão/estatística & dados numéricos , Software
20.
Nucleic Acids Res ; 42(8): 4823-32, 2014 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-24682821

RESUMO

Proximal promoters are fundamental genomic elements for gene expression. They vary in terms of GC percentage, CpG abundance, presence of TATA signal, evolutionary conservation, chromosomal spread of transcription start sites and breadth of expression across cell types. These properties are correlated, and it has been suggested that there are two classes of promoters: one class with high CpG, widely spread transcription start sites and broad expression, and another with TATA signals, narrow spread and restricted expression. However, it has been unclear why these properties are correlated in this way. We reexamined these features using the deep FANTOM5 CAGE data from hundreds of cell types. First, we point out subtle but important biases in previous definitions of promoters and of expression breadth. Second, we show that most promoters are rather nonspecifically expressed across many cell types. Third, promoters' expression breadth is independent of maximum expression level, and therefore correlates with average expression level. Fourth, the data show a more complex picture than two classes, with a network of direct and indirect correlations among promoter properties. By tentatively distinguishing the direct from the indirect correlations, we reveal simple explanations for them.


Assuntos
Regiões Promotoras Genéticas , Animais , Ilhas de CpG , Interpretação Estatística de Dados , Humanos , Camundongos , TATA Box , Sítio de Iniciação de Transcrição , Transcrição Gênica
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA