Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 15 de 15
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Nucleic Acids Res ; 52(4): 1720-1735, 2024 Feb 28.
Artigo em Inglês | MEDLINE | ID: mdl-38109317

RESUMO

Nucleotide excision repair (NER) removes helix-distorting DNA lesions and is therefore critical for genome stability. During NER, DNA is unwound on either side of the lesion and excised, but the rules governing incision site selection, particularly in eukaryotic cells, are unclear. Excision repair-sequencing (XR-seq) sequences excised NER fragments, but analysis has been limited because the lesion location is unknown. Here, we exploit accelerated cytosine deamination rates in UV-induced CPD (cyclobutane pyrimidine dimer) lesions to precisely map their locations at C to T mismatches in XR-seq reads, revealing general and species-specific patterns of incision site selection during NER. Our data indicate that the 5' incision site occurs preferentially in HYV (i.e. not G; C/T; not T) sequence motifs, a pattern that can be explained by sequence preferences of the XPF-ERCC1 endonuclease. In contrast, the 3' incision site does not show strong sequence preferences, once truncated reads arising from mispriming events are excluded. Instead, the 3' incision is partially determined by the 5' incision site distance, indicating that the two incision events are coupled. Finally, our data reveal unique and coupled NER incision patterns at nucleosome boundaries. These findings reveal key principles governing NER incision site selection in eukaryotic cells.


Assuntos
Citosina , Reparo por Excisão , Citosina/química , Desaminação , Dano ao DNA , Células Eucarióticas/química
2.
J Chem Theory Comput ; 18(12): 7043-7051, 2022 Dec 13.
Artigo em Inglês | MEDLINE | ID: mdl-36374620

RESUMO

Although community or cluster identification is becoming a standard tool within the simulation community, traditional algorithms are challenging to adapt to time-dependent data. Here, we introduce temporal community identification using the Δ-screening algorithm, which has the flexibility to account for varying community compositions, merging and splitting behaviors within dynamically evolving chemical networks. When applied to a complex chemical system whose varying chemical environments cause multiple time scale behavior, Δ-screening is able to resolve the multiple time scales of temporal communities. This computationally efficient algorithm is easily adapted to a wide range of dynamic chemical systems; flexibility in implementation allows the user to increase or decrease the resolution of temporal features by controlling parameters associated with community composition and fluctuations therein.


Assuntos
Algoritmos , Simulação por Computador
3.
iScience ; 25(11): 105273, 2022 Nov 18.
Artigo em Inglês | MEDLINE | ID: mdl-36304115

RESUMO

De novo genome assembly is a fundamental problem in computational molecular biology that aims to reconstruct an unknown genome sequence from a set of short DNA sequences (or reads) obtained from the genome. The relative ordering of the reads along the target genome is not known a priori, which is one of the main contributors to the increased complexity of the assembly process. In this article, with the dual objective of improving assembly quality and exposing a high degree of parallelism, we present a partitioning-based approach. Our framework, BOA (bucket-order-assemble), uses a bucketing alongside graph- and hypergraph-based partitioning techniques to produce a partial ordering of the reads. This partial ordering enables us to divide the read set into disjoint blocks that can be independently assembled in parallel using any state-of-the-art serial assembler of choice. Experimental results show that BOA improves both the overall assembly quality and performance.

4.
JAMA Netw Open ; 5(8): e2225508, 2022 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-35930285

RESUMO

Importance: Person-to-person contact is important for the transmission of health care-associated pathogens. Quantifying these contact patterns is crucial for modeling disease transmission and understanding routes of potential transmission. Objective: To generate and analyze the mixing matrices of hospital patients based on their contacts within hospital units. Design, Setting, and Participants: In this quality improvement study, mixing matrices were created using a weighted contact network of connected hospital patients, in which contact was defined as occupying the same hospital unit for 1 day. Participants included hospitalized patients at 299 hospital units in 24 hospitals in the Southeastern United States that were part of the Duke Antimicrobial Stewardship Outreach Network between January 2015 and December 2017. Analysis was conducted between October 2021 and February 2022. Main Outcomes and Measures: The mixing matrices of patients for each hospital unit were assessed using age, Elixhauser Score, and a measure of antibiotic exposure. Results: Among 1 549 413 hospitalized patients (median [IQR] age, 44 [26-63] years; 883 580 [56.3%] women) in 299 hospital units, some units had highly similar patterns across multiple hospitals, although the number of patients varied to a great extent. For most of the adult inpatient units, frequent mixing was observed for older adult groups, while outpatient units (eg, emergency departments and behavioral health units) showed mixing between different age groups. Most units mixing patterns followed the marginal distribution of age; however, patients aged 90 years or older with longer lengths of stay created a secondary peak in some medical wards. From the mixing matrices by Elixhauser Score, mixing between patients with relatively higher comorbidity index was observed in intensive care units. Mixing matrices by antibiotic spectrum, a 4-point scale based on priority for antibiotic stewardship programs, resulted in 6 major distinct patterns owing to the variation of the type of antibiotics used in different units, namely those dominated by a single antibiotic spectrum (narrow, broad, or extended), 1 pattern spanning all antibiotic spectrum types and 2 forms of narrow- and extended-spectrum dominant exposure patterns (an emergency room where patients were exposed to one type of antibiotic or the other and a pediatric ward where patients were exposed to both types). Conclusions and Relevance: This quality improvement study found that the mixing patterns of patients both within and between hospitals followed broadly expected patterns, although with a considerable amount of heterogeneity. These patterns could be used to inform mathematical models of health care-associated infections, assess the appropriateness of both models and policies for smaller community hospitals, and provide baseline information for the design of interventions that rely on altering patient contact patterns, such as practices for transferring patients within hospitals.


Assuntos
Gestão de Antimicrobianos , Infecção Hospitalar , Adulto , Idoso , Antibacterianos/uso terapêutico , Criança , Infecção Hospitalar/epidemiologia , Feminino , Hospitais , Humanos , Pacientes Internados , Masculino
5.
IEEE/ACM Trans Comput Biol Bioinform ; 18(4): 1535-1548, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-31647442

RESUMO

Phenomics is an emerging branch of modern biology that uses high throughput phenotyping tools to capture multiple environmental and phenotypic traits, often at massive spatial and temporal scales. The resulting high dimensional data represent a treasure trove of information for providing an in-depth understanding of how multiple factors interact and contribute to the overall growth and behavior of different genotypes. However, computational tools that can parse through such complex data and aid in extracting plausible hypotheses are currently lacking. In this article, we present Hyppo-X, a new algorithmic approach to visually explore complex phenomics data and in the process characterize the role of environment on phenotypic traits. We model the problem as one of unsupervised structure discovery, and use emerging principles from algebraic topology and graph theory for discovering higher-order structures of complex phenomics data. We present an open source software which has interactive visualization capabilities to facilitate data navigation and hypothesis formulation. We test and evaluate Hyppo-X on two real-world plant (maize) data sets. Our results demonstrate the ability of our approach to delineate divergent subpopulation-level behavior. Notably, our approach shows how environmental factors could influence phenotypic behavior, and how that effect varies across different genotypes and different time scales. To the best of our knowledge, this effort provides one of the first approaches to systematically formalize the problem of hypothesis extraction for phenomics data. Considering the infancy of the phenomics field, tools that help users explore complex data and extract plausible hypotheses in a data-guided manner will be critical to future advancements in the use of such data.


Assuntos
Fenômica/métodos , Fenótipo , Software , Algoritmos , Bases de Dados Genéticas , Zea mays/genética , Zea mays/fisiologia
6.
IEEE/ACM Trans Comput Biol Bioinform ; 16(4): 1091-1106, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-28910776

RESUMO

De novo genome assembly describes the process of reconstructing an unknown genome from a large collection of short (or long) reads sequenced from the genome. A single run of a Next-Generation Sequencing (NGS) technology can produce billions of short reads, making genome assembly computationally demanding (both in terms of memory and time). One of the major computational steps in modern day short read assemblers involves the construction and use of a string data structure called the de Bruijn graph. In fact, a majority of short read assemblers build the complete de Bruijn graph for the set of input reads, and subsequently traverse and prune low-quality edges, in order to generate genomic "contigs"-the output of assembly. These steps of graph construction and traversal, contribute to well over 90 percent of the runtime and memory. In this paper, we present a fast algorithm, FastEtch, that uses sketching to build an approximate version of the de Bruijn graph for the purpose of generating an assembly. The algorithm uses Count-Min sketch, which is a probabilistic data structure for streaming data sets. The result is an approximate de Bruijn graph that stores information pertaining only to a selected subset of nodes that are most likely to contribute to the contig generation step. In addition, edges are not stored; instead that fraction which contribute to our contig generation are detected on-the-fly. This approximate approach is intended to significantly improve performance (both execution time and memory footprint) whilst possibly compromising on the output assembly quality. We present two main versions of the assembler-one that generates an assembly, where each contig represents a contiguous genomic region from one strand of the DNA, and another that generates an assembly, where the contigs can straddle either of the two strands of the DNA. For further scalability, we have implemented a multi-threaded parallel code. Experimental results using our algorithm conducted on E. coli, Yeast, C. elegans, and Human (Chr2 and Chr2+3) genomes show that our method yields one of the best time-memory-quality trade-offs, when compared against many state-of-the-art genome assemblers.


Assuntos
Biologia Computacional/instrumentação , Mapeamento de Sequências Contíguas/instrumentação , Genoma , Software , Algoritmos , Animais , Caenorhabditis elegans/genética , Biologia Computacional/métodos , Mapeamento de Sequências Contíguas/métodos , Escherichia coli/genética , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Probabilidade , Leveduras/genética
7.
Artigo em Inglês | MEDLINE | ID: mdl-29990252

RESUMO

Methods to efficiently uncover and extract community structures are required in a number of biological applications where networked data and their interactions can be modeled as graphs, and observing tightly-knit groups of vertices ("communities") can offer insights into the structural and functional building blocks of the underlying network. Classical applications of community detection have largely focused on unipartite networks - i.e., graphs built out of a single type of objects. However, due to increased availability of biological data from various sources, there is now an increasing need for handling heterogeneous networks which are built out of multiple types of objects. In this paper, we address the problem of identifying communities from biological bipartite networks - i.e., networks where interactions are observed between two different types of objects (e.g., genes and diseases, drugs and protein complexes, plants and pollinators, and hosts and pathogens). Toward detecting communities in such bipartite networks, we make the following contributions: i) (metric) we propose a variant of bipartite modularity; ii) (algorithms) we present an efficient algorithm called biLouvain that implements a set of heuristics toward fast and precise community detection in bipartite networks (https://github.com/paolapesantez/biLouvain); and iii) (experiments) we present a thorough experimental evaluation of our algorithm including comparison to other state-of-the-art methods to identify communities in bipartite networks. Experimental results show that our biLouvain algorithm identifies communities that have a comparable or better quality (as measured by bipartite modularity) than existing methods, while significantly reducing the time-to-solution between one and four orders of magnitude.


Assuntos
Biologia Computacional/métodos , Modelos Biológicos , Algoritmos , Bases de Dados Genéticas , Redes Reguladoras de Genes/genética
8.
BMC Bioinformatics ; 19(1): 83, 2018 03 05.
Artigo em Inglês | MEDLINE | ID: mdl-29506470

RESUMO

BACKGROUND: Clustering of protein sequences is of key importance in predicting the structure and function of newly sequenced proteins and is also of use for their annotation. With the advent of multiple high-throughput sequencing technologies, new protein sequences are becoming available at an extraordinary rate. The rapid growth rate has impeded deployment of existing protein clustering/annotation tools which depend largely on pairwise sequence alignment. RESULTS: In this paper, we propose an alignment-free clustering approach, coreClust, for annotating protein sequences using detected conserved regions. The proposed algorithm uses Min-Wise Independent Hashing for identifying similar conserved regions. Min-Wise Independent Hashing works by generating a (w,c)-sketch for each document and comparing these sketches. Our algorithm fits well within the MapReduce framework, permitting scalability. We show that coreClust generates results comparable to existing known methods. In particular, we show that the clusters generated by our algorithm capture the subfamilies of the Pfam domain families for which the sequences in a cluster have a similar domain architecture. We show that for a data set of 90,000 sequences (about 250,000 domain regions), the clusters generated by our algorithm give a 75% average weighted F1 score, our accuracy metric, when compared to the clusters generated by a semi-exhaustive pairwise alignment algorithm. CONCLUSIONS: The new clustering algorithm can be used to generate meaningful clusters of conserved regions. It is a scalable method that when paired with our prior work, NADDA for detecting conserved regions, provides a complete end-to-end pipeline for annotating protein sequences.


Assuntos
Algoritmos , Bases de Dados de Proteínas , Anotação de Sequência Molecular , Alinhamento de Sequência/métodos , Sequência de Aminoácidos , Análise por Conglomerados , Filogenia , Domínios Proteicos , Rickettsia/classificação
9.
PLoS One ; 11(8): e0161338, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-27552220

RESUMO

BACKGROUND: Identifying conserved regions in protein sequences is a fundamental operation, occurring in numerous sequence-driven analysis pipelines. It is used as a way to decode domain-rich regions within proteins, to compute protein clusters, to annotate sequence function, and to compute evolutionary relationships among protein sequences. A number of approaches exist for identifying and characterizing protein families based on their domains, and because domains represent conserved portions of a protein sequence, the primary computation involved in protein family characterization is identification of such conserved regions. However, identifying conserved regions from large collections (millions) of protein sequences presents significant challenges. METHODS: In this paper we present a new, alignment-free method for detecting conserved regions in protein sequences called NADDA (No-Alignment Domain Detection Algorithm). Our method exploits the abundance of exact matching short subsequences (k-mers) to quickly detect conserved regions, and the power of machine learning is used to improve the prediction accuracy of detection. We present a parallel implementation of NADDA using the MapReduce framework and show that our method is highly scalable. RESULTS: We have compared NADDA with Pfam and InterPro databases. For known domains annotated by Pfam, accuracy is 83%, sensitivity 96%, and specificity 44%. For sequences with new domains not present in the training set an average accuracy of 63% is achieved when compared to Pfam. A boost in results in comparison with InterPro demonstrates the ability of NADDA to capture conserved regions beyond those present in Pfam. We have also compared NADDA with ADDA and MKDOM2, assuming Pfam as ground-truth. On average NADDA shows comparable accuracy, more balanced sensitivity and specificity, and being alignment-free, is significantly faster. Excluding the one-time cost of training, runtimes on a single processor were 49s, 10,566s, and 456s for NADDA, ADDA, and MKDOM2, respectively, for a data set comprised of approximately 2500 sequences.


Assuntos
Sequência Conservada/genética , Análise de Sequência de Proteína , Homologia de Sequência de Aminoácidos , Algoritmos , Bases de Dados de Proteínas , Domínios Proteicos , Estrutura Terciária de Proteína , Alinhamento de Sequência , Software
10.
PLoS One ; 11(4): e0152404, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-27071032

RESUMO

High-throughput sequencing continues to produce an immense volume of information that is processed and assembled into mature sequence data. Data analysis tools are urgently needed that leverage the embedded DNA sequence polymorphisms and consequent changes to restriction sites or sequence motifs in a high-throughput manner to enable biological experimentation. CisSERS was developed as a standalone open source tool to analyze sequence datasets and provide biologists with individual or comparative genome organization information in terms of presence and frequency of patterns or motifs such as restriction enzymes. Predicted agarose gel visualization of the custom analyses results was also integrated to enhance the usefulness of the software. CisSERS offers several novel functionalities, such as handling of large and multiple datasets in parallel, multiple restriction enzyme site detection and custom motif detection features, which are seamlessly integrated with real time agarose gel visualization. Using a simple fasta-formatted file as input, CisSERS utilizes the REBASE enzyme database. Results from CisSERS enable the user to make decisions for designing genotyping by sequencing experiments, reduced representation sequencing, 3'UTR sequencing, and cleaved amplified polymorphic sequence (CAPS) molecular markers for large sample sets. CisSERS is a java based graphical user interface built around a perl backbone. Several of the applications of CisSERS including CAPS molecular marker development were successfully validated using wet-lab experimentation. Here, we present the tool CisSERS and results from in-silico and corresponding wet-lab analyses demonstrating that CisSERS is a technology platform solution that facilitates efficient data utilization in genomics and genetics studies.


Assuntos
Motivos de Nucleotídeos/genética , Análise de Sequência de DNA/métodos , Regiões 3' não Traduzidas/genética , Biologia Computacional/métodos , Simulação por Computador , Genoma/genética , Genômica/métodos , Genótipo , Humanos , Polimorfismo Genético/genética , Software , Interface Usuário-Computador
11.
Pac Symp Biocomput ; : 225-34, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22174278

RESUMO

We report the development of a novel high performance computing method for the identification of proteins from unknown (environmental) samples. The method uses computational optimization to provide an effective way to control the false discovery rate for environmental samples and complements de novo peptide sequencing. Furthermore, the method provides information based on the expressed protein in a microbial community, and thus complements DNA-based identification methods. Testing on blind samples demonstrates that the method provides 79-95% overlap with analogous results from searches involving only the correct genomes. We provide scaling and performance evaluations for the software that demonstrate the ability to carry out large-scale optimizations on 1258 genomes containing 4.2M proteins.


Assuntos
Microbiota , Proteômica/estatística & dados numéricos , Espectrometria de Massas em Tandem/estatística & dados numéricos , Biologia Computacional , Metodologias Computacionais , Interpretação Estatística de Dados , Funções Verossimilhança , Microbiota/genética , Proteínas/genética , Proteínas/isolamento & purificação , Proteoma/genética , Proteoma/isolamento & purificação , Software
12.
Bioinformatics ; 27(21): 3072-3, 2011 Nov 01.
Artigo em Inglês | MEDLINE | ID: mdl-21926122

RESUMO

SUMMARY: A MapReduce-based implementation called MR-MSPolygraph for parallelizing peptide identification from mass spectrometry data is presented. The underlying serial method, MSPolygraph, uses a novel hybrid approach to match an experimental spectrum against a combination of a protein sequence database and a spectral library. Our MapReduce implementation can run on any Hadoop cluster environment. Experimental results demonstrate that, relative to the serial version, MR-MSPolygraph reduces the time to solution from weeks to hours, for processing tens of thousands of experimental spectra. Speedup and other related performance studies are also reported on a 400-core Hadoop cluster using spectral datasets from environmental microbial communities as inputs. AVAILABILITY: The source code along with user documentation are available on http://compbio.eecs.wsu.edu/MR-MSPolygraph. CONTACT: ananth@eecs.wsu.edu; william.cannon@pnnl.gov. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Espectrometria de Massas/métodos , Peptídeos/química , Software , Bases de Dados de Proteínas , Análise de Sequência de Proteína
13.
BMC Genomics ; 12: 410, 2011 Aug 12.
Artigo em Inglês | MEDLINE | ID: mdl-21838895

RESUMO

BACKGROUND: Virulence acquisition and loss is a dynamic adaptation of pathogens to thrive in changing milieus. We investigated the mechanisms of virulence loss at the whole genome level using Babesia bovis as a model apicomplexan in which genetically related attenuated parasites can be reliably derived from virulent parental strains in the natural host. We expected virulence loss to be accompanied by consistent changes at the gene level, and that such changes would be shared among attenuated parasites of diverse geographic and genetic background. RESULTS: Surprisingly, while single nucleotide polymorphisms in 14 genes distinguished all attenuated parasites from their virulent parental strains, all non-synonymous changes resulted in no deleterious amino acid modification that could consistently be associated with attenuation (or virulence) in this hemoparasite. Interestingly, however, attenuation significantly reduced the overall population's genome diversity with 81% of base pairs shared among attenuated strains, compared to only 60% of base pairs common among virulent parental parasites. There were significantly fewer genes that were unique to their geographical origins among the attenuated parasites, resulting in a simplified population structure among the attenuated strains. CONCLUSIONS: This simplified structure includes reduced diversity of the variant erythrocyte surface 1 (ves) multigene family repertoire among attenuated parasites when compared to virulent parental strains, possibly suggesting that overall variance in large protein families such as Variant Erythrocyte Surface Antigens has a critical role in expression of the virulence phenotype. In addition, the results suggest that virulence (or attenuation) mechanisms may not be shared among all populations of parasites at the gene level, but instead may reflect expansion or contraction of the population structure in response to shifting milieus.


Assuntos
Babesia bovis/genética , Babesia bovis/patogenicidade , Sangue/parasitologia , Variação Genética/genética , Genômica , Animais , Geografia , Fenótipo , Análise de Sequência , Especificidade da Espécie
14.
Nat Genet ; 42(10): 833-9, 2010 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-20802477

RESUMO

We report a high-quality draft genome sequence of the domesticated apple (Malus × domestica). We show that a relatively recent (>50 million years ago) genome-wide duplication (GWD) has resulted in the transition from nine ancestral chromosomes to 17 chromosomes in the Pyreae. Traces of older GWDs partly support the monophyly of the ancestral paleohexaploidy of eudicots. Phylogenetic reconstruction of Pyreae and the genus Malus, relative to major Rosaceae taxa, identified the progenitor of the cultivated apple as M. sieversii. Expansion of gene families reported to be involved in fruit development may explain formation of the pome, a Pyreae-specific false fruit that develops by proliferation of the basal part of the sepals, the receptacle. In apple, a subclade of MADS-box genes, normally involved in flower and fruit development, is expanded to include 15 members, as are other gene families involved in Rosaceae-specific metabolism, such as transport and assimilation of sorbitol.


Assuntos
Duplicação Gênica , Genes de Plantas/genética , Genoma de Planta , Malus/genética , Flores/genética , Flores/crescimento & desenvolvimento , Frutas/genética , Frutas/crescimento & desenvolvimento , Ligação Genética , Estudo de Associação Genômica Ampla , Malus/crescimento & desenvolvimento , Filogenia
15.
PLoS Genet ; 5(11): e1000728, 2009 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-19936048

RESUMO

Most of our understanding of plant genome structure and evolution has come from the careful annotation of small (e.g., 100 kb) sequenced genomic regions or from automated annotation of complete genome sequences. Here, we sequenced and carefully annotated a contiguous 22 Mb region of maize chromosome 4 using an improved pseudomolecule for annotation. The sequence segment was comprehensively ordered, oriented, and confirmed using the maize optical map. Nearly 84% of the sequence is composed of transposable elements (TEs) that are mostly nested within each other, of which most families are low-copy. We identified 544 gene models using multiple levels of evidence, as well as five miRNA genes. Gene fragments, many captured by TEs, are prevalent within this region. Elimination of gene redundancy from a tetraploid maize ancestor that originated a few million years ago is responsible in this region for most disruptions of synteny with sorghum and rice. Consistent with other sub-genomic analyses in maize, small RNA mapping showed that many small RNAs match TEs and that most TEs match small RNAs. These results, performed on approximately 1% of the maize genome, demonstrate the feasibility of refining the B73 RefGen_v1 genome assembly by incorporating optical map, high-resolution genetic map, and comparative genomic data sets. Such improvements, along with those of gene and repeat annotation, will serve to promote future functional genomic and phylogenomic research in maize and other grasses.


Assuntos
Pareamento de Bases/genética , Genoma de Planta/genética , Zea mays/genética , Sequência de Bases , Cromossomos de Plantas/genética , Elementos de DNA Transponíveis/genética , Evolução Molecular , Duplicação Gênica , Rearranjo Gênico/genética , Genes de Plantas , Loci Gênicos/genética , Dados de Sequência Molecular , Mutação/genética , Fases de Leitura Aberta/genética , Oryza/genética , Mapeamento Físico do Cromossomo , RNA de Plantas/genética , Homologia de Sequência do Ácido Nucleico , Sorghum/genética , Sintenia/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...