Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 15 de 15
Filtrar
1.
Bioinformatics ; 2024 May 24.
Artículo en Inglés | MEDLINE | ID: mdl-38788221

RESUMEN

MOTIVATION: In recent years, the focus of bioinformatics research has moved from individual sequences to collections of sequences. Given the fundamental role of the Burrows-Wheeler Transform (BWT) in string processing, a number of dedicated tools have been developed for computing the BWT of string collections. While the focus has been on improving efficiency, both in space and time, the exact definition of the BWT employed has not been at the center of attention. As we show in this paper, the different tools in use often compute non-equivalent BWT variants: the resulting transforms can differ from each other significantly, including the number r of runs, a central parameter of the BWT. Moreover, with many tools, the transform depends on the input order of the collection. In other words, on the same dataset, the same tool may output different transforms if the dataset is given in a different order. RESULTS: We studied 18 dedicated tools for computing the BWT of string collections and were able to identify 6 different BWT variants computed by these tools. We review the differences between these BWT variants, both from a theoretical and from a practical point of view, comparing them on 8 real-life biological datasets with different characteristics. We find that the differences can be extensive, depending on the datasets, and are largest on collections of many similar short sequences. The parameter r, the number of runs of the BWT, also shows notable variation between the different BWT variants; on our datasets, it varied by a multiplicative factor of up to 4.2. AVAILABILITY: Source code and scripts to replicate the results and download the data used in the article are available at https://github.com/davidecenzato/BWT-variants-for-string-collections. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

2.
Algorithms Mol Biol ; 19(1): 11, 2024 Mar 12.
Artículo en Inglés | MEDLINE | ID: mdl-38475889

RESUMEN

We introduce a new algorithm for constructing the generalized suffix array of a collection of highly similar strings. As a first step, we construct a compressed representation of the matching statistics of the collection with respect to a reference string. We then use this data structure to distribute suffixes into a partial order, and subsequently to speed up suffix comparisons to complete the generalized suffix array. Our experimental evidence with a prototype implementation (a tool we call sacamats) shows that on string collections with highly similar strings we can construct the suffix array in time competitive with or faster than the fastest available methods. Along the way, we describe a heuristic for fast computation of the matching statistics of two strings, which may be of independent interest.

3.
Proc Data Compress Conf ; 2022: 93-102, 2022 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-38812828

RESUMEN

Generating pangenomic datasets is becoming increasingly common but there are still few tools able to handle them and even fewer accessible to non-specialists. Building compressed suffix trees (CSTs) for pangenomic datasets is still a major challenge but could be enormously beneficial to the community. In this paper, we present a method, which we refer to as RePFP-CST, for building CSTs in a manner that is scalable. To accomplish this, we show how to build a CST directly from VCF files without decompressing them, and to prune from the prefix-free parse (PFP) phrase boundaries whose removal reduces the total size of the dictionary and the parse. We show that these improvements reduce the time and space required for the construction of the CST, and the memory footprint of the finished CST, enabling us to build a CST for a terabyte of DNA for the first time in the literature.

4.
Theor Comput Sci ; 859: 134-148, 2021 Mar 06.
Artículo en Inglés | MEDLINE | ID: mdl-34163096

RESUMEN

Prefix normal words are binary words with the property that no factor has more 1s than the prefix of the same length. Finite prefix normal words were introduced in [Fici and Lipták, DLT 2011]. In this paper, we study infinite prefix normal words and explore their relationship to some known classes of infinite binary words. In particular, we establish a connection between prefix normal words and Sturmian words, between prefix normal words and abelian complexity, and between prefix normality and lexicographic order.

5.
Int Symp String Process Inf Retr ; 12944: 129-142, 2021 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-38742019

RESUMEN

Given an input string, the Burrows-Wheeler Transform (BWT) can be seen as a reversible permutation of it that allows efficient compression and fast substring queries. Due to these properties, it has been widely applied in the analysis of genomic sequence data, enabling important tasks such as read alignment. Mantaci et al. [TCS2007] extended the notion of the BWT to a collection of strings by defining the extended Burrows-Wheeler Transform (eBWT). This definition requires no modification of the input collection, and has the property that the output is independent of the order of the strings in the collection. However, over the years, the term eBWT has been used more generally to describe any BWT of a collection of strings. The fundamental property of the original definition (i.e., the independence from the input order) is frequently disregarded. In this paper, we propose a simple linear-time algorithm for the construction of the original eBWT, which does not require the preprocessing of Bannai et al. [CPM 2021]. As a byproduct, we obtain the first linear-time algorithm for computing the BWT of a single string that uses neither an end-of-string symbol nor Lyndon rotations. We also combine our new eBWT construction with a variation of prefix-free parsing (PFP) [WABI 2019] to allow for construction of the eBWT on large collections of genomic sequences. We implement this combined algorithm (pfpebwt) and evaluate it on a collection of human chromosomes 19 from the 1,000 Genomes Project, on a collection of Salmonella genomes from GenomeTrakr, and on a collection of SARS-CoV2 genomes from EBI's COVID-19 data portal. We demonstrate that pfpebwt is the fastest method for all collections, with a maximum speedup of 7.6x on the second best method. The peak memory is at most 2x larger than the second best method. Comparing with methods that are also, as our algorithm, able to report suffix array samples, we obtain a 57.1x improvement in peak memory. The source code is publicly available at https://github.com/davidecenzato/PFP-eBWT.

6.
Int Symp String Process Inf Retr ; 12944: 3-12, 2021 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-38742018

RESUMEN

The extended Burrows Wheeler Transform (eBWT) was introduced by Mantaci et al. [TCS 2007] to extend the definition of the BWT to a collection of strings. In our prior work [SPIRE 2021], we give a linear-time algorithm for the eBWT that preserves the fundamental property of the original definition (i.e., the independence from the input order). The algorithm combines a modification of the Suffix Array Induced Sorting (SAIS) algorithm [IEEE Trans Comput 2011] with Prefix Free Parsing [AMB 2019; JCB 2020]. In this paper, we show how this construction algorithm leads to r-indexing the eBWT, i.e., run-length encoded eBWT and SA samples of Gagie et al. [SODA 2018] can be constructed efficiently from the components of the PFP. Moreover, we show that finding maximal exact matches (MEMs) between a query string and the r-index of the eBWT can be efficiently supported.

7.
J Neuroimaging ; 22(2): 129-36, 2012 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-21447022

RESUMEN

BACKGROUND AND PURPOSE: The pathological differences underlying the clinical disease phases in multiple sclerosis (MS) are poorly characterized. We sought to explore the relationship between the distribution of white matter (WM) lesions in relapsing-remitting (RR) and secondary progressive (SP) MS and the normal regional variability of cerebral perfusion. METHODS: WM lesions were identified and quantified on a single magnetic resonance imaging scan from 1,249 patients with MS. The spatial distribution of lesions was compared between early RR, late RR, and SP MS in the context of normal cerebral perfusion patterns provided by a single-photon emission-computed tomography atlas of healthy individuals. RESULTS: Patients with SP MS had more distinct and larger lesions than patients with RR MS. Across all subjects, lesions were present in regions of relatively lower normal perfusion than normal appearing WM. Further, lesions in SP MS were more common in areas of lower perfusion as compared to the lesion distribution in early and late RR MS. CONCLUSION: Chronic plaques were more prevalent in WM regions with lower relative perfusion. Lesions in more highly perfused regions were more commonly observed in early RR MS and therefore, may be more likely to successfully remyelinate and resolve.


Asunto(s)
Encéfalo/patología , Esclerosis Múltiple Crónica Progresiva/patología , Esclerosis Múltiple Recurrente-Remitente/patología , Fibras Nerviosas Mielínicas/patología , Adulto , Encéfalo/diagnóstico por imagen , Femenino , Humanos , Imagen por Resonancia Magnética , Masculino , Persona de Mediana Edad , Esclerosis Múltiple Crónica Progresiva/diagnóstico por imagen , Esclerosis Múltiple Recurrente-Remitente/diagnóstico por imagen , Fibras Nerviosas Mielínicas/diagnóstico por imagen , Cintigrafía
8.
Bioinformatics ; 27(24): 3348-55, 2011 Dec 15.
Artículo en Inglés | MEDLINE | ID: mdl-21984769

RESUMEN

MOTIVATION: Second-generation sequencing technology has reinvigorated research using expression data, and clustering such data remains a significant challenge, with much larger datasets and with different error profiles. Algorithms that rely on all-versus-all comparison of sequences are not practical for large datasets. RESULTS: We introduce a new filter for string similarity which has the potential to eliminate the need for all-versus-all comparison in clustering of expression data and other similar tasks. Our filter is based on multiple long exact matches between the two strings, with the additional constraint that these matches must be sufficiently far apart. We give details of its efficient implementation using modified suffix arrays. We demonstrate its efficiency by presenting our new expression clustering tool, wcd-express, which uses this heuristic. We compare it to other current tools and show that it is very competitive both with respect to quality and run time. AVAILABILITY: Source code and binaries available under GPL at http://code.google.com/p/wcdest. Runs on Linux and MacOS X. CONTACT: scott.hazelhurst@wits.ac.za; zsuzsa@cebitec.uni-bielefeld.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Inteligencia Artificial , Análisis por Conglomerados , Perfilación de la Expresión Génica , Biología Computacional/economía , Biología Computacional/métodos , Humanos , Programas Informáticos
9.
J Neurol Sci ; 284(1-2): 116-9, 2009 Sep 15.
Artículo en Inglés | MEDLINE | ID: mdl-19428028

RESUMEN

OBJECTIVE: To determine the rate of treatment failure in patients outside of a controlled treatment trial and to ascertain the factors physicians used to make this decision. METHODS: One hundred and thirty four patients with the diagnosis of relapsing-remitting (RR) multiple sclerosis (MS) or clinically isolated symptom (CIS) enrolled in the CLIMB study (Comprehensive Longitudinal Investigation of Multiple Sclerosis at the Brigham and Women's Hospital) were treated with either interferon beta or glatiramer acetate as their initial treatment for MS. RESULTS: The probability of failing initial treatment within 3 years was 30%. Clinical activity, defined as relapses and/or progression in disability, determined treatment failure in 35.7% (n=10) of nonresponders. New T2 hyperintense or gadolinium-enhancing lesions on MRI was used to define treatment failure in 28.6% (n=8) and new MRI lesions were used in combination with clinical activity in 35.7% (n=10). Treatment failures had a higher T2 hyperintense lesion volume (p=0.015) and number of gadolinium-enhancing lesions (p=0.0001) on the enrollment MRI than responders. CONCLUSIONS: These observations demonstrate that treating physicians use both clinical and MRI parameters to define a response to treatment and initiation of a treatment change and that baseline MRI identified those with increased risk of treatment failure.


Asunto(s)
Factores Inmunológicos/uso terapéutico , Interferón beta/uso terapéutico , Esclerosis Múltiple Recurrente-Remitente/tratamiento farmacológico , Péptidos/uso terapéutico , Adulto , Medios de Contraste , Progresión de la Enfermedad , Femenino , Estudios de Seguimiento , Gadolinio , Acetato de Glatiramer , Humanos , Incidencia , Interferón beta-1a , Interferon beta-1b , Imagen por Resonancia Magnética , Masculino , Persona de Mediana Edad , Esclerosis Múltiple Recurrente-Remitente/patología , Estudios Prospectivos , Factores de Riesgo , Insuficiencia del Tratamiento , Adulto Joven
10.
Arch Neurol ; 66(2): 234-7, 2009 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-19204160

RESUMEN

BACKGROUND: Benign multiple sclerosis (MS) is defined by minimal or no disability after many years of observation, therefore a less degenerative disease process is suspected to be present in this subset of patients. OBJECTIVE: To compare brain atrophy rates in patients with long-standing benign MS vs typical early MS. DESIGN: A longitudinal prospective cohort study and a retrospective database review. SETTING: An academic MS center. PATIENTS: Thirty-nine patients with clinically defined benign MS and an age-matched group of 40 patients with early relapsing-remitting MS. MAIN OUTCOME MEASURES: Baseline demographic, treatment, brain magnetic resonance imaging measures, and annualized atrophy rates, derived from serial brain parenchymal fraction measurements across 2 years, were compared. RESULTS: In the baseline analysis, patients with benign MS were matched to the early MS group on age, sex, treatment with immunomodulatory therapy, T2 lesion volume, and brain parenchymal fraction. The mean (SD) annualized brain atrophy rate in patients with benign MS (-0.16% [0.51%]) was lower than that in patients with early MS (-0.46% [0.72%]) (P = .02). The difference remained significant after controlling for age, sex, and treatment (P = .04). CONCLUSIONS: Serial magnetic resonance imaging revealed a low 2-year rate of brain atrophy in patients with clinically benign MS, suggesting a less prominent degenerative component in its pathogenesis than in patients with typical early MS. Identification of patients with a low rate of brain atrophy may indicate a benign course.


Asunto(s)
Atrofia/patología , Encéfalo/patología , Esclerosis Múltiple/patología , Adulto , Distribución por Edad , Edad de Inicio , Anciano , Atrofia/etiología , Atrofia/fisiopatología , Encéfalo/fisiopatología , Estudios de Cohortes , Progresión de la Enfermedad , Femenino , Humanos , Estudios Longitudinales , Imagen por Resonancia Magnética , Masculino , Persona de Mediana Edad , Esclerosis Múltiple/clasificación , Esclerosis Múltiple/fisiopatología , Valor Predictivo de las Pruebas , Pronóstico , Estudios Prospectivos , Estudios Retrospectivos , Índice de Severidad de la Enfermedad
11.
Bioinformatics ; 25(2): 218-24, 2009 Jan 15.
Artículo en Inglés | MEDLINE | ID: mdl-19015140

RESUMEN

MOTIVATION: High-resolution mass spectrometry (MS) is among the most widely used technologies in metabolomics. Metabolites participate in almost all cellular processes, but most metabolites still remain uncharacterized. Determination of the sum formula is a crucial step in the identification of an unknown metabolite, as it reduces its possible structures to a hopefully manageable set. RESULTS: We present a method for determining the sum formula of a metabolite solely from its mass and the natural distribution of its isotopes. Our input is a measured isotope pattern from a high resolution mass spectrometer, and we want to find those molecules that best match this pattern. Our method is computationally efficient, and results on experimental data are very promising: for orthogonal time-of-flight mass spectrometry, we correctly identify sum formulas for >90% of the molecules, ranging in mass up to 1000 Da.


Asunto(s)
Algoritmos , Espectrometría de Masas/métodos , Metabolómica , Programas Informáticos , Isótopos/química , Peso Molecular
12.
Arch Neurol ; 65(11): 1449-53, 2008 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-19001162

RESUMEN

BACKGROUND: Individual magnetic resonance imaging (MRI) disease severity measures, such as atrophy or lesions, show weak relationships to clinical status in patients with multiple sclerosis (MS). OBJECTIVE: To combine MS-MRI measures of disease severity into a composite score. DESIGN: Retrospective analysis of prospectively collected data. SETTING: Community-based and referral subspecialty clinic in an academic hospital. PATIENTS: A total of 103 patients with MS, with a mean (SD) Expanded Disability Status Scale (EDSS) score of 3.3 (2.2), of whom 62 (60.2%) had the relapsing-remitting, 33 (32.0%) the secondary progressive, and 8 (7.8%) the primary progressive form. MAIN OUTCOME MEASURES: Brain MRI measures included baseline T2 hyperintense (T2LV) and T1 hypointense (T1LV) lesion volume and brain parenchymal fraction (BPF), a marker of global atrophy. The ratio of T1LV to T2LV (T1:T2) assessed lesion severity. A Magnetic Resonance Disease Severity Scale (MRDSS) score, on a continuous scale from 0 to 10, was derived for each patient using T2LV, BPF, and T1:T2. RESULTS: The MRDSS score averaged 5.1 (SD, 2.6). Baseline MRI and EDSS correlations were moderate for BPF, T1:T2, and MRDSS and weak for T2LV. The MRDSS showed a larger effect size than the individual MRI components in distinguishing patients with the relapsing-remitting form from those with the secondary progressive form. Models containing either T2LV or MRDSS were significantly associated with disability progression during the mean (SD) 3.2 (0.3)-year observation period, when adjusting for baseline EDSS score. CONCLUSION: Combining brain MRI lesion and atrophy measures can predict MS clinical progression and provides the basis for developing an MRI-based continuous scale as a marker of MS disease severity.


Asunto(s)
Imagen por Resonancia Magnética , Esclerosis Múltiple Crónica Progresiva/diagnóstico , Esclerosis Múltiple Crónica Progresiva/fisiopatología , Esclerosis Múltiple Recurrente-Remitente/diagnóstico , Esclerosis Múltiple Recurrente-Remitente/fisiopatología , Índice de Severidad de la Enfermedad , Adulto , Atrofia , Encéfalo/patología , Progresión de la Enfermedad , Femenino , Humanos , Masculino , Persona de Mediana Edad , Valor Predictivo de las Pruebas , Estudios Prospectivos , Estudios Retrospectivos , Adulto Joven
13.
Bioinformatics ; 24(13): 1542-6, 2008 Jul 01.
Artículo en Inglés | MEDLINE | ID: mdl-18480101

RESUMEN

UNLABELLED: The wcd system is an open source tool for clustering expressed sequence tags (EST) and other DNA and RNA sequences. wcd allows efficient all-versus-all comparison of ESTs using either the d(2) distance function or edit distance, improving existing implementations of d(2). It supports merging, refinement and reclustering of clusters. It is 'drop in' compatible with the StackPack clustering package. wcd supports parallelization under both shared memory and cluster architectures. It is distributed with an EMBOSS wrapper allowing wcd to be installed as part of an EMBOSS installation (and so provided by a web server). AVAILABILITY: wcd is distributed under a GPL licence and is available from http://code.google.com/p/wcdest. SUPPLEMENTARY INFORMATION: Additional experimental results. The wcd manual, a companion paper describing underlying algorithms, and all datasets used for experimentation can also be found at www.bioinf.wits.ac.za/~scott/wcdsupp.html.


Asunto(s)
Algoritmos , Análisis por Conglomerados , Etiquetas de Secuencia Expresada , Internet , Reconocimiento de Normas Patrones Automatizadas/métodos , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Secuencia de Bases , Datos de Secuencia Molecular
14.
Bioinformatics ; 24(4): 591-3, 2008 Feb 15.
Artículo en Inglés | MEDLINE | ID: mdl-18174179

RESUMEN

UNLABELLED: We introduce Decomp, a tool that computes the sum formula of all molecules whose mass equals the input mass. This problem arises frequently in biochemistry and mass spectrometry (MS), when we know the molecular mass of a protein, DNA or metabolite fragment but have no other information. A closely related problem is known as the Money Changing Problem (MCP), where all masses are positive integers. Recently, efficient algorithms have been developed for the MCP, in which Decomp applies to real-valued MS data. The excellent performance of this method on proteomic and metabolomic MS data has recently been demonstrated. Decomp has an easy-to-use graphical interface, which caters for both types of users: those interested in solving MCP instances and those submitting MS data. AVAILABILITY: Decomp is freely accessible at http://bibiserv.techfak.uni-bielefeld.de/decomp/.


Asunto(s)
Biología Computacional/métodos , Espectrometría de Masas/métodos , Programas Informáticos
15.
J Proteome Res ; 4(5): 1768-74, 2005.
Artículo en Inglés | MEDLINE | ID: mdl-16212431

RESUMEN

We present AUDENS, a new platform-independent open source tool for automated de novo sequencing of peptides from MS/MS data. We implemented a dynamic programming algorithm and combined it with a flexible preprocessing module which is designed to distinguish between signal and other peaks. By applying a user-defined set of heuristics, AUDENS screens through the spectrum and assigns high relevance values to putative signal peaks. The algorithm constructs a sequence path through the MS/MS spectrum using the peak relevances to score each suggested sequence path, i.e., the corresponding amino acid sequence. At present, we consider AUDENS a prototype that unfolds its biggest potential if used in parallel with other de novo sequencing tools. AUDENS is available open source and can be downloaded with further documentation at http://www.ti.inf.ethz.ch/pw/software/audens/ .


Asunto(s)
Análisis de Secuencia de Proteína/instrumentación , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Algoritmos , Secuencia de Aminoácidos , Arabidopsis/metabolismo , Proteínas del Sistema Complemento , Internet , Iones , Espectrometría de Masas , Datos de Secuencia Molecular , Péptidos/química , Proteínas de Plantas/química , Sensibilidad y Especificidad
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA