Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 36
Filter
1.
J Comput Biol ; 4(3): 297-309, 1997.
Article in English | MEDLINE | ID: mdl-9278061

ABSTRACT

Recently, Gelfand, Mironov and Pevzner (1996) proposed a spliced alignment approach to gene recognition that provides 99% accurate recognition of human genes if a related mammalian protein is available. However, even 99% accurate gene predictions are insufficient for automated sequence annotation in large-scale sequencing projects and therefore have to be complemented by experimental gene verification. One hundred percent accurate gene predictions would lead to a substantial reduction of experimental work on gene identification. Our goal is to develop an algorithm that either predicts an exon assembly with accuracy sufficient for sequence annotation or warns a biologist that the accuracy of a prediction is insufficient and further experimental work is required. We study suboptimal and error-tolerant spliced alignment problems as the first steps towards such an algorithm, and report an algorithm which provides 100% accurate recognition of human genes in 37% of cases (if a related mammalian protein is available). In 52% of genes, the algorithm predicts at least one exon with 100% accuracy.


Subject(s)
Algorithms , Genes , Nucleic Acid Conformation , RNA Splicing , Amino Acid Sequence , Animals , Binding Sites , Humans , Sequence Alignment/methods
2.
J Comput Biol ; 7(6): 777-87, 2000.
Article in English | MEDLINE | ID: mdl-11382361

ABSTRACT

Database search in tandem mass spectrometry is a powerful tool for protein identification. High-throughput spectral acquisition raises the problem of dealing with genetic variation and peptide modifications within a population of related proteins. A method that cross-correlates and clusters related spectra in large collections of uncharacterized spectra (i.e., from normal and diseased individuals) would be very valuable in functional proteomics. This problem is far from being simple since very similar peptides may have very different spectra. We introduce a new notion of spectral similarity that allows one to identify related spectra even if the corresponding peptides have multiple modifications/mutations. Based on this notion, we developed a new algorithm for mutation-tolerant database search as well as a method for cross-correlating related uncharacterized spectra.


Subject(s)
Algorithms , Image Processing, Computer-Assisted , Mass Spectrometry/methods , Mutation , Proteins/genetics , Databases, Factual , Proteins/chemistry , Software
3.
J Comput Biol ; 6(3-4): 327-42, 1999.
Article in English | MEDLINE | ID: mdl-10582570

ABSTRACT

Peptide sequencing via tandem mass spectrometry (MS/MS) is one of the most powerful tools in proteomics for identifying proteins. Because complete genome sequences are accumulating rapidly, the recent trend in interpretation of MS/MS spectra has been database search. However, de novo MS/MS spectral interpretation remains an open problem typically involving manual interpretation by expert mass spectrometrists. We have developed a new algorithm, SHERENGA, for de novo interpretation that automatically learns fragment ion types and intensity thresholds from a collection of test spectra generated from any type of mass spectrometer. The test data are used to construct optimal path scoring in the graph representations of MS/MS spectra. A ranked list of high scoring paths corresponds to potential peptide sequences. SHERENGA is most useful for interpreting sequences of peptides resulting from unknown proteins and for validating the results of database search algorithms in fully automated, high-throughput peptide sequencing.


Subject(s)
Algorithms , Mass Spectrometry/methods , Peptides/chemistry , Sequence Analysis/methods , Amino Acid Sequence , Databases, Factual , Evaluation Studies as Topic , Mass Spectrometry/statistics & numerical data , Sequence Analysis/statistics & numerical data
4.
J Biomol Struct Dyn ; 7(1): 63-73, 1989 Aug.
Article in English | MEDLINE | ID: mdl-2684223

ABSTRACT

A new method of DNA reading was proposed at the end of 1988 by Lysov et al. According to the authors' claims it has certain advantages as compared to the Maxam-Gilbert and Sanger methods, which are revealed by automation and rapidity of DNA sequencing. Nevertheless its employment is hampered by a number of biological and mathematical problems. The present study proposes an algorithm that allows to overcome the computational difficulties occurring in the course of the method during reconstruction of the DNA sequence by its l-tuple composition. It is shown also that the biochemical problems connected with the loss of information about the l-tuple DNA composition during hybridization are not crucial and can be overcome by finding the maximal flow of minimal cost in the special graph.


Subject(s)
Base Sequence , DNA , Algorithms , Molecular Structure , Spectrum Analysis
5.
J Biomol Struct Dyn ; 6(5): 1013-26, 1989 Apr.
Article in English | MEDLINE | ID: mdl-2531596

ABSTRACT

Mathematical models of the generation of genetic texts appeared simultaneously with the first sequencing DNA. They are used to establish functional and evolutionary relations between genetic texts, to predict the number and distribution of specific sites in a sequence and to identify "meaningful" words. The present paper deals with two problems: 1) The significance of deviations from the mean statistical characteristics in a genetic text. Anyone who has addressed himself to the statistical analysis of sequenced DNA is familiar with the question: what deviations from the expected frequencies of occurrence of particular words testify to the "biological" significance of those words? We propose a formula for the variance of the number of word's occurrences in the text, with allowance for word overlaps, making it possible to assess the significance of the deviations from the expected statistical characteristics. 2) A new method for predicting the frequencies of occurrence of particular words in a genetic text using the statistical characteristics of "spaced" L-grams. The method can be used for predicting the number of restriction sites in human DNA and in planning experiments on the physical mapping and sequencing of the human genome.


Subject(s)
Base Sequence , Nucleotides , Bacteriophage lambda/genetics , Escherichia coli/genetics , Models, Theoretical , Nucleotides/analysis , T-Phages/genetics
6.
J Biomol Struct Dyn ; 6(5): 1027-38, 1989 Apr.
Article in English | MEDLINE | ID: mdl-2531597

ABSTRACT

Words are irregularly distributed in genetic texts. The analysis of this irregularity leads to the notion of stationary and non-stationary words. The polyW and polyS tracts are shown to be the most non-stationary words in genetic texts (here W-[A,T], S-[G,C], a polyW tract is a sequence of A,T nucleotides and a polyS tract is a sequence of G,C nucleotides. The distribution of stationary words suggests a method for partitioning DNA into zones. The zones obtained in the case of the phage are interpreted in the light of the Dowe hypothesis of the modular structure of bacteriophage genomes.


Subject(s)
Base Sequence , DNA , Nucleotides , Adenoviridae/genetics , Bacteriophage lambda/genetics , DNA/ultrastructure , Escherichia coli/genetics , Genes , T-Phages/genetics
7.
J Biomol Struct Dyn ; 9(2): 399-410, 1991 Oct.
Article in English | MEDLINE | ID: mdl-1741970

ABSTRACT

The SHOM method (Sequencing by Hybridization with Oligonucleotide Matrix) developed in 1988 is a new approach to nucleic acid sequencing by hybridization to an oligonucleotide matrix composed of an array of immobilized oligonucleotides. The original matrix proposed for sequencing by SHOM had to contain at least 65,536 octanucleotides. The present work describes a new family of matrices, which allows one to reduce the number of synthesized oligonucleotides 5-15 times without essentially decreasing the resolving power of the method.


Subject(s)
Base Sequence , Nucleic Acid Hybridization , Oligonucleotides , Algorithms , Computers , DNA , Genetic Techniques , Molecular Sequence Data
8.
Mol Biol (Mosk) ; 21(3): 788-96, 1987.
Article in Russian | MEDLINE | ID: mdl-3657777

ABSTRACT

The method of DNA molecules physical mapping based on the algorithms of discrete optimization and graph theory was proposed. The input information consisted of the sizes of single and double restrictions fragments and the level of their measurement errors. The method presents possibilities for optimal planning of experiments and step by step construction of physical maps. Efficiency of the method and examples of its application are discussed.


Subject(s)
DNA , Nucleotide Mapping/methods , Software
9.
Mol Biol (Mosk) ; 23(4): 1075-9, 1989.
Article in Russian | MEDLINE | ID: mdl-2555681

ABSTRACT

The problem of search of an universal linker with minimal length containing all restriction endonuclease recognition sites is considered. We reduce this problem to the search of Eiler's and Hamilton's paths in graph. The use of the discrete optimization methods allows to construct the linker with the length closed to the minimum.


Subject(s)
Models, Theoretical , Oligonucleotides , Base Sequence , DNA Restriction Enzymes , Molecular Sequence Data , Substrate Specificity
10.
Mol Biol (Mosk) ; 25(2): 552-62, 1991.
Article in Russian | MEDLINE | ID: mdl-1881398

ABSTRACT

The SHOM method (Sequencing by Hybridization with Oligonucleotide Matrix) developed in 1988 is a new approach to nucleic acid sequencing by hybridization to a octanucleotide matrix composed of an array of immobilized oligonucleotides. The original matrix proposed for sequencing by SHOM had to contain at least 65,536 octanucleotides. The present work describes a new family of matrices for sequencing, which allows one to reduce the number of synthesized oligonucleotides 5-15 times without essentially decreasing the resolving power of the method.


Subject(s)
DNA/genetics , Nucleic Acid Hybridization , Base Sequence , Molecular Sequence Data
12.
Proc Natl Acad Sci U S A ; 103(52): 19824-9, 2006 Dec 26.
Article in English | MEDLINE | ID: mdl-17189424

ABSTRACT

We propose an approach for identifying microinversions across different species and show that microinversions provide a source of low-homoplasy evolutionary characters. These characters may be used as "certificates" to verify different branches in a phylogenetic tree, turning the challenging problem of phylogeny reconstruction into a relatively simple algorithmic problem. We estimate that there exist hundreds of thousands of microinversions in genomes of mammals from comparative sequencing projects, an untapped source of new phylogenetic characters.


Subject(s)
Biological Evolution , Animals , Computational Biology , Genetic Vectors/genetics , Humans , Mammals
13.
J Proteome Res ; 5(10): 2554-66, 2006 Oct.
Article in English | MEDLINE | ID: mdl-17022627

ABSTRACT

We have employed recently developed blind modification search techniques to generate the most comprehensive map of post-translational modifications (PTMs) in human lens constructed to date. Three aged lenses, two of which had moderate cataract, and one young control lens were analyzed using multidimensional liquid chromatography mass spectrometry. In total, 491 modification sites in lens proteins were identified. There were 155 in vivo PTM sites in crystallins: 77 previously reported sites and 78 newly detected PTM sites. Several of these sites had modifications previously undetected by mass spectrometry in lens including carboxymethyl lysine (+58 Da), carboxyethyl lysine (+72 Da), and an arginine modification of +55 Da with yet unknown chemical structure. These new modifications were observed in all three aged lenses but were not found in the young lens. Several new sites of cysteine methylation were identified indicating this modification is more extensive in lens than previously thought. The results were used to estimate the extent of modification at specific sites by spectral counting. We tested the long-standing hypothesis that PTMs contribute to age-related loss of crystallin solubility by comparing spectral counts between the water-soluble and water-insoluble fractions of the aged lenses and found that the extent of deamidation was significantly increased in the water-insoluble fractions. On the basis of spectral counting, the most abundant PTMs in aged lenses were deamidations and methylated cysteines with other PTMs present at lower levels.


Subject(s)
Amides/analysis , Crystallins/analysis , Lens, Crystalline/chemistry , Protein Processing, Post-Translational , Age Factors , Aged , Aged, 80 and over , Amino Acid Sequence , Cysteine/analysis , Humans , Infant, Newborn , Male , Methylation , Molecular Sequence Data , Peptides/analysis , Solubility
14.
Comput Chem ; 18(3): 221-3, 1994 Sep.
Article in English | MEDLINE | ID: mdl-7952892

ABSTRACT

Despite recent advances in DNA sequencing by hybridization it is still a random shotgun method. Even if one manages to routinely sequence short DNA fragments by SBH these fragments have to be assembled into the final genomic sequence. Recently different additional biochemical experiments were suggested which potentially may drastically increase the resolving power of SBH. However biologists frequently cannot estimate the computer science limitations of the proposed additional experiments and no computational studies of additional experiments for SBH were provided yet. The paper discusses a combinatorial technique which might help a biologist to analyze different additional biochemical experiments and to combine these data with SBH data to increase the resolving power of SBH.


Subject(s)
DNA/genetics , Sequence Analysis, DNA/methods , Base Sequence , Computers , Gene Rearrangement , Mathematics , Molecular Sequence Data , Nucleic Acid Hybridization , Sequence Analysis, DNA/statistics & numerical data
15.
Comput Appl Biosci ; 8(2): 121-7, 1992 Apr.
Article in English | MEDLINE | ID: mdl-1591607

ABSTRACT

Upon searching local similarities in long sequences, the necessity of a 'rapid' similarity search becomes acute. Quadratic complexity of dynamic programming algorithms forces the employment of filtration methods that allow elimination of the sequences with a low similarity level. The paper is devoted to the theoretical substantiations of the filtration method based on the statistical distance between texts. The notion of the filtration efficiency is introduced and the efficiency of several filters is estimated. It is shown that the efficiency of the statistical l-tuple filtration upon DNA database search is associated with a potential extension of the original four-letter alphabet and grows exponentially with increasing l. The formula that allows one to estimate the filtration parameters is presented.


Subject(s)
Sequence Alignment/statistics & numerical data , Amino Acid Sequence , Base Sequence , DNA/genetics , Models, Statistical , Proteins/genetics , Sequence Alignment/methods , Software
16.
Comput Appl Biosci ; 7(1): 39-49, 1991 Jan.
Article in English | MEDLINE | ID: mdl-2004273

ABSTRACT

According to the hypothesis of the modular structure of DNA, genomes consist of modules of various nature which may differ in statistical characteristics. Statistical analysis helps in revealing the differences in statistical characteristics and predicting the modular structure. In this connection the question about the contribution of each word of length l (l-tuple) to the inhomogeneity of genetic text arises. The notion of stationary (i.e. relatively evenly distributed over a genome) versus non-stationary l-tuples has been introduced previously. In this paper, the dinucleotide distributions for all long sequences from GenBank were analyzed and it was shown that non-stationary dinucleotides are closely associated with polyW and polyS tracts (W denotes 'weak' nucleotides A or T, while S stands for the 'strong' nucleotides G or C). Thus, genome inhomogeneity is shown to be determined mainly by AA, TT, GG, CC, AT, TA, GC and CG dinucleotides. It has been demonstrated that neither 'codon usage' nor the 'isochore model' can account for this phenomenon.


Subject(s)
Computer Simulation , Genomic Library , Mathematical Computing , Models, Genetic , Algorithms , Animals , Base Sequence , Humans , Markov Chains , Molecular Sequence Data , Monte Carlo Method , Nucleotides , Programming Languages , Software
17.
Bioinformatics ; 18(10): 1374-81, 2002 Oct.
Article in English | MEDLINE | ID: mdl-12376382

ABSTRACT

MOTIVATION: Gene activity is often affected by binding transcription factors to short fragments in DNA sequences called motifs. Identification of subtle regulatory motifs in a DNA sequence is a difficult pattern recognition problem. In this paper we design a new motif finding algorithm that can detect very subtle motifs. RESULTS: We introduce the notion of a multiprofile and use it for finding subtle motifs in DNA sequences. Multiprofiles generalize the notion of a profile and allow one to detect subtle patterns that escape detection by the standard profiles. Our MULTIPROFILER algorithm outperforms other leading motif finding algorithms in a number of synthetic models. Moreover, it can be shown that in some previously studied motif models, MULTIPROFILER is capable of pushing the performance envelope to its theoretical limits. AVAILABILITY: http://www-cse.ucsd.edu/groups/bioinformatics/software.html


Subject(s)
Algorithms , Amino Acid Motifs/genetics , Regulatory Sequences, Nucleic Acid/genetics , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Base Sequence , Benchmarking , Consensus Sequence/genetics , DNA/genetics , DNA-Binding Proteins/genetics , Escherichia coli/genetics , Escherichia coli/metabolism , Molecular Sequence Data , Promoter Regions, Genetic/genetics , Quality Control , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae/metabolism , Sensitivity and Specificity
18.
Bioinformatics ; 18(10): 1382-90, 2002 Oct.
Article in English | MEDLINE | ID: mdl-12376383

ABSTRACT

MOTIVATION: What constitutes a subtle motif? Intuitively, it is a motif that is almost indistinguishable, in the statistical sense, from random motifs. This question has important practical consequences: consider, for example, a biologist that is generating a sample of upstream regulatory sequences with the goal of finding a regulatory pattern that is shared by these sequences. If the sequences are too short then one risks losing some of the regulatory patterns that are located further upstream. Conversely, if the sequences are too long, the motif becomes too subtle and one is then likely to encounter random motifs which are at least as significant statistically as the regulatory pattern itself. In practical terms one would like to recognize the sequence length threshold, or the twilight zone, beyond which the motifs are in some sense too subtle. RESULTS: The paper defines the motif twilight zone where every motif finding algorithm would be exposed to random motifs which are as significant as the one which is sought. We also propose an objective tool for evaluating the performance of subtle motif finding algorithms. Finally we apply these tools to evaluate the success of our MULTIPROFILER algorithm to detect subtle motifs.


Subject(s)
Amino Acid Motifs/genetics , Models, Genetic , Models, Statistical , Regulatory Sequences, Nucleic Acid/genetics , Sequence Analysis, DNA/methods , Algorithms , Base Sequence , Benchmarking , Consensus Sequence/genetics , DNA/genetics , DNA Mutational Analysis/methods , Molecular Sequence Data , Quality Control , Reproducibility of Results , Sensitivity and Specificity , Sequence Alignment/methods , Software , Stochastic Processes
19.
Article in English | MEDLINE | ID: mdl-10786293

ABSTRACT

One current approach to quality control in DNA array manufacturing is to synthesize a small set of test probes that detect variation in the manufacturing process. These fidelity probes consist of identical copies of the same probe, but they are deliberately manufactured using different steps of the manufacturing process. A known target is hybridized to these probes, and those hybridization results are indicative of the quality of the manufacturing process. It is not only desirable to detect variations, but also to analyze the variations that occur, indicating in what process step the manufacture changed. We describe a combinatorial approach which constructs a small set of fidelity probes that not only detect variations, but also point out the manufacturing step in which a variation has occurred. This algorithm is currently being used in mass-production of DNA arrays at Affyetrix.


Subject(s)
DNA Probes , Oligonucleotide Array Sequence Analysis , Algorithms , Combinatorial Chemistry Techniques , Quality Control , Reproducibility of Results , Software
20.
Comput Appl Biosci ; 13(2): 205-10, 1997 Apr.
Article in English | MEDLINE | ID: mdl-9146969

ABSTRACT

Sequencing by hybridization (SBH) is a promising alternative approach to DNA sequencing and mutation detection. Analysis of the resolving power of SBH involves rather difficult combinatorial and probabilistic problems, and sometimes computer simulation is the only way to estimate the parameters and limitations of SBH experiments. This paper describes a software package, DNA-SPECTRUM, which allows one to analyze the resolving power and parameters of SBH. We also introduce the technique for visualizing multiple SBH reconstructions and describe applications of DNA-SPECTRUM to estimate various SBH parameters. DNA-SPECTRUM is available at http://www-hto.usc.edu/software/sbh/index. html.


Subject(s)
Sequence Analysis, DNA/methods , Software , Base Sequence , Computer Graphics , Computer Simulation , DNA/genetics , Evaluation Studies as Topic , Nucleic Acid Hybridization , Sequence Analysis, DNA/statistics & numerical data
SELECTION OF CITATIONS
SEARCH DETAIL