Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 14 de 14
Filter
Add more filters











Publication year range
1.
Nat Struct Mol Biol ; 30(4): 417-424, 2023 04.
Article in English | MEDLINE | ID: mdl-36914796

ABSTRACT

Non-B DNA structures formed by repetitive sequence motifs are known instigators of mutagenesis in experimental systems. Analyzing this phenomenon computationally in the human genome requires careful disentangling of intrinsic confounding factors, including overlapping and interrupted motifs and recurrent sequencing errors. Here, we show that accounting for these factors eliminates all signals of repeat-induced mutagenesis that extend beyond the motif boundary, and eliminates or dramatically shrinks the magnitude of mutagenesis within some motifs, contradicting previous reports. Mutagenesis not attributable to artifacts revealed several biological mechanisms. Polymerase slippage generates frequent indels within every variety of short tandem repeat motif, implicating slipped-strand structures. Interruption-correcting single nucleotide variants within short tandem repeats may originate from error-prone polymerases. Secondary-structure formation promotes single nucleotide variants within palindromic repeats and duplications within direct repeats. G-quadruplex motifs cause recurrent sequencing errors, whereas mutagenesis at Z-DNAs is conspicuously absent.


Subject(s)
DNA , Genome, Human , Humans , Nucleotide Motifs/genetics , Mutagenesis , DNA/genetics , DNA/chemistry , Nucleotides
2.
J Biomed Inform ; 133: 104174, 2022 09.
Article in English | MEDLINE | ID: mdl-35998814

ABSTRACT

Despite genomic sequencing rapidly transforming from being a bench-side tool to a routine procedure in a hospital, there is a noticeable lack of genomic analysis software that supports both clinical and research workflows as well as crowdsourcing. Furthermore, most existing software packages are not forward-compatible in regards to supporting ever-changing diagnostic rules adopted by the genetics community. Regular updates of genomics databases pose challenges for reproducible and traceable automated genetic diagnostics tools. Lastly, most of the software tools score low on explainability amongst clinicians. We have created a fully open-source variant curation tool, AnFiSA, with the intention to invite and accept contributions from clinicians, researchers, and professional software developers. The design of AnFiSA addresses the aforementioned issues via the following architectural principles: using a multidimensional database management system (DBMS) for genomic data to address reproducibility, curated decision trees adaptable to changing clinical rules, and a crowdsourcing-friendly interface to address difficult-to-diagnose cases. We discuss how we have chosen our technology stack and describe the design and implementation of the software. Finally, we show in detail how selected workflows can be implemented using the current version of AnFiSA by a medical geneticist.


Subject(s)
Genomics , Software , Computational Biology/methods , Database Management Systems , Databases, Genetic , Genomics/methods , Reproducibility of Results , Workflow
3.
Nature ; 508(7497): 469-76, 2014 Apr 24.
Article in English | MEDLINE | ID: mdl-24759409

ABSTRACT

The discovery of rare genetic variants is accelerating, and clear guidelines for distinguishing disease-causing sequence variants from the many potentially functional variants present in any human genome are urgently needed. Without rigorous standards we risk an acceleration of false-positive reports of causality, which would impede the translation of genomic research findings into the clinical diagnostic setting and hinder biological understanding of disease. Here we discuss the key challenges of assessing sequence variants in human disease, integrating both gene-level and variant-level support for causality. We propose guidelines for summarizing confidence in variant pathogenicity and highlight several areas that require further resource development.


Subject(s)
Disease , Genetic Predisposition to Disease/genetics , Genetic Variation/genetics , Guidelines as Topic , False Positive Reactions , Genes/genetics , Humans , Information Dissemination , Publishing , Reproducibility of Results , Research Design , Translational Research, Biomedical/standards
4.
Anal Chem ; 73(9): 1917-26, 2001 May 01.
Article in English | MEDLINE | ID: mdl-11354471

ABSTRACT

MALDI-quadrupole time-of-flight mass spectrometry was applied to identify proteins from organisms whose genomes are still unknown. The identification was carried out by successively searching a sequence database-first with a peptide mass fingerprint, then with a packet of noninterpreted MS/MS spectra, and finally with peptide sequences obtained by automated interpretation of the MS/MS spectra. A "MS BLAST" homology searching protocol was developed to overcome specific limitations imposed by mass spectrometric data, such as the limited accuracy of de novo sequence predictions. This approach was tested in a small-scale proteomic project involving the identification of 15 bands of gel-separated proteins from the methylotrophic yeast Pichia pastoris, whose genome has not yet been sequenced and which is only distantly related to other fungi.


Subject(s)
Databases, Factual , Genome, Fungal , Pichia/genetics , Proteome/chemistry , Sequence Analysis, Protein/methods , Spectrometry, Mass, Matrix-Assisted Laser Desorption-Ionization , Algorithms , Amino Acid Sequence , Animals , Cell Line , Dogs , Kidney/cytology , Membrane Proteins/chemistry , Molecular Sequence Data , Peptide Mapping/instrumentation , Peptide Mapping/methods , Trypsin/metabolism
5.
Hum Mol Genet ; 10(6): 591-7, 2001 Mar 15.
Article in English | MEDLINE | ID: mdl-11230178

ABSTRACT

Single nucleotide polymorphisms (SNPs) constitute the bulk of human genetic variation, occurring with an average density of approximately 1/1000 nucleotides of a genotype. SNPs are either neutral allelic variants or are under selection of various strengths, and the impact of SNPs on fitness remains unknown. Identification of SNPs affecting human phenotype, especially leading to risks of complex disorders, is one of the key problems of medical genetics. SNPs in protein-coding regions that cause amino acid variants (non-synonymous cSNPs) are most likely to affect phenotypes. We have developed a straightforward and reliable method based on physical and comparative considerations that estimates the impact of an amino acid replacement on the three-dimensional structure and function of the protein. We estimate that approximately 20% of common human non-synonymous SNPs damage the protein. The average minor allele frequency of such SNPs in our data set was two times lower than that of benign non-synonymous SNPs. The average human genotype carries approximately 10(3) damaging non-synonymous SNPs that together cause a substantial reduction in fitness.


Subject(s)
Gene Deletion , Gene Frequency/genetics , Polymorphism, Single Nucleotide , Alleles , Amino Acid Substitution/genetics , Genetic Variation , Genotype , Humans , Models, Molecular , Protein Conformation , Selection, Genetic
6.
Curr Opin Struct Biol ; 11(1): 125-30, 2001 Feb.
Article in English | MEDLINE | ID: mdl-11179902

ABSTRACT

With the massive amount of sequence and structural data being produced, new avenues emerge for exploiting the information therein for applications in several fields. Fold distributions can be mapped onto entire genomes to learn about the nature of the protein universe and many of the interactions between proteins can now be predicted solely on the basis of the genomic context of their genes. Furthermore, by utilising the new incoming data on single nucleotide polymorphisms by mapping them onto three-dimensional structures of proteins, problems concerning population, medical and evolutionary genetics can be addressed.


Subject(s)
Genomics/methods , Polymorphism, Single Nucleotide , Protein Binding , Protein Folding , Apolipoproteins E/chemistry , Apolipoproteins E/genetics , Forecasting/methods , Models, Theoretical , Phenotype , Sequence Homology, Amino Acid
10.
Protein Eng ; 12(5): 387-94, 1999 May.
Article in English | MEDLINE | ID: mdl-10360979

ABSTRACT

Sequence weighting techniques are aimed at balancing redundant observed information from subsets of similar sequences in multiple alignments. Traditional approaches apply the same weight to all positions of a given sequence, hence equal efficiency of phylogenetic changes is assumed along the whole sequence. This restrictive assumption is not required for the new method PSIC (position-specific independent counts) described in this paper. The number of independent observations (counts) of an amino acid type at a given alignment position is calculated from the overall similarity of the sequences that share the amino acid type at this position with the help of statistical concepts. This approach allows the fast computation of position-specific sequence weights even for alignments containing hundreds of sequences. The PSIC approach has been applied to profile extraction and to the fold family assignment of protein sequences with known structures. Our method was shown to be very productive in finding distantly related sequences and more powerful than Hidden Markov Models or the profile methods in WiseTools and PSI-BLAST in many cases. The profile extraction routine is available on the WWW (http://www.bork.embl-heidelberg. de/PSIC or http://www.imb.ac.ru/PSIC).


Subject(s)
Proteins/chemistry , Sequence Alignment/statistics & numerical data , Algorithms , Amino Acid Sequence , Amino Acids/chemistry , Conserved Sequence , Databases, Factual , Internet , Molecular Sequence Data , Protein Folding
11.
J Mol Med (Berl) ; 77(11): 754-60, 1999 Nov.
Article in English | MEDLINE | ID: mdl-10619435

ABSTRACT

Analysis of human genetic variation can shed light on the problem of the genetic basis of complex disorders. Nonsynonymous single nucleotide polymorphisms (SNPs), which affect the amino acid sequence of proteins, are believed to be the most frequent type of variation associated with the respective disease phenotype. Complete enumeration of nonsynonymous SNPs in the candidate genes will enable further association studies on panels of affected and unaffected individuals. Experimental detection of SNPs requires implementation of expensive technologies and is still far from being routine. Alternatively, SNPs can be identified by computational analysis of a publicly available expressed sequence tag (EST) database following experimental verification. We performed in silico analysis of amino acid variation for 471 of proteins with a documented history of experimental variation studies and with confirmed association with human diseases. This allowed us to evaluate the level of completeness of the current knowledge of nonsynonymous SNPs in well studied, medically relevant genes and to estimate the proportion of new variants which can be added with the help of computer-aided mining in EST databases. Our results suggest that approx. 50% of frequent nonsynonymous variants are already stored in public databases. Computational methods based on the scan of an EST database can add significantly to the current knowledge, but they are greatly limited by the size of EST databases and the nonuniform coverage of genes by ESTs. Nevertheless, a considerable number of new candidate nonsynonymous SNPs in genes of medical interest were found by EST screening procedure.


Subject(s)
Genetic Diseases, Inborn/genetics , Polymorphism, Single Nucleotide/genetics , Databases, Factual , Electronic Data Processing , Expressed Sequence Tags , Humans
12.
J Mol Biol ; 280(3): 323-6, 1998 Jul 17.
Article in English | MEDLINE | ID: mdl-9665839

ABSTRACT

Homology search techniques based on the iterative PSI-BLAST method in combination with various filters for low sequence complexity are applied to assign folds to all Mycoplasma genitalium proteins. The resulting procedure (implemented as a web server) is able to predict at least one domain in 37% of these proteins automatically, with an estimated accuracy higher than 98%. Taking structural features such as coiled coil or transmembrane regions aside, folds can be assigned to more than half of the globular proteins in a bacterium just by iterative sequence comparison.


Subject(s)
Bacterial Proteins/chemistry , Mycoplasma/chemistry , Protein Folding , Protein Conformation , Sequence Homology
13.
Proteins ; 31(3): 225-46, 1998 May 15.
Article in English | MEDLINE | ID: mdl-9593195

ABSTRACT

The parametric description of residue environments through solvent accessibility, backbone conformation, or pairwise residue-residue distances is the key to the comparison between amino acid types at protein sequence positions and residue locations in structural templates (condition of protein sequence-structure match). For the first time, the research results presented in this study clarify and allow to quantify, on a rigorous statistical basis, to what extent the amino acid type-specific distributions of commonly used environment parameters are discriminative with respect to the 20 amino acid types. Relying on the Bahadur theory, we estimate the probability of error in a single-sequence-structure alignment based on weak or absent discriminative power in a learning database of protein structure. We present the results for many residue environment variables and demonstrate that each fold description parameter is sensitive with respect to only a few amino acid types while indifferent to most of the other amino acid types. Even complex structural characteristics combining solvent-accessible surface area, backbone conformation, and pairwise distances distinguish only some amino acid types, whereas the others remain nondiscriminated. We find that the knowledge-based potentials currently in use treat especially Ala, Asp, Gln, His, Ser, Thr, and Tyr as essentially "average" amino acids. Thus, highly discriminative amino acid types define the alignment register in gapless sequence-structure alignments. The introduction of gaps leads to alignment ambiguities at sequence positions occupied by nondiscriminated amino acid types. Therefore, local sequence-structure alignments produced by techniques with gaps cannot be reliable. Conceptionally new and more sensitive environment parameters must be invented.


Subject(s)
Amino Acids/chemistry , Protein Conformation , Chemical Phenomena , Chemistry, Physical , Databases, Factual , Mathematics , Protein Folding , Protein Structure, Secondary , Protein Structure, Tertiary , Sequence Alignment , Solvents , Templates, Genetic
14.
Protein Eng ; 10(6): 635-46, 1997 Jun.
Article in English | MEDLINE | ID: mdl-9278276

ABSTRACT

The assignment of query protein sequences to probable folds in a threading approach is based on the statistical analysis (learning) of structural properties of amino acids in known protein structures. We formalize the recognition problem in terms of mathematical statistics, namely statistical hypothesis testing. Our general formulation leads to various mathematical forms of a decision rule function for evaluation of the quality of a sequence-structure fit. Three criteria were derived according to a likelihood ratio approach. Two of them have new functional forms while the third happens to coincide with the mean force potential function previously derived under the additional assumption of the Boltzmann law. New decision rule functions employ (i) the Parzen estimator of a probability density and (ii) the newly introduced non-parametric statistic with known asymptotic distribution. We compared criteria efficiency by a 'structure seeks sequence' search for three highly populated template folds through a query library of non-homologous sequences of proteins with known 3D structure using residue accessibility as an environmental variable. Various criteria reflect different underlying statistical propositions and thus often recognize diverse correct sequence-structure matches. On the other hand, if an amino acid sequence is recognized as compatible with a template by each of three decision rules it appears that one can make a more reliable inference of sequence-structure relationship since almost all false positives obtained by the three criteria differ.


Subject(s)
Amino Acid Sequence , Models, Statistical , Protein Conformation , Algorithms , Data Interpretation, Statistical , Likelihood Functions , Peptide Library , Protein Folding , Sequence Alignment , Structure-Activity Relationship , Templates, Genetic
SELECTION OF CITATIONS
SEARCH DETAIL