Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 66
Filtrar
1.
Mol Biol Evol ; 2024 Jun 06.
Artigo em Inglês | MEDLINE | ID: mdl-38842253

RESUMO

Despite having important biological implications, insertion and deletion (indel) events are often disregarded or mishandled during phylogenetic inference. In multiple sequence alignment, indels are represented as gaps and are estimated without considering the distinct evolutionary history of insertions and deletions. Consequently, indels are usually excluded from subsequent inference steps, such as ancestral sequence reconstruction and phylogenetic tree search. Here, we introduce indel-aware parsimony (indelMaP), a novel way to treat gaps under the parsimony criterion by considering insertions and deletions as separate evolutionary events and accounting for long indels. By identifying the precise location of an evolutionary event on the tree, we can separate overlapping indel events and use affine gap penalties for long indel modelling. Our indel-aware approach harnesses the phylogenetic signal from indels, including them into all inference stages. Validation and comparison to state-of-the-art inference tools on simulated data show that indelMaP is most suitable for densely sampled datasets with closely to moderately related sequences, where it can reach alignment quality comparable to probabilistic methods and accurately infer ancestral sequences, including indel patterns. Due to its remarkable speed, our method is well-suited for epidemiological datasets, eliminating the need for downsampling and enabling the exploitation of the additional information provided by dense taxonomic sampling. Moreover, indelMaP offers new insights into the indel patterns of biologically significant sequences and advances our understanding of genetic variability by considering gaps as crucial evolutionary signals rather than mere artefacts.

2.
Sci Rep ; 14(1): 3331, 2024 02 09.
Artigo em Inglês | MEDLINE | ID: mdl-38336885

RESUMO

Short tandem repeat (STR) mutations are prevalent in colorectal cancer (CRC), especially in tumours with the microsatellite instability (MSI) phenotype. While STR length variations are known to regulate gene expression under physiological conditions, the functional impact of STR mutations in CRC remains unclear. Here, we integrate STR mutation data with clinical information and gene expression data to study the gene regulatory effects of STR mutations in CRC. We confirm that STR mutability in CRC highly depends on the MSI status, repeat unit size, and repeat length. Furthermore, we present a set of 1244 putative expression STRs (eSTRs) for which the STR length is associated with gene expression levels in CRC tumours. The length of 73 eSTRs is associated with expression levels of cancer-related genes, nine of which are CRC-specific genes. We show that linear models describing eSTR-gene expression relationships allow for predictions of gene expression changes in response to eSTR mutations. Moreover, we found an increased mutability of eSTRs in MSI tumours. Our evidence of gene regulatory roles for eSTRs in CRC highlights a mostly overlooked way through which tumours may modulate their phenotypes. Future extensions of these findings could uncover new STR-based targets in the treatment of cancer.


Assuntos
Neoplasias Colorretais , Repetições de Microssatélites , Humanos , Repetições de Microssatélites/genética , Mutação , Instabilidade de Microssatélites , Neoplasias Colorretais/patologia , Expressão Gênica
3.
J Mol Biol ; 435(20): 168260, 2023 10 15.
Artigo em Inglês | MEDLINE | ID: mdl-37678708

RESUMO

Short tandem repeats (STRs) are consecutive repetitions of one to six nucleotide motifs. They are hypervariable due to the high prevalence of repeat unit insertions or deletions primarily caused by polymerase slippage during replication. Genetic variation at STRs has been shown to influence a range of traits in humans, including gene expression, cancer risk, and autism. Until recently STRs have been poorly studied since they pose significant challenges to bioinformatics analyses. Moreover, genome-wide analysis of STR variation in population-scale cohorts requires large amounts of data and computational resources. However, the recent advent of genome-wide analysis tools has resulted in multiple large genome-wide datasets of STR variation spanning nearly two million genomic loci in thousands of individuals from diverse populations. Here we present WebSTR, a database of genetic variation and other characteristics of genome-wide STRs across human populations. WebSTR is based on reference panels of more than 1.7 million human STRs created with state of the art repeat annotation methods and can easily be extended to include additional cohorts or species. It currently contains data based on STR genotypes for individuals from the 1000 Genomes Project, H3Africa, the Genotype-Tissue Expression (GTEx) Project and colorectal cancer patients from the TCGA dataset. WebSTR is implemented as a relational database with programmatic access available through an API and a web portal for browsing data. The web portal is publicly available at https://webstr.ucsd.edu.


Assuntos
Bases de Dados Genéticas , Variação Genética , Genoma Humano , Repetições de Microssatélites , Humanos , Biologia Computacional , Genótipo , Repetições de Microssatélites/genética , Estudo de Associação Genômica Ampla , Conjuntos de Dados como Assunto , Neoplasias Colorretais/genética
4.
Life Sci Alliance ; 6(4)2023 04.
Artigo em Inglês | MEDLINE | ID: mdl-36754567

RESUMO

The dopamine transporter gene, SLC6A3, has received substantial attention in genetic association studies of various phenotypes. Although some variable number tandem repeats (VNTRs) present in SLC6A3 have been tested in genetic association studies, results have not been consistent. VNTRs in SLC6A3 that have not been examined genetically were characterized. The Tandem Repeat Annotation Library was used to characterize the VNTRs of 64 unrelated long-read haplotype-phased SLC6A3 sequences. Sequence similarity of each repeat unit of the five VNTRs is reported, along with the correlations of SNP-SNP, SNP-VNTR, and VNTR-VNTR alleles across the gene. One of these VNTRs is a novel hyper-VNTR (hyVNTR) in intron 8 of SLC6A3, which contains a range of 3.4-133.4 repeat copies and has a consensus sequence length of 38 bp, with 82% G+C content. The 38-base repeat was predicted to form G-quadruplexes in silico and was confirmed by circular dichroism spectroscopy. In addition, this hyVNTR contains multiple putative binding sites for PRDM9, which, in combination with low levels of linkage disequilibrium around the hyVNTR, suggests it might be a recombination hotspot.


Assuntos
Proteínas da Membrana Plasmática de Transporte de Dopamina , Repetições Minissatélites , Alelos , Proteínas da Membrana Plasmática de Transporte de Dopamina/genética , Haplótipos , Íntrons , Repetições Minissatélites/genética , Humanos
5.
J Evol Biol ; 36(2): 321-336, 2023 02.
Artigo em Inglês | MEDLINE | ID: mdl-36289560

RESUMO

Short tandem repeats (STRs) are units of 1-6 bp that repeat in a tandem fashion in DNA. Along with single nucleotide polymorphisms and large structural variations, they are among the major genomic variants underlying genetic, and likely phenotypic, divergence. STRs experience mutation rates that are orders of magnitude higher than other well-studied genotypic variants. Frequent copy number changes result in a wide range of alleles, and provide unique opportunities for modulating complex phenotypes through variation in repeat length. While classical studies have identified key roles of individual STR loci, the advent of improved sequencing technology, high-quality genome assemblies for diverse species, and bioinformatics methods for genome-wide STR analysis now enable more systematic study of STR variation across wide evolutionary ranges. In this review, we explore mutation and selection processes that affect STR copy number evolution, and how these processes give rise to varying STR patterns both within and across species. Finally, we review recent examples of functional and adaptive changes linked to STRs.


Assuntos
Genoma , Repetições de Microssatélites , Mutação , Genótipo , Fenótipo
6.
Syst Biol ; 72(2): 307-318, 2023 Jun 16.
Artigo em Inglês | MEDLINE | ID: mdl-35866991

RESUMO

Modern phylogenetic methods allow inference of ancestral molecular sequences given an alignment and phylogeny relating present-day sequences. This provides insight into the evolutionary history of molecules, helping to understand gene function and to study biological processes such as adaptation and convergent evolution across a variety of applications. Here, we propose a dynamic programming algorithm for fast joint likelihood-based reconstruction of ancestral sequences under the Poisson Indel Process (PIP). Unlike previous approaches, our method, named ARPIP, enables the reconstruction with insertions and deletions based on an explicit indel model. Consequently, inferred indel events have an explicit biological interpretation. Likelihood computation is achieved in linear time with respect to the number of sequences. Our method consists of two steps, namely finding the most probable indel points and reconstructing ancestral sequences. First, we find the most likely indel points and prune the phylogeny to reflect the insertion and deletion events per site. Second, we infer the ancestral states on the pruned subtree in a manner similar to FastML. We applied ARPIP (Ancestral Reconstruction under PIP) on simulated data sets and on real data from the Betacoronavirus genus. ARPIP reconstructs both the indel events and substitutions with a high degree of accuracy. Our method fares well when compared to established state-of-the-art methods such as FastML and PAML. Moreover, the method can be extended to explore both optimal and suboptimal reconstructions, include rate heterogeneity through time and more. We believe it will expand the range of novel applications of ancestral sequence reconstruction. [Ancestral sequences; dynamic programming; evolutionary stochastic process; indel; joint ancestral sequence reconstruction; maximum likelihood; Poisson Indel Process; phylogeny; SARS-CoV.].


Assuntos
Algoritmos , Mutação INDEL , Filogenia , Funções Verossimilhança , Alinhamento de Sequência , Mutação INDEL/genética , Evolução Molecular
7.
Front Bioinform ; 2: 827207, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36304281

RESUMO

Literature-based discovery (LBD) mines existing literature in order to generate new hypotheses by finding links between previously disconnected pieces of knowledge. Although automated LBD systems are becoming widespread and indispensable in a wide variety of knowledge domains, little has been done to introduce LBD to the field of natural products research. Despite growing knowledge in the natural product domain, most of the accumulated information is found in detached data pools. LBD can facilitate better contextualization and exploitation of this wealth of data, for example by formulating new hypotheses for natural product research, especially in the context of drug discovery and development. Moreover, automated LBD systems promise to accelerate the currently tedious and expensive process of lead identification, optimization, and development. Focusing on natural product research, we briefly reflect the development of automated LBD and summarize its methods and principal data sources. In a thorough review of published use cases of LBD in the biomedical domain, we highlight the immense potential of this data mining approach for natural product research, especially in context with drug discovery or repurposing, mode of action, as well as drug or substance interactions. Most of the 91 natural product-related discoveries in our sample of reported use cases of LBD were addressed at a computer science audience. Therefore, it is the wider goal of this review to introduce automated LBD to researchers who work with natural products and to facilitate the dialogue between this community and the developers of automated LBD systems.

8.
Distrib Parallel Databases ; 40(2-3): 409-440, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36097541

RESUMO

The problem of natural language processing over structured data has become a growing research field, both within the relational database and the Semantic Web community, with significant efforts involved in question answering over knowledge graphs (KGQA). However, many of these approaches are either specifically targeted at open-domain question answering using DBpedia, or require large training datasets to translate a natural language question to SPARQL in order to query the knowledge graph. Hence, these approaches often cannot be applied directly to complex scientific datasets where no prior training data is available. In this paper, we focus on the challenges of natural language processing over knowledge graphs of scientific datasets. In particular, we introduce Bio-SODA, a natural language processing engine that does not require training data in the form of question-answer pairs for generating SPARQL queries. Bio-SODA uses a generic graph-based approach for translating user questions to a ranked list of SPARQL candidate queries. Furthermore, Bio-SODA uses a novel ranking algorithm that includes node centrality as a measure of relevance for selecting the best SPARQL candidate query. Our experiments with real-world datasets across several scientific domains, including the official bioinformatics Question Answering over Linked Data (QALD) challenge, as well as the CORDIS dataset of European projects, show that Bio-SODA outperforms publicly available KGQA systems by an F1-score of least 20% and by an even higher factor on more complex bioinformatics datasets. Finally, we introduce Bio-SODA UX, a graphical user interface designed to assist users in the exploration of large knowledge graphs and in dynamically disambiguating natural language questions that target the data available in these graphs.

9.
Sci Rep ; 12(1): 8883, 2022 05 25.
Artigo em Inglês | MEDLINE | ID: mdl-35614123

RESUMO

Several human pathogens exhibit distinct patterns of seasonality and circulate as pairs. For instance, influenza A virus subtypes oscillate and peak during winter seasons of the world's temperate climate zones. Alternation of dominant strains in successive influenza seasons makes epidemic forecasting a major challenge. From the start of the 2009 influenza pandemic we enrolled influenza A virus infected patients (n = 2980) in a global prospective clinical study. Complete hemagglutinin sequences were obtained from 1078 A/H1N1 and 1033 A/H3N2 viruses. We used phylodynamics to construct high resolution spatio-temporal phylogenetic hemagglutinin trees and estimated global influenza A effective reproductive numbers (R) over time (2009-2013). We demonstrate that R oscillates around R = 1 with a clear opposed alternation pattern between phases of the A/H1N1 and A/H3N2 subtypes. Moreover, we find a similar alternation pattern for the number of global viral spread between the sampled geographical locations. Both observations suggest a between-strain competition for susceptible hosts on a global level. Extrinsic factors that affect person-to-person transmission are a major driver of influenza seasonality. The data presented here indicate that cross-reactive host immunity is also a key intrinsic driver of influenza seasonality, which determines the influenza A virus strain at the onset of each epidemic season.


Assuntos
Vírus da Influenza A Subtipo H1N1 , Vírus da Influenza A , Influenza Humana , Hemaglutininas , Humanos , Vírus da Influenza A Subtipo H3N2/genética , Influenza Humana/epidemiologia , Filogenia , Estudos Prospectivos , Estações do Ano
10.
J Voice ; 36(2): 294.e1-294.e12, 2022 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-32739034

RESUMO

Recent research describes the effect of Type 2 diabetes (T2D) on voice, suggesting that it can be diagnosed based on vocal clues. Although these studies have similar experimental designs with respect to the voice data and the analysis methods, the conclusions regarding the voice changes differ substantially and are at times contradictory. This is unexpected, since the mechanism of pathological deterioration behind the observed changes is the same. This year in an article published in J. of Voice it was suggested that vocal changes may be different among ethnicities. Before this hypothesis can be accepted, the study protocols should be improved and unified, to ensure that the empirical evidence is reliable. Additionally, given the recently published data about the temporal voice changes as a result of glucose swings, we propose that the persons in hypo- and hyperglycemic conditions should be excluded from the experiment. Since no study succeeded in diabetes detection, it is timely to mention that there is an alternative methodology for disease detection from voice, which is far more sensitive than the state of the art procedure. We propose a script that is available from the first author on request.


Assuntos
Diabetes Mellitus Tipo 2 , Voz , Diabetes Mellitus Tipo 2/complicações , Diabetes Mellitus Tipo 2/diagnóstico , Humanos
11.
Mod Pathol ; 35(2): 240-248, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-34475526

RESUMO

The backbone of all colorectal cancer classifications including the consensus molecular subtypes (CMS) highlights microsatellite instability (MSI) as a key molecular pathway. Although mucinous histology (generally defined as >50% extracellular mucin-to-tumor area) is a "typical" feature of MSI, it is not limited to this subgroup. Here, we investigate the association of CMS classification and mucin-to-tumor area quantified using a deep learning algorithm, and  the expression of specific mucins in predicting CMS groups and clinical outcome. A weakly supervised segmentation method was developed to quantify extracellular mucin-to-tumor area in H&E images. Performance was compared to two pathologists' scores, then applied to two cohorts: (1) TCGA (n = 871 slides/412 patients) used for mucin-CMS group correlation and (2) Bern (n = 775 slides/517 patients) for histopathological correlations and next-generation Tissue Microarray construction. TCGA and CPTAC (n = 85 patients) were used to further validate mucin detection and CMS classification by gene and protein expression analysis for MUC2, MUC4, MUC5AC and MUC5B. An excellent inter-observer agreement between pathologists' scores and the algorithm was obtained (ICC = 0.92). In TCGA, mucinous tumors were predominantly CMS1 (25.7%), CMS3 (24.6%) and CMS4 (16.2%). Average mucin in CMS2 was 1.8%, indicating negligible amounts. RNA and protein expression of MUC2, MUC4, MUC5AC and MUC5B were low-to-absent in CMS2. MUC5AC protein expression correlated with aggressive tumor features (e.g., distant metastases (p = 0.0334), BRAF mutation (p < 0.0001), mismatch repair-deficiency (p < 0.0001), and unfavorable 5-year overall survival (44% versus 65% for positive/negative staining). MUC2 expression showed the opposite trend, correlating with less lymphatic (p = 0.0096) and venous vessel invasion (p = 0.0023), no impact on survival.The absence of mucin-expressing tumors in CMS2 provides an important phenotype-genotype correlation. Together with MSI, mucinous histology may help predict CMS classification using only histopathology and should be considered in future image classifiers of molecular subtypes.


Assuntos
Neoplasias Encefálicas , Neoplasias Colorretais , Biomarcadores Tumorais/análise , Biomarcadores Tumorais/genética , Neoplasias Colorretais/patologia , Humanos , Instabilidade de Microssatélites , Mucina-2/análise , Mucina-2/genética , Mutação
13.
BMC Bioinformatics ; 22(1): 518, 2021 Oct 24.
Artigo em Inglês | MEDLINE | ID: mdl-34689750

RESUMO

BACKGROUND: Current alignment tools typically lack an explicit model of indel evolution, leading to artificially short inferred alignments (i.e., over-alignment) due to inconsistencies between the indel history and the phylogeny relating the input sequences. RESULTS: We present a new progressive multiple sequence alignment tool ProPIP. The process of insertions and deletions is described using an explicit evolutionary model-the Poisson Indel Process or PIP. The method is based on dynamic programming and is implemented in a frequentist framework. The source code can be compiled on Linux, macOS and Microsoft Windows platforms. The algorithm is implemented in C++ as standalone program. The source code is freely available on GitHub at https://github.com/acg-team/ProPIP and is distributed under the terms of the GNU GPL v3 license. CONCLUSIONS: The use of an explicit indel evolution model allows to avoid over-alignment, to infer gaps in a phylogenetically consistent way and to make inferences about the rates of insertions and deletions. Instead of the arbitrary gap penalties, the parameters used by ProPIP are the insertion and deletion rates, which have biological interpretation and are contextualized in a probabilistic environment. As a result, indel rate settings may be optimised in order to infer phylogenetically meaningful gap patterns.


Assuntos
Evolução Molecular , Mutação INDEL , Algoritmos , Filogenia , Alinhamento de Sequência , Software
14.
Cancer Genet ; 256-257: 165-178, 2021 08.
Artigo em Inglês | MEDLINE | ID: mdl-34186498

RESUMO

PURPOSE: This study aimed to investigate factors, which influence the content of circulating tumor DNA (ctDNA). METHODS: 398 serial plasma samples were collected within 1-7 consecutive days from patients with EGFR-mutated lung cancer (n = 13), RAS/RAF-mutated colorectal cancer (n = 54) and BRAF-mutated melanoma (n = 17), who presented with measurable tumor disease. The amount of ctDNA was determined by ddPCR. RESULTS: Among 82 patients, who donated 2-6 serial plasma samples, 42 subjects were classified as ctDNA-positive; only 22% cases were mutation-positive across all consecutive tests, while 24/82 (29%) patients showed presence of mutated ctDNA in some but not all blood draws. Subjects with progressing tumors had higher probability of being detected ctDNA-positive as compared to patients, who responded to therapy or had stable disease (39/55 (71%) vs. 4/24 (17%); p = 0.0001). Our study failed to reveal the impact of the time of the day, recent meal or prior physical exercise on the results of ctDNA testing. CONCLUSIONS: Presence of ctDNA in plasma is particularly characteristic for patients, who experience clinical progression of tumor disease. Consecutive plasma tests may occasionally provide discordant data; thus, the repetition of analysis may be advised in certain cases in order to ensure the validity of negative ctDNA result.


Assuntos
DNA Tumoral Circulante/sangue , Exercício Físico/fisiologia , Carga Tumoral , Idoso , Idoso de 80 Anos ou mais , DNA Tumoral Circulante/genética , Análise Mutacional de DNA , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Mutação/genética , Probabilidade , Reprodutibilidade dos Testes , Fatores de Tempo
15.
Indian J Microbiol ; 61(1): 24-30, 2021 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-33505089

RESUMO

Streptomycetes, Gram-positive bacteria with huge and GC-rich genomes provide an ample example of codon usage bias taken to the extreme. Particularly, in all sequenced to date streptomycete genomes leucyl codon TTA is the rarest one. It is present (usually once or twice) in 70-200 out of 7000-8000 coding sequences that make up a typical streptomycete genome. tRNALeu UAA of streptomycetes, encoded by the bldA gene, has been shown to be present in mature form only after the onset of morphological differentiation and activation of secondary metabolism. Consequently, during the early stages of cell growth, the translation of genes carrying the TTA codon can be interrupted due to the absence of tRNALeu UAA. Several reports show that mutations of TTA to synonymous codons in certain genes indeed relieve their expression from bldA dependence. However, the deletion of bldA does not always arrest the expression of TTA-containing genes. The nucleotides T/C downstream of TTA were suggested, in 2002, to favor TTA mistranslation. We tested this hypothesis using sizable datasets derived from individual Streptomyces genome and a subset of TTA+ genes for secondary metabolism known for their active expression. Our results revealed nucleotide biases downstream of NNA codons family, such as the preference for C and the avoidance of A. Yet, none of the observed biases was sufficient to claim a special case for TTA codon. Hence, the issue of codon context and TTA codon mistranslation in Streptomyces deserves further elaboration.

16.
J Big Data ; 8(1): 3, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33489717

RESUMO

Knowledge graphs are a powerful concept for querying large amounts of data. These knowledge graphs are typically enormous and are often not easily accessible to end-users because they require specialized knowledge in query languages such as SPARQL. Moreover, end-users need a deep understanding of the structure of the underlying data models often based on the Resource Description Framework (RDF). This drawback has led to the development of Question-Answering (QA) systems that enable end-users to express their information needs in natural language. While existing systems simplify user access, there is still room for improvement in the accuracy of these systems. In this paper we propose a new QA system for translating natural language questions into SPARQL queries. The key idea is to break up the translation process into 5 smaller, more manageable sub-tasks and use ensemble machine learning methods as well as Tree-LSTM-based neural network models to automatically learn and translate a natural language question into a SPARQL query. The performance of our proposed QA system is empirically evaluated using the two renowned benchmarks-the 7th Question Answering over Linked Data Challenge (QALD-7) and the Large-Scale Complex Question Answering Dataset (LC-QuAD). Experimental results show that our QA system outperforms the state-of-art systems by 15% on the QALD-7 dataset and by 48% on the LC-QuAD dataset, respectively. In addition, we make our source code available.

17.
Front Bioinform ; 1: 685844, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-36303757

RESUMO

Short tandem repeats (STRs) are abundant in genomic sequences and are known for comparatively high mutation rates; STRs therefore are thought to be a potent source of genetic diversity. In protein-coding sequences STRs primarily encode disorder-promoting amino acids and are often located in intrinsically disordered regions (IDRs). STRs are frequently studied in the scope of microsatellite instability (MSI) in cancer, with little focus on the connection between protein STRs and IDRs. We believe, however, that this relationship should be explicitly included when ascertaining STR functionality in cancer. Here we explore this notion using all canonical human proteins from SwissProt, wherein we detected 3,699 STRs. Over 80% of these consisted completely of disorder promoting amino acids. 62.1% of amino acids in STR sequences were predicted to also be in an IDR, compared to 14.2% for non-repeat sequences. Over-representation analysis showed STR-containing proteins to be primarily located in the nucleus where they perform protein- and nucleotide-binding functions and regulate gene expression. They were also enriched in cancer-related signaling pathways. Furthermore, we found enrichments of STR-containing proteins among those correlated with patient survival for cancers derived from eight different anatomical sites. Intriguingly, several of these cancer types are not known to have a MSI-high (MSI-H) phenotype, suggesting that protein STRs play a role in cancer pathology in non MSI-H settings. Their intrinsic link with IDRs could therefore be an attractive topic of future research to further explore the role of STRs and IDRs in cancer. We speculate that our observations may be linked to the known dosage-sensitivity of disordered proteins, which could hint at a concentration-dependent gain-of-function mechanism in cancer for proteins containing STRs and IDRs.

18.
Front Bioinform ; 1: 691865, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-36303789

RESUMO

The Tandem Repeat Annotation Library (TRAL) focuses on analyzing tandem repeat units in genomic sequences. TRAL can integrate and harmonize tandem repeat annotations from a large number of external tools, and provides a statistical model for evaluating and filtering the detected repeats. TRAL version 2.0 includes new features such as a module for identifying repeats from circular profile hidden Markov models, a new repeat alignment method based on the progressive Poisson Indel Process, an improved installation procedure and a docker container. TRAL is an open-source Python 3 library and is available, together with documentation and tutorials via vital-it.ch/software/tral.

19.
Genes (Basel) ; 11(4)2020 04 09.
Artigo em Inglês | MEDLINE | ID: mdl-32283633

RESUMO

Protein tandem repeats (TRs) are often associated with immunity-related functions and diseases. Since that last census of protein TRs in 1999, the number of curated proteins increased more than seven-fold and new TR prediction methods were published. TRs appear to be enriched with intrinsic disorder and vice versa. The significance and the biological reasons for this association are unknown. Here, we characterize protein TRs across all kingdoms of life and their overlap with intrinsic disorder in unprecedented detail. Using state-of-the-art prediction methods, we estimate that 50.9% of proteins contain at least one TR, often located at the sequence flanks. Positive linear correlation between the proportion of TRs and the protein length was observed universally, with Eukaryotes in general having more TRs, but when the difference in length is taken into account the difference is quite small. TRs were enriched with disorder-promoting amino acids and were inside intrinsically disordered regions. Many such TRs were homorepeats. Our results support that TRs mostly originate by duplication and are involved in essential functions such as transcription processes, structural organization, electron transport and iron-binding. In viruses, TRs are found in proteins essential for virulence.


Assuntos
Proteínas Intrinsicamente Desordenadas/química , Proteínas/química , Proteoma/química , Sequências de Repetição em Tandem , Animais , Humanos , Conformação Proteica
20.
NAR Genom Bioinform ; 2(4): lqaa092, 2020 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-33575636

RESUMO

Recently we presented a frequentist dynamic programming (DP) approach for multiple sequence alignment based on the explicit model of indel evolution Poisson Indel Process (PIP). This phylogeny-aware approach produces evolutionary meaningful gap patterns and is robust to the 'over-alignment' bias. Despite linear time complexity for the computation of marginal likelihoods, the overall method's complexity is cubic in sequence length. Inspired by the popular aligner MAFFT, we propose a new technique to accelerate the evolutionary indel based alignment. Amino acid sequences are converted to sequences representing their physicochemical properties, and homologous blocks are identified by multi-scale short-time Fourier transform. Three three-dimensional DP matrices are then created under PIP, with homologous blocks defining sparse structures where most cells are excluded from the calculations. The homologous blocks are connected through intermediate 'linking blocks'. The homologous and linking blocks are aligned under PIP as independent DP sub-matrices and their tracebacks merged to yield the final alignment. The new algorithm can largely profit from parallel computing, yielding a theoretical speed-up estimated to be proportional to the cubic power of the number of sub-blocks in the DP matrices. We compare the new method to the original PIP approach and demonstrate it on real data.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA