Pesquisa | BVS CLAP/SMR-OPAS/OMS

1.

On the Impact of the Data Acquisition Protocol on ECG Biometric Identification.

Ramos, Mariana S; Carvalho, João M; Pinho, Armando J; Brás, Susana.

Sensors (Basel) ; 21(14)2021 Jul 07.

Artigo em Inglês | MEDLINE | ID: mdl-34300385

RESUMO

Electrocardiographic (ECG) signals have been used for clinical purposes for a long time. Notwithstanding, they may also be used as the input for a biometric identification system. Several studies, as well as some prototypes, are already based on this principle. One of the methods already used for biometric identification relies on a measure of similarity based on the Kolmogorov Complexity, called the Normalized Relative Compression (NRC)-this approach evaluates the similarity between two ECG segments without the need to delineate the signal wave. This methodology is the basis of the present work. We have collected a dataset of ECG signals from twenty participants on two different sessions, making use of three different kits simultaneously-one of them using dry electrodes, placed on their fingers; the other two using wet sensors placed on their wrists and chests. The aim of this work was to study the influence of the ECG protocol collection, regarding the biometric identification system's performance. Several variables in the data acquisition are not controllable, so some of them will be inspected to understand their influence in the system. Movement, data collection point, time interval between train and test datasets and ECG segment duration are examples of variables that may affect the system, and they are studied in this paper. Through this study, it was concluded that this biometric identification system needs at least 10 s of data to guarantee that the system learns the essential information. It was also observed that "off-the-person" data acquisition led to a better performance over time, when compared to "on-the-person" places.

Assuntos

Identificação Biométrica , Compressão de Dados , Algoritmos , Eletrocardiografia , Dedos , Humanos , Processamento de Sinais Assistido por Computador

2.

AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models.

Silva, Milton; Pratas, Diogo; Pinho, Armando J.

Entropy (Basel) ; 23(5)2021 Apr 26.

Artigo em Inglês | MEDLINE | ID: mdl-33925812

RESUMO

Recently, the scientific community has witnessed a substantial increase in the generation of protein sequence data, triggering emergent challenges of increasing importance, namely efficient storage and improved data analysis. For both applications, data compression is a straightforward solution. However, in the literature, the number of specific protein sequence compressors is relatively low. Moreover, these specialized compressors marginally improve the compression ratio over the best general-purpose compressors. In this paper, we present AC2, a new lossless data compressor for protein (or amino acid) sequences. AC2 uses a neural network to mix experts with a stacked generalization approach and individual cache-hash memory models to the highest-context orders. Compared to the previous compressor (AC), we show gains of 2-9% and 6-7% in reference-free and reference-based modes, respectively. These gains come at the cost of three times slower computations. AC2 also improves memory usage against AC, with requirements about seven times lower, without being affected by the sequences' input size. As an analysis application, we use AC2 to measure the similarity between each SARS-CoV-2 protein sequence with each viral protein sequence from the whole UniProt database. The results consistently show higher similarity to the pangolin coronavirus, followed by the bat and human coronaviruses, contributing with critical results to a current controversial subject. AC2 is available for free download under GPLv3 license.

3.

Cryfa: a secure encryption tool for genomic data.

Hosseini, Morteza; Pratas, Diogo; Pinho, Armando J.

Bioinformatics ; 35(1): 146-148, 2019 01 01.

Artigo em Inglês | MEDLINE | ID: mdl-30020420

RESUMO

Summary: The ever-increasing growth of high-throughput sequencing technologies has led to a great acceleration of medical and biological research and discovery. As these platforms advance, the amount of information for diverse genomes increases at unprecedented rates. Confidentiality, integrity and authenticity of such genomic information should be ensured due to its extremely sensitive nature. In this paper, we propose Cryfa, a fast secure encryption tool for genomic data, namely in Fasta, Fastq, VCF, SAM and BAM formats, which is also capable of reducing the storage size of Fasta and Fastq files. Cryfa uses advanced encryption standard (AES) encryption combined with a shuffling mechanism, which leads to a substantial enhancement of the security against low data complexity attacks. Compared to AES Crypt, a general-purpose encryption tool, Cryfa is an industry-oriented tool, which is able to provide confidentiality, integrity and authenticity of data at four times more speed; in addition, it can reduce the file sizes to 1/3. Due to the absence of a method similar to Cryfa, we have simulated its behavior with a combination of encryption and compression tools, for comparison purpose. For instance, our tool is nine times faster than its fastest competitor in Fasta files. Also, Cryfa has a very low memory usage (only a few megabytes), which makes it feasible to run on any computer. Availability and implementation: Source codes and binaries are available, under GPLv3, at https://github.com/pratas/cryfa. Supplementary information: Supplementary data are available at Bioinformatics online.

Assuntos

Compressão de Dados , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Software , Biologia Computacional

4.

Multimodal Emotion Evaluation: A Physiological Model for Cost-Effective Emotion Classification.

Pinto, Gisela; Carvalho, João M; Barros, Filipa; Soares, Sandra C; Pinho, Armando J; Brás, Susana.

Sensors (Basel) ; 20(12)2020 Jun 21.

Artigo em Inglês | MEDLINE | ID: mdl-32575894

RESUMO

Emotional responses are associated with distinct body alterations and are crucial to foster adaptive responses, well-being, and survival. Emotion identification may improve peoples' emotion regulation strategies and interaction with multiple life contexts. Several studies have investigated emotion classification systems, but most of them are based on the analysis of only one, a few, or isolated physiological signals. Understanding how informative the individual signals are and how their combination works would allow to develop more cost-effective, informative, and objective systems for emotion detection, processing, and interpretation. In the present work, electrocardiogram, electromyogram, and electrodermal activity were processed in order to find a physiological model of emotions. Both a unimodal and a multimodal approach were used to analyze what signal, or combination of signals, may better describe an emotional response, using a sample of 55 healthy subjects. The method was divided in: (1) signal preprocessing; (2) feature extraction; (3) classification using random forest and neural networks. Results suggest that the electrocardiogram (ECG) signal is the most effective for emotion classification. Yet, the combination of all signals provides the best emotion identification performance, with all signals providing crucial information for the system. This physiological model of emotions has important research and clinical implications, by providing valuable information about the value and weight of physiological signals for emotional classification, which can critically drive effective evaluation, monitoring and intervention, regarding emotional processing and regulation, considering multiple contexts.

Assuntos

Emoções/fisiologia , Modelos Biológicos , Redes Neurais de Computação , Análise Custo-Benefício , Eletrocardiografia , Eletromiografia , Humanos

5.

Comparison of Compression-Based Measures with Application to the Evolution of Primate Genomes.

Pratas, Diogo; Silva, Raquel M; Pinho, Armando J.

Entropy (Basel) ; 20(6)2018 May 23.

Artigo em Inglês | MEDLINE | ID: mdl-33265483

RESUMO

An efficient DNA compressor furnishes an approximation to measure and compare information quantities present in, between and across DNA sequences, regardless of the characteristics of the sources. In this paper, we compare directly two information measures, the Normalized Compression Distance (NCD) and the Normalized Relative Compression (NRC). These measures answer different questions; the NCD measures how similar both strings are (in terms of information content) and the NRC (which, in general, is nonsymmetric) indicates the fraction of one of them that cannot be constructed using information from the other one. This leads to the problem of finding out which measure (or question) is more suitable for the answer we need. For computing both, we use a state of the art DNA sequence compressor that we benchmark with some top compressors in different compression modes. Then, we apply the compressor on DNA sequences with different scales and natures, first using synthetic sequences and then on real DNA sequences. The last include mitochondrial DNA (mtDNA), messenger RNA (mRNA) and genomic DNA (gDNA) of seven primates. We provide several insights into evolutionary acceleration rates at different scales, namely, the observation and confirmation across the whole genomes of a higher variation rate of the mtDNA relative to the gDNA. We also show the importance of relative compression for localizing similar information regions using mtDNA.

6.

Three minimal sequences found in Ebola virus genomes and absent from human DNA.

Silva, Raquel M; Pratas, Diogo; Castro, Luísa; Pinho, Armando J; Ferreira, Paulo J S G.

Bioinformatics ; 31(15): 2421-5, 2015 Aug 01.

Artigo em Inglês | MEDLINE | ID: mdl-25840045

RESUMO

MOTIVATION: Ebola virus causes high mortality hemorrhagic fevers, with more than 25 000 cases and 10 000 deaths in the current outbreak. Only experimental therapies are available, thus, novel diagnosis tools and druggable targets are needed. RESULTS: Analysis of Ebola virus genomes from the current outbreak reveals the presence of short DNA sequences that appear nowhere in the human genome. We identify the shortest such sequences with lengths between 12 and 14. Only three absent sequences of length 12 exist and they consistently appear at the same location on two of the Ebola virus proteins, in all Ebola virus genomes, but nowhere in the human genome. The alignment-free method used is able to identify pathogen-specific signatures for quick and precise action against infectious agents, of which the current Ebola virus outbreak provides a compelling example.

Assuntos

DNA Viral/química , Ebolavirus/genética , Surtos de Doenças , Genoma Humano , Genoma Viral , Doença pelo Vírus Ebola/epidemiologia , Doença pelo Vírus Ebola/virologia , Humanos , Análise de Sequência de DNA , Proteínas Virais/genética

7.

MFCompress: a compression tool for FASTA and multi-FASTA data.

Pinho, Armando J; Pratas, Diogo.

Bioinformatics ; 30(1): 117-8, 2014 Jan 01.

Artigo em Inglês | MEDLINE | ID: mdl-24132931

RESUMO

MOTIVATION: The data deluge phenomenon is becoming a serious problem in most genomic centers. To alleviate it, general purpose tools, such as gzip, are used to compress the data. However, although pervasive and easy to use, these tools fall short when the intention is to reduce as much as possible the data, for example, for medium- and long-term storage. A number of algorithms have been proposed for the compression of genomics data, but unfortunately only a few of them have been made available as usable and reliable compression tools. RESULTS: In this article, we describe one such tool, MFCompress, specially designed for the compression of FASTA and multi-FASTA files. In comparison to gzip and applied to multi-FASTA files, MFCompress can provide additional average compression gains of almost 50%, i.e. it potentially doubles the available storage, although at the cost of some more computation time. On highly redundant datasets, and in comparison with gzip, 8-fold size reductions have been obtained. AVAILABILITY: Both source code and binaries for several operating systems are freely available for non-commercial use at http://bioinformatics.ua.pt/software/mfcompress/.

Assuntos

Compressão de Dados , Software , Algoritmos , Genoma , Genômica , Humanos , Cadeias de Markov

8.

GReEn: a tool for efficient compression of genome resequencing data.

Pinho, Armando J; Pratas, Diogo; Garcia, Sara P.

Nucleic Acids Res ; 40(4): e27, 2012 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-22139935

RESUMO

Research in the genomic sciences is confronted with the volume of sequencing and resequencing data increasing at a higher pace than that of data storage and communication resources, shifting a significant part of research budgets from the sequencing component of a project to the computational one. Hence, being able to efficiently store sequencing and resequencing data is a problem of paramount importance. In this article, we describe GReEn (Genome Resequencing Encoding), a tool for compressing genome resequencing data using a reference genome sequence. It overcomes some drawbacks of the recently proposed tool GRS, namely, the possibility of compressing sequences that cannot be handled by GRS, faster running times and compression gains of over 100-fold for some sequences. This tool is freely available for non-commercial use at ftp://ftp.ieeta.pt/~ap/codecs/GReEn1.tar.gz.

Assuntos

Compressão de Dados , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA , Software , Genoma Humano , Genoma de Planta , Humanos

9.

The breakdown of the word symmetry in the human genome.

Afreixo, Vera; Bastos, Carlos A C; Garcia, Sara P; Rodrigues, João M O S; Pinho, Armando J; Ferreira, Paulo J S G.

J Theor Biol ; 335: 153-9, 2013 Oct 21.

Artigo em Inglês | MEDLINE | ID: mdl-23831271

RESUMO

Previous studies have suggested that Chargaff's second rule may hold for relatively long words (above 10nucleotides), but this has not been conclusively shown. In particular, the following questions remain open: Is the phenomenon of symmetry statistically significant? If so, what is the word length above which significance is lost? Can deviations in symmetry due to the finite size of the data be identified? This work addresses these questions by studying word symmetries in the human genome, chromosomes and transcriptome. To rule out finite-length effects, the results are compared with those obtained from random control sequences built to satisfy Chargaff's second parity rule. We use several techniques to evaluate the phenomenon of symmetry, including Pearson's correlation coefficient, total variational distance, a novel word symmetry distance, as well as traditional and equivalence statistical tests. We conclude that word symmetries are statistical significant in the human genome for word lengths up to 6nucleotides. For longer words, we present evidence that the phenomenon may not be as prevalent as previously thought.

Assuntos

Cromossomos Humanos/genética , Genoma Humano/fisiologia , Modelos Genéticos , Cromossomos Humanos/metabolismo , Humanos , Transcriptoma/fisiologia

10.

Concentration of inverted repeats along human DNA.

Bastos, Carlos A C; Afreixo, Vera; Rodrigues, João M O S; Pinho, Armando J.

J Integr Bioinform ; 20(2)2023 Jun 01.

Artigo em Inglês | MEDLINE | ID: mdl-37486620

RESUMO

This work aims to describe the observed enrichment of inverted repeats in the human genome; and to identify and describe, with detailed length profiles, the regions with significant and relevant enriched occurrence of inverted repeats. The enrichment is assessed and tested with a recently proposed measure (z-scores based measure). We simulate a genome using an order 7 Markov model trained with the data from the real genome. The simulated genome is used to establish the critical values which are used as decision thresholds to identify the regions with significant enriched concentrations. Several human genome regions are highly enriched in the occurrence of inverted repeats. This is observed in all the human chromosomes. The distribution of inverted repeat lengths varies along the genome. The majority of the regions with severely exaggerated enrichment contain mainly short length inverted repeats. There are also regions with regular peaks along the inverted repeats lengths distribution (periodic regularities) and other regions with exaggerated enrichment for long lengths (less frequent). However, adjacent regions tend to have similar distributions.

11.

AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data.

Silva, Jorge M; Qi, Weihong; Pinho, Armando J; Pratas, Diogo.

Gigascience ; 122022 Dec 28.

Artigo em Inglês | MEDLINE | ID: mdl-38091509

RESUMO

BACKGROUND: Low-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain low-complexity or repetitive elements. For example, these can be tandem repeats, inverted repeats, homopolymer tails, GC-biased regions, similar genes, and hairpins, among many others. Identifying these regions is crucial because of their association with regulatory and structural characteristics. Moreover, their identification provides positional and quantity information where standard assembly methodologies face significant difficulties because of substantial higher depth coverage (mountains), ambiguous read mapping, or where sequencing or reconstruction defects may occur. However, the capability to distinguish low-complexity regions (LCRs) in genomic and proteomic sequences is a challenge that depends on the model's ability to find them automatically. Low-complexity patterns can be implicit through specific or combined sources, such as algorithmic or probabilistic, and recurring to different spatial distances-namely, local, medium, or distant associations. FINDINGS: This article addresses the challenge of automatically modeling and distinguishing LCRs, providing a new method and tool (AlcoR) for efficient and accurate segmentation and visualization of these regions in genomic and proteomic sequences. The method enables the use of models with different memories, providing the ability to distinguish local from distant low-complexity patterns. The method is reference and alignment free, providing additional methodologies for testing, including a highly flexible simulation method for generating biological sequences (DNA or protein) with different complexity levels, sequence masking, and a visualization tool for automatic computation of the LCR maps into an ideogram style. We provide illustrative demonstrations using synthetic, nearly synthetic, and natural sequences showing the high efficiency and accuracy of AlcoR. As large-scale results, we use AlcoR to unprecedentedly provide a whole-chromosome low-complexity map of a recent complete human genome and the haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar. CONCLUSIONS: The AlcoR method provides the ability of fast sequence characterization through data complexity analysis, ideally for scenarios entangling the presence of new or unknown sequences. AlcoR is implemented in C language using multithreading to increase the computational speed, is flexible for multiple applications, and does not contain external dependencies. The tool accepts any sequence in FASTA format. The source code is freely provided at https://github.com/cobilab/alcor.

12.

Genome analysis with distance to the nearest dissimilar nucleotide.

Afreixo, Vera; Bastos, Carlos A C; Pinho, Armando J; Garcia, Sara P; Ferreira, Paulo J S G.

J Theor Biol ; 275(1): 52-8, 2011 Apr 21.

Artigo em Inglês | MEDLINE | ID: mdl-21295040

RESUMO

DNA may be represented by sequences of four symbols, but it is often useful to convert those symbols into real or complex numbers for further analysis. Several mapping schemes have been used in the past, but most of them seem to be unrelated to any intrinsic characteristic of DNA. The objective of this work was to study a mapping scheme that is directly related to DNA characteristics, and that could be useful in discriminating between different species. Recently, we have proposed a methodology based on the inter-nucleotide distance, which proved to contribute to the discrimination among species. In this paper, we introduce a new distance, the distance to the nearest dissimilar nucleotide, which is the distance of a nucleotide to first occurrence of a different nucleotide. This distance is related to the repetition structure of single nucleotides. Using the information resulting from the concatenation of the distance to the nearest dissimilar and the inter-nucleotide distance, we found that this new distance brings additional discriminative capabilities. This suggests that the distance to the nearest dissimilar nucleotide might contribute with useful information about the evolution of the species.

Assuntos

Genoma/genética , Modelos Genéticos , Nucleotídeos/genética , Animais , Sequência de Bases , Humanos , Dados de Sequência Molecular , Filogenia , Alinhamento de Sequência , Especificidade da Espécie

13.

Genome analysis with inter-nucleotide distances.

Afreixo, Vera; Bastos, Carlos A C; Pinho, Armando J; Garcia, Sara P; Ferreira, Paulo J S G.

Bioinformatics ; 25(23): 3064-70, 2009 Dec 01.

Artigo em Inglês | MEDLINE | ID: mdl-19759198

RESUMO

MOTIVATION: DNA sequences can be represented by sequences of four symbols, but it is often useful to convert the symbols into real or complex numbers for further analysis. Several mapping schemes have been used in the past, but they seem unrelated to any intrinsic characteristic of DNA. The objective of this work was to find a mapping scheme directly related to DNA characteristics and that would be useful in discriminating between different species. Mathematical models to explore DNA correlation structures may contribute to a better knowledge of the DNA and to find a concise DNA description. RESULTS: We developed a methodology to process DNA sequences based on inter-nucleotide distances. Our main contribution is a method to obtain genomic signatures for complete genomes, based on the inter-nucleotide distances, that are able to discriminate between different species. Using these signatures and hierarchical clustering, it is possible to build phylogenetic trees. Phylogenetic trees lead to genome differentiation and allow the inference of phylogenetic relations. The phylogenetic trees generated in this work display related species close to each other, suggesting that the inter-nucleotide distances are able to capture essential information about the genomes. To create the genomic signature, we construct a vector which describes the inter-nucleotide distance distribution of a complete genome and compare it with the reference distance distribution, which is the distribution of a sequence where the nucleotides are placed randomly and independently. It is the residual or relative error between the data and the reference distribution that is used to compare the DNA sequences of different organisms.

Assuntos

DNA/química , Genoma , Genômica/métodos , Nucleotídeos/química , Análise de Sequência de DNA/métodos , Algoritmos , Sequência de Bases , Filogenia

14.

Efficient DNA sequence compression with neural networks.

Silva, Milton; Pratas, Diogo; Pinho, Armando J.

Gigascience ; 9(11)2020 11 11.

Artigo em Inglês | MEDLINE | ID: mdl-33179040

RESUMO

BACKGROUND: The increasing production of genomic data has led to an intensified need for models that can cope efficiently with the lossless compression of DNA sequences. Important applications include long-term storage and compression-based data analysis. In the literature, only a few recent articles propose the use of neural networks for DNA sequence compression. However, they fall short when compared with specific DNA compression tools, such as GeCo2. This limitation is due to the absence of models specifically designed for DNA sequences. In this work, we combine the power of neural networks with specific DNA models. For this purpose, we created GeCo3, a new genomic sequence compressor that uses neural networks for mixing multiple context and substitution-tolerant context models. FINDINGS: We benchmark GeCo3 as a reference-free DNA compressor in 5 datasets, including a balanced and comprehensive dataset of DNA sequences, the Y-chromosome and human mitogenome, 2 compilations of archaeal and virus genomes, 4 whole genomes, and 2 collections of FASTQ data of a human virome and ancient DNA. GeCo3 achieves a solid improvement in compression over the previous version (GeCo2) of $2.4\%$, $7.1\%$, $6.1\%$, $5.8\%$, and $6.0\%$, respectively. To test its performance as a reference-based DNA compressor, we benchmark GeCo3 in 4 datasets constituted by the pairwise compression of the chromosomes of the genomes of several primates. GeCo3 improves the compression in $12.4\%$, $11.7\%$, $10.8\%$, and $10.1\%$ over the state of the art. The cost of this compression improvement is some additional computational time (1.7-3 times slower than GeCo2). The RAM use is constant, and the tool scales efficiently, independently of the sequence size. Overall, these values outperform the state of the art. CONCLUSIONS: GeCo3 is a genomic sequence compressor with a neural network mixing approach that provides additional gains over top specific genomic compressors. The proposed mixing method is portable, requiring only the probabilities of the models as inputs, providing easy adaptation to other data compressors or compression-based data analysis tools. GeCo3 is released under GPLv3 and is available for free download at https://github.com/cobilab/geco3.

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala , Software , Algoritmos , Sequência de Bases , Redes Neurais de Computação , Análise de Sequência de DNA

15.

Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements.

Hosseini, Morteza; Pratas, Diogo; Morgenstern, Burkhard; Pinho, Armando J.

Gigascience ; 9(5)2020 05 01.

Artigo em Inglês | MEDLINE | ID: mdl-32432328

RESUMO

BACKGROUND: The development of high-throughput sequencing technologies and, as its result, the production of huge volumes of genomic data, has accelerated biological and medical research and discovery. Study on genomic rearrangements is crucial owing to their role in chromosomal evolution, genetic disorders, and cancer. RESULTS: We present Smash++, an alignment-free and memory-efficient tool to find and visualize small- and large-scale genomic rearrangements between 2 DNA sequences. This computational solution extracts information contents of the 2 sequences, exploiting a data compression technique to find rearrangements. We also present Smash++ visualizer, a tool that allows the visualization of the detected rearrangements along with their self- and relative complexity, by generating an SVG (Scalable Vector Graphics) image. CONCLUSIONS: Tested on several synthetic and real DNA sequences from bacteria, fungi, Aves, and Mammalia, the proposed tool was able to accurately find genomic rearrangements. The detected regions were in accordance with previous studies, which took alignment-based approaches or performed FISH (fluorescence in situ hybridization) analysis. The maximum peak memory usage among all experiments was â¼1 GB, which makes Smash++ feasible to run on present-day standard computers.

Assuntos

Biologia Computacional/métodos , Genômica/métodos , Software , Algoritmos , Rearranjo Gênico , Genoma , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA/métodos

16.

On finding minimal absent words.

Pinho, Armando J; Ferreira, Paulo J S G; Garcia, Sara P; Rodrigues, João M O S.

BMC Bioinformatics ; 10: 137, 2009 May 08.

Artigo em Inglês | MEDLINE | ID: mdl-19426495

RESUMO

BACKGROUND: The problem of finding the shortest absent words in DNA data has been recently addressed, and algorithms for its solution have been described. It has been noted that longer absent words might also be of interest, but the existing algorithms only provide generic absent words by trivially extending the shortest ones. RESULTS: We show how absent words relate to the repetitions and structure of the data, and define a new and larger class of absent words, called minimal absent words, that still captures the essential properties of the shortest absent words introduced in recent works. The words of this new class are minimal in the sense that if their leftmost or rightmost character is removed, then the resulting word is no longer an absent word. We describe an algorithm for generating minimal absent words that, in practice, runs in approximately linear time. An implementation of this algorithm is publicly available at ftp://www.ieeta.pt/~ap/maws. CONCLUSION: Because the set of minimal absent words that we propose is much larger than the set of the shortest absent words, it is potentially more useful for applications that require a richer variety of absent words. Nevertheless, the number of minimal absent words is still manageable since it grows at most linearly with the string size, unlike generic absent words that grow exponentially. Both the algorithm and the concepts upon which it depends shed additional light on the structure of absent words and complement the existing studies on the topic.

Assuntos

Algoritmos , Sequência de Bases , DNA/química , Genômica/métodos , Análise de Sequência de DNA/métodos , Bases de Dados de Ácidos Nucleicos

17.

AC: A Compression Tool for Amino Acid Sequences.

Hosseini, Morteza; Pratas, Diogo; Pinho, Armando J.

Interdiscip Sci ; 11(1): 68-76, 2019 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-30721401

RESUMO

Advancement of protein sequencing technologies has led to the production of a huge volume of data that needs to be stored and transmitted. This challenge can be tackled by compression. In this paper, we propose AC, a state-of-the-art method for lossless compression of amino acid sequences. The proposed method works based on the cooperation between finite-context models and substitutional tolerant Markov models. Compared to several general-purpose and specific-purpose protein compressors, AC provides the best bit-rates. This method can also compress the sequences nine times faster than its competitor, paq8l. In addition, employing AC, we analyze the compressibility of a large number of sequences from different domains. The results show that viruses are the most difficult sequences to be compressed. Archaea and bacteria are the second most difficult ones, and eukaryota are the easiest sequences to be compressed.

Assuntos

Algoritmos , Compressão de Dados , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software , Sequência de Aminoácidos , Cadeias de Markov

18.

Distribution of Distances Between Symmetric Words in the Human Genome: Analysis of Regular Peaks.

Bastos, Carlos A C; Afreixo, Vera; Rodrigues, João M O S; Pinho, Armando J; Silva, Raquel M.

Interdiscip Sci ; 11(3): 367-372, 2019 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-30911903

RESUMO

Finding DNA sites with high potential for the formation of hairpin/cruciform structures is an important task. Previous works studied the distances between adjacent reversed complement words (symmetric word pairs) and also for non-adjacent words. It was observed that for some words a few distances were favoured (peaks) and that in some distributions there was strong peak regularity. The present work extends previous studies, by improving the detection and characterization of peak regularities in the symmetric word pairs distance distributions of the human genome. This work also analyzes the location of the sequences that originate the observed strong peak periodicity in the distance distribution. The results obtained in this work may indicate genomic sites with potential for the formation of hairpin/cruciform structures.

Assuntos

DNA/química , Genoma Humano , Algoritmos , Cromossomos Humanos , Bases de Dados Genéticas , Genômica , Humanos , Modelos Genéticos , Conformação de Ácido Nucleico , Análise de Sequência de DNA/métodos , Software

19.

Biometric and Emotion Identification: An ECG Compression Based Method.

Brás, Susana; Ferreira, Jacqueline H T; Soares, Sandra C; Pinho, Armando J.

Front Psychol ; 9: 467, 2018.

Artigo em Inglês | MEDLINE | ID: mdl-29670564

RESUMO

We present an innovative and robust solution to both biometric and emotion identification using the electrocardiogram (ECG). The ECG represents the electrical signal that comes from the contraction of the heart muscles, indirectly representing the flow of blood inside the heart, it is known to convey a key that allows biometric identification. Moreover, due to its relationship with the nervous system, it also varies as a function of the emotional state. The use of information-theoretic data models, associated with data compression algorithms, allowed to effectively compare ECG records and infer the person identity, as well as emotional state at the time of data collection. The proposed method does not require ECG wave delineation or alignment, which reduces preprocessing error. The method is divided into three steps: (1) conversion of the real-valued ECG record into a symbolic time-series, using a quantization process; (2) conditional compression of the symbolic representation of the ECG, using the symbolic ECG records stored in the database as reference; (3) identification of the ECG record class, using a 1-NN (nearest neighbor) classifier. We obtained over 98% of accuracy in biometric identification, whereas in emotion recognition we attained over 90%. Therefore, the method adequately identify the person, and his/her emotion. Also, the proposed method is flexible and may be adapted to different problems, by the alteration of the templates for training the model.

20.

Metagenomic Composition Analysis of an Ancient Sequenced Polar Bear Jawbone from Svalbard.

Pratas, Diogo; Hosseini, Morteza; Grilo, Gonçalo; Pinho, Armando J; Silva, Raquel M; Caetano, Tânia; Carneiro, João; Pereira, Filipe.

Genes (Basel) ; 9(9)2018 Sep 06.

Artigo em Inglês | MEDLINE | ID: mdl-30200636

RESUMO

The sequencing of ancient DNA samples provides a novel way to find, characterize, and distinguish exogenous genomes of endogenous targets. After sequencing, computational composition analysis enables filtering of undesired sources in the focal organism, with the purpose of improving the quality of assemblies and subsequent data analysis. More importantly, such analysis allows extinct and extant species to be identified without requiring a specific or new sequencing run. However, the identification of exogenous organisms is a complex task, given the nature and degradation of the samples, and the evident necessity of using efficient computational tools, which rely on algorithms that are both fast and highly sensitive. In this work, we relied on a fast and highly sensitive tool, FALCON-meta, which measures similarity against whole-genome reference databases, to analyse the metagenomic composition of an ancient polar bear (Ursus maritimus) jawbone fossil. The fossil was collected in Svalbard, Norway, and has an estimated age of 110,000 to 130,000 years. The FASTQ samples contained 349 GB of nonamplified shotgun sequencing data. We identified and localized, relative to the FASTQ samples, the genomes with significant similarities to reference microbial genomes, including those of viruses, bacteria, and archaea, and to fungal, mitochondrial, and plastidial sequences. Among other striking features, we found significant similarities between modern-human, some bacterial and viral sequences (contamination) and the organelle sequences of wild carrot and tomato relative to the whole samples. For each exogenous candidate, we ran a damage pattern analysis, which in addition to revealing shallow levels of damage in the plant candidates, identified the source as contamination.

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA