Pesquisa | BVS Integralidade em Saúde

BIOMAPP::CHIP: large-scale motif analysis.

Garbelini, Jader M Caldonazzo; Sanches, Danilo S; Pozo, Aurora T Ramirez.

BMC Bioinformatics ; 25(1): 128, 2024 Mar 26.

Artigo em Inglês | MEDLINE | ID: mdl-38528492

RESUMO

BACKGROUND: Discovery biological motifs plays a fundamental role in understanding regulatory mechanisms. Computationally, they can be efficiently represented as kmers, making the counting of these elements a critical aspect for ensuring not only the accuracy but also the efficiency of the analytical process. This is particularly useful in scenarios involving large data volumes, such as those generated by the ChIP-seq protocol. Against this backdrop, we introduce BIOMAPP::CHIP, a tool specifically designed to optimize the discovery of biological motifs in large data volumes. RESULTS: We conducted a comprehensive set of comparative tests with state-of-the-art algorithms. Our analyses revealed that BIOMAPP::CHIP outperforms existing approaches in various metrics, excelling both in terms of performance and accuracy. The tests demonstrated a higher detection rate of significant motifs and also greater agility in the execution of the algorithm. Furthermore, the SMT component played a vital role in the system's efficiency, proving to be both agile and accurate in kmer counting, which in turn improved the overall efficacy of our tool. CONCLUSION: BIOMAPP::CHIP represent real advancements in the discovery of biological motifs, particularly in large data volume scenarios, offering a relevant alternative for the analysis of ChIP-seq data and have the potential to boost future research in the field. This software can be found at the following address: (https://github.com/jadermcg/biomapp-chip).

Assuntos

Algoritmos , Software , Análise de Sequência de DNA/métodos , Imunoprecipitação da Cromatina/métodos , Sítios de Ligação , Motivos de Nucleotídeos

MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors.

Bonidia, Robson P; Domingues, Douglas S; Sanches, Danilo S; de Carvalho, André C P L F.

Brief Bioinform ; 23(1)2022 01 17.

Artigo em Inglês | MEDLINE | ID: mdl-34750626

RESUMO

One of the main challenges in applying machine learning algorithms to biological sequence data is how to numerically represent a sequence in a numeric input vector. Feature extraction techniques capable of extracting numerical information from biological sequences have been reported in the literature. However, many of these techniques are not available in existing packages, such as mathematical descriptors. This paper presents a new package, MathFeature, which implements mathematical descriptors able to extract relevant numerical information from biological sequences, i.e. DNA, RNA and proteins (prediction of structural features along the primary sequence of amino acids). MathFeature makes available 20 numerical feature extraction descriptors based on approaches found in the literature, e.g. multiple numeric mappings, genomic signal processing, chaos game theory, entropy and complex networks. MathFeature also allows the extraction of alternative features, complementing the existing packages. To ensure that our descriptors are robust and to assess their relevance, experimental results are presented in nine case studies. According to these results, the features extracted by MathFeature showed high performance (0.6350-0.9897, accuracy), both applying only mathematical descriptors, but also hybridization with well-known descriptors in the literature. Finally, through MathFeature, we overcame several studies in eight benchmark datasets, exemplifying the robustness and viability of the proposed package. MathFeature has advanced in the area by bringing descriptors not available in other packages, as well as allowing non-experts to use feature extraction techniques.

Assuntos

Proteínas , RNA , Algoritmos , Sequência de Aminoácidos , DNA/genética , Aprendizado de Máquina , Proteínas/química , RNA/genética

BioAutoML: automated feature engineering and metalearning to predict noncoding RNAs in bacteria.

Bonidia, Robson P; Santos, Anderson P Avila; de Almeida, Breno L S; Stadler, Peter F; da Rocha, Ulisses N; Sanches, Danilo S; de Carvalho, André C P L F.

Brief Bioinform ; 23(4)2022 07 18.

Artigo em Inglês | MEDLINE | ID: mdl-35753697

RESUMO

Recent technological advances have led to an exponential expansion of biological sequence data and extraction of meaningful information through Machine Learning (ML) algorithms. This knowledge has improved the understanding of mechanisms related to several fatal diseases, e.g. Cancer and coronavirus disease 2019, helping to develop innovative solutions, such as CRISPR-based gene editing, coronavirus vaccine and precision medicine. These advances benefit our society and economy, directly impacting people's lives in various areas, such as health care, drug discovery, forensic analysis and food processing. Nevertheless, ML-based approaches to biological data require representative, quantitative and informative features. Many ML algorithms can handle only numerical data, and therefore sequences need to be translated into a numerical feature vector. This process, known as feature extraction, is a fundamental step for developing high-quality ML-based models in bioinformatics, by allowing the feature engineering stage, with design and selection of suitable features. Feature engineering, ML algorithm selection and hyperparameter tuning are often manual and time-consuming processes, requiring extensive domain knowledge. To deal with this problem, we present a new package: BioAutoML. BioAutoML automatically runs an end-to-end ML pipeline, extracting numerical and informative features from biological sequence databases, using the MathFeature package, and automating the feature selection, ML algorithm(s) recommendation and tuning of the selected algorithm(s) hyperparameters, using Automated ML (AutoML). BioAutoML has two components, divided into four modules: (1) automated feature engineering (feature extraction and selection modules) and (2) Metalearning (algorithm recommendation and hyper-parameter tuning modules). We experimentally evaluate BioAutoML in two different scenarios: (i) prediction of the three main classes of noncoding RNAs (ncRNAs) and (ii) prediction of the eight categories of ncRNAs in bacteria, including housekeeping and regulatory types. To assess BioAutoML predictive performance, it is experimentally compared with two other AutoML tools (RECIPE and TPOT). According to the experimental results, BioAutoML can accelerate new studies, reducing the cost of feature engineering processing and either keeping or improving predictive performance. BioAutoML is freely available at https://github.com/Bonidia/BioAutoML.

Assuntos

Vacinas contra COVID-19 , COVID-19 , Algoritmos , Bactérias/genética , Humanos , Aprendizado de Máquina

BioDeepfuse: a hybrid deep learning approach with integrated feature extraction techniques for enhanced non-coding RNA classification.

Avila Santos, Anderson P; de Almeida, Breno L S; Bonidia, Robson P; Stadler, Peter F; Stefanic, Polonca; Mandic-Mulec, Ines; Rocha, Ulisses; Sanches, Danilo S; de Carvalho, André C P L F.

RNA Biol ; 21(1): 1-12, 2024 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-38528797

RESUMO

The accurate classification of non-coding RNA (ncRNA) sequences is pivotal for advanced non-coding genome annotation and analysis, a fundamental aspect of genomics that facilitates understanding of ncRNA functions and regulatory mechanisms in various biological processes. While traditional machine learning approaches have been employed for distinguishing ncRNA, these often necessitate extensive feature engineering. Recently, deep learning algorithms have provided advancements in ncRNA classification. This study presents BioDeepFuse, a hybrid deep learning framework integrating convolutional neural networks (CNN) or bidirectional long short-term memory (BiLSTM) networks with handcrafted features for enhanced accuracy. This framework employs a combination of k-mer one-hot, k-mer dictionary, and feature extraction techniques for input representation. Extracted features, when embedded into the deep network, enable optimal utilization of spatial and sequential nuances of ncRNA sequences. Using benchmark datasets and real-world RNA samples from bacterial organisms, we evaluated the performance of BioDeepFuse. Results exhibited high accuracy in ncRNA classification, underscoring the robustness of our tool in addressing complex ncRNA sequence data challenges. The effective melding of CNN or BiLSTM with external features heralds promising directions for future research, particularly in refining ncRNA classifiers and deepening insights into ncRNAs in cellular processes and disease manifestations. In addition to its original application in the context of bacterial organisms, the methodologies and techniques integrated into our framework can potentially render BioDeepFuse effective in various and broader domains.

Assuntos

Aprendizado Profundo , RNA não Traduzido/genética , Algoritmos , RNA , Redes Neurais de Computação

Feature extraction approaches for biological sequences: a comparative study of mathematical features.

Bonidia, Robson P; Sampaio, Lucas D H; Domingues, Douglas S; Paschoal, Alexandre R; Lopes, Fabrício M; de Carvalho, André C P L F; Sanches, Danilo S.

Brief Bioinform ; 22(5)2021 09 02.

Artigo em Inglês | MEDLINE | ID: mdl-33585910

RESUMO

As consequence of the various genomic sequencing projects, an increasing volume of biological sequence data is being produced. Although machine learning algorithms have been successfully applied to a large number of genomic sequence-related problems, the results are largely affected by the type and number of features extracted. This effect has motivated new algorithms and pipeline proposals, mainly involving feature extraction problems, in which extracting significant discriminatory information from a biological set is challenging. Considering this, our work proposes a new study of feature extraction approaches based on mathematical features (numerical mapping with Fourier, entropy and complex networks). As a case study, we analyze long non-coding RNA sequences. Moreover, we separated this work into three studies. First, we assessed our proposal with the most addressed problem in our review, e.g. lncRNA and mRNA; second, we also validate the mathematical features in different classification problems, to predict the class of lncRNA, e.g. circular RNAs sequences; third, we analyze its robustness in scenarios with imbalanced data. The experimental results demonstrated three main contributions: first, an in-depth study of several mathematical features; second, a new feature extraction pipeline; and third, its high performance and robustness for distinct RNA sequence classification. Availability:https://github.com/Bonidia/FeatureExtraction_BiologicalSequences.

Assuntos

Biologia Computacional/métodos , Aprendizado Profundo , Modelos Teóricos , RNA Circular/genética , RNA Longo não Codificante/genética , RNA Mensageiro/genética , Sequência de Bases/genética , Entropia , Análise de Fourier , Humanos , Fases de Leitura Aberta , RNA Circular/classificação , RNA Longo não Codificante/classificação , RNA Mensageiro/classificação

CRISPRloci: comprehensive and accurate annotation of CRISPR-Cas systems.

Alkhnbashi, Omer S; Mitrofanov, Alexander; Bonidia, Robson; Raden, Martin; Tran, Van Dinh; Eggenhofer, Florian; Shah, Shiraz A; Öztürk, Ekrem; Padilha, Victor A; Sanches, Danilo S; de Carvalho, André C P L F; Backofen, Rolf.

Nucleic Acids Res ; 49(W1): W125-W130, 2021 07 02.

Artigo em Inglês | MEDLINE | ID: mdl-34133710

RESUMO

CRISPR-Cas systems are adaptive immune systems in prokaryotes, providing resistance against invading viruses and plasmids. The identification of CRISPR loci is currently a non-standardized, ambiguous process, requiring the manual combination of multiple tools, where existing tools detect only parts of the CRISPR-systems, and lack quality control, annotation and assessment capabilities of the detected CRISPR loci. Our CRISPRloci server provides the first resource for the prediction and assessment of all possible CRISPR loci. The server integrates a series of advanced Machine Learning tools within a seamless web interface featuring: (i) prediction of all CRISPR arrays in the correct orientation; (ii) definition of CRISPR leaders for each locus; and (iii) annotation of cas genes and their unambiguous classification. As a result, CRISPRloci is able to accurately determine the CRISPR array and associated information, such as: the Cas subtypes; cassette boundaries; accuracy of the repeat structure, orientation and leader sequence; virus-host interactions; self-targeting; as well as the annotation of cas genes, all of which have been missing from existing tools. This annotation is presented in an interactive interface, making it easy for scientists to gain an overview of the CRISPR system in their organism of interest. Predictions are also rendered in GFF format, enabling in-depth genome browser inspection. In summary, CRISPRloci constitutes a full suite for CRISPR-Cas system characterization that offers annotation quality previously available only after manual inspection.

Assuntos

Sistemas CRISPR-Cas , Repetições Palindrômicas Curtas Agrupadas e Regularmente Espaçadas , Anotação de Sequência Molecular , Software , Proteínas Associadas a CRISPR/classificação , Proteínas Associadas a CRISPR/genética , Aprendizado de Máquina

Information Theory for Biological Sequence Classification: A Novel Feature Extraction Technique Based on Tsallis Entropy.

Bonidia, Robson P; Avila Santos, Anderson P; de Almeida, Breno L S; Stadler, Peter F; Nunes da Rocha, Ulisses; Sanches, Danilo S; de Carvalho, André C P L F.

Entropy (Basel) ; 24(10)2022 Oct 01.

Artigo em Inglês | MEDLINE | ID: mdl-37420418

RESUMO

In recent years, there has been an exponential growth in sequencing projects due to accelerated technological advances, leading to a significant increase in the amount of data and resulting in new challenges for biological sequence analysis. Consequently, the use of techniques capable of analyzing large amounts of data has been explored, such as machine learning (ML) algorithms. ML algorithms are being used to analyze and classify biological sequences, despite the intrinsic difficulty in extracting and finding representative biological sequence methods suitable for them. Thereby, extracting numerical features to represent sequences makes it statistically feasible to use universal concepts from Information Theory, such as Tsallis and Shannon entropy. In this study, we propose a novel Tsallis entropy-based feature extractor to provide useful information to classify biological sequences. To assess its relevance, we prepared five case studies: (1) an analysis of the entropic index q; (2) performance testing of the best entropic indices on new datasets; (3) a comparison made with Shannon entropy and (4) generalized entropies; (5) an investigation of the Tsallis entropy in the context of dimensionality reduction. As a result, our proposal proved to be effective, being superior to Shannon entropy and robust in terms of generalization, and also potentially representative for collecting information in fewer dimensions compared with methods such as Singular Value Decomposition and Uniform Manifold Approximation and Projection.

Sequence motif finder using memetic algorithm.

Caldonazzo Garbelini, Jader M; Kashiwabara, André Y; Sanches, Danilo S.

BMC Bioinformatics ; 19(1): 4, 2018 01 03.

Artigo em Inglês | MEDLINE | ID: mdl-29298679

RESUMO

BACKGROUND: De novo prediction of Transcription Factor Binding Sites (TFBS) using computational methods is a difficult task and it is an important problem in Bioinformatics. The correct recognition of TFBS plays an important role in understanding the mechanisms of gene regulation and helps to develop new drugs. RESULTS: We here present Memetic Framework for Motif Discovery (MFMD), an algorithm that uses semi-greedy constructive heuristics as a local optimizer. In addition, we used a hybridization of the classic genetic algorithm as a global optimizer to refine the solutions initially found. MFMD can find and classify overrepresented patterns in DNA sequences and predict their respective initial positions. MFMD performance was assessed using ChIP-seq data retrieved from the JASPAR site, promoter sequences extracted from the ABS site, and artificially generated synthetic data. The MFMD was evaluated and compared with well-known approaches in the literature, called MEME and Gibbs Motif Sampler, achieving a higher f-score in the most datasets used in this work. CONCLUSIONS: We have developed an approach for detecting motifs in biopolymers sequences. MFMD is a freely available software that can be promising as an alternative to the development of new tools for de novo motif discovery. Its open-source software can be downloaded at https://github.com/jadermcg/mfmd .

Assuntos

Algoritmos , Fatores de Transcrição/metabolismo , Sequência de Bases , Sítios de Ligação , Internet , Fatores de Transcrição/química , Fatores de Transcrição/genética , Interface Usuário-Computador

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

Detalhe da pesquisa