Pesquisa | BVS - MINISTÉRIO DA SAÚDE

1.

The genome and population genomics of allopolyploid Coffea arabica reveal the diversification history of modern coffee cultivars.

Salojärvi, Jarkko; Rambani, Aditi; Yu, Zhe; Guyot, Romain; Strickler, Susan; Lepelley, Maud; Wang, Cui; Rajaraman, Sitaram; Rastas, Pasi; Zheng, Chunfang; Muñoz, Daniella Santos; Meidanis, João; Paschoal, Alexandre Rossi; Bawin, Yves; Krabbenhoft, Trevor J; Wang, Zhen Qin; Fleck, Steven J; Aussel, Rudy; Bellanger, Laurence; Charpagne, Aline; Fournier, Coralie; Kassam, Mohamed; Lefebvre, Gregory; Métairon, Sylviane; Moine, Déborah; Rigoreau, Michel; Stolte, Jens; Hamon, Perla; Couturon, Emmanuel; Tranchant-Dubreuil, Christine; Mukherjee, Minakshi; Lan, Tianying; Engelhardt, Jan; Stadler, Peter; Correia De Lemos, Samara Mireza; Suzuki, Suzana Ivamoto; Sumirat, Ucu; Wai, Ching Man; Dauchot, Nicolas; Orozco-Arias, Simon; Garavito, Andrea; Kiwuka, Catherine; Musoli, Pascal; Nalukenge, Anne; Guichoux, Erwan; Reinout, Havinga; Smit, Martin; Carretero-Paulet, Lorenzo; Filho, Oliveiro Guerreiro; Braghini, Masako Toma.

Nat Genet ; 56(4): 721-731, 2024 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-38622339

RESUMO

Coffea arabica, an allotetraploid hybrid of Coffea eugenioides and Coffea canephora, is the source of approximately 60% of coffee products worldwide, and its cultivated accessions have undergone several population bottlenecks. We present chromosome-level assemblies of a di-haploid C. arabica accession and modern representatives of its diploid progenitors, C. eugenioides and C. canephora. The three species exhibit largely conserved genome structures between diploid parents and descendant subgenomes, with no obvious global subgenome dominance. We find evidence for a founding polyploidy event 350,000-610,000 years ago, followed by several pre-domestication bottlenecks, resulting in narrow genetic variation. A split between wild accessions and cultivar progenitors occurred ~30.5 thousand years ago, followed by a period of migration between the two populations. Analysis of modern varieties, including lines historically introgressed with C. canephora, highlights their breeding histories and loci that may contribute to pathogen resistance, laying the groundwork for future genomics-based breeding of C. arabica.

Assuntos

Coffea , Coffea/genética , Café , Genoma de Planta/genética , Metagenômica , Melhoramento Vegetal

2.

Genomic object detection: An improved approach for transposable elements detection and classification using convolutional neural networks.

Orozco-Arias, Simon; Lopez-Murillo, Luis Humberto; Piña, Johan S; Valencia-Castrillon, Estiven; Tabares-Soto, Reinel; Castillo-Ossa, Luis; Isaza, Gustavo; Guyot, Romain.

PLoS One ; 18(9): e0291925, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37733731

RESUMO

Analysis of eukaryotic genomes requires the detection and classification of transposable elements (TEs), a crucial but complex and time-consuming task. To improve the performance of tools that accomplish these tasks, Machine Learning approaches (ML) that leverage computer resources, such as GPUs (Graphical Processing Unit) and multiple CPU (Central Processing Unit) cores, have been adopted. However, until now, the use of ML techniques has mostly been limited to classification of TEs. Herein, a detection-classification strategy (named YORO) based on convolutional neural networks is adapted from computer vision (YOLO) to genomics. This approach enables the detection of genomic objects through the prediction of the position, length, and classification in large DNA sequences such as fully sequenced genomes. As a proof of concept, the internal protein-coding domains of LTR-retrotransposons are used to train the proposed neural network. Precision, recall, accuracy, F1-score, execution times and time ratios, as well as several graphical representations were used as metrics to measure performance. These promising results open the door for a new generation of Deep Learning tools for genomics. YORO architecture is available at https://github.com/simonorozcoarias/YORO.

Assuntos

Elementos de DNA Transponíveis , Genômica , Elementos de DNA Transponíveis/genética , Benchmarking , Eucariotos , Redes Neurais de Computação

3.

InpactorDB: A Plant LTR Retrotransposon Reference Library.

Orozco-Arias, Simon; Gaviria-Orrego, Simon; Tabares-Soto, Reinel; Isaza, Gustavo; Guyot, Romain.

Methods Mol Biol ; 2703: 31-44, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37646935

RESUMO

LTR retrotransposons (LTR-RT) are major components of plant genomes. These transposable elements participate in the structure and evolution of genes and genomes through their mobility and their copy number amplification. For example, they are commonly used as evolutionary markers in genetic, genomic, and cytogenetic approaches. However, the plant research community is faced with the near absence of free availability of full-length, curated, and lineage-level classified LTR retrotransposon reference sequences. In this chapter, we will introduce InpactorDB, an LTR retrotransposon sequence database of 181 plant species representing 98 plant families for a total of 67,241 non-redundant elements. We will introduce how to use newly sequenced genomes to identify and classify LTR-RTs in a similar way with a standardized procedure using the Inpactor tool. InpactorDB is freely available at https://inpactordb.github.io .

Assuntos

Bases de Dados de Ácidos Nucleicos , Retroelementos , Retroelementos/genética , Biblioteca Gênica , Citogenética , Genoma de Planta

4.

Requests classification in the customer service area for software companies using machine learning and natural language processing.

Arias-Barahona, María Ximena; Arteaga-Arteaga, Harold Brayan; Orozco-Arias, Simón; Flórez-Ruíz, Juan Camilo; Valencia-Díaz, Mario Andrés; Tabares-Soto, Reinel.

PeerJ Comput Sci ; 9: e1016, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37346599

RESUMO

Artificial intelligence (AI) is one of the components recognized for its potential to transform the way we live today radically. It makes it possible for machines to learn from experience, adjust to new contributions and perform tasks like human beings. The business field is the focus of this research. This article proposes implementing an incident classification model using machine learning (ML) and natural language processing (NLP). The application is for the technical support area in a software development company that currently resolves customer requests manually. Through ML and NLP techniques applied to company data, it is possible to know the category of a request given by the client. It increases customer satisfaction by reviewing historical records to analyze their behavior and correctly provide the expected solution to the incidents presented. Also, this practice would reduce the cost and time spent on relationship management with the potential consumer. This work evaluates different Machine Learning models, such as support vector machine (SVM), Extra Trees, and Random Forest. The SVM algorithm demonstrates the highest accuracy of 98.97% with class balance, hyper-parameter optimization, and pre-processing techniques.

5.

Machine learning applications on intratumoral heterogeneity in glioblastoma using single-cell RNA sequencing data.

Arteaga-Arteaga, Harold Brayan; Candamil-Cortés, Mariana S; Breaux, Brian; Guillen-Rondon, Pablo; Orozco-Arias, Simon; Tabares-Soto, Reinel.

Brief Funct Genomics ; 22(5): 428-441, 2023 11 10.

Artigo em Inglês | MEDLINE | ID: mdl-37119295

RESUMO

Artificial intelligence is revolutionizing all fields that affect people's lives and health. One of the most critical applications is in the study of tumors. It is the case of glioblastoma (GBM) that has behaviors that need to be understood to develop effective therapies. Due to advances in single-cell RNA sequencing (scRNA-seq), it is possible to understand the cellular and molecular heterogeneity in the GBM. Given that there are different cell groups in these tumors, there is a need to apply Machine Learning (ML) algorithms. It will allow extracting information to understand how cancer changes and broaden the search for effective treatments. We proposed multiple comparisons of ML algorithms to classify cell groups based on the GBM scRNA-seq data. This broad comparison spectrum can show the scientific-medical community which models can achieve the best performance in this task. In this work are classified the following cell groups: Tumor Core (TC), Tumor Periphery (TP) and Normal Periphery (NP), in binary and multi-class scenarios. This work presents the biomarker candidates found for the models with the best results. The analyses presented here allow us to verify the biomarker candidates to understand the genetic characteristics of GBM, which may be affected by a suitable identification of GBM heterogeneity. This work obtained for the four scenarios covered cross-validation results of $93.03\% \pm 5.37\%$, $97.42\% \pm 3.94\%$, $98.27\% \pm 1.81\%$ and $93.04\% \pm 6.88\%$ for the classification of TP versus TC, TP versus NP, NP versus TP and TC (TPC) and NP versus TP versus TC, respectively.

Assuntos

Glioblastoma , Humanos , Glioblastoma/genética , Glioblastoma/patologia , Inteligência Artificial , Biomarcadores , Aprendizado de Máquina , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos

6.

HERV-K (HML-2) insertion polymorphisms in the 8q24.13 region and their potential etiological associations with acute myeloid leukemia.

Camargo-Forero, Nicolás; Orozco-Arias, Simon; Perez Agudelo, Juan M; Guyot, Romain.

Arch Virol ; 168(4): 125, 2023 Mar 29.

Artigo em Inglês | MEDLINE | ID: mdl-36988711

RESUMO

Human endogenous retroviruses (HERVs) are LTR retrotransposons that are present in the human genome. Among them, members of the HERV-K (HML-2) group are suspected to play a role in the development of different types of cancer, including lung, ovarian, and prostate cancer, as well as leukemia. Acute myeloid leukemia (AML) is an important disease that causes 1% of cancer deaths in the United States and has a survival rate of 28.7%. Here, we describe a method for assessing the statistical association between HERV-K (HML-2) transposable element insertion polymorphisms (or TIPs) and AML, using whole-genome sequencing and read mapping using TIP_finder software. Our results suggest that 101 polymorphisms involving HERV-K (HML-2) elements were correlated with AML, with a percentage between 44.4 to 56.6%, most of which (70) were located in the region from 8q24.13 to 8q24.21. Moreover, it was found that the TRIB1, LRATD2, POU5F1B, MYC, PCAT1, PVT1, and CCDC26 genes could be displaced or fragmented by TIPs. Furthermore, a general method was devised to facilitate analysis of the correlation between transposable element insertions and specific diseases. Finally, although the relationship between HERV-K (HML-2) TIPs and AML remains unclear, the data reported in this study indicate a statistical correlation, as supported by the χ2 test with p-values < 0.05.

Assuntos

Retrovirus Endógenos , Leucemia Mieloide Aguda , Masculino , Humanos , Retrovirus Endógenos/genética , Elementos de DNA Transponíveis , Polimorfismo Genético , Genoma Humano , Leucemia Mieloide Aguda/genética , Proteínas Serina-Treonina Quinases , Peptídeos e Proteínas de Sinalização Intracelular/genética

7.

High nucleotide similarity of three Copia lineage LTR retrotransposons among plant genomes.

Orozco-Arias, Simon; Dupeyron, Mathilde; Gutiérrez-Duque, David; Tabares-Soto, Reinel; Guyot, Romain.

Genome ; 66(3): 51-61, 2023 Mar 01.

Artigo em Inglês | MEDLINE | ID: mdl-36623262

RESUMO

Transposable elements (TEs) are mobile elements found in the majority of eukaryotic genomes. TEs deeply impact the structure and evolution of chromosomes and can induce mutations affecting coding genes. In plants, the major group of TEs is long terminal repeat retrotransposons (LTR-RTs). They are classified into superfamilies (Gypsy, Copia) and subclassified into lineages. Horizontal transfer (HT), defined as the nonsexual transmission of genetic material between species, is a process allowing LTR-RTs to invade a new genome. Although this phenomenon was considered rare, recent studies demonstrate numerous transfers of LTR-RTs. This study aims to determine which LTR-RT lineages are shared with high similarity among 69 plant genomes. We identified and classified 88 450 LTR-RTs and determined 143 cases of high similarities between pairs of genomes. Most of them involved three Copia lineages (Oryco/Ivana, Retrofit/Ale, and Tork/Tar/Ikeros). A detailed analysis of three cases of high similarities involving Tork/Tar/Ikeros group shows an uneven distribution in the phylogeny of the elements and incongruence with between phylogenetic trees topologies, indicating they could be originated from HTs. Overall, our results suggest that LTR-RT Copia lineages share outstanding similarity between distant species and may likely be involved in HT mechanisms more frequent than initially estimated.

Assuntos

Nucleotídeos , Retroelementos , Filogenia , Genoma de Planta , Sequências Repetidas Terminais/genética , Evolução Molecular

8.

G-SAIP: Graphical Sequence Alignment Through Parallel Programming in the Post-Genomic Era.

Piña, Johan S; Orozco-Arias, Simon; Tobón-Orozco, Nicolas; Camargo-Forero, Leonardo; Tabares-Soto, Reinel; Guyot, Romain.

Evol Bioinform Online ; 19: 11769343221150585, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-36703866

RESUMO

A common task in bioinformatics is to compare DNA sequences to identify similarities between organisms at the sequence level. An approach to such comparison is the dot-plots, a 2-dimensional graphical representation to analyze DNA or protein alignments. Dot-plots alignment software existed before the sequencing revolution, and now there is an ongoing limitation when dealing with large-size sequences, resulting in very long execution times. High-Performance Computing (HPC) techniques have been successfully used in many applications to reduce computing times, but so far, very few applications for graphical sequence alignment using HPC have been reported. Here, we present G-SAIP (Graphical Sequence Alignment in Parallel), a software capable of spawning multiple distributed processes on CPUs, over a supercomputing infrastructure to speed up the execution time for dot-plot generation up to 1.68× compared with other current fastest tools, improve the efficiency for comparative structural genomic analysis, phylogenetics because the benefits of pairwise alignments for comparison between genomes, repetitive structure identification, and assembly quality checking.

9.

Inpactor2: a software based on deep learning to identify and classify LTR-retrotransposons in plant genomes.

Orozco-Arias, Simon; Humberto Lopez-Murillo, Luis; Candamil-Cortés, Mariana S; Arias, Maradey; Jaimes, Paula A; Rossi Paschoal, Alexandre; Tabares-Soto, Reinel; Isaza, Gustavo; Guyot, Romain.

Brief Bioinform ; 24(1)2023 01 19.

Artigo em Inglês | MEDLINE | ID: mdl-36502372

RESUMO

LTR-retrotransposons are the most abundant repeat sequences in plant genomes and play an important role in evolution and biodiversity. Their characterization is of great importance to understand their dynamics. However, the identification and classification of these elements remains a challenge today. Moreover, current software can be relatively slow (from hours to days), sometimes involve a lot of manual work and do not reach satisfactory levels in terms of precision and sensitivity. Here we present Inpactor2, an accurate and fast application that creates LTR-retrotransposon reference libraries in a very short time. Inpactor2 takes an assembled genome as input and follows a hybrid approach (deep learning and structure-based) to detect elements, filter partial sequences and finally classify intact sequences into superfamilies and, as very few tools do, into lineages. This tool takes advantage of multi-core and GPU architectures to decrease execution times. Using the rice genome, Inpactor2 showed a run time of 5 minutes (faster than other tools) and has the best accuracy and F1-Score of the tools tested here, also having the second best accuracy and specificity only surpassed by EDTA, but achieving 28% higher sensitivity. For large genomes, Inpactor2 is up to seven times faster than other available bioinformatics tools.

Assuntos

Aprendizado Profundo , Retroelementos , Retroelementos/genética , Sequências Repetidas Terminais/genética , Genoma de Planta , Software , Evolução Molecular , Filogenia

10.

Cov-caldas: A new COVID-19 chest X-Ray dataset from state of Caldas-Colombia.

Alzate-Grisales, Jesús Alejandro; Mora-Rubio, Alejandro; Arteaga-Arteaga, Harold Brayan; Bravo-Ortiz, Mario Alejandro; Arias-Garzón, Daniel; López-Murillo, Luis Humberto; Mercado-Ruiz, Esteban; Villa-Pulgarin, Juan Pablo; Cardona-Morales, Oscar; Orozco-Arias, Simon; Buitrago-Carmona, Felipe; Palancares-Sosa, Maria Jose; Martínez-Rodríguez, Fernanda; Contreras-Ortiz, Sonia H; Saborit-Torres, Jose Manuel; Montell Serrano, Joaquim Ángel; Ramirez-Sánchez, María Mónica; Sierra-Gaber, Mario Alfonso; Jaramillo-Robledo, Oscar; de la Iglesia-Vayá, Maria; Tabares-Soto, Reinel.

Sci Data ; 9(1): 757, 2022 12 07.

Artigo em Inglês | MEDLINE | ID: mdl-36476596

RESUMO

The emergence of COVID-19 as a global pandemic forced researchers worldwide in various disciplines to investigate and propose efficient strategies and/or technologies to prevent COVID-19 from further spreading. One of the main challenges to be overcome is the fast and efficient detection of COVID-19 using deep learning approaches and medical images such as Chest Computed Tomography (CT) and Chest X-ray images. In order to contribute to this challenge, a new dataset was collected in collaboration with "S.E.S Hospital Universitario de Caldas" ( https://hospitaldecaldas.com/ ) from Colombia and organized following the Medical Imaging Data Structure (MIDS) format. The dataset contains 7,307 chest X-ray images divided into 3,077 and 4,230 COVID-19 positive and negative images. Images were subjected to a selection and anonymization process to allow the scientific community to use them freely. Finally, different convolutional neural networks were used to perform technical validation. This dataset contributes to the scientific community by tackling significant limitations regarding data quality and availability for the detection of COVID-19.

Assuntos

COVID-19 , Humanos , COVID-19/diagnóstico por imagem , Raios X , Colômbia

11.

Automatic curation of LTR retrotransposon libraries from plant genomes through machine learning.

Orozco-Arias, Simon; Candamil-Cortes, Mariana S; Jaimes, Paula A; Valencia-Castrillon, Estiven; Tabares-Soto, Reinel; Isaza, Gustavo; Guyot, Romain.

J Integr Bioinform ; 19(3)2022 Sep 01.

Artigo em Inglês | MEDLINE | ID: mdl-35822734

RESUMO

Transposable elements are mobile sequences that can move and insert themselves into chromosomes, activating under internal or external stimuli, giving the organism the ability to adapt to the environment. Annotating transposable elements in genomic data is currently considered a crucial task to understand key aspects of organisms such as phenotype variability, species evolution, and genome size, among others. Because of the way they replicate, LTR retrotransposons are the most common transposable elements in plants, accounting in some cases for up to 80% of all DNA information. To annotate these elements, a reference library is usually created, a curation process is performed, eliminating TE fragments and false positives and then annotated in the genome using the homology method. However, the curation process can take weeks, requires extensive manual work and the execution of multiple time-consuming bioinformatics software. Here, we propose a machine learning-based approach to perform this process automatically on plant genomes, obtaining up to 91.18% F1-score. This approach was tested with four plant species, obtaining up to 93.6% F1-score (Oryza granulata) in only 22.61 s, where bioinformatics methods took approximately 6 h. This acceleration demonstrates that the ML-based approach is efficient and could be used in massive sequencing projects.

Assuntos

Retroelementos , Sequências Repetidas Terminais , Elementos de DNA Transponíveis , Evolução Molecular , Genoma de Planta , Aprendizado de Máquina , Plantas/genética , Retroelementos/genética

12.

Machine learning applications to predict two-phase flow patterns.

Arteaga-Arteaga, Harold Brayan; Mora-Rubio, Alejandro; Florez, Frank; Murcia-Orjuela, Nicolas; Diaz-Ortega, Cristhian Eduardo; Orozco-Arias, Simon; delaPava, Melissa; Bravo-Ortíz, Mario Alejandro; Robinson, Melvin; Guillen-Rondon, Pablo; Tabares-Soto, Reinel.

PeerJ Comput Sci ; 7: e798, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34909465

RESUMO

Recent advances in artificial intelligence with traditional machine learning algorithms and deep learning architectures solve complex classification problems. This work presents the performance of different artificial intelligence models to classify two-phase flow patterns, showing the best alternatives for this specific classification problem using two-phase flow regimes (liquid and gas) in pipes. Flow patterns are affected by physical variables such as superficial velocity, viscosity, density, and superficial tension. They also depend on the construction characteristics of the pipe, such as the angle of inclination and the diameter. We selected 12 databases (9,029 samples) to train and test machine learning models, considering these variables that influence the flow patterns. The primary dataset is Shoham (1982), containing 5,675 samples with six different flow patterns. An extensive set of metrics validated the results obtained. The most relevant characteristics for training the models using Shoham (1982) dataset are gas and liquid superficial velocities, angle of inclination, and diameter. Regarding the algorithms, the Extra Trees model classifies the flow patterns with the highest degree of fidelity, achieving an accuracy of 98.8%.

13.

COVID-19 detection in X-ray images using convolutional neural networks.

Arias-Garzón, Daniel; Alzate-Grisales, Jesús Alejandro; Orozco-Arias, Simon; Arteaga-Arteaga, Harold Brayan; Bravo-Ortiz, Mario Alejandro; Mora-Rubio, Alejandro; Saborit-Torres, Jose Manuel; Serrano, Joaquim Ángel Montell; de la Iglesia Vayá, Maria; Cardona-Morales, Oscar; Tabares-Soto, Reinel.

Mach Learn Appl ; 6: 100138, 2021 Dec 15.

Artigo em Inglês | MEDLINE | ID: mdl-34939042

RESUMO

COVID-19 global pandemic affects health care and lifestyle worldwide, and its early detection is critical to control cases' spreading and mortality. The actual leader diagnosis test is the Reverse transcription Polymerase chain reaction (RT-PCR), result times and cost of these tests are high, so other fast and accessible diagnostic tools are needed. Inspired by recent research that correlates the presence of COVID-19 to findings in Chest X-ray images, this papers' approach uses existing deep learning models (VGG19 and U-Net) to process these images and classify them as positive or negative for COVID-19. The proposed system involves a preprocessing stage with lung segmentation, removing the surroundings which does not offer relevant information for the task and may produce biased results; after this initial stage comes the classification model trained under the transfer learning scheme; and finally, results analysis and interpretation via heat maps visualization. The best models achieved a detection accuracy of COVID-19 around 97%.

14.

Sensitivity of deep learning applied to spatial image steganalysis.

Tabares-Soto, Reinel; Arteaga-Arteaga, Harold Brayan; Mora-Rubio, Alejandro; Bravo-Ortíz, Mario Alejandro; Arias-Garzón, Daniel; Alzate-Grisales, Jesús Alejandro; Orozco-Arias, Simon; Isaza, Gustavo; Ramos-Pollán, Raúl.

PeerJ Comput Sci ; 7: e616, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34604512

RESUMO

In recent years, the traditional approach to spatial image steganalysis has shifted to deep learning (DL) techniques, which have improved the detection accuracy while combining feature extraction and classification in a single model, usually a convolutional neural network (CNN). The main contribution from researchers in this area is new architectures that further improve detection accuracy. Nevertheless, the preprocessing and partition of the database influence the overall performance of the CNN. This paper presents the results achieved by novel steganalysis networks (Xu-Net, Ye-Net, Yedroudj-Net, SR-Net, Zhu-Net, and GBRAS-Net) using different combinations of image and filter normalization ranges, various database splits, different activation functions for the preprocessing stage, as well as an analysis on the activation maps and how to report accuracy. These results demonstrate how sensible steganalysis systems are to changes in any stage of the process, and how important it is for researchers in this field to register and report their work thoroughly. We also propose a set of recommendations for the design of experiments in steganalysis with DL.

15.

K-mer-based machine learning method to classify LTR-retrotransposons in plant genomes.

Orozco-Arias, Simon; Candamil-Cortés, Mariana S; Jaimes, Paula A; Piña, Johan S; Tabares-Soto, Reinel; Guyot, Romain; Isaza, Gustavo.

PeerJ ; 9: e11456, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34055489

RESUMO

Every day more plant genomes are available in public databases and additional massive sequencing projects (i.e., that aim to sequence thousands of individuals) are formulated and released. Nevertheless, there are not enough automatic tools to analyze this large amount of genomic information. LTR retrotransposons are the most frequent repetitive sequences in plant genomes; however, their detection and classification are commonly performed using semi-automatic and time-consuming programs. Despite the availability of several bioinformatic tools that follow different approaches to detect and classify them, none of these tools can individually obtain accurate results. Here, we used Machine Learning algorithms based on k-mer counts to classify LTR retrotransposons from other genomic sequences and into lineages/families with an F1-Score of 95%, contributing to develop a free-alignment and automatic method to analyze these sequences.

16.

Strategy to improve the accuracy of convolutional neural network architectures applied to digital image steganalysis in the spatial domain.

Tabares-Soto, Reinel; Arteaga-Arteaga, Harold Brayan; Mora-Rubio, Alejandro; Bravo-Ortíz, Mario Alejandro; Arias-Garzón, Daniel; Alzate Grisales, Jesús Alejandro; Burbano Jacome, Alejandro; Orozco-Arias, Simon; Isaza, Gustavo; Ramos Pollan, Raul.

PeerJ Comput Sci ; 7: e451, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-33954236

RESUMO

In recent years, Deep Learning techniques applied to steganalysis have surpassed the traditional two-stage approach by unifying feature extraction and classification in a single model, the Convolutional Neural Network (CNN). Several CNN architectures have been proposed to solve this task, improving steganographic images' detection accuracy, but it is unclear which computational elements are relevant. Here we present a strategy to improve accuracy, convergence, and stability during training. The strategy involves a preprocessing stage with Spatial Rich Models filters, Spatial Dropout, Absolute Value layer, and Batch Normalization. Using the strategy improves the performance of three steganalysis CNNs and two image classification CNNs by enhancing the accuracy from 2% up to 10% while reducing the training time to less than 6 h and improving the networks' stability.

17.

The absence of the caffeine synthase gene is involved in the naturally decaffeinated status of Coffea humblotiana, a wild species from Comoro archipelago.

Raharimalala, Nathalie; Rombauts, Stephane; McCarthy, Andrew; Garavito, Andréa; Orozco-Arias, Simon; Bellanger, Laurence; Morales-Correa, Alexa Yadira; Froger, Solène; Michaux, Stéphane; Berry, Victoria; Metairon, Sylviane; Fournier, Coralie; Lepelley, Maud; Mueller, Lukas; Couturon, Emmanuel; Hamon, Perla; Rakotomalala, Jean-Jacques; Descombes, Patrick; Guyot, Romain; Crouzillat, Dominique.

Sci Rep ; 11(1): 8119, 2021 04 14.

Artigo em Inglês | MEDLINE | ID: mdl-33854089

RESUMO

Caffeine is the most consumed alkaloid stimulant in the world. It is synthesized through the activity of three known N-methyltransferase proteins. Here we are reporting on the 422-Mb chromosome-level assembly of the Coffea humblotiana genome, a wild and endangered, naturally caffeine-free, species from the Comoro archipelago. We predicted 32,874 genes and anchored 88.7% of the sequence onto the 11 chromosomes. Comparative analyses with the African Robusta coffee genome (C. canephora) revealed an extensive genome conservation, despite an estimated 11 million years of divergence and a broad diversity of genome sizes within the Coffea genus. In this genome, the absence of caffeine is likely due to the absence of the caffeine synthase gene which converts theobromine into caffeine through an illegitimate recombination mechanism. These findings pave the way for further characterization of caffeine-free species in the Coffea genus and will guide research towards naturally-decaffeinated coffee drinks for consumers.

Assuntos

Coffea/genética , Metiltransferases/genética , Proteínas de Plantas/genética , Sequência de Aminoácidos , Cafeína/análise , Cromossomos de Plantas , Coffea/química , Coffea/enzimologia , Comores , Hibridização Genômica Comparativa , Evolução Molecular , Metiltransferases/classificação , Metiltransferases/deficiência , Filogenia , Folhas de Planta/química , Folhas de Planta/enzimologia , Folhas de Planta/genética , Proteínas de Plantas/classificação , Proteínas de Plantas/metabolismo , Alinhamento de Sequência , Análise de Sequência de RNA , Teobromina/análise

18.

InpactorDB: A Classified Lineage-Level Plant LTR Retrotransposon Reference Library for Free-Alignment Methods Based on Machine Learning.

Orozco-Arias, Simon; Jaimes, Paula A; Candamil, Mariana S; Jiménez-Varón, Cristian Felipe; Tabares-Soto, Reinel; Isaza, Gustavo; Guyot, Romain.

Genes (Basel) ; 12(2)2021 01 28.

Artigo em Inglês | MEDLINE | ID: mdl-33525408

RESUMO

Long terminal repeat (LTR) retrotransposons are mobile elements that constitute the major fraction of most plant genomes. The identification and annotation of these elements via bioinformatics approaches represent a major challenge in the era of massive plant genome sequencing. In addition to their involvement in genome size variation, LTR retrotransposons are also associated with the function and structure of different chromosomal regions and can alter the function of coding regions, among others. Several sequence databases of plant LTR retrotransposons are available for public access, such as PGSB and RepetDB, or restricted access such as Repbase. Although these databases are useful to identify LTR-RTs in new genomes by similarity, the elements of these databases are not fully classified to the lineage (also called family) level. Here, we present InpactorDB, a semi-curated dataset composed of 130,439 elements from 195 plant genomes (belonging to 108 plant species) classified to the lineage level. This dataset has been used to train two deep neural networks (i.e., one fully connected and one convolutional) for the rapid classification of these elements. In lineage-level classification approaches, we obtain up to 98% performance, indicated by the F1-score, precision and recall scores.

Assuntos

Biologia Computacional/métodos , Bases de Dados de Ácidos Nucleicos , Genoma de Planta , Genômica/métodos , Retroelementos , Sequências Repetidas Terminais , Aprendizado de Máquina , Redes Neurais de Computação , Reprodutibilidade dos Testes

19.

TIP_finder: An HPC Software to Detect Transposable Element Insertion Polymorphisms in Large Genomic Datasets.

Orozco-Arias, Simon; Tobon-Orozco, Nicolas; Piña, Johan S; Jiménez-Varón, Cristian Felipe; Tabares-Soto, Reinel; Guyot, Romain.

Biology (Basel) ; 9(9)2020 Sep 09.

Artigo em Inglês | MEDLINE | ID: mdl-32917036

RESUMO

Transposable elements (TEs) are non-static genomic units capable of moving indistinctly from one chromosomal location to another. Their insertion polymorphisms may cause beneficial mutations, such as the creation of new gene function, or deleterious in eukaryotes, e.g., different types of cancer in humans. A particular type of TE called LTR-retrotransposons comprises almost 8% of the human genome. Among LTR retrotransposons, human endogenous retroviruses (HERVs) bear structural and functional similarities to retroviruses. Several tools allow the detection of transposon insertion polymorphisms (TIPs) but fail to efficiently analyze large genomes or large datasets. Here, we developed a computational tool, named TIP_finder, able to detect mobile element insertions in very large genomes, through high-performance computing (HPC) and parallel programming, using the inference of discordant read pair analysis. TIP_finder inputs are (i) short pair reads such as those obtained by Illumina, (ii) a chromosome-level reference genome sequence, and (iii) a database of consensus TE sequences. The HPC strategy we propose adds scalability and provides a useful tool to analyze huge genomic datasets in a decent running time. TIP_finder accelerates the detection of transposon insertion polymorphisms (TIPs) by up to 55 times in breast cancer datasets and 46 times in cancer-free datasets compared to the fastest available algorithms. TIP_finder applies a validated strategy to find TIPs, accelerates the process through HPC, and addresses the issues of runtime for large-scale analyses in the post-genomic era. TIP_finder version 1.0 is available at https://github.com/simonorozcoarias/TIP_finder.

20.

A comparative study of machine learning and deep learning algorithms to classify cancer types based on microarray gene expression data.

Tabares-Soto, Reinel; Orozco-Arias, Simon; Romero-Cano, Victor; Segovia Bucheli, Vanesa; Rodríguez-Sotelo, José Luis; Jiménez-Varón, Cristian Felipe.

PeerJ Comput Sci ; 6: e270, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-33816921

RESUMO

Cancer classification is a topic of major interest in medicine since it allows accurate and efficient diagnosis and facilitates a successful outcome in medical treatments. Previous studies have classified human tumors using a large-scale RNA profiling and supervised Machine Learning (ML) algorithms to construct a molecular-based classification of carcinoma cells from breast, bladder, adenocarcinoma, colorectal, gastro esophagus, kidney, liver, lung, ovarian, pancreas, and prostate tumors. These datasets are collectively known as the 11_tumor database, although this database has been used in several works in the ML field, no comparative studies of different algorithms can be found in the literature. On the other hand, advances in both hardware and software technologies have fostered considerable improvements in the precision of solutions that use ML, such as Deep Learning (DL). In this study, we compare the most widely used algorithms in classical ML and DL to classify the tumors described in the 11_tumor database. We obtained tumor identification accuracies between 90.6% (Logistic Regression) and 94.43% (Convolutional Neural Networks) using k-fold cross-validation. Also, we show how a tuning process may or may not significantly improve algorithms' accuracies. Our results demonstrate an efficient and accurate classification method based on gene expression (microarray data) and ML/DL algorithms, which facilitates tumor type prediction in a multi-cancer-type scenario.

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA