RESUMO
Life scientists are increasingly turning to high-throughput sequencing technologies in their research programs, owing to the enormous potential of these methods. In a parallel manner, the number of core facilities that provide bioinformatics support are also increasing. Notably, the generation of complex large datasets has necessitated the development of bioinformatics support core facilities that aid laboratory scientists with cost-effective and efficient data management, analysis, and interpretation. In this article, we address the challenges-related to communication, good laboratory practice, and data handling-that may be encountered in core support facilities when providing bioinformatics support, drawing on our own experiences working as support bioinformaticians on multidisciplinary research projects. Most importantly, the article proposes a list of guidelines that outline how these challenges can be preemptively avoided and effectively managed to increase the value of outputs to the end user, covering the entire research project lifecycle, including experimental design, data analysis, and management (i.e., sharing and storage). In addition, we highlight the importance of clear and transparent communication, comprehensive preparation, appropriate handling of samples and data using monitoring systems, and the employment of appropriate tools and standard operating procedures to provide effective bioinformatics support.
Assuntos
Biologia Computacional/economia , Biologia Computacional/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Pesquisa Biomédica/economia , Pesquisa Biomédica/métodos , Comunicação , Biologia Computacional/normas , Sequenciamento de Nucleotídeos em Larga Escala/economia , Sequenciamento de Nucleotídeos em Larga Escala/normas , Humanos , Projetos de Pesquisa/normasRESUMO
The persistence and emergence of new multi-drug resistant Mycobacterium tuberculosis (M. tb) strains continues to advance the devastating tuberculosis (TB) epidemic. Robust systems are needed to accurately and rapidly perform drug-resistance profiling, and machine learning (ML) methods combined with genomic sequence data may provide novel insights into drug-resistance mechanisms. Using 372 M. tb isolates, the combined utility of ML and bioinformatics to perform drug-resistance profiling is demonstrated. SNPs, InDels, and dinucleotide frequencies are explored as input features for three ML models, namely Decision Trees, Random Forest, and the eXtreme Gradient Boosted model. Using SNPs and InDels, all three models performed equally well yielding a 99% accuracy, 97% recall, and 99% F1-score. Using dinucleotide frequencies, the XGBoost algorithm was superior with a 97% accuracy, 94% recall and 97% F1-score. This study validates the use of variants and presents dinucleotide features as another effective feature encoding method for ML-based phenotype classification.
Assuntos
Antituberculosos , Farmacorresistência Bacteriana Múltipla , Aprendizado de Máquina , Mycobacterium tuberculosis , Tuberculose , Antituberculosos/farmacologia , Antituberculosos/uso terapêutico , Farmacorresistência Bacteriana Múltipla/genética , Humanos , Mycobacterium tuberculosis/efeitos dos fármacos , Mycobacterium tuberculosis/genética , Tuberculose/tratamento farmacológicoRESUMO
The world is grappling with the COVID-19 pandemic caused by the 2019 novel SARS-CoV-2. To better understand this novel virus and its relationship with other pathogens, new methods for analyzing the genome are required. In this study, intrinsic dinucleotide genomic signatures were analyzed for whole genome sequence data of eight pathogenic species, including SARS-CoV-2. The genome sequences were transformed into dinucleotide relative frequencies and classified using the extreme gradient boosting (XGBoost) model. The classification models were trained to a) distinguish between the sequences of all eight species and b) distinguish between sequences of SARS-CoV-2 that originate from different geographic regions. Our method attained 100% in all performance metrics and for all tasks in the eight-species classification problem. Moreover, the models achieved 67% balanced accuracy for the task of classifying the SARS-CoV-2 sequences into the six continental regions and achieved 86% balanced accuracy for the task of classifying SARS-CoV-2 samples as either originating from Asia or not. Analysis of the dinucleotide genomic profiles of the eight species revealed a similarity between the SARS-CoV-2 and MERS-CoV viral sequences. Further analysis of SARS-CoV-2 viral sequences from the six continents revealed that samples from Oceania had the highest frequency of TT dinucleotides as well as the lowest CG frequency compared to the other continents. The dinucleotide signatures of AC, AG,CA, CT, GA, GT, TC, and TG were well conserved across most genomes, while the frequencies of other dinucleotide signatures varied considerably. Altogether, the results from this study demonstrate the utility of dinucleotide relative frequencies for discriminating and identifying similar species.