Your browser doesn't support javascript.
loading
Virus Pop-Expanding Viral Databases by Protein Sequence Simulation.
Kende, Julia; Bonomi, Massimiliano; Temmam, Sarah; Regnault, Béatrice; Pérot, Philippe; Eloit, Marc; Bigot, Thomas.
Afiliación
  • Kende J; Bioinformatics and Biostatistics Hub, Institut Pasteur, Université Paris Cité, F-75015 Paris, France.
  • Bonomi M; Department of Structural Biology and Chemistry, Institut Pasteur, Université Paris Cité, CNRS UMR 3528, F-75015 Paris, France.
  • Temmam S; Pathogen Discovery Laboratory, Institut Pasteur, Université Paris Cité, F-75015 Paris, France.
  • Regnault B; Pathogen Discovery Laboratory, Institut Pasteur, Université Paris Cité, F-75015 Paris, France.
  • Pérot P; Pathogen Discovery Laboratory, Institut Pasteur, Université Paris Cité, F-75015 Paris, France.
  • Eloit M; Pathogen Discovery Laboratory, Institut Pasteur, Université Paris Cité, F-75015 Paris, France.
  • Bigot T; Ecole Nationale Vétérinaire d'Alfort, F-94700 Maisons-Alfort, France.
Viruses ; 15(6)2023 05 24.
Article en En | MEDLINE | ID: mdl-37376527
ABSTRACT
The improvement of our knowledge of the virosphere, which includes unknown viruses, is a key area in virology. Metagenomics tools, which perform taxonomic assignation from high throughput sequencing datasets, are generally evaluated with datasets derived from biological samples or in silico spiked samples containing known viral sequences present in public databases, resulting in the inability to evaluate the capacity of these tools to detect novel or distant viruses. Simulating realistic evolutionary directions is therefore key to benchmark and improve these tools. Additionally, expanding current databases with realistic simulated sequences can improve the capacity of alignment-based searching strategies for finding distant viruses, which could lead to a better characterization of the "dark matter" of metagenomics data. Here, we present Virus Pop, a novel pipeline for simulating realistic protein sequences and adding new branches to a protein phylogenetic tree. The tool generates simulated sequences with substitution rate variations that are dependent on protein domains and inferred from the input dataset, allowing for a realistic representation of protein evolution. The pipeline also infers ancestral sequences corresponding to multiple internal nodes of the input data phylogenetic tree, enabling new sequences to be inserted at various points of interest in the group studied. We demonstrated that Virus Pop produces simulated sequences that closely match the structural and functional characteristics of real protein sequences, taking as an example the spike protein of sarbecoviruses. Virus Pop also succeeded at creating sequences that resemble real sequences not included in the databases, which facilitated the identification of a novel pathogenic human circovirus not included in the input database. In conclusion, Virus Pop is helpful for challenging taxonomic assignation tools and could help improve databases to better detect distant viruses.
Asunto(s)
Palabras clave

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Asunto principal: Virus / Biología Computacional Tipo de estudio: Prognostic_studies Límite: Humans Idioma: En Revista: Viruses Año: 2023 Tipo del documento: Article País de afiliación: Francia

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Asunto principal: Virus / Biología Computacional Tipo de estudio: Prognostic_studies Límite: Humans Idioma: En Revista: Viruses Año: 2023 Tipo del documento: Article País de afiliación: Francia
...