Polymorphisms predicting phylogeny in hepatitis B virus.

Lourenço, José; McNaughton, Anna L; Pley, Caitlin; Obolski, Uri; Gupta, Sunetra; Matthews, Philippa C

Lourenço, José; McNaughton, Anna L; Pley, Caitlin; Obolski, Uri; Gupta, Sunetra; Matthews, Philippa C.

Afiliação

Lourenço J; BioISI (Biosystems and Integrative Sciences Institute), Faculty of Sciences, University of Lisbon, Campo Grande, Lisbon 1749-016, Portugal.
McNaughton AL; Population Health Science, Bristol Medical School, University of Bristo, 5 Tyndall Ave, Bristol BS81UDBS8, UK.
Pley C; Guy's and St Thomas' NHS Foundation Trust, Westminster Bridge Rd, London SE1 7EH, UK.
Obolski U; School of Public Health, Tel Aviv University, Tel Aviv 6997801, Israel.
Gupta S; Porter School of the Environment and Earth Sciences, Tel Aviv University, Tel Aviv 6997801, Israel.
Matthews PC; Department of Zoology, University of Oxford, Medawar Building for Pathogen Research, South Parks Road, Oxford OX1 3SY, UK.

Virus Evol ; 9(1): veac116, 2023.

Article em En | MEDLINE | ID: mdl-36628296

ABSTRACT

ABSTRACT

Hepatitis B viruses (HBVs) are compact viruses with circular genomes of â¼3.2 kb in length. Four genes (HBx, Core, Surface, and Polymerase) generating seven products are encoded on overlapping reading frames. Ten HBV genotypes have been characterised (A-J), which may account for differences in transmission, outcomes of infection, and treatment response. However, HBV genotyping is rarely undertaken, and sequencing remains inaccessible in many settings. We set out to assess which amino acid (aa) sites in the HBV genome are most informative for determining genotype, using a machine learning approach based on random forest algorithms (RFA). We downloaded 5,496 genome-length HBV sequences from a public database, excluding recombinant sequences, regions with conserved indels, and genotypes I and J. Each gene was separately translated into aa, and the proteins concatenated into a single sequence (length 1,614 aa). Using RFA, we searched for aa sites predictive of genotype and assessed covariation among the sites with a mutual information-based method. We were able to discriminate confidently between genotypes A-H using ten aa sites. Half of these sites (5/10) sites were identified in Polymerase (Pol), of which 4/5 were in the spacer domain and one in reverse transcriptase. A further 4/10 sites were located in Surface protein and a single site in HBx. There were no informative sites in Core. Properties of the aa were generally not conserved between genotypes at informative sites. Among the highest co-varying pairs of sites, there were fifty-five pairs that included one of these 'top ten' sites. Overall, we have shown that RFA analysis is a powerful tool for identifying aa sites that predict the HBV lineage, with an unexpectedly high number of such sites in the spacer domain, which has conventionally been viewed as unimportant for structure or function. Our results improve ease of genotype prediction from limited regions of HBV sequences and may have future applications in understanding HBV evolution.

Palavras-chave

HBV; covariation; diversity; evolution; genotype; hepadnavirus; hepatitis B virus; machine learning; mutation; phylogeny; polymorphism; selection; subgenotype

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Idioma: En Ano de publicação: 2023 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Base de dados: MEDLINE Idioma: En Ano de publicação: 2023 Tipo de documento: Article