Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 99
Filter
Add more filters

Publication year range
1.
PLoS Pathog ; 18(1): e1010224, 2022 Jan.
Article in English | MEDLINE | ID: mdl-34990490

ABSTRACT

[This corrects the article DOI: 10.1371/journal.ppat.1009786.].

2.
Syst Biol ; 72(6): 1280-1295, 2023 Dec 30.
Article in English | MEDLINE | ID: mdl-37756489

ABSTRACT

The bootstrap method is based on resampling sequence alignments and re-estimating trees. Felsenstein's bootstrap proportions (FBP) are the most common approach to assess the reliability and robustness of sequence-based phylogenies. However, when increasing taxon sampling (i.e., the number of sequences) to hundreds or thousands of taxa, FBP tend to return low support for deep branches. The transfer bootstrap expectation (TBE) has been recently suggested as an alternative to FBP. TBE is measured using a continuous transfer index in [0,1] for each bootstrap tree, instead of the binary {0,1} index used in FBP to measure the presence/absence of the branch of interest. TBE has been shown to yield higher and more informative supports while inducing a very low number of falsely supported branches. Nonetheless, it has been argued that TBE must be used with care due to sampling issues, especially in datasets with a high number of closely related taxa. In this study, we conduct multiple experiments by varying taxon sampling and comparing FBP and TBE support values on different phylogenetic depths, using empirical datasets. Our results show that the main critique of TBE stands in extreme cases with shallow branches and highly unbalanced sampling among clades, but that TBE is still robust in most cases, while FBP is inescapably negatively impacted by high taxon sampling. We suggest guidelines and good practices in TBE (and FBP) computing and interpretation.


Subject(s)
Phylogeny , Reproducibility of Results
3.
Syst Biol ; 72(6): 1387-1402, 2023 Dec 30.
Article in English | MEDLINE | ID: mdl-37703335

ABSTRACT

Multi-type birth-death (MTBD) models are phylodynamic analogies of compartmental models in classical epidemiology. They serve to infer such epidemiological parameters as the average number of secondary infections Re and the infectious time from a phylogenetic tree (a genealogy of pathogen sequences). The representatives of this model family focus on various aspects of pathogen epidemics. For instance, the birth-death exposed-infectious (BDEI) model describes the transmission of pathogens featuring an incubation period (when there is a delay between the moment of infection and becoming infectious, as for Ebola and SARS-CoV-2), and permits its estimation along with other parameters. With constantly growing sequencing data, MTBD models should be extremely useful for unravelling information on pathogen epidemics. However, existing implementations of these models in a phylodynamic framework have not yet caught up with the sequencing speed. Computing time and numerical instability issues limit their applicability to medium data sets (≤ 500 samples), while the accuracy of estimations should increase with more data. We propose a new highly parallelizable formulation of ordinary differential equations for MTBD models. We also extend them to forests to represent situations when a (sub-)epidemic started from several cases (e.g., multiple introductions to a country). We implemented it for the BDEI model in a maximum likelihood framework using a combination of numerical analysis methods for efficient equation resolution. Our implementation estimates epidemiological parameter values and their confidence intervals in two minutes on a phylogenetic tree of 10,000 samples. Comparison to the existing implementations on simulated data shows that it is not only much faster but also more accurate. An application of our tool to the 2014 Ebola epidemic in Sierra-Leone is also convincing, with very fast calculation and precise estimates. As MTBD models are closely related to Cladogenetic State Speciation and Extinction (ClaSSE)-like models, our findings could also be easily transferred to the macroevolution domain.


Subject(s)
Epidemics , Hemorrhagic Fever, Ebola , Humans , Phylogeny , Hemorrhagic Fever, Ebola/epidemiology , Likelihood Functions , Epidemiological Models
4.
Nucleic Acids Res ; 50(21): 12328-12343, 2022 11 28.
Article in English | MEDLINE | ID: mdl-36453997

ABSTRACT

G-quadruplexes (G4s) are four-stranded nucleic acid structures formed by the stacking of G-tetrads. Here we investigated their formation and function during HIV-1 infection. Using bioinformatics and biophysics analyses we first searched for evolutionary conserved G4-forming sequences in HIV-1 genome. We identified 10 G4s with conservation rates higher than those of HIV-1 regulatory sequences such as RRE and TAR. We then used porphyrin-based G4-binders to probe the formation of the G4s during infection of human cells by native HIV-1. The G4-binders efficiently inhibited HIV-1 infectivity, which is attributed to the formation of G4 structures during HIV-1 replication. Using a qRT-PCR approach, we showed that the formation of viral G4s occurs during the first 2 h post-infection and their stabilization by the G4-binders prevents initiation of reverse transcription. We also used a G4-RNA pull-down approach, based on a G4-specific biotinylated probe, to allow the direct detection and identification of viral G4-RNA in infected cells. Most of the detected G4-RNAs contain crucial regulatory elements such as the PPT and cPPT sequences as well as the U3 region. Hence, these G4s would function in the early stages of infection when the viral RNA genome is being processed for the reverse transcription step.


Subject(s)
G-Quadruplexes , HIV-1 , Humans , RNA/chemistry , HIV-1/genetics , Regulatory Sequences, Nucleic Acid , Conserved Sequence
5.
PLoS Pathog ; 17(8): e1009786, 2021 08.
Article in English | MEDLINE | ID: mdl-34370795

ABSTRACT

CRF19 is a recombinant form of HIV-1 subtypes D, A1 and G, which was first sampled in Cuba in 1999, but was already present there in 1980s. CRF19 was reported almost uniquely in Cuba, where it accounts for ∼25% of new HIV-positive patients and causes rapid progression to AIDS (∼3 years). We analyzed a large data set comprising ∼350 pol and env sequences sampled in Cuba over the last 15 years and ∼350 from Los Alamos database. This data set contained both CRF19 (∼315), and A1, D and G sequences. We performed and combined analyses for the three A1, G and D regions, using fast maximum likelihood approaches, including: (1) phylogeny reconstruction, (2) spatio-temporal analysis of the virus spread, and ancestral character reconstruction for (3) transmission mode and (4) drug resistance mutations (DRMs). We verified these results with a Bayesian approach. This allowed us to acquire new insights on the CRF19 origin and transmission patterns. We showed that CRF19 recombined between 1966 and 1977, most likely in Cuban community stationed in Congo region. We further investigated CRF19 spread on the Cuban province level, and discovered that the epidemic started in 1970s, most probably in Villa Clara, that it was at first carried by heterosexual transmissions, and then quickly spread in the 1980s within the "men having sex with men" (MSM) community, with multiple transmissions back to heterosexuals. The analysis of the transmission patterns of common DRMs found very few resistance transmission clusters. Our results show a very early introduction of CRF19 in Cuba, which could explain its local epidemiological success. Ignited by a major founder event, the epidemic then followed a similar pattern as other subtypes and CRFs in Cuba. The reason for the short time to AIDS remains to be understood and requires specific surveillance, in Cuba and elsewhere.


Subject(s)
Disease Transmission, Infectious/statistics & numerical data , Genetic Variation , HIV Infections/epidemiology , HIV-1/classification , Phylogeny , Bayes Theorem , Cuba/epidemiology , Female , HIV Infections/transmission , HIV Infections/virology , HIV-1/genetics , HIV-1/physiology , Humans , Male
6.
Syst Biol ; 71(3): 630-648, 2022 04 19.
Article in English | MEDLINE | ID: mdl-34469581

ABSTRACT

Widely used approaches for extracting phylogenetic information from aligned sets of molecular sequences rely upon probabilistic models of nucleotide substitution or amino-acid replacement. The phylogenetic information that can be extracted depends on the number of columns in the sequence alignment and will be decreased when the alignment contains gaps due to insertion or deletion events. Motivated by the measurement of information loss, we suggest assessment of the effective sequence length (ESL) of an aligned data set. The ESL can differ from the actual number of columns in a sequence alignment because of the presence of alignment gaps. Furthermore, the estimation of phylogenetic information is affected by model misspecification. Inevitably, the actual process of molecular evolution differs from the probabilistic models employed to describe this process. This disparity means the amount of phylogenetic information in an actual sequence alignment will differ from the amount in a simulated data set of equal size, which motivated us to develop a new test for model adequacy. Via theory and empirical data analysis, we show how to disentangle the effects of gaps and model misspecification. By comparing the Fisher information of actual and simulated sequences, we identify which alignment sites and tree branches are most affected by gaps and model misspecification. [Fisher information; gaps; insertion; deletion; indel; model adequacy; goodness-of-fit test; sequence alignment.].


Subject(s)
Evolution, Molecular , INDEL Mutation , Models, Genetic , Models, Statistical , Phylogeny , Sequence Alignment
7.
Bioinformatics ; 37(12): 1761-1762, 2021 07 19.
Article in English | MEDLINE | ID: mdl-33045068

ABSTRACT

MOTIVATION: The first cases of the COVID-19 pandemic emerged in December 2019. Until the end of February 2020, the number of available genomes was below 1000 and their multiple alignment was easily achieved using standard approaches. Subsequently, the availability of genomes has grown dramatically. Moreover, some genomes are of low quality with sequencing/assembly errors, making accurate re-alignment of all genomes nearly impossible on a daily basis. A more efficient, yet accurate approach was clearly required to pursue all subsequent bioinformatics analyses of this crucial data. RESULTS: hCoV-19 genomes are highly conserved, with very few indels and no recombination. This makes the profile HMM approach particularly well suited to align new genomes, add them to an existing alignment and filter problematic ones. Using a core of ∼2500 high quality genomes, we estimated a profile using HMMER, and implemented this profile in COVID-Align, a user-friendly interface to be used online or as standalone via Docker. The alignment of 1000 genomes requires ∼50 minutes on our cluster. Moreover, COVID-Align provides summary statistics, which can be used to determine the sequencing quality and evolutionary novelty of input genomes (e.g. number of new mutations and indels). AVAILABILITY AND IMPLEMENTATION: https://covalign.pasteur.cloud, hub.docker.com/r/evolbioinfo/covid-align. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
COVID-19 , Software , Genome , Humans , Pandemics , SARS-CoV-2
8.
Bioinformatics ; 37(11): 1506-1514, 2021 Jul 12.
Article in English | MEDLINE | ID: mdl-30726875

ABSTRACT

MOTIVATION: Most evolutionary analyses are based on pre-estimated multiple sequence alignment. Wong et al. established the existence of an uncertainty induced by multiple sequence alignment when reconstructing phylogenies. They were able to show that in many cases different aligners produce different phylogenies, with no simple objective criterion sufficient to distinguish among these alternatives. RESULTS: We demonstrate that incorporating MSA induced uncertainty into bootstrap sampling can significantly increase correlation between clade correctness and its corresponding bootstrap value. Our procedure involves concatenating several alternative multiple sequence alignments of the same sequences, produced using different commonly used aligners. We then draw bootstrap replicates while favoring columns of the more unique aligner among the concatenated aligners. We named this concatenation and bootstrapping method, Weighted Partial Super Bootstrap (wpSBOOT). We show on three simulated datasets of 16, 32 and 64 tips that our method improves the predictive power of bootstrap values. We also used as a benchmark an empirical collection of 853 one to one orthologous genes from seven yeast species and found wpSBOOT to significantly improve discrimination capacity between topologically correct and incorrect trees. Bootstrap values of wpSBOOT are comparable to similar readouts estimated using a single method. However, for reduced trees by 50 and 95% bootstrap thresholds, wpSBOOT comes out the lowest Type I error (less FP). AVAILABILITY AND IMPLEMENTATION: The automated generation of replicates has been implemented in the T-Coffee package, which is available as open source freeware available from www.tcoffee.org. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

9.
PLoS Comput Biol ; 17(8): e1008873, 2021 08.
Article in English | MEDLINE | ID: mdl-34437532

ABSTRACT

Drug resistance mutations (DRMs) appear in HIV under treatment pressure. DRMs are commonly transmitted to naive patients. The standard approach to reveal new DRMs is to test for significant frequency differences of mutations between treated and naive patients. However, we then consider each mutation individually and cannot hope to study interactions between several mutations. Here, we aim to leverage the ever-growing quantity of high-quality sequence data and machine learning methods to study such interactions (i.e. epistasis), as well as try to find new DRMs. We trained classifiers to discriminate between Reverse Transcriptase Inhibitor (RTI)-experienced and RTI-naive samples on a large HIV-1 reverse transcriptase (RT) sequence dataset from the UK (n ≈ 55, 000), using all observed mutations as binary representation features. To assess the robustness of our findings, our classifiers were evaluated on independent data sets, both from the UK and Africa. Important representation features for each classifier were then extracted as potential DRMs. To find novel DRMs, we repeated this process by removing either features or samples associated to known DRMs. When keeping all known resistance signal, we detected sufficiently prevalent known DRMs, thus validating the approach. When removing features corresponding to known DRMs, our classifiers retained some prediction accuracy, and six new mutations significantly associated with resistance were identified. These six mutations have a low genetic barrier, are correlated to known DRMs, and are spatially close to either the RT active site or the regulatory binding pocket. When removing both known DRM features and sequences containing at least one known DRM, our classifiers lose all prediction accuracy. These results likely indicate that all mutations directly conferring resistance have been found, and that our newly discovered DRMs are accessory or compensatory mutations. Moreover, apart from the accessory nature of the relationships we found, we did not find any significant signal of further, more subtle epistasis combining several mutations which individually do not seem to confer any resistance.


Subject(s)
Big Data , Drug Resistance, Viral/genetics , HIV Infections/drug therapy , HIV Infections/virology , HIV-1/drug effects , HIV-1/genetics , Supervised Machine Learning , Africa , Anti-HIV Agents/pharmacology , Bayes Theorem , Computational Biology , Databases, Genetic , Decision Trees , Epistasis, Genetic , Genes, Viral , HIV Reverse Transcriptase/antagonists & inhibitors , HIV Reverse Transcriptase/chemistry , HIV Reverse Transcriptase/genetics , Humans , Logistic Models , Models, Genetic , Mutation , United Kingdom
10.
Syst Biol ; 69(3): 521-529, 2020 05 01.
Article in English | MEDLINE | ID: mdl-31432087

ABSTRACT

Reconstructing ancestral characters and traits along a phylogenetic tree is central to evolutionary biology. It is the key to understanding morphology changes among species, inferring ancestral biochemical properties of life, or recovering migration routes in phylogeography. The goal is 2-fold: to reconstruct the character state at the tree root (e.g., the region of origin of some species) and to understand the process of state changes along the tree (e.g., species flow between countries). We deal here with discrete characters, which are "unique," as opposed to sequence characters (nucleotides or amino-acids), where we assume the same model for all the characters (or for large classes of characters with site-dependent models) and thus benefit from multiple information sources. In this framework, we use mathematics and simulations to demonstrate that although each goal can be achieved with high accuracy individually, it is generally impossible to accurately estimate both the root state and the rates of state changes along the tree branches, from the observed data at the tips of the tree. This is because the global rates of state changes along the branches that are optimal for the two estimation tasks have opposite trends, leading to a fundamental trade-off in accuracy. This inherent "Darwinian uncertainty principle" concerning the simultaneous estimation of "patterns" and "processes" governs ancestral reconstructions in biology. For certain tree shapes (typically speciation trees) the uncertainty of simultaneous estimation is reduced when more tips are present; however, for other tree shapes it does not (e.g., coalescent trees used in population genetics).


Subject(s)
Classification/methods , Models, Theoretical , Phylogeny , Computer Simulation
11.
Nucleic Acids Res ; 47(W1): W260-W265, 2019 07 02.
Article in English | MEDLINE | ID: mdl-31028399

ABSTRACT

Phylogeny.fr, created in 2008, has been designed to facilitate the execution of phylogenetic workflows, and is nowadays widely used. However, since its development, user needs have evolved, new tools and workflows have been published, and the number of jobs has increased dramatically, thus promoting new practices, which motivated its refactoring. We developed NGPhylogeny.fr to be more flexible in terms of tools and workflows, easily installable, and more scalable. It integrates numerous tools in their latest version (e.g. TNT, FastME, MrBayes, etc.) as well as new ones designed in the last ten years (e.g. PhyML, SMS, FastTree, trimAl, BOOSTER, etc.). These tools cover a large range of usage (sequence searching, multiple sequence alignment, model selection, tree inference and tree drawing) and a large panel of standard methods (distance, parsimony, maximum likelihood and Bayesian). They are integrated in workflows, which have been already configured ('One click'), can be customized ('Advanced'), or are built from scratch ('A la carte'). Workflows are managed and run by an underlying Galaxy workflow system, which makes workflows more scalable in terms of number of jobs and size of data. NGPhylogeny.fr is deployable on any server or personal computer, and is freely accessible at https://ngphylogeny.fr.


Subject(s)
Databases, Factual , Internet , Phylogeny , Software
12.
Mol Biol Evol ; 36(9): 2069-2085, 2019 09 01.
Article in English | MEDLINE | ID: mdl-31127303

ABSTRACT

The reconstruction of ancestral scenarios is widely used to study the evolution of characters along phylogenetic trees. One commonly uses the marginal posterior probabilities of the character states, or the joint reconstruction of the most likely scenario. However, marginal reconstructions provide users with state probabilities, which are difficult to interpret and visualize, whereas joint reconstructions select a unique state for every tree node and thus do not reflect the uncertainty of inferences. We propose a simple and fast approach, which is in between these two extremes. We use decision-theory concepts (namely, the Brier score) to associate each node in the tree to a set of likely states. A unique state is predicted in tree regions with low uncertainty, whereas several states are predicted in uncertain regions, typically around the tree root. To visualize the results, we cluster the neighboring nodes associated with the same states and use graph visualization tools. The method is implemented in the PastML program and web server. The results on simulated data demonstrate the accuracy and robustness of the approach. PastML was applied to the phylogeography of Dengue serotype 2 (DENV2), and the evolution of drug resistances in a large HIV data set. These analyses took a few minutes and provided convincing results. PastML retrieved the main transmission routes of human DENV2 and showed the uncertainty of the human-sylvatic DENV2 geographic origin. With HIV, the results show that resistance mutations mostly emerge independently under treatment pressure, but resistance clusters are found, corresponding to transmissions among untreated patients.


Subject(s)
Computational Biology/methods , Phylogeny , Software , Decision Theory , Dengue Virus/genetics , HIV/genetics
13.
BMC Evol Biol ; 19(1): 163, 2019 08 02.
Article in English | MEDLINE | ID: mdl-31375065

ABSTRACT

BACKGROUND: Ancestral character states computed from the combination of phylogenetic trees with extrinsic traits are used to decipher evolutionary scenarios in various research fields such as phylogeography, epidemiology, and ecology. Despite the existence of powerful methods and software in ancestral character state inference, difficulties may arise when interpreting the outputs of such inferences. The growing complexity of data (trees, annotations), the diversity of optimization criteria for computing trees and ancestral character states, the combinatorial explosion of potential evolutionary scenarios if some ancestral characters states do not stand out clearly from others, requires the design of new methods to explore associations of phylogenetic trees with extrinsic traits, to ease the visualization and interpretation of evolutionary scenarios. RESULT: We developed PastView, a user-friendly interface that includes numerical and graphical features to help users to import and/or compute ancestral character states from discrete variables and extract ancestral scenarios as sets of successive transitions of character states from the tree root to its leaves. PastView provides summarized views such as transition maps and integrates comparative tools to highlight agreements or discrepancies between methods of ancestral annotations inference. CONCLUSION: The main contribution of PastView is to assemble known numerical and graphical methods into a multi-maps graphical user interface dedicated to the computing, searching and viewing of evolutionary scenarios based on phylogenetic trees and ancestral character states. PastView is available publicly as a standalone software on www.pastview.org .


Subject(s)
Phylogeny , Software , User-Computer Interface , Albania/epidemiology , Dengue/epidemiology , Dengue Virus/genetics , HIV Infections/epidemiology , HIV-1/genetics , Humans , Phenotype , Phylogeography
14.
Syst Biol ; 67(6): 997-1009, 2018 11 01.
Article in English | MEDLINE | ID: mdl-30295908

ABSTRACT

Phylogenetic reconstructions are essential in genomics data analyses and depend on accurate multiple sequence alignment (MSA) models. We show that all currently available large-scale progressive multiple alignment methods are numerically unstable when dealing with amino-acid sequences. They produce significantly different output when changing sequence input order. We used the HOMFAM protein sequences dataset to show that on datasets larger than 100 sequences, this instability affects on average 21.5% of the aligned residues. The resulting Maximum Likelihood (ML) trees estimated from these MSAs are equally unstable with over 38% of the branches being sensitive to the sequence input order. We established that about two-thirds of this uncertainty stems from the unordered nature of children nodes within the guide trees used to estimate MSAs. To quantify this uncertainty we developed unistrap, a novel approach that estimates the combined effect of alignment uncertainty and site sampling on phylogenetic tree branch supports. Compared with the regular bootstrap procedure, unistrap provides branch support estimates that take into account a larger fraction of the parameters impacting tree instability when processing datasets containing a large number of sequences.


Subject(s)
Classification/methods , Models, Genetic , Phylogeny , Proteins/genetics , Proteins/chemistry , Sequence Alignment , Software , Uncertainty
15.
PLoS Comput Biol ; 14(1): e1005889, 2018 01.
Article in English | MEDLINE | ID: mdl-29293498

ABSTRACT

Comparing and aligning protein sequences is an essential task in bioinformatics. More specifically, local alignment tools like BLAST are widely used for identifying conserved protein sub-sequences, which likely correspond to protein domains or functional motifs. However, to limit the number of false positives, these tools are used with stringent sequence-similarity thresholds and hence can miss several hits, especially for species that are phylogenetically distant from reference organisms. A solution to this problem is then to integrate additional contextual information to the procedure. Here, we propose to use domain co-occurrence to increase the sensitivity of pairwise sequence comparisons. Domain co-occurrence is a strong feature of proteins, since most protein domains tend to appear with a limited number of other domains on the same protein. We propose a method to take this information into account in a typical BLAST analysis and to construct new domain families on the basis of these results. We used Plasmodium falciparum as a case study to evaluate our method. The experimental findings showed an increase of 14% of the number of significant BLAST hits and an increase of 25% of the proteome area that can be covered with a domain. Our method identified 2240 new domains for which, in most cases, no model of the Pfam database could be linked. Moreover, our study of the quality of the new domains in terms of alignment and physicochemical properties show that they are close to that of standard Pfam domains. Source code of the proposed approach and supplementary data are available at: https://gite.lirmm.fr/menichelli/pairwise-comparison-with-cooccurrence.


Subject(s)
Proteins/chemistry , Proteins/genetics , Sequence Alignment/methods , Sequence Analysis, Protein/methods , Algorithms , Amino Acid Sequence , Computational Biology , Databases, Protein , Plasmodium falciparum/chemistry , Plasmodium falciparum/genetics , Protein Domains , Protozoan Proteins/chemistry , Protozoan Proteins/genetics , Sequence Alignment/statistics & numerical data , Sequence Analysis, Protein/statistics & numerical data
16.
J Math Biol ; 79(2): 485-508, 2019 07.
Article in English | MEDLINE | ID: mdl-31037350

ABSTRACT

The transfer distance (TD) was introduced in the classification framework and studied in the context of phylogenetic tree matching. Recently, Lemoine et al. (Nature 556(7702):452-456, 2018. https://doi.org/10.1038/s41586-018-0043-0 ) showed that TD can be a powerful tool to assess the branch support on large phylogenies, thus providing a relevant alternative to Felsenstein's bootstrap. This distance allows a reference branch[Formula: see text] in a reference tree [Formula: see text] to be compared to a branch b from another tree T (typically a bootstrap tree), both on the same set of n taxa. The TD between these branches is the number of taxa that must be transferred from one side of b to the other in order to obtain [Formula: see text]. By taking the minimum TD from [Formula: see text] to all branches in T we define the transfer index, denoted by [Formula: see text], measuring the degree of agreement of T with [Formula: see text]. Let us consider a reference branch [Formula: see text] having p tips on its light side and define the transfer support (TS) as [Formula: see text]. Lemoine et al. (2018) used computer simulations to show that the TS defined in this manner is close to 0 for random "bootstrap" trees. In this paper, we demonstrate that result mathematically: when T is randomly drawn, TS converges in probability to 0 when n tends to [Formula: see text]. Moreover, we fully characterize the distribution of [Formula: see text] on caterpillar trees, indicating that the convergence is fast, and that even when n is small, moderate levels of branch support cannot appear by chance.


Subject(s)
Gene Transfer, Horizontal , Models, Genetic , Phylogeny , Algorithms , Computer Simulation
17.
Proc Natl Acad Sci U S A ; 113(41): 11537-11542, 2016 10 11.
Article in English | MEDLINE | ID: mdl-27681623

ABSTRACT

Recent experiments provide sound arguments in favor of the in vivo expression of the AntiSense Protein (ASP) of HIV-1. This putative protein is encoded on the antisense strand of the provirus genome and entirely overlapped by the env gene with reading frame -2. The existence of ASP was suggested in 1988, but is still controversial, and its function has yet to be determined. We used a large dataset of ∼23,000 HIV-1 and SIV sequences to study the origin, evolution, and conservation of the asp gene. We found that the ASP ORF is specific to group M of HIV-1, which is responsible for the human pandemic. Moreover, the correlation between the presence of asp and the prevalence of HIV-1 groups and M subtypes appeared to be statistically significant. We then looked for evidence of selection pressure acting on asp Using computer simulations, we showed that the conservation of the ASP ORF in the group M could not be due to chance. Standard methods were ineffective in disentangling the two selection pressures imposed by both the Env and ASP proteins-an expected outcome with overlaps in frame -2. We thus developed a method based on careful evolutionary analysis of the presence/absence of stop codons, revealing that ASP does impose significant selection pressure. All of these results support the idea that asp is the 10th gene of HIV-1 group M and indicate a correlation with the spread of the pandemic.


Subject(s)
HIV Infections/epidemiology , HIV Infections/virology , HIV-1/genetics , Pandemics , Viral Proteins/genetics , Base Sequence , Conserved Sequence/genetics , Evolution, Molecular , Genome, Viral , HIV Infections/genetics , Phylogeny , Reading Frames/genetics , Selection, Genetic
18.
Mol Biol Evol ; 34(9): 2422-2424, 2017 09 01.
Article in English | MEDLINE | ID: mdl-28472384

ABSTRACT

Model selection using likelihood-based criteria (e.g., AIC) is one of the first steps in phylogenetic analysis. One must select both a substitution matrix and a model for rates across sites. A simple method is to test all combinations and select the best one. We describe heuristics to avoid these extensive calculations. Runtime is divided by ∼2 with results remaining nearly the same, and the method performs well compared with ProtTest and jModelTest2. Our software, "Smart Model Selection" (SMS), is implemented in the PhyML environment and available using two interfaces: command-line (to be integrated in pipelines) and a web server (http://www.atgc-montpellier.fr/phyml-sms/).


Subject(s)
Computational Biology/methods , Algorithms , Likelihood Functions , Models, Genetic , Phylogeny , Software
19.
PLoS Comput Biol ; 13(3): e1005416, 2017 03.
Article in English | MEDLINE | ID: mdl-28263987

ABSTRACT

Inferring epidemiological parameters such as the R0 from time-scaled phylogenies is a timely challenge. Most current approaches rely on likelihood functions, which raise specific issues that range from computing these functions to finding their maxima numerically. Here, we present a new regression-based Approximate Bayesian Computation (ABC) approach, which we base on a large variety of summary statistics intended to capture the information contained in the phylogeny and its corresponding lineage-through-time plot. The regression step involves the Least Absolute Shrinkage and Selection Operator (LASSO) method, which is a robust machine learning technique. It allows us to readily deal with the large number of summary statistics, while avoiding resorting to Markov Chain Monte Carlo (MCMC) techniques. To compare our approach to existing ones, we simulated target trees under a variety of epidemiological models and settings, and inferred parameters of interest using the same priors. We found that, for large phylogenies, the accuracy of our regression-ABC is comparable to that of likelihood-based approaches involving birth-death processes implemented in BEAST2. Our approach even outperformed these when inferring the host population size with a Susceptible-Infected-Removed epidemiological model. It also clearly outperformed a recent kernel-ABC approach when assuming a Susceptible-Infected epidemiological model with two host types. Lastly, by re-analyzing data from the early stages of the recent Ebola epidemic in Sierra Leone, we showed that regression-ABC provides more realistic estimates for the duration parameters (latency and infectiousness) than the likelihood-based method. Overall, ABC based on a large variety of summary statistics and a regression method able to perform variable selection and avoid overfitting is a promising approach to analyze large phylogenies.


Subject(s)
Bayes Theorem , Disease Outbreaks/statistics & numerical data , Ebolavirus/genetics , Hemorrhagic Fever, Ebola/epidemiology , Hemorrhagic Fever, Ebola/virology , Models, Statistical , Algorithms , Computer Simulation , Ebolavirus/classification , Ebolavirus/isolation & purification , Humans , Incidence , Phylogeny , Regression Analysis , Risk Factors , Sierra Leone/epidemiology , Virus Latency/genetics
SELECTION OF CITATIONS
SEARCH DETAIL