ABSTRACT
The unprecedented coverage offered by next-generation sequencing (NGS) technology has facilitated the assessment of the population complexity of intra-host RNA viral populations at an unprecedented level of detail. Consequently, analysis of NGS datasets could be used to extract and infer crucial epidemiological and biomedical information on the levels of both infected individuals and susceptible populations, thus enabling the development of more effective prevention strategies and antiviral therapeutics. Such information includes drug resistance, infection stage, transmission clusters and structures of transmission networks. However, NGS data require sophisticated analysis dealing with millions of error-prone short reads per patient. Prior to the NGS era, epidemiological and phylogenetic analyses were geared toward Sanger sequencing technology; now, they must be redesigned to handle the large-scale NGS datasets and properly model the evolution of heterogeneous rapidly mutating viral populations. Additionally, dedicated epidemiological surveillance systems require big data analytics to handle millions of reads obtained from thousands of patients for rapid outbreak investigation and management. We survey bioinformatics tools analyzing NGS data for (i) characterization of intra-host viral population complexity including single nucleotide variant and haplotype calling; (ii) downstream epidemiological analysis and inference of drug-resistant mutations, age of infection and linkage between patients; and (iii) data collection and analytics in surveillance systems for fast response and control of outbreaks.
Subject(s)
Epidemiological Monitoring , Genomics/methods , High-Throughput Nucleotide Sequencing/methods , RNA Virus Infections/virology , RNA Viruses/genetics , Humans , RNA Virus Infections/epidemiology , RNA Viruses/classification , RNA Viruses/isolation & purification , RNA Viruses/pathogenicityABSTRACT
Rapidly evolving RNA viruses continuously produce minority haplotypes that can become dominant if they are drug-resistant or can better evade the immune system. Therefore, early detection and identification of minority viral haplotypes may help to promptly adjust the patient's treatment plan preventing potential disease complications. Minority haplotypes can be identified using next-generation sequencing, but sequencing noise hinders accurate identification. The elimination of sequencing noise is a non-trivial task that still remains open. Here we propose CliqueSNV based on extracting pairs of statistically linked mutations from noisy reads. This effectively reduces sequencing noise and enables identifying minority haplotypes with the frequency below the sequencing error rate. We comparatively assess the performance of CliqueSNV using an in vitro mixture of nine haplotypes that were derived from the mutation profile of an existing HIV patient. We show that CliqueSNV can accurately assemble viral haplotypes with frequencies as low as 0.1% and maintains consistent performance across short and long bases sequencing platforms.
Subject(s)
Algorithms , Computational Biology/methods , Haplotypes , High-Throughput Nucleotide Sequencing/methods , RNA Virus Infections/diagnosis , RNA Viruses/genetics , COVID-19/diagnosis , COVID-19/virology , Gene Frequency , HIV Infections/diagnosis , HIV Infections/virology , HIV-1/genetics , Humans , Mutation , Polymorphism, Single Nucleotide , RNA Virus Infections/virology , Reproducibility of Results , SARS-CoV-2/genetics , Sensitivity and SpecificityABSTRACT
Outbreak investigations use data from interviews, healthcare providers, laboratories and surveillance systems. However, integrated use of data from multiple sources requires a patchwork of software that present challenges in usability, interoperability, confidentiality, and cost. Rapid integration, visualization and analysis of data from multiple sources can guide effective public health interventions. We developed MicrobeTrace to facilitate rapid public health responses by overcoming barriers to data integration and exploration in molecular epidemiology. MicrobeTrace is a web-based, client-side, JavaScript application (https://microbetrace.cdc.gov) that runs in Chromium-based browsers and remains fully operational without an internet connection. Using publicly available data, we demonstrate the analysis of viral genetic distance networks and introduce a novel approach to minimum spanning trees that simplifies results. We also illustrate the potential utility of MicrobeTrace in support of contact tracing by analyzing and displaying data from an outbreak of SARS-CoV-2 in South Korea in early 2020. MicrobeTrace is developed and actively maintained by the Centers for Disease Control and Prevention. Users can email microbetrace@cdc.gov for support. The source code is available at https://github.com/cdcgov/microbetrace.
Subject(s)
Communicable Diseases/epidemiology , Data Visualization , Molecular Epidemiology/methods , Public Health/methods , Software , Centers for Disease Control and Prevention, U.S. , Disease Outbreaks , Humans , United StatesABSTRACT
BACKGROUND: RNA viruses mutate at extremely high rates, forming an intra-host viral population of closely related variants, which allows them to evade the host's immune system and makes them particularly dangerous. Viral outbreaks pose a significant threat for public health, and, in order to deal with it, it is critical to infer transmission clusters, i.e., decide whether two viral samples belong to the same outbreak. Next-generation sequencing (NGS) can significantly help in tackling outbreak-related problems. While NGS data is first obtained as short reads, existing methods rely on assembled sequences. This requires reconstruction of the entire viral population, which is complicated, error-prone and time-consuming. RESULTS: The experimental validation using sequencing data from HCV outbreaks shows that the proposed algorithm can successfully identify genetic relatedness between viral populations, infer transmission direction, transmission clusters and outbreak sources, as well as decide whether the source is present in the sequenced outbreak sample and identify it. CONCLUSIONS: Introduced algorithm allows to cluster genetically related samples, infer transmission directions and predict sources of outbreaks. Validation on experimental data demonstrated that algorithm is able to reconstruct various transmission characteristics. Advantage of the method is the ability to bypass cumbersome read assembly, thus eliminating the chance to introduce new errors, and saving processing time by allowing to use raw NGS reads.
Subject(s)
Hepacivirus , RNA Viruses , Algorithms , Disease Outbreaks , Hepacivirus/genetics , High-Throughput Nucleotide SequencingABSTRACT
Summary: Genomic sequences are assembled into a variable, but large number of contigs that should be scaffolded (ordered and oriented) for facilitating comparative or functional analysis. Finding scaffolding is computationally challenging due to misassemblies, inconsistent coverage across the genome and long repeats. An accurate assessment of scaffolding tools should take into account multiple locations of the same contig on the reference scaffolding rather than matching a repeat to a single best location. This makes mapping of inferred scaffoldings onto the reference a computationally challenging problem. This paper formulates the repeat-aware scaffolding evaluation problem, which is to find a mapping of the inferred scaffolding onto the reference maximizing number of correct links and proposes a scalable algorithm capable of handling large whole-genome datasets. Our novel scaffolding validation framework has been applied to assess the most of state-of-the-art scaffolding tools on the representative subset of Genome Assembly Golden-Standard Evaluations (GAGE) datasets and some novel simulated datasets. Availability and implementation: The source code of this evaluation framework is available at https://github.com/mandricigor/repeat-aware. The documentation is hosted at https://mandricigor.github.io/repeat-aware. Supplementary information: Supplementary data are available at Bioinformatics online.
Subject(s)
Contig Mapping/methods , Genome , Repetitive Sequences, Nucleic Acid , Sequence Analysis, DNA/methods , Software , Algorithms , Bacteria/genetics , Eukaryota/genetics , Genomics/methods , HumansABSTRACT
Motivation: Genomic analysis has become one of the major tools for disease outbreak investigations. However, existing computational frameworks for inference of transmission history from viral genomic data often do not consider intra-host diversity of pathogens and heavily rely on additional epidemiological data, such as sampling times and exposure intervals. This impedes genomic analysis of outbreaks of highly mutable viruses associated with chronic infections, such as human immunodeficiency virus and hepatitis C virus, whose transmissions are often carried out through minor intra-host variants, while the additional epidemiological information often is either unavailable or has a limited use. Results: The proposed framework QUasispecies Evolution, Network-based Transmission INference (QUENTIN) addresses the above challenges by evolutionary analysis of intra-host viral populations sampled by deep sequencing and Bayesian inference using general properties of social networks relevant to infection dissemination. This method allows inference of transmission direction even without the supporting case-specific epidemiological information, identify transmission clusters and reconstruct transmission history. QUENTIN was validated on experimental and simulated data, and applied to investigate HCV transmission within a community of hosts with high-risk behavior. It is available at https://github.com/skumsp/QUENTIN. Contact: pskums@gsu.edu or alexz@cs.gsu.edu or rahul@sfsu.edu or yek0@cdc.gov. Supplementary information: Supplementary data are available at Bioinformatics online.
Subject(s)
Genome, Viral , High-Throughput Nucleotide Sequencing/methods , Quasispecies , Sequence Analysis, RNA/methods , Software , Bayes Theorem , Disease Outbreaks , Genomics/methods , Hepacivirus/genetics , Humans , Sequence Analysis, DNA/methodsABSTRACT
BACKGROUND: RNA viruses such as HCV and HIV mutate at extremely high rates, and as a result, they exist in infected hosts as populations of genetically related variants. Recent advances in sequencing technologies make possible to identify such populations at great depth. In particular, these technologies provide new opportunities for inference of relatedness between viral samples, identification of transmission clusters and sources of infection, which are crucial tasks for viral outbreaks investigations. RESULTS: We present (i) an evolutionary simulation algorithm Viral Outbreak InferenCE (VOICE) inferring genetic relatedness, (ii) an algorithm MinDistB detecting possible transmission using minimal distances between intra-host viral populations and sizes of their relative borders, and (iii) a non-parametric recursive clustering algorithm Relatedness Depth (ReD) analyzing clusters' structure to infer possible transmissions and their directions. All proposed algorithms were validated using real sequencing data from HCV outbreaks. CONCLUSIONS: All algorithms are applicable to the analysis of outbreaks of highly heterogeneous RNA viruses. Our experimental validation shows that they can successfully identify genetic relatedness between viral populations, as well as infer transmission clusters and outbreak sources.
Subject(s)
Computational Biology , Hepacivirus/genetics , Phylogeny , Quasispecies/genetics , Sequence Analysis, RNA , Algorithms , Cluster Analysis , Genome, Viral/genetics , RNA, Viral/geneticsABSTRACT
Succinic semialdehyde dehydrogenase (SSADH) converts succinic semialdehyde (SSA) to succinic acid in the mitochondrial matrix and is involved in the metabolism of the inhibitory neurotransmitter γ-aminobutyric acid (GABA). The molecular structure of human SSADH revealed the intrinsic regulatory mechanism--redox-switch modulation--by which large conformational changes are brought about in the catalytic loop through disulfide bonding. The crystal structures revealed two SSADH conformations, and computational modeling of transformation between them can provide substantial insights into detailed dynamic redox modulation. On the basis of these two clear crystal structures, we modeled the conformational motion between these structures in silico. For that purpose, we proposed and used a geometry-based coarse-grained mathematical model of long-range protein motion and the related modeling algorithm. The algorithm is based on solving the special optimization problem, which is similar to the classical Monge-Kantorovich mass transportation problem. The modeled transformation was supported by another morphing method based on a completely different framework. The result of the modeling facilitates better interpretation and understanding of the SSADH biological role.
Subject(s)
Models, Molecular , Succinate-Semialdehyde Dehydrogenase/chemistry , Algorithms , Catalytic Domain , Disulfides/chemistry , Humans , Oxidation-Reduction , Protein ConformationABSTRACT
Evaluating changes in metabolic pathway activity is essential for studying disease mechanisms and developing new treatments, with significant benefits extending to human health. Here, we propose EMPathways2, a maximum likelihood pipeline that is based on the expectation-maximization algorithm, which is capable of evaluating enzyme expression and metabolic pathway activity level. We first estimate enzyme expression from RNA-seq data that is used for simultaneous estimation of pathway activity levels using enzyme participation levels in each pathway. We implement the novel pipeline to RNA-seq data from several groups of mice, which provides a deeper look at the biochemical changes occurring as a result of bacterial infection, disease, and immune response. Our results show that estimated enzyme expression, pathway activity levels, and enzyme participation levels in each pathway are robust and stable across all samples. Estimated activity levels of a significant number of metabolic pathways strongly correlate with the infected and uninfected status of the respective rodent types.
ABSTRACT
Human inborn errors of immunity include rare disorders entailing functional and quantitative antibody deficiencies due to impaired B cells called the common variable immunodeficiency (CVID) phenotype. Patients with CVID face delayed diagnoses and treatments for 5 to 15 years after symptom onset because the disorders are rare (prevalence of ~1/25,000), and there is extensive heterogeneity in CVID phenotypes, ranging from infections to autoimmunity to inflammatory conditions, overlapping with other more common disorders. The prolonged diagnostic odyssey drives excessive system-wide costs before diagnosis. Because there is no single causal mechanism, there are no genetic tests to definitively diagnose CVID. Here, we present PheNet, a machine learning algorithm that identifies patients with CVID from their electronic health records (EHRs). PheNet learns phenotypic patterns from verified CVID cases and uses this knowledge to rank patients by likelihood of having CVID. PheNet could have diagnosed more than half of our patients with CVID 1 or more years earlier than they had been diagnosed. When applied to a large EHR dataset, followed by blinded chart review of the top 100 patients ranked by PheNet, we found that 74% were highly probable to have CVID. We externally validated PheNet using >6 million records from disparate medical systems in California and Tennessee. As artificial intelligence and machine learning make their way into health care, we show that algorithms such as PheNet can offer clinical benefits by expediting the diagnosis of rare diseases.
Subject(s)
Common Variable Immunodeficiency , Electronic Health Records , Humans , Common Variable Immunodeficiency/diagnosis , Machine Learning , Algorithms , Male , Female , Phenotype , Adult , Undiagnosed Diseases/diagnosisABSTRACT
We investigated transmission dynamics of a large human immunodeficiency virus (HIV) outbreak among persons who inject drugs (PWID) in KY and OH during 2017-20 by using detailed phylogenetic, network, recombination, and cluster dating analyses. Using polymerase (pol) sequences from 193 people associated with the investigation, we document high HIV-1 diversity, including Subtype B (44.6 per cent); numerous circulating recombinant forms (CRFs) including CRF02_AG (2.5 per cent) and CRF02_AG-like (21.8 per cent); and many unique recombinant forms composed of CRFs with major subtypes and sub-subtypes [CRF02_AG/B (24.3 per cent), B/CRF02_AG/B (0.5 per cent), and A6/D/B (6.4 per cent)]. Cluster analysis of sequences using a 1.5 per cent genetic distance identified thirteen clusters, including a seventy-five-member cluster composed of CRF02_AG-like and CRF02_AG/B, an eighteen-member CRF02_AG/B cluster, Subtype B clusters of sizes ranging from two to twenty-three, and a nine-member A6/D and A6/D/B cluster. Recombination and phylogenetic analyses identified CRF02_AG/B variants with ten unique breakpoints likely originating from Subtype B and CRF02_AG-like viruses in the largest clusters. The addition of contact tracing results from OH to the genetic networks identified linkage between persons with Subtype B, CRF02_AG, and CRF02_AG/B sequences in the clusters supporting de novo recombinant generation. Superinfection prevalence was 13.3 per cent (8/60) in persons with multiple specimens and included infection with B and CRF02_AG; B and CRF02_AG/B; or B and A6/D/B. In addition to the presence of multiple, distinct molecular clusters associated with this outbreak, cluster dating inferred transmission associated with the largest molecular cluster occurred as early as 2006, with high transmission rates during 2017-8 in certain other molecular clusters. This outbreak among PWID in KY and OH was likely driven by rapid transmission of multiple HIV-1 variants including de novo viral recombinants from circulating viruses within the community. Our findings documenting the high HIV-1 transmission rate and clustering through partner services and molecular clusters emphasize the importance of leveraging multiple different data sources and analyses, including those from disease intervention specialist investigations, to better understand outbreak dynamics and interrupt HIV spread.
ABSTRACT
The UCLA ATLAS Community Health Initiative (ATLAS) has an initial target to recruit 150,000 participants from across the UCLA Health system with the goal of creating a genomic database to accelerate precision medicine efforts in California. This initiative includes a biobank embedded within the UCLA Health system that comprises de-identified genomic data linked to electronic health records (EHRs). The first freeze of data from September 2020 contains 27,987 genotyped samples imputed to 7.9 million SNPs across the genome and is linked with de-identified versions of the EHRs from UCLA Health. Here, we describe a centralized repository of the genotype data and provide tools and pipelines to perform genome- and phenome-wide association studies across a wide range of EHR-derived phenotypes and genetic ancestry groups. We demonstrate the utility of this resource through the analysis of 7 well-studied traits and recapitulate many previous genetic and phenotypic associations.
ABSTRACT
Computational methods represent the lifeblood of modern molecular biology. Benchmarking is important for all methods, but with a focus here on computational methods, benchmarking is critical to dissect important steps of analysis pipelines, formally assess performance across common situations as well as edge cases, and ultimately guide users on what tools to use. Benchmarking can also be important for community building and advancing methods in a principled way. We conducted a meta-analysis of recent single-cell benchmarks to summarize the scope, extensibility, and neutrality, as well as technical features and whether best practices in open data and reproducible research were followed. The results highlight that while benchmarks often make code available and are in principle reproducible, they remain difficult to extend, for example, as new methods and new ways to assess methods emerge. In addition, embracing containerization and workflow systems would enhance reusability of intermediate benchmarking results, thus also driving wider adoption.
Subject(s)
Benchmarking , Computational Biology , Computational Biology/methods , WorkflowABSTRACT
Global climate change with the cyclicity of natural and climatic processes in the growing season of berry plants, causes weakening at the defense system to (a)biotic stressors, which actualize the need for accelerated cultivar-improving breeding. A new hybrid red currant material was obtained and studied by the method of interspecific hybridization. Correlation analysis was used to assess the relationship between adaptively significant and economical and biological traits. To assess intergenotypic variability, hierarchical clustering was used according to the studied features, which allowed combining three standard methods of multidimensional data analysis. Genotypes adapted to different stressors were identified. The genotypes 271-58-24, 44-5-2, 261-65-19, and 'Jonkheer van Tets' were found to have a higher ratio of bound water to free water as compared with the others. Moreover, the genotypes of 271-58-24, 261-65-19, 77-1-47, and 'Jonkheer van Tets' were found to have less cold damage during the cold periods. The two most productive genotypes were found to be the genotypes 44-5-2, 143-23-35, and 1426-21-80. A dependence of yield on the beginning of differentiation of flower buds, which led to the abundance of flower inflorescences, was revealed. Rapid restoration of leaf hydration ensured successful adaptation of genotypes to the "temperature shock" of the growing season. The genotypes 271-58-24 and 'Jonkheer van Tets' were then observed to be far from the test traits and none of these traits were observed to characterize these two genotypes. The genotypes of 261-65-19 and 77-1-47 were then observed to be characterized by their high stability to Cecidophyopsis ribis scores. Genotypes 261-65-19 and 271-58-24, obtained with the participation of 'Jonkheer van Tets' as the maternal form, showed sufficient resistance to Pseudopeziza ribis and Cecidophyopsis ribis. Overall results suggested that the hydration recovery of red currant plants is significantly important for a yield improvement. A new cultivar 'Podarok Pobediteliam (genotype 44-5-2) was obtained that meets the requirements of intensive gardening and is characterized by high adaptability, productivity, and technological effectiveness.
ABSTRACT
BACKGROUND: Large medical centers in urban areas, like Los Angeles, care for a diverse patient population and offer the potential to study the interplay between genetic ancestry and social determinants of health. Here, we explore the implications of genetic ancestry within the University of California, Los Angeles (UCLA) ATLAS Community Health Initiative-an ancestrally diverse biobank of genomic data linked with de-identified electronic health records (EHRs) of UCLA Health patients (N=36,736). METHODS: We quantify the extensive continental and subcontinental genetic diversity within the ATLAS data through principal component analysis, identity-by-descent, and genetic admixture. We assess the relationship between genetically inferred ancestry (GIA) and >1500 EHR-derived phenotypes (phecodes). Finally, we demonstrate the utility of genetic data linked with EHR to perform ancestry-specific and multi-ancestry genome and phenome-wide scans across a broad set of disease phenotypes. RESULTS: We identify 5 continental-scale GIA clusters including European American (EA), African American (AA), Hispanic Latino American (HL), South Asian American (SAA) and East Asian American (EAA) individuals and 7 subcontinental GIA clusters within the EAA GIA corresponding to Chinese American, Vietnamese American, and Japanese American individuals. Although we broadly find that self-identified race/ethnicity (SIRE) is highly correlated with GIA, we still observe marked differences between the two, emphasizing that the populations defined by these two criteria are not analogous. We find a total of 259 significant associations between continental GIA and phecodes even after accounting for individuals' SIRE, demonstrating that for some phenotypes, GIA provides information not already captured by SIRE. GWAS identifies significant associations for liver disease in the 22q13.31 locus across the HL and EAA GIA groups (HL p-value=2.32×10-16, EAA p-value=6.73×10-11). A subsequent PheWAS at the top SNP reveals significant associations with neurologic and neoplastic phenotypes specifically within the HL GIA group. CONCLUSIONS: Overall, our results explore the interplay between SIRE and GIA within a disease context and underscore the utility of studying the genomes of diverse individuals through biobank-scale genotyping linked with EHR-based phenotyping.
Subject(s)
Electronic Health Records , Public Health , Asian People , Biological Specimen Banks , Genomics , HumansABSTRACT
This article presents a novel scalable character-based phylogeny algorithm for dense viral sequencing data called SPHERE (Scalable PHylogEny with REcurrent mutations). The algorithm is based on an evolutionary model where recurrent mutations are allowed, but backward mutations are prohibited. The algorithm creates rooted character-based phylogeny trees, wherein all leaves and internal nodes are labeled by observed taxa. We show that SPHERE phylogeny is more stable than Nextstrain's, and that it accurately infers known transmission links from the early pandemic. SPHERE is a fast algorithm that can process >200,000 sequences in <2 hours, which offers a compact phylogenetic visualization of Global Initiative on Sharing All Influenza Data (GISAID).
Subject(s)
Mutation , Phylogeny , SARS-CoV-2/genetics , Algorithms , COVID-19/transmission , COVID-19/virology , Databases, Genetic , HumansABSTRACT
In this article, we present our novel pipeline for analysis of metabolic activity using a microbial community's metatranscriptome sequence data set for validation. Our method is based on expectation-maximization (EM) algorithm and provides enzyme expression and pathway activity levels. Further expanding our analysis, we consider individual enzymatic activity and compute enzyme participation coefficients to approximate the metabolic pathway activity more accurately. We apply our EM pathways pipeline to a metatranscriptomic data set of a plankton community from surface waters of the Northern Gulf of Mexico. The data set consists of RNA-seq data and respective environmental parameters, which were sampled at two depths, six times a day over multiple 24-hour cycles. Furthermore, we discuss microbial dependence on day-night cycle within our findings based on a three-way correlation of the enzyme expression during antipodal times-midnight and noon. We show that the enzyme participation levels strongly affect the metabolic activity estimates: that is, marginal and multiple linear regression of enzymatic and metabolic pathway activity correlated significantly with the recorded environmental parameters. Our analysis statistically validates that EM-based methods produce meaningful results, as our method confirms statistically significant dependence of metabolic pathway activity on the environmental parameters, such as salinity, temperature, brightness, and a few others.
Subject(s)
Bacteria/genetics , Gene Expression Profiling/methods , Metabolic Networks and Pathways , Plankton/microbiology , Algorithms , Gulf of Mexico , Linear Models , Metagenomics , Sequence Analysis, RNAABSTRACT
The availability of millions of SARS-CoV-2 (Severe Acute Respiratory Syndrome-Coronavirus-2) sequences in public databases such as GISAID (Global Initiative on Sharing All Influenza Data) and EMBL-EBI (European Molecular Biology Laboratory-European Bioinformatics Institute) (the United Kingdom) allows a detailed study of the evolution, genomic diversity, and dynamics of a virus such as never before. Here, we identify novel variants and subtypes of SARS-CoV-2 by clustering sequences in adapting methods originally designed for haplotyping intrahost viral populations. We asses our results using clustering entropy-the first time it has been used in this context. Our clustering approach reaches lower entropies compared with other methods, and we are able to boost this even further through gap filling and Monte Carlo-based entropy minimization. Moreover, our method clearly identifies the well-known Alpha variant in the U.K. and GISAID data sets, and is also able to detect the much less represented (<1% of the sequences) Beta (South Africa), Epsilon (California), and Gamma and Zeta (Brazil) variants in the GISAID data set. Finally, we show that each variant identified has high selective fitness, based on the growth rate of its cluster over time. This demonstrates that our clustering approach is a viable alternative for detecting even rare subtypes in very large data sets.
Subject(s)
Cluster Analysis , Computational Biology/methods , Brazil , Databases, Genetic , Entropy , Humans , Monte Carlo Method , South Africa , United Kingdom , United StatesABSTRACT
HIV-1 subtype CRF01_AE is the second most predominant strain in Bulgaria, yet little is known about the molecular epidemiology of its origin and transmissibility. We used a phylodynamics approach to better understand this sub-epidemic by analyzing 270 HIV-1 polymerase (pol) sequences collected from persons diagnosed with HIV/AIDS between 1995 and 2019. Using network analyses at a 1.5% genetic distance threshold (d), we found a large 154-member outbreak cluster composed mostly of persons who inject drugs (PWID) that were predominantly men. At d = 0.5%, which was used to identify more recent transmission, the large cluster dissociated into three clusters of 18, 12, and 7 members, respectively, five dyads, and 107 singletons. Phylogenetic analysis of the Bulgarian sequences with publicly available global sequences showed that CRF01_AE likely originated from multiple Asian countries, with Vietnam as the likely source of the outbreak cluster between 1988 and 1990. Our findings indicate that CRF01_AE was introduced into Bulgaria multiple times since 1988, and infections then rapidly spread among PWID locally with bridging to other risk groups and countries. CRF01_AE continues to spread in Bulgaria as evidenced by the more recent large clusters identified at d = 0.5%, highlighting the importance of public health prevention efforts in the PWID communities.