RESUMO
Fewer than half of individuals with a suspected Mendelian or monogenic condition receive a precise molecular diagnosis after comprehensive clinical genetic testing. Improvements in data quality and costs have heightened interest in using long-read sequencing (LRS) to streamline clinical genomic testing, but the absence of control datasets for variant filtering and prioritization has made tertiary analysis of LRS data challenging. To address this, the 1000 Genomes Project ONT Sequencing Consortium aims to generate LRS data from at least 800 of the 1000 Genomes Project samples. Our goal is to use LRS to identify a broader spectrum of variation so we may improve our understanding of normal patterns of human variation. Here, we present data from analysis of the first 100 samples, representing all 5 superpopulations and 19 subpopulations. These samples, sequenced to an average depth of coverage of 37x and sequence read N50 of 54 kbp, have high concordance with previous studies for identifying single nucleotide and indel variants outside of homopolymer regions. Using multiple structural variant (SV) callers, we identify an average of 24,543 high-confidence SVs per genome, including shared and private SVs likely to disrupt gene function as well as pathogenic expansions within disease-associated repeats that were not detected using short reads. Evaluation of methylation signatures revealed expected patterns at known imprinted loci, samples with skewed X-inactivation patterns, and novel differentially methylated regions. All raw sequencing data, processed data, and summary statistics are publicly available, providing a valuable resource for the clinical genetics community to discover pathogenic SVs.
RESUMO
BACKGROUND: Different formulae have been developed globally to estimate gestational age (GA) by ultrasonography in the first trimester of pregnancy. In this study, we develop an Indian population-specific dating formula and compare its performance with published formulae. Finally, we evaluate the implications of the choice of dating method on preterm birth (PTB) rate. This study's data was from GARBH-Ini, an ongoing pregnancy cohort of North Indian women to study PTB. METHODS: Comparisons between ultrasonography-Hadlock and last menstrual period (LMP) based dating methods were made by studying the distribution of their differences by Bland-Altman analysis. Using data-driven approaches, we removed data outliers more efficiently than by applying clinical parameters. We applied advanced machine learning algorithms to identify relevant features for GA estimation and developed an Indian population-specific formula (Garbhini-GA1) for the first trimester. PTB rates of Garbhini-GA1 and other formulae were compared by estimating sensitivity and accuracy. RESULTS: Performance of Garbhini-GA1 formula, a non-linear function of crown-rump length (CRL), was equivalent to published formulae for estimation of first trimester GA (LoA, - 0.46,0.96 weeks). We found that CRL was the most crucial parameter in estimating GA and no other clinical or socioeconomic covariates contributed to GA estimation. The estimated PTB rate across all the formulae including LMP ranged 11.27-16.50% with Garbhini-GA1 estimating the least rate with highest sensitivity and accuracy. While the LMP-based method overestimated GA by 3 days compared to USG-Hadlock formula; at an individual level, these methods had less than 50% agreement in the classification of PTB. CONCLUSIONS: An accurate estimation of GA is crucial for the management of PTB. Garbhini-GA1, the first such formula developed in an Indian setting, estimates PTB rates with higher accuracy, especially when compared to commonly used Hadlock formula. Our results reinforce the need to develop population-specific gestational age formulae.
Assuntos
Estatura Cabeça-Cóccix , Idade Gestacional , Primeiro Trimestre da Gravidez , Nascimento Prematuro/classificação , Ultrassonografia Pré-Natal/métodos , Adulto , Feminino , Humanos , Índia , Recém-Nascido , Gravidez , Estudos Prospectivos , Adulto JovemRESUMO
BACKGROUND: Obtaining a precise molecular diagnosis through clinical genetic testing provides information about disease prognosis or progression, allows accurate counseling about recurrence risk, and empowers individuals to benefit from precision therapies or take part in N-of-1 trials. Unfortunately, more than half of individuals with a suspected Mendelian condition remain undiagnosed after a comprehensive clinical evaluation, and the results of any individual clinical genetic test ordered during a typical evaluation may take weeks or months to return. Furthermore, commonly used technologies, such as short-read sequencing, are limited in the types of disease-causing variation they can identify. New technologies, such as long-read sequencing (LRS), are poised to solve these problems. CONTENT: Recent technical advances have improved accuracy, increased throughput, and decreased the costs of commercially available LRS technologies. This has resolved many historical concerns about the use of LRS in the clinical environment and opened the door to widespread clinical adoption of LRS. Here, we review LRS technology, how it has been used in the research setting to clarify complex variants or identify disease-causing variation missed by prior clinical testing, and how it may be used clinically in the near future. SUMMARY: LRS is unique in that, as a single data source, it has the potential to replace nearly every other clinical genetic test offered today. When analyzed in a stepwise fashion, LRS will simplify laboratory processes, reduce barriers to comprehensive genetic testing, increase the rate of genetic diagnoses, and shorten the amount of time required to make a molecular diagnosis.
Assuntos
Testes Genéticos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA/métodos , Análise de Sequência de RNA , Sequenciamento de Nucleotídeos em Larga Escala/métodos , PrognósticoRESUMO
Background: A large proportion of pregnant women in lower and middle-income countries (LMIC) seek their first antenatal care after 14 weeks of gestation. While the last menstrual period (LMP) is still the most prevalent method of determining gestational age (GA), ultrasound-based foetal biometry is considered more accurate in the second and third trimesters. In LMIC settings, the Hadlock formula, originally developed using data from a small Caucasian population, is widely used for estimating GA and foetal weight worldwide as the pre-programmed formula in ultrasound machines. This approach can lead to inaccuracies when estimating GA in a diverse population. Therefore, this study aimed to develop a population-specific model for estimating GA in the late trimesters that was as accurate as the GA estimation in the first trimester, using data from GARBH-Ini, a pregnancy cohort in a North Indian district hospital, and subsequently validate the model in an independent cohort in South India. Methods: Data obtained by longitudinal ultrasonography across all trimesters of pregnancy was used to develop and validate GA models for the second and third trimesters. The gold standard for GA estimation in the first trimester was determined using ultrasonography. The Garbhini-GA2, a polynomial regression model, was developed using the genetic algorithm-based method, showcasing the best performance among the models considered. This model incorporated three of the five routinely measured ultrasonographic parameters during the second and third trimesters. To assess its performance, the Garbhini-GA2 model was compared against the Hadlock and INTERGROWTH-21st models using both the TEST set (N = 1493) from the GARBH-Ini cohort and an independent VALIDATION dataset (N = 948) from the Christian Medical College (CMC), Vellore cohort. Evaluation metrics, including root-mean-squared error, bias, and preterm birth (PTB) rates, were utilised to comprehensively assess the model's accuracy and reliability. Findings: With first trimester GA dating as the baseline, Garbhini-GA2 reduced the GA estimation median error by more than three times compared to the Hadlock formula. Further, the PTB rate estimated using Garbhini-GA2 was more accurate when compared to the INTERGROWTH-21st and Hadlock formulae, which overestimated the rate by 22.47% and 58.91%, respectively. Interpretation: The Garbhini-GA2 is the first late-trimester GA estimation model to be developed and validated using Indian population data. Its higher accuracy in GA estimation, comparable to GA estimation in the first trimester and PTB classification, underscores the significance of deploying population-specific GA formulae to enhance antenatal care. Funding: The GARBH-Ini cohort study was funded by the Department of Biotechnology, Government of India (BT/PR9983/MED/97/194/2013). The ultrasound repository was partly supported by the Grand Challenges India-All Children Thriving Program, Biotechnology Industry Research Assistance Council, Department of Biotechnology, Government of India (BIRAC/GCI/0115/03/14-ACT). The research reported in this publication was made possible by a grant (BT/kiData0394/06/18) from the Grand Challenges India at Biotechnology Industry Research Assistance Council (BIRAC), an operating division jointly supported by DBT-BMGF-BIRAC. The external validation study at CMC Vellore was partly supported by a grant (BT/kiData0394/06/18) from the Grand Challenges India at Biotechnology Industry Research Assistance Council (BIRAC), an operating division jointly supported by DBT-BMGF-BIRAC and by Exploratory Research Grant (SB/20-21/0602/BT/RBCX/008481) from Robert Bosch Centre for Data Science and Artificial Intelligence (RBCDSAI), IIT Madras. An alum endowment from Prakash Arunachalam (BIO/18-19/304/ALUM/KARH) partly funded this study at the Centre for Integrative Biology and Systems Medicine, IIT Madras.
RESUMO
Less than half of individuals with a suspected Mendelian condition receive a precise molecular diagnosis after comprehensive clinical genetic testing. Improvements in data quality and costs have heightened interest in using long-read sequencing (LRS) to streamline clinical genomic testing, but the absence of control datasets for variant filtering and prioritization has made tertiary analysis of LRS data challenging. To address this, the 1000 Genomes Project ONT Sequencing Consortium aims to generate LRS data from at least 800 of the 1000 Genomes Project samples. Our goal is to use LRS to identify a broader spectrum of variation so we may improve our understanding of normal patterns of human variation. Here, we present data from analysis of the first 100 samples, representing all 5 superpopulations and 19 subpopulations. These samples, sequenced to an average depth of coverage of 37x and sequence read N50 of 54 kbp, have high concordance with previous studies for identifying single nucleotide and indel variants outside of homopolymer regions. Using multiple structural variant (SV) callers, we identify an average of 24,543 high-confidence SVs per genome, including shared and private SVs likely to disrupt gene function as well as pathogenic expansions within disease-associated repeats that were not detected using short reads. Evaluation of methylation signatures revealed expected patterns at known imprinted loci, samples with skewed X-inactivation patterns, and novel differentially methylated regions. All raw sequencing data, processed data, and summary statistics are publicly available, providing a valuable resource for the clinical genetics community to discover pathogenic SVs.