RESUMEN
Amyloid fibril formation is associated with various amyloidoses, including neurodegenerative diseases such as Alzheimer's and Parkinson's diseases. Despite the numerous studies on the inhibition of amyloid formation, the prevention and treatment of a majority of amyloid-related disorders are still challenging. In this study, we investigated the effects of various plant extracts on amyloid formation of α-synuclein. We found that the extracts from Eucalyptus gunnii are able to inhibit amyloid formation, and to disaggregate preformed fibrils, in vitro. The extract itself did not lead to cell damage. In the extract, miquelianin, which is a glycosylated form of quercetin and has been detected in the plasma and the brain, was identified and assessed to have a moderate inhibitory activity, compared to the effects of ellagic acid and quercetin, which are strong inhibitors for amyloid formation. The properties of miquelianin provide insights into the mechanisms controlling the assembly of α-synuclein in the brain.
Asunto(s)
Amiloide , Eucalyptus , Extractos Vegetales , Quercetina , alfa-Sinucleína , alfa-Sinucleína/metabolismo , alfa-Sinucleína/antagonistas & inhibidores , Extractos Vegetales/farmacología , Extractos Vegetales/química , Amiloide/metabolismo , Amiloide/antagonistas & inhibidores , Eucalyptus/química , Humanos , Quercetina/farmacología , Quercetina/química , Quercetina/análogos & derivadosRESUMEN
BACKGROUND: The prediction of potentially pathogenic variant combinations in patients remains a key task in the field of medical genetics for the understanding and detection of oligogenic/multilocus diseases. Models tailored towards such cases can help shorten the gap of missing diagnoses and can aid researchers in dealing with the high complexity of the derived data. The predictor VarCoPP (Variant Combinations Pathogenicity Predictor) that was published in 2019 and identified potentially pathogenic variant combinations in gene pairs (bilocus variant combinations), was the first important step in this direction. Despite its usefulness and applicability, several issues still remained that hindered a better performance, such as its False Positive (FP) rate, the quality of its training set and its complex architecture. RESULTS: We present VarCoPP2.0: the successor of VarCoPP that is a simplified, faster and more accurate predictive model identifying potentially pathogenic bilocus variant combinations. Results from cross-validation and on independent data sets reveal that VarCoPP2.0 has improved in terms of both sensitivity (95% in cross-validation and 98% during testing) and specificity (5% FP rate). At the same time, its running time shows a significant 150-fold decrease due to the selection of a simpler Balanced Random Forest model. Its positive training set now consists of variant combinations that are more confidently linked with evidence of pathogenicity, based on the confidence scores present in OLIDA, the Oligogenic Diseases Database ( https://olida.ibsquare.be ). The improvement of its performance is also attributed to a more careful selection of up-to-date features identified via an original wrapper method. We show that the combination of different variant and gene pair features together is important for predictions, highlighting the usefulness of integrating biological information at different levels. CONCLUSIONS: Through its improved performance and faster execution time, VarCoPP2.0 enables a more accurate analysis of larger data sets linked to oligogenic diseases. Users can access the ORVAL platform ( https://orval.ibsquare.be ) to apply VarCoPP2.0 on their data.
RESUMEN
Notwithstanding important advances in the context of single-variant pathogenicity identification, novel breakthroughs in discerning the origins of many rare diseases require methods able to identify more complex genetic models. We present here the Variant Combinations Pathogenicity Predictor (VarCoPP), a machine-learning approach that identifies pathogenic variant combinations in gene pairs (called digenic or bilocus variant combinations). We show that the results produced by this method are highly accurate and precise, an efficacy that is endorsed when validating the method on recently published independent disease-causing data. Confidence labels of 95% and 99% are identified, representing the probability of a bilocus combination being a true pathogenic result, providing geneticists with rational markers to evaluate the most relevant pathogenic combinations and limit the search space and time. Finally, the VarCoPP has been designed to act as an interpretable method that can provide explanations on why a bilocus combination is predicted as pathogenic and which biological information is important for that prediction. This work provides an important step toward the genetic understanding of rare diseases, paving the way to clinical knowledge and improved patient care.
Asunto(s)
Predisposición Genética a la Enfermedad/genética , Variación Genética/genética , Enfermedades Raras/genética , Marcadores Genéticos/genética , HumanosRESUMEN
DNA methylation controls gene expression, and once established, DNA methylation patterns are faithfully copied during DNA replication by the maintenance DNA methyltransferase Dnmt1. In vivo, Dnmt1 interacts with Uhrf1, which recognizes hemimethylated CpGs. Recently, we reported that Uhrf1-catalyzed K18- and K23-ubiquitinated histone H3 binds to the N-terminal region (the replication focus targeting sequence, RFTS) of Dnmt1 to stimulate its methyltransferase activity. However, it is not yet fully understood how ubiquitinated histone H3 stimulates Dnmt1 activity. Here, we show that monoubiquitinated histone H3 stimulates Dnmt1 activity toward DNA with multiple hemimethylated CpGs but not toward DNA with only a single hemimethylated CpG, suggesting an influence of ubiquitination on the processivity of Dnmt1. The Dnmt1 activity stimulated by monoubiquitinated histone H3 was additively enhanced by the Uhrf1 SRA domain, which also binds to RFTS. Thus, Dnmt1 activity is regulated by catalysis (ubiquitination)-dependent and -independent functions of Uhrf1.
Asunto(s)
ADN (Citosina-5-)-Metiltransferasa 1/genética , ADN (Citosina-5-)-Metiltransferasa 1/metabolismo , Histonas/metabolismo , Proteínas Potenciadoras de Unión a CCAAT/genética , ADN/metabolismo , ADN (Citosina-5-)-Metiltransferasas/genética , ADN (Citosina-5-)-Metiltransferasas/metabolismo , Metilación de ADN , Replicación del ADN , Histonas/fisiología , Humanos , Unión Proteica , Ubiquitina/metabolismo , Ubiquitina-Proteína Ligasas/metabolismo , UbiquitinaciónRESUMEN
A tremendous amount of DNA sequencing data is being produced around the world with the ambition to capture in more detail the mechanisms underlying human diseases. While numerous bioinformatics tools exist that allow the discovery of causal variants in Mendelian diseases, little to no support is provided to do the same for variant combinations, an essential task for the discovery of the causes of oligogenic diseases. ORVAL (the Oligogenic Resource for Variant AnaLysis), which is presented here, provides an answer to this problem by focusing on generating networks of candidate pathogenic variant combinations in gene pairs, as opposed to isolated variants in unique genes. This online platform integrates innovative machine learning methods for combinatorial variant pathogenicity prediction with visualization techniques, offering several interactive and exploratory tools, such as pathogenic gene and protein interaction networks, a ranking of pathogenic gene pairs, as well as visual mappings of the cellular location and pathway information. ORVAL is the first web-based exploration platform dedicated to identifying networks of candidate pathogenic variant combinations with the sole ambition to help in uncovering oligogenic causes for patients that cannot rely on the classical disease analysis tools. ORVAL is available at https://orval.ibsquare.be.
Asunto(s)
Enfermedades Genéticas Congénitas/genética , Predisposición Genética a la Enfermedad , Herencia Multifactorial/genética , Programas Informáticos , Biología Computacional , Enfermedades Genéticas Congénitas/diagnóstico , Humanos , Mutación/genética , Análisis de Secuencia de ADNRESUMEN
While biomedical relation extraction (bioRE) datasets have been instrumental in the development of methods to support biocuration of single variants from texts, no datasets are currently available for the extraction of digenic or even oligogenic variant relations, despite the reports in literature that epistatic effects between combinations of variants in different loci (or genes) are important to understand disease etiologies. This work presents the creation of a unique dataset of oligogenic variant combinations, geared to train tools to help in the curation of scientific literature. To overcome the hurdles associated with the number of unlabelled instances and the cost of expertise, active learning (AL) was used to optimize the annotation, thus getting assistance in finding the most informative subset of samples to label. By pre-annotating 85 full-text articles containing the relevant relations from the Oligogenic Diseases Database (OLIDA) with PubTator, text fragments featuring potential digenic variant combinations, i.e. gene-variant-gene-variant, were extracted. The resulting fragments of texts were annotated with ALAMBIC, an AL-based annotation platform. The resulting dataset, called DUVEL, is used to fine-tune four state-of-the-art biomedical language models: BiomedBERT, BiomedBERT-large, BioLinkBERT and BioM-BERT. More than 500 000 text fragments were considered for annotation, finally resulting in a dataset with 8442 fragments, 794 of them being positive instances, covering 95% of the original annotated articles. When applied to gene-variant pair detection, BiomedBERT-large achieves the highest F1 score (0.84) after fine-tuning, demonstrating significant improvement compared to the non-fine-tuned model, underlining the relevance of the DUVEL dataset. This study shows how AL may play an important role in the creation of bioRE dataset relevant for biomedical curation applications. DUVEL provides a unique biomedical corpus focusing on 4-ary relations between two genes and two variants. It is made freely available for research on GitHub and Hugging Face. Database URL: https://huggingface.co/datasets/cnachteg/duvel or https://doi.org/10.57967/hf/1571.
Asunto(s)
Aprendizaje Automático Supervisado , Humanos , Minería de Datos/métodos , Curaduría de Datos/métodos , Bases de Datos GenéticasRESUMEN
Automatic biomedical relation extraction (bioRE) is an essential task in biomedical research in order to generate high-quality labelled data that can be used for the development of innovative predictive methods. However, building such fully labelled, high quality bioRE data sets of adequate size for the training of state-of-the-art relation extraction models is hindered by an annotation bottleneck due to limitations on time and expertise of researchers and curators. We show here how Active Learning (AL) plays an important role in resolving this issue and positively improve bioRE tasks, effectively overcoming the labelling limits inherent to a data set. Six different AL strategies are benchmarked on seven bioRE data sets, using PubMedBERT as the base model, evaluating their area under the learning curve (AULC) as well as intermediate results measurements. The results demonstrate that uncertainty-based strategies, such as Least-Confident or Margin Sampling, are statistically performing better in terms of F1-score, accuracy and precision, than other types of AL strategies. However, in terms of recall, a diversity-based strategy, called Core-set, outperforms all strategies. AL strategies are shown to reduce the annotation need (in order to reach a performance at par with training on all data), from 6% to 38%, depending on the data set; with Margin Sampling and Least-Confident Sampling strategies moreover obtaining the best AULCs compared to the Random Sampling baseline. We show through the experiments the importance of using AL methods to reduce the amount of labelling needed to construct high-quality data sets leading to optimal performance of deep learning models. The code and data sets to reproduce all the results presented in the article are available at https://github.com/oligogenic/Deep_active_learning_bioRE.
Asunto(s)
Investigación Biomédica , Exactitud de los Datos , Área Bajo la CurvaRESUMEN
Although standards and guidelines for the interpretation of variants identified in genes that cause Mendelian disorders have been developed, this is not the case for more complex genetic models including variant combinations in multiple genes. During a large curation process conducted on 318 research articles presenting oligogenic variant combinations, we encountered several recurring issues concerning their proper reporting and pathogenicity assessment. These mainly concern the absence of strong evidence that refutes a monogenic model and the lack of a proper genetic and functional assessment of the joint effect of the involved variants. With the increasing accumulation of such cases, it has become essential to develop standards and guidelines on how these oligogenic/multilocus variant combinations should be interpreted, validated, and reported in order to provide high-quality data and supporting evidence to the scientific community.
Asunto(s)
Programas Informáticos , VirulenciaRESUMEN
Improving the understanding of the oligogenic nature of diseases requires access to high-quality, well-curated Findable, Accessible, Interoperable, Reusable (FAIR) data. Although first steps were taken with the development of the Digenic Diseases Database, leading to novel computational advancements to assist the field, these were also linked with a number of limitations, for instance, the ad hoc curation protocol and the inclusion of only digenic cases. The OLIgogenic diseases DAtabase (OLIDA) presents a novel, transparent and rigorous curation protocol, introducing a confidence scoring mechanism for the published oligogenic literature. The application of this protocol on the oligogenic literature generated a new repository containing 916 oligogenic variant combinations linked to 159 distinct diseases. Information extracted from the scientific literature is supplemented with current knowledge support obtained from public databases. Each entry is an oligogenic combination linked to a disease, labelled with a confidence score based on the level of genetic and functional evidence that supports its involvement in this disease. These scores allow users to assess the relevance and proof of pathogenicity of each oligogenic combination in the database, constituting markers for reporting improvements on disease-causing oligogenic variant combinations. OLIDA follows the FAIR principles, providing detailed documentation, easy data access through its application programming interface and website, use of unique identifiers and links to existing ontologies. DATABASE URL: https://olida.ibsquare.be.
Asunto(s)
Programas Informáticos , Vocabulario Controlado , Bases de Datos FactualesRESUMEN
In order to gain insight into oligogenic disorders, understanding those involving bi-locus variant combinations appears to be key. In prior work, we showed that features at multiple biological scales can already be used to discriminate among two types, i.e. disorders involving true digenic and modifier combinations. The current study expands this machine learning work towards dual molecular diagnosis cases, providing a classifier able to effectively distinguish between these three types. To reach this goal and gain an in-depth understanding of the decision process, game theory and tree decomposition techniques are applied to random forest predictors to investigate the relevance of feature combinations in the prediction. A machine learning model with high discrimination capabilities was developed, effectively differentiating the three classes in a biologically meaningful manner. Combining prediction interpretation and statistical analysis, we propose a biologically meaningful characterization of each class relying on specific feature strengths. Figuring out how biological characteristics shift samples towards one of three classes provides clinically relevant insight into the underlying biological processes as well as the disease itself.
Asunto(s)
Teoría del Juego , Predisposición Genética a la Enfermedad/genética , Aprendizaje Automático , Herencia Multifactorial/genética , Árboles de Decisión , HumanosRESUMEN
Facioscapulohumeral muscular dystrophy (FSHD) is associated with an activation of the double homeobox 4 (DUX4) gene, which we previously identified within the D4Z4 repeated elements in the 4q35 subtelomeric region. The pathological DUX4 mRNA is derived from the most distal D4Z4 unit and extends unexpectedly within the flanking pLAM region, which provides an intron and polyadenylation signal. The conditions that are required to develop FSHD are a permissive allele providing the polyadenylation signal and hypomethylation of the D4Z4 repeat array compared with the healthy muscle. The DUX4 protein is a 52-kDa transcription factor that initiates a large gene deregulation cascade leading to muscle atrophy, inflammation, differentiation defects, and oxidative stress, which are the key features of FSHD. DUX4 is a retrogene that is normally expressed in germline cells and is submitted to repeat-induced silencing in adult tissues. Since DUX4 mRNAs have been detected in human embryonic and induced pluripotent stem cells, we investigated whether they could also be expressed in human mesenchymal stromal cells (hMSCs). We found that DUX4 mRNAs were induced during the differentiation of hMSCs into osteoblasts and that this process involved DUX4 and new longer protein forms (58 and 70 kDa). A DUX4 mRNA with a more distant 5' start site was characterized that presented a 60-codon reading frame extension and encoded the 58-kDa protein. Transfections of hMSCs with an antisense oligonucleotide targeting DUX4 mRNAs decreased both the 52- and 58-kDa protein levels and confirmed their identity. Gain- and loss-of-function experiments in hMSCs suggested these DUX4 proteins had opposite roles in osteogenic differentiation as evidenced by the alkaline phosphatase activity and calcium deposition. Differentiation was delayed by the 58-kDa DUX4 expression and it was increased by 52-kDa DUX4. These data indicate a role for DUX4 protein forms in the osteogenic differentiation of hMSCs.