Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 11 de 11
Filter
1.
Int J Mol Sci ; 25(9)2024 Apr 26.
Article in English | MEDLINE | ID: mdl-38731932

ABSTRACT

The serious drawback underlying the biological annotation of whole-genome sequence data is the p >> n problem, which means that the number of polymorphic variants (p) is much larger than the number of available phenotypic records (n). We propose a way to circumvent the problem by combining a LASSO logistic regression with deep learning to classify cows as susceptible or resistant to mastitis, based on single nucleotide polymorphism (SNP) genotypes. Among several architectures, the one with 204,642 SNPs was selected as the best. This architecture was composed of two layers with, respectively, 7 and 46 units per layer implementing respective drop-out rates of 0.210 and 0.358. The classification of the test data resulted in AUC = 0.750, accuracy = 0.650, sensitivity = 0.600, and specificity = 0.700. Significant SNPs were selected based on the SHapley Additive exPlanation (SHAP). As a final result, one GO term related to the biological process and thirteen GO terms related to molecular function were significantly enriched in the gene set that corresponded to the significant SNPs. Our findings revealed that the optimal approach can correctly predict susceptibility or resistance status for approximately 65% of cows. Genes marked by the most significant SNPs are related to the immune response and protein synthesis.


Subject(s)
Deep Learning , Mastitis, Bovine , Polymorphism, Single Nucleotide , Whole Genome Sequencing , Cattle , Mastitis, Bovine/genetics , Animals , Female , Whole Genome Sequencing/methods , Genetic Predisposition to Disease , Genotype
2.
NAR Genom Bioinform ; 6(2): lqae040, 2024 Jun.
Article in English | MEDLINE | ID: mdl-38686136

ABSTRACT

This study compared computational approaches to parallelization of an SNP calling workflow. The data comprised DNA from five Holstein-Friesian cows sequenced with the Illumina platform. The pipeline consisted of quality control, alignment to the reference genome, post-alignment, and SNP calling. Three approaches to parallelization were compared: (i) a plain Bash script in which a pipeline for each cow was executed as separate processes invoked at the same time, (ii) a Bash script wrapped in a single Nextflow process and (iii) a Nextflow script with each component of the pipeline defined as a separate process. The results demonstrated that on average, the multi-process Nextflow script performed 15-27% faster depending on the number of assigned threads, with the biggest execution time advantage over the plain Bash approach observed with 10 threads. In terms of RAM usage, the most substantial variation was observed for the multi-process Nextflow, for which it increased with the number of assigned threads, while RAM consumption of the other setups did not depend much on the number of threads assigned for computations. Due to intermediate and log files generated, disk usage was markedly higher for the multi-process Nextflow than for the plain Bash and for the single-process Nextflow.

3.
J Appl Genet ; 2024 Mar 27.
Article in English | MEDLINE | ID: mdl-38539022

ABSTRACT

Recently, numerous studies including various tissues have been carried out on long non-coding RNAs (lncRNAs), but still, its variability has not yet been fully understood. In this study, we characterised the inter-individual variability of lncRNAs in pigs, in the context of number, length and expression. Transcriptomes collected from muscle tissue belonging to six Polish Landrace boars (PL1-PL6), including half-brothers (PL1-PL3), were investigated using bioinformatics (lncRNA identification and functional analysis) and statistical (lncRNA variability) methods. The number of lncRNA ranged from 1289 to 3500 per animal, and the total number of common lncRNAs among all boars was 232. The number, length and expression of lncRNAs significantly varied between individuals, and no consistent pattern has been found between pairs of half-brothers. In detail, PL5 exhibits lower expression than the others, while PL4 has significantly higher expression than PL2-PL3 and PL5-PL6. Noteworthy, comparing the inter-individual variability of lncRNA and mRNA expression, they exhibited concordant patterns. The enrichment analysis for common lncRNA target genes determined a variety of biological processes that play fundamental roles in cell biology, and they were mostly related to whole-body homeostasis maintenance, energy and protein synthesis as well as dynamics of multiple nucleoprotein complexes. The high variability of lncRNA landscape in the porcine genome has been revealed in this study. The inter-individual differences have been found in the context of three aspects: the number, length and expression of lncRNAs, which contribute to a better understanding of its complex nature.

4.
Cancers (Basel) ; 15(3)2023 Jan 27.
Article in English | MEDLINE | ID: mdl-36765737

ABSTRACT

The number of cases of pancreatic cancers in 2019 in Poland was 3852 (approx. 2% of all cancers). The course of the disease is very fast, and the average survival time from the diagnosis is 6 months. Only <2% of patients live for 5 years from the diagnosis, 8% live for 2 years, and almost half live for only about 3 months. A family predisposition to pancreatic cancer occurs in about 10% of cases. Several oncogenes in which somatic changes lead to the development of tumours, including genes BRCA1/2 and PALB2, TP53, CDKN2A, SMAD4, MLL3, TGFBR2, ARID1A and SF3B1, are involved in pancreatic cancer. Between 4% and 10% of individuals with pancreatic cancer will have a mutation in one of these genes. Six percent of patients with pancreatic cancer have NTRK pathogenic fusion. The pathogenesis of pancreatic cancer can in many cases be characterised by homologous recombination deficiency (HRD)-cell inability to effectively repair DNA. It is estimated that from 24% to as many as 44% of pancreatic cancers show HRD. The most common cause of HRD are inactivating mutations in the genes regulating this DNA repair system, mainly BRCA1 and BRCA2, but also PALB2, RAD51C and several dozen others.

5.
Funct Integr Genomics ; 23(1): 19, 2022 Dec 23.
Article in English | MEDLINE | ID: mdl-36564645

ABSTRACT

Since copy number variants (CNVs) have been recognized as an important source of genetic and transcriptomic variation, we aimed to characterize the impact of CNVs located within coding, intergenic, upstream, and downstream gene regions on the expression of transcripts. Regions in which deletions occurred most often were introns, while duplications in coding regions. The transcript expression was lower for deleted coding (P = 0.008) and intronic regions (P = 1.355 × 10-10), but it was not changed in the case of upstream and downstream gene regions (P = 0.085). Moreover, the expression was decreased if duplication occurred in the coding region (P = 8.318 × 10-5). Furthermore, a negative correlation (r = - 0.27) between transcript length and its expression was observed. The correlation between the percent of deleted/duplicated transcript and transcript expression level was not significant for all concerned genomic regions in five out of six animals. The exceptions were deletions in coding regions (P = 0.004) and duplications in introns (P = 0.01) in one individual. CNVs in coding (deletions, duplications) and intronic (deletions) regions are important modulators of transcripts by reducing their expression level. We hypothesize that deletions imply severe consequences by interrupting genes. The negative correlation between the size of the transcript and its expression level found in this study is consistent with the hypothesis that selection favours shorter introns and a moderate number of exons in highly expressed genes. This may explain the transcript expression reduction by duplications. We did not find the correlation between the size of deletions/duplications and transcript expression level suggesting that expression is modulated by CNVs regardless of their size.


Subject(s)
DNA Copy Number Variations , Genome , Animals , Genomics , Introns , Exons
6.
Sci Rep ; 12(1): 7671, 2022 05 10.
Article in English | MEDLINE | ID: mdl-35538164

ABSTRACT

Since global temperature is expected to rise by 2 °C in 2050 heat stress may become the most severe environmental factor. In the study, we illustrate the application of mixed linear models for the analysis of whole transcriptome expression in livers and adrenal tissues of Sprague-Dawley rats obtained by a heat stress experiment. By applying those models, we considered four sources of variation in transcript expression, comprising transcripts (1), genes (2), Gene Ontology terms (3), and Reactome pathways (4) and focussed on accounting for the similarity within each source, which was expressed as a covariance matrix. Models based on transcripts or genes levels explained a larger proportion of log2 fold change than models fitting the functional components of Gene Ontology terms or Reactome pathways. In the liver, among the most significant genes were PNKD and TRIP12. In the adrenal tissue, one transcript of the SUCO gene was expressed more strongly in the control group than in the heat-stress group. PLEC had two transcripts, which were significantly overexpressed in the heat-stress group. PER3 was significant only on gene level. Moving to the functional scale, five Gene Ontologies and one Reactome pathway were significant in the liver. They can be grouped into ontologies related to DNA repair, histone ubiquitination, the regulation of embryonic development and cytoplasmic translation. Linear mixed models are valuable tools for the analysis of high-throughput biological data. Their main advantages are the possibility to incorporate information on covariance between observations and circumventing the problem of multiple testing.


Subject(s)
Gene Expression Profiling , Heat Stress Disorders , Animals , Biodiversity , Heat-Shock Response/genetics , Linear Models , Rats , Rats, Sprague-Dawley , Temperature , Transcriptome
7.
J Appl Genet ; 63(3): 527-533, 2022 Sep.
Article in English | MEDLINE | ID: mdl-35590085

ABSTRACT

Copy number variants (CNVs) may cover up to 12% of the whole genome and have substantial impact on phenotypes. We used 5867 duplications and 33,181 deletions available from the 1000 Genomes Project to characterise genomic regions vulnerable to CNV formation and to identify sequence features characteristic for those regions. The GC content for deletions was lower and for duplications was higher than for randomly selected regions. In regions flanking deletions and downstream of duplications, content was higher than in the random sequences, but upstream of duplication content was lower. In duplications and downstream of deletion regions, the percentage of low-complexity sequences was not different from the randomised data. In deletions and upstream of CNVs, it was higher, while for downstream of duplications, it was lower as compared to random sequences. The majority of CNVs intersected with genic regions - mainly with introns. GC content may be associated with CNV formation and CNVs, especially duplications are initiated in low-complexity regions. Moreover, CNVs located or overlapped with introns indicate their role in shaping intron variability. Genic CNV regions were enriched in many essential biological processes such as cell adhesion, synaptic transmission, transport, cytoskeleton organization, immune response and metabolic mechanisms, which indicates that these large-scaled variants play important biological roles.


Subject(s)
DNA Copy Number Variations , Genome-Wide Association Study , Base Sequence , DNA Copy Number Variations/genetics , Genome , Genomics , Humans
8.
J Appl Genet ; 61(4): 617-618, 2020 Dec.
Article in English | MEDLINE | ID: mdl-33044661

ABSTRACT

The original version on this paper contained an error. Figure 5 was published with the same image of Fig. 4.

9.
J Appl Genet ; 61(4): 607-616, 2020 Dec.
Article in English | MEDLINE | ID: mdl-32996082

ABSTRACT

A downside of next-generation sequencing technology is the high technical error rate. We built a tool, which uses array-based genotype information to classify next-generation sequencing-based SNPs into the correct and the incorrect calls. The deep learning algorithms were implemented via Keras. Several algorithms were tested: (i) the basic, naïve algorithm, (ii) the naïve algorithm modified by pre-imposing different weights on incorrect and correct SNP class in calculating the loss metric and (iii)-(v) the naïve algorithm modified by random re-sampling (with replacement) of the incorrect SNPs to match 30%/60%/100% of the number of correct SNPs. The training data set was composed of data from three bulls and consisted of 2,227,995 correct (97.94%) and 46,920 incorrect SNPs, while the validation data set consisted of data from one bull with 749,506 correct (98.05%) and 14,908 incorrect SNPs. The results showed that for a rare event classification problem, like incorrect SNP detection in NGS data, the most parsimonious naïve model and a model with the weighting of SNP classes provided the best results for the classification of the validation data set. Both classified 19% of truly incorrect SNPs as incorrect and 99% of truly correct SNPs as correct and resulted in the F1 score of 0.21 - the highest among the compared algorithms. We conclude the basic models were less adapted to the specificity of a training data set and thus resulted in better classification of the independent, validation data set, than the other tested models.


Subject(s)
Deep Learning , Genotyping Techniques/methods , Polymorphism, Single Nucleotide/genetics , Whole Genome Sequencing/methods , Algorithms , Animals , Cattle
10.
PLoS One ; 13(6): e0198419, 2018.
Article in English | MEDLINE | ID: mdl-29856873

ABSTRACT

In Bos taurus the universality of the reference genome is biased towards genetic variation represented by only two related individuals representing the same Hereford breed. Therefore, results of genetic analyses based on this reference may not be reliable. The 1000 Bull Genomes resource allows for identification of breed-specific polymorphisms and for the construction of breed-specific reference genomes. Whole-genome sequences or 936 bulls allowed us to construct seven breed specific reference genomes of Bos taurus for Angus, Brown Swiss, Fleckvieh, Hereford, Jersey, Limousin and Simmental. In order to identify breed-specific variants all detected SNPs were filtered within-breed to satisfy criteria of the number of missing genotypes not higher than 7% and the alternative allele frequency equal to unity. The highest number of breed-specific SNPs was identified for Jersey (130,070) and the lowest-for the Simmental breed (197). Such breed-specific polymorphisms were annotated to coding regions overlapping with 78 genes in Angus, 140 in Brown Swiss, 132 in Fleckvieh, 100 in Hereford, 643 in Jersey, 10 in Limousin and no genes in Simmental. For most of the breeds, the majority of breed-specific variants from coding regions was synonymous. However, most of Fleckvieh-specific and Hereford-specific polymorphisms were missense mutations. Since the identified variants are characteristic for the analysed breeds, they form the basis of phenotypic differences observed between them, which result from different breeding programmes. Breed-specific reference genomes can enhance the accuracy of SNP driven inferences such as Genome-wide Association Studies or SNP genotype imputation.


Subject(s)
Genome , Polymorphism, Single Nucleotide , Animals , Breeding , Cattle , Gene Frequency , Genetic Variation , Genotype , Male , Whole Genome Sequencing
11.
Mamm Genome ; 26(11-12): 658-65, 2015 Dec.
Article in English | MEDLINE | ID: mdl-26475143

ABSTRACT

Despite the growing number of sequenced bovine genomes, the knowledge of the population-wide variation of sequences remains limited. In many studies, statistical methodology was not applied in order to relate findings in the sequenced samples to a population-wide level. Our goal was to assess the population-wide variation in DNA sequence based on whole-genome sequences of 32 Holstein-Friesian cows. The number of SNPs significantly varied across individuals. The number of identified SNPs increased with coverage, following a logarithmic curve. A total of 15,272,427 SNPs were identified, 99.16 % of them being bi-allelic. Missense SNPs were classified into three categories based on their genomic location: housekeeping genes, genes undergoing strong selection, and genes neutral to selection. The number of missense SNPs was significantly higher within genes neutral to selection than in the other two categories. The number of variants located within 3'UTR and 5'UTR regions was also significantly different across gene families. Moreover, the number of insertions and deletions differed significantly among cows varying between 261,712 and 330,103 insertions and from 271,398 to 343,649 deletions. Results not only demonstrate inter-individual variation in the number of SNPs and indels but also show that the number of missense SNPs differs across genes representing different functional backgrounds.


Subject(s)
Mastitis, Bovine/genetics , Polymorphism, Single Nucleotide , Animals , Case-Control Studies , Cattle , DNA Copy Number Variations , Female , Genome , INDEL Mutation , Mutation, Missense , Sequence Analysis, DNA
SELECTION OF CITATIONS
SEARCH DETAIL
...