Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 9 de 9
Filtrar
1.
Bioinformatics ; 40(4)2024 Mar 29.
Artigo em Inglês | MEDLINE | ID: mdl-38603603

RESUMO

MOTIVATION: Genome sequencing technologies reveal a huge amount of genomic sequences. Neural network-based methods can be prime candidates for retrieving insights from these sequences because of their applicability to large and diverse datasets. However, the highly variable lengths of genome sequences severely impair the presentation of sequences as input to the neural network. Genetic variations further complicate tasks that involve sequence comparison or alignment. RESULTS: Inspired by the theory and applications of "spaced seeds," we propose a graph representation of genome sequences called "gapped pattern graph." These graphs can be transformed through a Graph Convolutional Network to form lower-dimensional embeddings for downstream tasks. On the basis of the gapped pattern graphs, we implemented a neural network model and demonstrated its performance on diverse tasks involving microbe and mammalian genome data. Our method consistently outperformed all the other state-of-the-art methods across various metrics on all tasks, especially for the sequences with limited homology to the training data. In addition, our model was able to identify distinct gapped pattern signatures from the sequences. AVAILABILITY AND IMPLEMENTATION: The framework is available at https://github.com/deepomicslab/GCNFrame.

2.
BMC Bioinformatics ; 24(1): 40, 2023 Feb 08.
Artigo em Inglês | MEDLINE | ID: mdl-36755234

RESUMO

BACKGROUND: Distance functions are fundamental for evaluating the differences between gene expression profiles. Such a function would output a low value if the profiles are strongly correlated-either negatively or positively-and vice versa. One popular distance function is the absolute correlation distance, [Formula: see text], where [Formula: see text] is similarity measure, such as Pearson or Spearman correlation. However, the absolute correlation distance fails to fulfill the triangle inequality, which would have guaranteed better performance at vector quantization, allowed fast data localization, as well as accelerated data clustering. RESULTS: In this work, we propose [Formula: see text] as an alternative. We prove that [Formula: see text] satisfies the triangle inequality when [Formula: see text] represents Pearson correlation, Spearman correlation, or Cosine similarity. We show [Formula: see text] to be better than [Formula: see text], another variant of [Formula: see text] that satisfies the triangle inequality, both analytically as well as experimentally. We empirically compared [Formula: see text] with [Formula: see text] in gene clustering and sample clustering experiment by real-world biological data. The two distances performed similarly in both gene clustering and sample clustering in hierarchical clustering and PAM (partitioning around medoids) clustering. However, [Formula: see text] demonstrated more robust clustering. According to the bootstrap experiment, [Formula: see text] generated more robust sample pair partition more frequently (P-value [Formula: see text]). The statistics on the time a class "dissolved" also support the advantage of [Formula: see text] in robustness. CONCLUSION: [Formula: see text], as a variant of absolute correlation distance, satisfies the triangle inequality and is capable for more robust clustering.


Assuntos
Transcriptoma , Análise por Conglomerados
3.
BMC Genomics ; 20(Suppl 2): 186, 2019 Apr 04.
Artigo em Inglês | MEDLINE | ID: mdl-30967119

RESUMO

BACKGROUND: Recent advances in genome analysis have established that chromatin has preferred 3D conformations, which bring distant loci into contact. Identifying these contacts is important for us to understand possible interactions between these loci. This has motivated the creation of the Hi-C technology, which detects long-range chromosomal interactions. Distance geometry-based algorithms, such as ChromSDE and ShRec3D, have been able to utilize Hi-C data to infer 3D chromosomal structures. However, these algorithms, being matrix-based, are space- and time-consuming on very large datasets. A human genome of 100 kilobase resolution would involve ∼30,000 loci, requiring gigabytes just in storing the matrices. RESULTS: We propose a succinct representation of the distance matrices which tremendously reduces the space requirement. We give a complete solution, called SuperRec, for the inference of chromosomal structures from Hi-C data, through iterative solving the large-scale weighted multidimensional scaling problem. CONCLUSIONS: SuperRec runs faster than earlier systems without compromising on result accuracy. The SuperRec package can be obtained from http://www.cs.cityu.edu.hk/~shuaicli/SuperRec .


Assuntos
Algoritmos , Cromatina/química , Cromossomos Humanos/química , Biologia Computacional/métodos , Genoma Humano , Cromatina/genética , Cromossomos Humanos/genética , Simulação por Computador , Humanos , Modelos Moleculares , Conformação de Ácido Nucleico
5.
BMC Genomics ; 17 Suppl 4: 430, 2016 08 18.
Artigo em Inglês | MEDLINE | ID: mdl-27556418

RESUMO

BACKGROUND: Accurately identifying gene regulatory network is an important task in understanding in vivo biological activities. The inference of such networks is often accomplished through the use of gene expression data. Many methods have been developed to evaluate gene expression dependencies between transcription factor and its target genes, and some methods also eliminate transitive interactions. The regulatory (or edge) direction is undetermined if the target gene is also a transcription factor. Some methods predict the regulatory directions in the gene regulatory networks by locating the eQTL single nucleotide polymorphism, or by observing the gene expression changes when knocking out/down the candidate transcript factors; regrettably, these additional data are usually unavailable, especially for the samples deriving from human tissues. RESULTS: In this study, we propose the Context Based Dependency Network (CBDN), a method that is able to infer gene regulatory networks with the regulatory directions from gene expression data only. To determine the regulatory direction, CBDN computes the influence of source to target by evaluating the magnitude changes of expression dependencies between the target gene and the others with conditioning on the source gene. CBDN extends the data processing inequality by involving the dependency direction to distinguish between direct and transitive relationship between genes. We also define two types of important regulators which can influence a majority of the genes in the network directly or indirectly. CBDN can detect both of these two types of important regulators by averaging the influence functions of candidate regulator to the other genes. In our experiments with simulated and real data, even with the regulatory direction taken into account, CBDN outperforms the state-of-the-art approaches for inferring gene regulatory network. CBDN identifies the important regulators in the predicted network: 1. TYROBP influences a batch of genes that are related to Alzheimer's disease; 2. ZNF329 and RB1 significantly regulate those 'mesenchymal' gene expression signature genes for brain tumors. CONCLUSION: By merely leveraging gene expression data, CBDN can efficiently infer the existence of gene-gene interactions as well as their regulatory directions. The constructed networks are helpful in the identification of important regulators for complex diseases.


Assuntos
Proteínas Adaptadoras de Transdução de Sinal/genética , Doença de Alzheimer/genética , Proteínas de Ligação a DNA/genética , Proteínas de Membrana/genética , Proteínas de Ligação a Retinoblastoma/genética , Ubiquitina-Proteína Ligases/genética , Algoritmos , Encéfalo/metabolismo , Encéfalo/patologia , Biologia Computacional , Simulação por Computador , Regulação da Expressão Gênica/genética , Redes Reguladoras de Genes/genética , Humanos , Polimorfismo de Nucleotídeo Único , Locos de Características Quantitativas/genética , Fatores de Transcrição/genética , Transcriptoma
6.
BMC Bioinformatics ; 11: 25, 2010 Jan 13.
Artigo em Inglês | MEDLINE | ID: mdl-20070892

RESUMO

BACKGROUND: Ab initio protein structure prediction methods generate numerous structural candidates, which are referred to as decoys. The decoy with the most number of neighbors of up to a threshold distance is typically identified as the most representative decoy. However, the clustering of decoys needed for this criterion involves computations with runtimes that are at best quadratic in the number of decoys. As a result currently there is no tool that is designed to exactly cluster very large numbers of decoys, thus creating a bottleneck in the analysis. RESULTS: Using three strategies aimed at enhancing performance (proximate decoys organization, preliminary screening via lower and upper bounds, outliers filtering) we designed and implemented a software tool for clustering decoys called Calibur. We show empirical results indicating the effectiveness of each of the strategies employed. The strategies are further fine-tuned according to their effectiveness.Calibur demonstrated the ability to scale well with respect to increases in the number of decoys. For a sample size of approximately 30 thousand decoys, Calibur completed the analysis in one third of the time required when the strategies are not used.For practical use Calibur is able to automatically discover from the input decoys a suitable threshold distance for clustering. Several methods for this discovery are implemented in Calibur, where by default a very fast one is used. Using the default method Calibur reported relatively good decoys in our tests. CONCLUSIONS: Calibur's ability to handle very large protein decoy sets makes it a useful tool for clustering decoys in ab initio protein structure prediction. As the number of decoys generated in these methods increases, we believe Calibur will come in important for progress in the field.


Assuntos
Análise por Conglomerados , Biologia Computacional/métodos , Conformação Proteica , Proteínas/química , Software , Bases de Dados de Proteínas , Dobramento de Proteína , Análise de Sequência de Proteína
7.
BMC Genomics ; 10: 42, 2009 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-19159490

RESUMO

BACKGROUND: Co-expressing genes tend to cluster in eukaryotic genomes. This paper analyzes correlation between the proximity of eukaryotic genes and their transcriptional expression pattern in the zebrafish (Danio rerio) genome using available microarray data and gene annotation. RESULTS: The analyses show that neighbouring genes are significantly coexpressed in the zebrafish genome, and the coexpression level is influenced by the intergenic distance and transcription orientation. This fact is further supported by examining the coexpression level of genes within positional clusters in the neighbourhood model. There is a positive correlation between gene coexpression and positional clustering in the zebrafish genome. CONCLUSION: The study provides another piece of evidence for the hypothesis that coexpressed genes do cluster in the eukaryotic genomes.


Assuntos
Genoma , Peixe-Zebra/genética , Animais , Análise por Conglomerados , Biologia Computacional , Bases de Dados Genéticas , Perfilação da Expressão Gênica , Regulação da Expressão Gênica , Modelos Genéticos , Transcrição Gênica
8.
Comput Biol Chem ; 74: 428-433, 2018 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-29625871

RESUMO

DNA fingerprinting, also known as DNA profiling, serves as a standard procedure in forensics to identify a person by the short tandem repeat (STR) loci in their DNA. By comparing the STR loci between DNA samples, practitioners can calculate a probability of match to identity the contributors of a DNA mixture. Most existing methods are based on 13 core STR loci which were identified by the Federal Bureau of Investigation (FBI). Analyses based on these loci of DNA mixture for forensic purposes are highly variable in procedures, and suffer from subjectivity as well as bias in complex mixture interpretation. With the emergence of next-generation sequencing (NGS) technologies, the sequencing of billions of DNA molecules can be parallelized, thus greatly increasing throughput and reducing the associated costs. This allows the creation of new techniques that incorporate more loci to enable complex mixture interpretation. In this paper, we propose a computation for likelihood ratio that uses NGS (next generation sequencing) data for DNA testing on mixed samples. We have applied the method to 4480 simulated DNA mixtures, which consist of various mixture proportions of 8 unrelated whole-genome sequencing data. The results confirm the feasibility of utilizing NGS data in DNA mixture interpretations. We observed an average likelihood ratio as high as 285,978 for two-person mixtures. Using our method, all 224 identity tests for two-person mixtures and three-person mixtures were correctly identified.


Assuntos
Impressões Digitais de DNA , DNA/genética , Sequenciamento de Nucleotídeos em Larga Escala , Repetições de Microssatélites/genética , Humanos
9.
Artigo em Inglês | MEDLINE | ID: mdl-26357275

RESUMO

The Local/Global Alignment (Zemla, 2003), or LGA, is a popular method for the comparison of protein structures. One of the two components of LGA requires us to compute the longest common contiguous segments between two protein structures. That is, given two structures A = (a1, ... ,a(n)) and B = (b1, ... ,b(n)) where a(k), b(k) ∈ ℝ(3), we are to find, among all the segments f = (a(i), ... ,a(j)) and g = (b(i), ... ,b(j)) that fulfill a certain criterion regarding their similarity, those of the maximum length. We consider the following criteria: (1) the root mean squared deviation (RMSD) between f and g is to be within a given t ∈ ℝ; (2) f and g can be superposed such that for each k, i ≤ k ≤ j, ||a(k) - b(k)|| ≤ t for a given t ∈ ℝ. We give an algorithm of O(n log n + nl) time complexity when the first requirement applies, where l is the maximum length of the segments fulfilling the criterion. We show an FPTAS which, for any ϵ ∈ ℝ, finds a segment of length at least l, but of RMSD up to (1 + ϵ)t, in O(n log n + n/ϵ) time. We propose an FPTAS which for any given ϵ ∈ R, finds all the segments f and g of the maximum length which can be superposed such that for each k, i ≤ k ≤ j, ||a(k) - b(k)|| ≤ (1 + ϵ)t, thus fulfilling the second requirement approximately. The algorithm has a time complexity of O(n log(2) n/ϵ(5)) when consecutive points in A are separated by the same distance (which is the case with protein structures). These worst-case runtime complexities are verified using C++ implementations of the algorithms, which we have made available at http://alcs.sourceforge.net/.


Assuntos
Biologia Computacional/métodos , Proteínas/química , Alinhamento de Sequência/métodos , Algoritmos , Modelos Moleculares , Conformação Proteica , Homologia Estrutural de Proteína
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA