Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 9 de 9
Filtrar
Mais filtros











Base de dados
Intervalo de ano de publicação
1.
J Transl Med ; 20(1): 373, 2022 08 18.
Artigo em Inglês | MEDLINE | ID: mdl-35982500

RESUMO

BACKGROUND: Recently, extensive cancer genomic studies have revealed mutational and clinical data of large cohorts of cancer patients. For example, the Pan-Lung Cancer 2016 dataset (part of The Cancer Genome Atlas project), summarises the mutational and clinical profiles of different subtypes of Lung Cancer (LC). Mutational and clinical signatures have been used independently for tumour typification and prediction of metastasis in LC patients. Is it then possible to achieve better typifications and predictions when combining both data streams? METHODS: In a cohort of 1144 Lung Adenocarcinoma (LUAD) and Lung Squamous Cell Carcinoma (LSCC) patients, we studied the number of missense mutations (hereafter, the Total Mutational Load TML) and distribution of clinical variables, for different classes of patients. Using the TML and different sets of clinical variables (tumour stage, age, sex, smoking status, and packs of cigarettes smoked per year), we built Random Forest classification models that calculate the likelihood of developing metastasis. RESULTS: We found that LC patients different in age, smoking status, and tumour type had significantly different mean TMLs. Although TML was an informative feature, its effect was secondary to the "tumour stage" feature. However, its contribution to the classification is not redundant with the latter; models trained using both TML and tumour stage performed better than models trained using only one of these variables. We found that models trained in the entire dataset (i.e., without using dimensionality reduction techniques) and without resampling achieved the highest performance, with an F1 score of 0.64 (95%CrI [0.62, 0.66]). CONCLUSIONS: Clinical variables and TML should be considered together when assessing the likelihood of LC patients progressing to metastatic states, as the information these encode is not redundant. Altogether, we provide new evidence of the need for comprehensive diagnostic tools for metastasis.


Assuntos
Adenocarcinoma de Pulmão , Carcinoma Pulmonar de Células não Pequenas , Carcinoma de Células Escamosas , Neoplasias Pulmonares , Adenocarcinoma de Pulmão/genética , Adenocarcinoma de Pulmão/patologia , Carcinoma Pulmonar de Células não Pequenas/patologia , Carcinoma de Células Escamosas/genética , Humanos , Neoplasias Pulmonares/genética , Neoplasias Pulmonares/patologia , Mutação/genética
2.
J Bioinform Comput Biol ; 17(3): 1950011, 2019 06.
Artigo em Inglês | MEDLINE | ID: mdl-31230498

RESUMO

Signaling pathways are responsible for the regulation of cell processes, such as monitoring the external environment, transmitting information across membranes, and making cell fate decisions. Given the increasing amount of biological data available and the recent discoveries showing that many diseases are related to the disruption of cellular signal transduction cascades, in silico discovery of signaling pathways in cell biology has become an active research topic in past years. However, reconstruction of signaling pathways remains a challenge mainly because of the need for systematic approaches for predicting causal relationships, like edge direction and activation/inhibition among interacting proteins in the signal flow. We propose an approach for predicting signaling pathways that integrates protein interactions, gene expression, phenotypes, and protein complex information. Our method first finds candidate pathways using a directed-edge-based algorithm and then defines a graph model to include causal activation relationships among proteins, in candidate pathways using cell cycle gene expression and phenotypes to infer consistent pathways in yeast. Then, we incorporate protein complex coverage information for deciding on the final predicted signaling pathways. We show that our approach improves the predictive results of the state of the art using different ranking metrics.


Assuntos
Ciclo Celular , Biologia Computacional/métodos , Complexos Multiproteicos/metabolismo , Transdução de Sinais , Algoritmos , Ciclo Celular/genética , Gráficos por Computador , Visualização de Dados , Expressão Gênica , Mapeamento de Interação de Proteínas/métodos , Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/metabolismo , Proteínas de Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/metabolismo
3.
Bioinformatics ; 35(20): 4120-4128, 2019 10 15.
Artigo em Inglês | MEDLINE | ID: mdl-30887042

RESUMO

MOTIVATION: Genome repositories are growing faster than our storage capacities, challenging our ability to store, transmit, process and analyze them. While genomes are not very compressible individually, those repositories usually contain myriads of genomes or genome reads of the same species, thereby creating opportunities for orders-of-magnitude compression by exploiting inter-genome similarities. A useful compression system, however, cannot be only usable for archival, but it must allow direct access to the sequences, ideally in transparent form so that applications do not need to be rewritten. RESULTS: We present a highly compressed filesystem that specializes in storing large collections of genomes and reads. The system obtains orders-of-magnitude compression by using Relative Lempel-Ziv, which exploits the high similarities between genomes of the same species. The filesystem transparently stores the files in compressed form, intervening the system calls of the applications without the need to modify them. A client/server variant of the system stores the compressed files in a server, while the client's filesystem transparently retrieves and updates the data from the server. The data between client and server are also transferred in compressed form, which saves an order of magnitude network time. AVAILABILITY AND IMPLEMENTATION: The C++ source code of our implementation is available for download in https://github.com/vsepulve/relz_fs.


Assuntos
Compressão de Dados , Genoma , Software
4.
PLoS One ; 12(9): e0183460, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28937982

RESUMO

Many proteins work together with others in groups called complexes in order to achieve a specific function. Discovering protein complexes is important for understanding biological processes and predict protein functions in living organisms. Large-scale and throughput techniques have made possible to compile protein-protein interaction networks (PPI networks), which have been used in several computational approaches for detecting protein complexes. Those predictions might guide future biologic experimental research. Some approaches are topology-based, where highly connected proteins are predicted to be complexes; some propose different clustering algorithms using partitioning, overlaps among clusters for networks modeled with unweighted or weighted graphs; and others use density of clusters and information based on protein functionality. However, some schemes still require much processing time or the quality of their results can be improved. Furthermore, most of the results obtained with computational tools are not accompanied by an analysis of false positives. We propose an effective and efficient mining algorithm for discovering highly connected subgraphs, which is our base for defining protein complexes. Our representation is based on transforming the PPI network into a directed acyclic graph that reduces the number of represented edges and the search space for discovering subgraphs. Our approach considers weighted and unweighted PPI networks. We compare our best alternative using PPI networks from Saccharomyces cerevisiae (yeast) and Homo sapiens (human) with state-of-the-art approaches in terms of clustering, biological metrics and execution times, as well as three gold standards for yeast and two for human. Furthermore, we analyze false positive predicted complexes searching the PDBe (Protein Data Bank in Europe) database in order to identify matching protein complexes that have been purified and structurally characterized. Our analysis shows that more than 50 yeast protein complexes and more than 300 human protein complexes found to be false positives according to our prediction method, i.e., not described in the gold standard complex databases, in fact contain protein complexes that have been characterized structurally and documented in PDBe. We also found that some of these protein complexes have recently been classified as part of a Periodic Table of Protein Complexes. The latest version of our software is publicly available at http://doi.org/10.6084/m9.figshare.5297314.v1.


Assuntos
Algoritmos , Modelos Moleculares , Mapeamento de Interação de Proteínas/métodos , Proteínas/metabolismo , Humanos , Internet , Saccharomyces cerevisiae , Software
5.
Inf Retr Boston ; 20(3): 253-291, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28596702

RESUMO

Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, interleaved LCPs and precomputed document lists, that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top-k document retrieval (find the k documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple [Formula: see text] model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case.

6.
BMC Ecol ; 12: 1, 2012 Jan 27.
Artigo em Inglês | MEDLINE | ID: mdl-22284854

RESUMO

BACKGROUND: The Andes-Amazon basin of Peru and Bolivia is one of the most data-poor, biologically rich, and rapidly changing areas of the world. Conservation scientists agree that this area hosts extremely high endemism, perhaps the highest in the world, yet we know little about the geographic distributions of these species and ecosystems within country boundaries. To address this need, we have developed conservation data on endemic biodiversity (~800 species of birds, mammals, amphibians, and plants) and terrestrial ecological systems (~90; groups of vegetation communities resulting from the action of ecological processes, substrates, and/or environmental gradients) with which we conduct a fine scale conservation prioritization across the Amazon watershed of Peru and Bolivia. We modelled the geographic distributions of 435 endemic plants and all 347 endemic vertebrate species, from existing museum and herbaria specimens at a regional conservation practitioner's scale (1:250,000-1:1,000,000), based on the best available tools and geographic data. We mapped ecological systems, endemic species concentrations, and irreplaceable areas with respect to national level protected areas. RESULTS: We found that sizes of endemic species distributions ranged widely (< 20 km2 to > 200,000 km2) across the study area. Bird and mammal endemic species richness was greatest within a narrow 2500-3000 m elevation band along the length of the Andes Mountains. Endemic amphibian richness was highest at 1000-1500 m elevation and concentrated in the southern half of the study area. Geographical distribution of plant endemism was highly taxon-dependent. Irreplaceable areas, defined as locations with the highest number of species with narrow ranges, overlapped slightly with areas of high endemism, yet generally exhibited unique patterns across the study area by species group. We found that many endemic species and ecological systems are lacking national-level protection; a third of endemic species have distributions completely outside of national protected areas. Protected areas cover only 20% of areas of high endemism and 20% of irreplaceable areas. Almost 40% of the 91 ecological systems are in serious need of protection (= < 2% of their ranges protected). CONCLUSIONS: We identify for the first time, areas of high endemic species concentrations and high irreplaceability that have only been roughly indicated in the past at the continental scale. We conclude that new complementary protected areas are needed to safeguard these endemics and ecosystems. An expansion in protected areas will be challenged by geographically isolated micro-endemics, varied endemic patterns among taxa, increasing deforestation, resource extraction, and changes in climate. Relying on pre-existing collections, publically accessible datasets and tools, this working framework is exportable to other regions plagued by incomplete conservation data.


Assuntos
Biodiversidade , Conservação dos Recursos Naturais/métodos , Demografia , Ecossistema , Modelos Teóricos , Animais , Bolívia , Geografia , Mapas como Assunto , Peru , Especificidade da Espécie
7.
IEEE Trans Pattern Anal Mach Intell ; 30(9): 1647-58, 2008 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-18617721

RESUMO

We introduce a new probabilistic proximity search algorithm for range and K-nearest neighbor (K-NN) searching in both coordinate and metric spaces. Although there exist solutions for these problems, they boil down to a linear scan when the space is intrinsically high-dimensional, as is the case in many pattern recognition tasks. This, for example, renders the K-NN approach to classification rather slow in large databases. Our novel idea is to predict closeness between elements according to how they order their distances towards a distinguished set of anchor objects. Each element in the space sorts the anchor objects from closest to farthest to it, and the similarity between orders turns out to be an excellent predictor of the closeness between the corresponding elements. We present extensive experiments comparing our method against state-of-the-art exact and approximate techniques, both in synthetic and real, metric and non-metric databases, measuring both CPU time and distance computations. The experiments demonstrate that our technique almost always improves upon the performance of alternative techniques, in some cases by a wide margin.


Assuntos
Algoritmos , Inteligência Artificial , Aumento da Imagem/métodos , Interpretação de Imagem Assistida por Computador/métodos , Armazenamento e Recuperação da Informação/métodos , Reconhecimento Automatizado de Padrão/métodos , Técnica de Subtração , Simulação por Computador , Interpretação Estatística de Dados , Modelos Estatísticos
8.
J Comput Biol ; 10(6): 903-23, 2003.
Artigo em Inglês | MEDLINE | ID: mdl-14980017

RESUMO

The problem of fast exact and approximate searching for a pattern that contains classes of characters and bounded size gaps (CBG) in a text has a wide range of applications, among which a very important one is protein pattern matching (for instance, one PROSITE protein site is associated with the CBG [RK] - x(2,3) - [DE] - x(2,3) - Y, where the brackets match any of the letters inside, and x(2,3) a gap of length between 2 and 3). Currently, the only way to search for a CBG in a text is to convert it into a full regular expression (RE). However, a RE is more sophisticated than a CBG, and searching for it with a RE pattern matching algorithm complicates the search and makes it slow. This is the reason why we design in this article two new practical CBG matching algorithms that are much simpler and faster than all the RE search techniques. The first one looks exactly once at each text character. The second one does not need to consider all the text characters, and hence it is usually faster than the first one, but in bad cases may have to read the same text character more than once. We then propose a criterion based on the form of the CBG to choose a priori the fastest between both. We also show how to search permitting a few mistakes in the occurrences. We performed many practical experiments using the PROSITE database, and all of them show that our algorithms are the fastest in virtually all cases.


Assuntos
Biologia Computacional/métodos , Reconhecimento Automatizado de Padrão , Proteínas/química , Proteínas/classificação , Algoritmos , Armazenamento e Recuperação da Informação
9.
Cochabamba; Centro de Ecología Simón I. Patiño; julio 2002. 719 p. ilus.
Monografia em Espanhol | LIBOCS, LIBOSP | ID: biblio-1333678
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA