RESUMO
The increasing prevalence of omics data sources is pushing the study of regulatory mechanisms underlying complex diseases such as cancer. However, the vast quantities of molecular features produced and the inherent interplay between them lead to a level of complexity that hampers both descriptive and predictive tasks, requiring custom-built algorithms that can extract relevant information from these sources of data. We propose a transformation that moves data centered on molecules (e.g., transcripts and proteins) to a new data space focused on putative regulatory modules given by statistically relevant co-expression patterns. To this end, the proposed transformation extracts patterns from the data through biclustering and uses them to create new variables with guarantees of interpretability and discriminative power. The transformation is shown to achieve dimensionality reductions of up to 99% and increase predictive performance of various classifiers across multiple omics layers. Results suggest that omics data transformations from gene-centric to pattern-centric data supports both prediction tasks and human interpretation, notably contributing to precision medicine applications.
Assuntos
Neoplasias , Humanos , Neoplasias/genética , Algoritmos , Genômica/métodos , Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodosRESUMO
BACKGROUND: Biclustering is increasingly used in biomedical data analysis, recommendation tasks, and text mining domains, with hundreds of biclustering algorithms proposed. When assessing the performance of these algorithms, more than real datasets are required as they do not offer a solid ground truth. Synthetic data surpass this limitation by producing reference solutions to be compared with the found patterns. However, generating synthetic datasets is challenging since the generated data must ensure reproducibility, pattern representativity, and real data resemblance. RESULTS: We propose G-Bic, a dataset generator conceived to produce synthetic benchmarks for the normative assessment of biclustering algorithms. Beyond expanding on aspects of pattern coherence, data quality, and positioning properties, it further handles specificities related to mixed-type datasets and time-series data.G-Bic has the flexibility to replicate real data regularities from diverse domains. We provide the default configurations to generate reproducible benchmarks to evaluate and compare diverse aspects of biclustering algorithms. Additionally, we discuss empirical strategies to simulate the properties of real data. CONCLUSION: G-Bic is a parametrizable generator for biclustering analysis, offering a solid means to assess biclustering solutions according to internal and external metrics robustly.
Assuntos
Benchmarking , Perfilação da Expressão Gênica , Reprodutibilidade dos Testes , Análise por Conglomerados , AlgoritmosRESUMO
A successful high-level gymnastics performance is the result of the coordination and inter-relation of body segments to produce movement prototypes. In this context, the exploration of different movement prototypes, as well as their relations with judges' scores, can aid coaches to design better learning and practice methodologies. Therefore, we investigate if there are different movement prototypes of the technique of the handspring tucked somersault with a half twist (HTB) on a mini trampoline with a vaulting table and its relations with judges' scores. We assessed flexion/extension angles of five joints during fifty trials, using an inertial measurement unit system. All trials were scored by international judges for execution. A multivariate time series cluster analysis was performed to identify movement prototypes and their differential association with judges' scores was statistically assessed. Nine different movement prototypes were identified for the HTB technique, with two of them associated with higher scores. Statistically strong associations were found between scores and movement phases one (i.e., from the last step on the carpet to the initial contact of both feet with the mini trampoline), two (i.e., from the initial contact to the take-off on the mini trampoline) and four (i.e., from the initial contact of both hands with the vaulting table to take-off on the vaulting table) and moderate associations with movement phase six (i.e., from the tucked body position to landing with both feet on the landing mat). Our findings suggest (a) the presence of multiple movement prototypes yielding successful scoring and (b) the moderate-to-strong association of movement variations along phases one, two, four and six with judges' scores. We suggest and provide guidelines for coaches to encourage movement variability that can lead their gymnasts to functionally adapt their performance and succeed when facing different constraints.
Assuntos
Ginástica , Julgamento , Movimento , Adulto , Humanos , Adulto Jovem , Mãos , RotaçãoRESUMO
Longitudinal cohort studies to study disease progression generally combine temporal features produced under periodic assessments (clinical follow-up) with static features associated with single-time assessments, genetic, psychophysiological, and demographic profiles. Subspace clustering, including biclustering and triclustering stances, enables the discovery of local and discriminative patterns from such multidimensional cohort data. These patterns, highly interpretable, are relevant to identifying groups of patients with similar traits or progression patterns. Despite their potential, their use for improving predictive tasks in clinical domains remains unexplored. In this work, we propose to learn predictive models from static and temporal data using discriminative patterns, obtained via biclustering and triclustering, as features within a state-of-the-art classifier, thus enhancing model interpretation. triCluster is extended to find time-contiguous triclusters in temporal data (temporal patterns) and a biclustering algorithm to discover coherent patterns in static data. The transformed data space, composed of bicluster and tricluster features, capture local and cross-variable associations with discriminative power, yielding unique statistical properties of interest. As a case study, we applied our methodology to follow-up data from Portuguese patients with Amyotrophic Lateral Sclerosis (ALS) to predict the need for non-invasive ventilation (NIV) since the last appointment. The results showed that, in general, our methodology outperformed baseline results using the original features. Furthermore, the bicluster/tricluster-based patterns used by the classifier can be used by clinicians to understand the models by highlighting relevant prognostic patterns.
Assuntos
Esclerose Lateral Amiotrófica , Ventilação não Invasiva , Esclerose Lateral Amiotrófica/diagnóstico , Esclerose Lateral Amiotrófica/terapia , Análise por Conglomerados , Humanos , Estudos Longitudinais , PrognósticoRESUMO
BACKGROUND: A considerable number of data mining approaches for biomedical data analysis, including state-of-the-art associative models, require a form of data discretization. Although diverse discretization approaches have been proposed, they generally work under a strict set of statistical assumptions which are arguably insufficient to handle the diversity and heterogeneity of clinical and molecular variables within a given dataset. In addition, although an increasing number of symbolic approaches in bioinformatics are able to assign multiple items to values occurring near discretization boundaries for superior robustness, there are no reference principles on how to perform multi-item discretizations. RESULTS: In this study, an unsupervised discretization method, DI2, for variables with arbitrarily skewed distributions is proposed. Statistical tests applied to assess differences in performance confirm that DI2 generally outperforms well-established discretizations methods with statistical significance. Within classification tasks, DI2 displays either competitive or superior levels of predictive accuracy, particularly delineate for classifiers able to accommodate border values. CONCLUSIONS: This work proposes a new unsupervised method for data discretization, DI2, that takes into account the underlying data regularities, the presence of outlier values disrupting expected regularities, as well as the relevance of border values. DI2 is available at https://github.com/JupitersMight/DI2.
Assuntos
Algoritmos , Mineração de Dados , Biologia ComputacionalRESUMO
BACKGROUND: Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations [Formula: see text] features [Formula: see text] contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. RESULTS: G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. CONCLUSIONS: Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric's potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.
Assuntos
Algoritmos , Análise por Conglomerados , Bases de Dados Factuais , Humanos , Software , Temperatura , LevedurasRESUMO
BACKGROUND: In the face of the current COVID-19 pandemic, the timely prediction of upcoming medical needs for infected individuals enables better and quicker care provision when necessary and management decisions within health care systems. OBJECTIVE: This work aims to predict the medical needs (hospitalizations, intensive care unit admissions, and respiratory assistance) and survivability of individuals testing positive for SARS-CoV-2 infection in Portugal. METHODS: A retrospective cohort of 38,545 infected individuals during 2020 was used. Predictions of medical needs were performed using state-of-the-art machine learning approaches at various stages of a patient's cycle, namely, at testing (prehospitalization), at posthospitalization, and during postintensive care. A thorough optimization of state-of-the-art predictors was undertaken to assess the ability to anticipate medical needs and infection outcomes using demographic and comorbidity variables, as well as dates associated with symptom onset, testing, and hospitalization. RESULTS: For the target cohort, 75% of hospitalization needs could be identified at the time of testing for SARS-CoV-2 infection. Over 60% of respiratory needs could be identified at the time of hospitalization. Both predictions had >50% precision. CONCLUSIONS: The conducted study pinpoints the relevance of the proposed predictive models as good candidates to support medical decisions in the Portuguese population, including both monitoring and in-hospital care decisions. A clinical decision support system is further provided to this end.
Assuntos
COVID-19/terapia , Hospitalização/estatística & dados numéricos , Unidades de Terapia Intensiva/estatística & dados numéricos , Respiração Artificial/estatística & dados numéricos , Adolescente , Adulto , Idoso , Idoso de 80 Anos ou mais , COVID-19/epidemiologia , Criança , Pré-Escolar , Estudos de Coortes , Feminino , Humanos , Lactente , Recém-Nascido , Estudos Longitudinais , Masculino , Pessoa de Meia-Idade , Pandemias , Portugal/epidemiologia , Estudos Retrospectivos , SARS-CoV-2/isolamento & purificação , Adulto JovemRESUMO
Postoperative complications are still hard to predict despite the efforts towards the creation of clinical risk scores. The published scores contribute for the creation of specialized tools, but with limited predictive performance and reusability for implementation in the oncological context. This work aims to predict postoperative complications risk for cancer patients, offering two major contributions. First, to develop and evaluate a machine learning-based risk score, specific for the Portuguese population using a retrospective cohort of 847 cancer patients undergoing surgery between 2016 and 2018, for 4 outcomes of interest: (1) existence of postoperative complications, (2) severity level of complications, (3) number of days in the Intermediate Care Unit (ICU), and (4) postoperative mortality within 1 year. An additional cohort of 137 cancer patients from the same center was used for validation. Second, to improve the interpretability of the predictive models. In order to achieve these objectives, we propose an approach for the learning of risk predictors, offering new perspectives and insights into the clinical decision process. For postoperative complications the Receiver Operating Characteristic Curve (AUC) was 0.69, for complications' severity AUC was 0.65, for the days in the ICU the mean absolute error was 1.07 days, and for 1-year postoperative mortality the AUC was 0.74, calculated on the development cohort. In this study, predictive models which could help to guide physicians at organizational and clinical decision making were developed. Additionally, a web-based decision support tool is further provided to this end.
Assuntos
Neoplasias , Complicações Pós-Operatórias , Estudos de Coortes , Humanos , Neoplasias/cirurgia , Portugal/epidemiologia , Complicações Pós-Operatórias/epidemiologia , Curva ROC , Estudos RetrospectivosRESUMO
BACKGROUND: Biclustering has been largely applied for the unsupervised analysis of biological data, being recognised today as a key technique to discover putative modules in both expression data (subsets of genes correlated in subsets of conditions) and network data (groups of coherently interconnected biological entities). However, given its computational complexity, only recent breakthroughs on pattern-based biclustering enabled efficient searches without the restrictions that state-of-the-art biclustering algorithms place on the structure and homogeneity of biclusters. As a result, pattern-based biclustering provides the unprecedented opportunity to discover non-trivial yet meaningful biological modules with putative functions, whose coherency and tolerance to noise can be tuned and made problem-specific. METHODS: To enable the effective use of pattern-based biclustering by the scientific community, we developed BicPAMS (Biclustering based on PAttern Mining Software), a software that: 1) makes available state-of-the-art pattern-based biclustering algorithms (BicPAM (Henriques and Madeira, Alg Mol Biol 9:27, 2014), BicNET (Henriques and Madeira, Alg Mol Biol 11:23, 2016), BicSPAM (Henriques and Madeira, BMC Bioinforma 15:130, 2014), BiC2PAM (Henriques and Madeira, Alg Mol Biol 11:1-30, 2016), BiP (Henriques and Madeira, IEEE/ACM Trans Comput Biol Bioinforma, 2015), DeBi (Serin and Vingron, AMB 6:1-12, 2011) and BiModule (Okada et al., IPSJ Trans Bioinf 48(SIG5):39-48, 2007)); 2) consistently integrates their dispersed contributions; 3) further explores additional accuracy and efficiency gains; and 4) makes available graphical and application programming interfaces. RESULTS: Results on both synthetic and real data confirm the relevance of BicPAMS for biological data analysis, highlighting its essential role for the discovery of putative modules with non-trivial yet biologically significant functions from expression and network data. CONCLUSIONS: BicPAMS is the first biclustering tool offering the possibility to: 1) parametrically customize the structure, coherency and quality of biclusters; 2) analyze large-scale biological networks; and 3) tackle the restrictive assumptions placed by state-of-the-art biclustering algorithms. These contributions are shown to be key for an adequate, complete and user-assisted unsupervised analysis of biological data. SOFTWARE: BicPAMS and its tutorial available in http://www.bicpams.com .
Assuntos
Expressão Gênica , Software , Algoritmos , Linhagem Celular Tumoral , Análise por Conglomerados , Redes Reguladoras de Genes , Humanos , Mapeamento de Interação de ProteínasRESUMO
INTRODUCTION: Little attention has been paid to distress in sexual functioning or the sexual satisfaction of people who practice BDSM (Bondage and Discipline, Domination and Submission, Sadism and Masochism). AIM: The purpose of this study was to describe sociodemographic characteristics and BDSM practices and compare BDSM practitioners' sexual outcomes (in BDSM and non-BDSM contexts). METHODS: A convenience sample of 68 respondents completed an online survey that used a participatory research framework. Cronbach's alpha and average inter-item correlations assessed scale reliability, and the Wilcoxon paired samples test compared the total scores between BDSM and non-BDSM contexts separately for men and women. Open-ended questions about BDSM sexual practices were coded using a preexisting thematic tree. MAIN OUTCOME MEASURES: We used self-reported demographic factors, including age at the onset of BDSM interest, age at first BDSM experience, and favorite and most frequent BDSM practices. The Global Measure of Sexual Satisfaction measured the amount of sexual distress, including low desire, arousal, maintaining arousal, premature orgasm, and anorgasmia. RESULTS: The participants had an average age of 33.15 years old and were highly educated and waited 6 years after becoming interested in BDSM to act on their interests. The practices in which the participants most frequently engaged did not coincide with the practices in which they were most interested and were overwhelmingly conducted at home. Comparisons between genders in terms of distress in sexual functioning in BDSM and non-BDSM contexts demonstrate that, with the exception of maintaining arousal, we found distress in sexual functioning to be statistically the same in BDSM and non-BDSM contexts for women. For men, we found that distress in sexual functioning, with the exception of premature orgasm and anorgasmia, was statistically significantly lower in the BDSM context. There were no differences in sexual satisfaction between BDSM and non-BDSM contexts for men or women. CONCLUSION: Our findings suggest that BDSM sexual activity should be addressed in clinical settings that account for BDSM identities, practices, relationships, preferences, sexual satisfaction, and distress in sexual function for men and women. Additional research needs are identified, such as the need to define distressful sexual functioning experiences and expand our understanding of the development of BDSM sexual identities.
Assuntos
Satisfação Pessoal , Comportamento Sexual/psicologia , Disfunções Sexuais Psicogênicas/psicologia , Adolescente , Adulto , Fatores Etários , Idoso , Feminino , Identidade de Gênero , Humanos , Masculino , Masoquismo/psicologia , Pessoa de Meia-Idade , Orgasmo , Avaliação de Resultados em Cuidados de Saúde , Reprodutibilidade dos Testes , Sadismo/psicologia , Fatores Sexuais , Fatores Socioeconômicos , Inquéritos e Questionários , Adulto JovemRESUMO
BACKGROUND: Biclustering is a critical task for biomedical applications. Order-preserving biclusters, submatrices where the values of rows induce the same linear ordering across columns, capture local regularities with constant, shifting, scaling and sequential assumptions. Additionally, biclustering approaches relying on pattern mining output deliver exhaustive solutions with an arbitrary number and positioning of biclusters. However, existing order-preserving approaches suffer from robustness, scalability and/or flexibility issues. Additionally, they are not able to discover biclusters with symmetries and parameterizable levels of noise. RESULTS: We propose new biclustering algorithms to perform flexible, exhaustive and noise-tolerant biclustering based on sequential patterns (BicSPAM). Strategies are proposed to allow for symmetries and to seize efficiency gains from item-indexable properties and/or from partitioning methods with conservative distance guarantees. Results show BicSPAM ability to capture symmetries, handle planted noise, and scale in terms of memory and time. BicSPAM also achieves the best match-scores for the recovery of hidden biclusters in synthetic datasets with varying noise distributions and levels of missing values. Finally, results on gene expression data lead to complete solutions, delivering new biclusters corresponding to putative modules with heightened biological relevance. CONCLUSIONS: BicSPAM provides an exhaustive way to discover flexible structures of order-preserving biclusters. To the best of our knowledge, BicSPAM is the first attempt to deal with order-preserving biclusters that allow for symmetries and that are robust to varying levels of noise.
Assuntos
Análise por Conglomerados , Perfilação da Expressão Gênica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Algoritmos , Mineração de Dados , Expressão Gênica , HumanosRESUMO
The prescription of psychotropic drugs has been rising in Europe over the last decade. This study provides a comprehensive profile of prepandemic consumption patterns of antidepressant, antipsychotic, and anxiolytic drugs in Portugal considering full nationwide psychotropic drug prescription and dispensing records (2016-2019) against several criteria, including active ingredient, sociodemographics, medical specialty, and incurred costs. An increase of 29.6% and 34.7% in the consumption of antipsychotics and antidepressants between 2016 and 2019 is highlighted, accompanied by an increase of 37M Eur in total expenditure (> 20M Eur in public copay) for these classes of drugs. Disparities in sociodemographic and geographical incidence are identified. Amongst other pivotal results, 64% of psychotropic drug prescriptions are undertaken by general practitioners, while only 21% undertaken by neurological and psychiatric specialties. Nationwide patterns of psychotropic drug prescription further reveal notable trends and determinants, establishing a reference point for cross-regional studies and being currently assessed at a national level to establish psychosocial initiatives and guidelines for medical practice and training.
Assuntos
Antipsicóticos , Medicina , Portugal/epidemiologia , Psicotrópicos/uso terapêutico , Antipsicóticos/uso terapêutico , Antidepressivos/uso terapêutico , Prescrições de MedicamentosRESUMO
BACKGROUND: Despite the advancements in multiagent chemotherapy in the past years, up to 10% of Hodgkin's Lymphoma (HL) cases are refractory to treatment and, after remission, patients experience an elevated risk of death from all causes. These complications are dependent on the treatment and therefore an increase in the prognostic accuracy of HL can help improve these outcomes and control treatment-related toxicity. Due to the low incidence of this cancer, there is a lack of works comprehensively assessing the predictability of treatment response, especially by resorting to machine learning (ML) advances and high-throughput technologies. METHODS: We present a methodology for predicting treatment response after two courses of Adriamycin, Bleomycin, Vinblastine and Dacarbazine (ABVD) chemotherapy, through the analysis of gene expression profiles using state-of-the-art ML algorithms. We work with expression levels of tumor samples of Classical Hodgkin's Lymphoma patients, obtained through the NanoString's nCounter platform. The presented approach combines dimensionality reduction procedures and hyperparameter optimization of various elected classifiers to retrieve reference predictability levels of refractory response to ABVD treatment using the regulatory profile of diagnostic tumor samples. In addition, we propose a data transformation procedure to map the original data space into a more discriminative one using biclustering, where features correspond to discriminative putative regulatory modules. RESULTS: Through an ensemble of feature selection procedures, we identify a set of 14 genes highly representative of the result of an fuorodeoxyglucose Positron Emission Tomography (FDG-PET) after two courses of ABVD chemotherapy. The proposed methodology further presents an increased performance against reference levels, with the proposed space transformation yielding improvements in the majority of the tested predictive models (e.g. Decision Trees show an improvement of 20pp in both precision and recall). CONCLUSIONS: Taken together, the results reveal improvements for predicting treatment response in HL disease by resorting to sophisticated statistical and ML principles. This work further consolidates the current hypothesis on the structural difficulty of this prognostic task, showing that there is still a considerable gap to be bridged for these technologies to reach the necessary maturity for clinical practice.
Assuntos
Doença de Hodgkin , Humanos , Doença de Hodgkin/tratamento farmacológico , Doença de Hodgkin/genética , Doença de Hodgkin/complicações , Transcriptoma , Bleomicina/uso terapêutico , Doxorrubicina/farmacologia , Doxorrubicina/uso terapêutico , Vimblastina/uso terapêutico , Vimblastina/efeitos adversos , Dacarbazina/efeitos adversos , Protocolos de Quimioterapia Combinada Antineoplásica/uso terapêuticoRESUMO
The accurate prediction of phenotypes in microorganisms is a main challenge for systems biology. Genome-scale models (GEMs) are a widely used mathematical formalism for predicting metabolic fluxes using constraint-based modeling methods such as flux balance analysis (FBA). However, they require prior knowledge of the metabolic network of an organism and appropriate objective functions, often hampering the prediction of metabolic fluxes under different conditions. Moreover, the integration of omics data to improve the accuracy of phenotype predictions in different physiological states is still in its infancy. Here, we present a novel approach for predicting fluxes under various conditions. We explore the use of supervised machine learning (ML) models using transcriptomics and/or proteomics data and compare their performance against the standard parsimonious FBA (pFBA) approach using case studies of Escherichia coli organism as an example. Our results show that the proposed omics-based ML approach is promising to predict both internal and external metabolic fluxes with smaller prediction errors in comparison to the pFBA approach. The code, data, and detailed results are available at the project's repository[1].
RESUMO
This work proposes a new class of explainable prognostic models for longitudinal data classification using triclusters. A new temporally constrained triclustering algorithm, termed TCtriCluster, is proposed to comprehensively find informative temporal patterns common to a subset of patients in a subset of features (triclusters), and use them as discriminative features within a state-of-the-art classifier with guarantees of interpretability. The proposed approach further enhances prediction with the potentialities of model explainability by revealing clinically relevant disease progression patterns underlying prognostics, describing features used for classification. The proposed methodology is used in the Amyotrophic Lateral Sclerosis (ALS) Portuguese cohort (N = 1321), providing the first comprehensive assessment of the prognostic limits of five notable clinical endpoints: need for non-invasive ventilation (NIV); need for an auxiliary communication device; need for percutaneous endoscopic gastrostomy (PEG); need for a caregiver; and need for a wheelchair. Triclustering-based predictors outperform state-of-the-art alternatives, being able to predict the need for auxiliary communication device (within 180 days) and the need for PEG (within 90 days) with an AUC above 90%. The approach was validated in clinical practice, supporting healthcare professionals in understanding the link between the highly heterogeneous patterns of ALS disease progression and the prognosis.
Assuntos
Esclerose Lateral Amiotrófica , Ventilação não Invasiva , Humanos , Esclerose Lateral Amiotrófica/diagnóstico , Esclerose Lateral Amiotrófica/terapia , Prognóstico , Progressão da Doença , Respiração Artificial , GastrostomiaRESUMO
Pattern discovery and subspace clustering play a central role in the biological domain, supporting for instance putative regulatory module discovery from omics data for both descriptive and predictive ends. In the presence of target variables (e.g. phenotypes), regulatory patterns should further satisfy delineate discriminative power properties, well-established in the presence of categorical outcomes, yet largely disregarded for numerical outcomes, such as risk profiles and quantitative phenotypes. DISA (Discriminative and Informative Subspace Assessment), a Python software package, is proposed to evaluate patterns in the presence of numerical outcomes using well-established measures together with a novel principle able to statistically assess the correlation gain of the subspace against the overall space. Results confirm the possibility to soundly extend discriminative criteria towards numerical outcomes without the drawbacks well-associated with discretization procedures. Results from four case studies confirm the validity and relevance of the proposed methods, further unveiling critical directions for research on biotechnology and biomedicine. Availability: DISA is freely available at https://github.com/JupitersMight/DISA under the MIT license.
Assuntos
Software , Análise por ConglomeradosRESUMO
STATEMENT: Enrichment analysis of cell transcriptional responses to SARS-CoV-2 infection from biclustering solutions yields broader coverage and superior enrichment of GO terms and KEGG pathways against alternative state-of-the-art machine learning solutions, thus aiding knowledge extraction. MOTIVATION AND METHODS: The comprehensive understanding of the impacts of SARS-CoV-2 virus on infected cells is still incomplete. This work aims at comparing the role of state-of-the-art machine learning approaches in the study of cell regulatory processes affected and induced by the SARS-CoV-2 virus using transcriptomic data from both infectable cell lines available in public databases and in vivo samples. In particular, we assess the relevance of clustering, biclustering and predictive modeling methods for functional enrichment. Statistical principles to handle scarcity of observations, high data dimensionality, and complex gene interactions are further discussed. In particular, and without loos of generalization ability, the proposed methods are applied to study the differential regulatory response of lung cell lines to SARS-CoV-2 (α-variant) against RSV, IAV (H1N1), and HPIV3 viruses. RESULTS: Gathered results show that, although clustering and predictive algorithms aid classic stances to functional enrichment analysis, more recent pattern-based biclustering algorithms significantly improve the number and quality of enriched GO terms and KEGG pathways with controlled false positive risks. Additionally, a comparative analysis of these results is performed to identify potential pathophysiological characteristics of COVID-19. These are further compared to those identified by other authors for the same virus as well as related ones such as SARS-CoV-1. The findings are particularly relevant given the lack of other works utilizing more complex machine learning algorithms within this context.
Assuntos
COVID-19 , Vírus da Influenza A Subtipo H1N1 , Análise por Conglomerados , Humanos , Aprendizado de Máquina , SARS-CoV-2RESUMO
Several cities around the world rely on urban rail transit systems composed of interconnected lines, serving massive numbers of passengers on a daily basis. Accessing the location of passengers is essential to ensure the efficient and safe operation and planning of these systems. However, passenger route choices between origin and destination pairs are variable, depending on the subjective perception of travel and waiting times, required transfers, convenience factors, and on-site vehicle arrivals. This work proposes a robust methodology to estimate passenger route choices based only on automated fare collection data, i.e. without privacy-invasive sensors and monitoring devices. Unlike previous approaches, our method does not require precise train timetable information or prior route choice models, and is robust to unforeseen operational events like malfunctions and delays. Train arrival times are inferred from passenger volume spikes at the exit gates, and the likelihood of eligible routes per passenger estimated based on the alignment between vehicle location and the passenger timings of entrance and exit. Applying this approach to automated fare collection data in Lisbon, we find that while in most cases passengers preferred the route with the least transfers, there were a significant number of cases where the shorter distance was preferred. Our findings are valuable for decision support among rail operators in various aspects such as passenger traffic bottleneck resolution, train allocation and scheduling, and placement of services.