RESUMO
BACKGROUND: Type 2 diabetes (T2DM) is a chronic condition with a high disease burden worldwide, and individuals with T2DM often have other morbidities. Understanding the local multimorbidity profile of patients with T2DM will inform precision medicine and public health, so that tailored interventions can be offered according to the different profiles. METHODS: An analysis was conducted of electronic health records (2016-2021) in one hospital in Lima, Peru. Based on ICD-10 codes and the available measurements (e.g., body mass index), we identified all T2DM cases and quantified the frequency of the most common comorbidities (those in ≥1% of the sample). We also conducted k-means analysis that was informed by the most frequent comorbidities, to identify clusters of patients with T2DM and other chronic conditions. RESULTS: There were 9582 individual records with T2DM (mean age 58.6 years, 61.5% women). The most frequent chronic conditions were obesity (29.4%), hypertension (18.8%), dyslipidemia (11.3%), hypothyroidism (6.4%), and arthropathy (3.6%); and 51.6% had multimorbidity: 32.8% had only one, 14.1% had two, and 4.7% had three or more extra chronic conditions in addition to T2DM. The cluster analysis revealed four unique groups: T2DM with no other chronic disease, T2DM with obesity only, T2DM with hypertension but without obesity, and T2DM with all other chronic conditions. CONCLUSIONS: More than one in two people with T2DM had multimorbidity. Obesity, hypertension, and dyslipidemia were the most common chronic conditions that were associated with T2DM. Four clusters of chronic morbidities were found, signaling mutually exclusive profiles of patients with T2DM according to their multimorbidity profile.
Assuntos
Diabetes Mellitus Tipo 2 , Dislipidemias , Hipertensão , Doença Crônica , Diabetes Mellitus Tipo 2/complicações , Diabetes Mellitus Tipo 2/epidemiologia , Dislipidemias/complicações , Feminino , Humanos , Hipertensão/complicações , Hipertensão/epidemiologia , Masculino , Pessoa de Meia-Idade , Multimorbidade , Obesidade/complicações , Obesidade/epidemiologia , Peru/epidemiologiaRESUMO
La Oroya is a city in the Peruvian Andes that has suffered a serious deterioration in its air quality, especially due to the high rate of sulfur dioxide (SO2) emissions, which underlines the importance of knowing its sources of contamination and variation over the years. In this sense, this study aimed to evaluate the immission levels and determine the sources of SO2 contamination in La Oroya. This analysis was performed using the hourly concentration data of SO2, and meteorological variables (wind speed and direction), which were analyzed for a period of three years (2018-2020). Graphs of time series, wind and pollutant roses, bivariate polar graphs, clustering k-means, nonparametric statistical tests, and the application of the conditional bivariate probability function were performed to analyze the data and identify the emission sources. The mean concentration of SO2 was 264.2 µg m-3 for the study period, where 55.66 and 2.37% of the evaluated days exceeded the guideline values recommended by the World Health Organization and the Peruvian Environmental Quality Standard for air for 24 h, respectively. The results showed a defined pattern for the daily and monthly variations, with peaks in the morning hours (0900-1000 h LT) and at the end of the year (December), respectively. The main sources of SO2 emissions identified were light and heavy vehicles that travel through the Central Highway, the La Oroya Metallurgical Complex, the transit of vehicles within the city, and the diesel-electric locomotives that provide cargo transportation services and tourism passenger transportation. The article attempts to contribute to the development of adequate air quality management policies.
RESUMO
BACKGROUND: Brazil has faced two simultaneous problems related to respiratory health: forest fires and the high mortality rate due to COVID-19 pandemics. The Amazon rain forest is one of the Brazilian biomes that suffers the most with fires caused by droughts and illegal deforestation. These fires can bring respiratory diseases associated with air pollution, and the State of Pará in Brazil is the most affected. COVID-19 pandemics associated with air pollution can potentially increase hospitalizations and deaths related to respiratory diseases. Here, we aimed to evaluate the association of fire occurrences with the COVID-19 mortality rates and general respiratory diseases hospitalizations in the State of Pará, Brazil. METHODS: We employed machine learning technique for clustering k-means accompanied with the elbow method used to identify the ideal quantity of clusters for the k-means algorithm, clustering 10 groups of cities in the State of Pará where we selected the clusters with the highest and lowest fires occurrence from the 2015 to 2019. Next, an Auto-regressive Integrated Moving Average Exogenous (ARIMAX) model was proposed to study the serial correlation of respiratory diseases hospitalizations and their associations with fire occurrences. Regarding the COVID-19 analysis, we computed the mortality risk and its confidence level considering the quarterly incidence rate ratio in clusters with high and low exposure to fires. FINDINGS: Using the k-means algorithm we identified two clusters with similar DHI (Development Human Index) and GDP (Gross Domestic Product) from a group of ten clusters that divided the State of Pará but with diverse behavior considering the hospitalizations and forest fires in the Amazon biome. From the auto-regressive and moving average model (ARIMAX), it was possible to show that besides the serial correlation, the fires occurrences contribute to the respiratory diseases increase, with an observed lag of six months after the fires for the case with high exposure to fires. A highlight that deserves attention concerns the relationship between fire occurrences and deaths. Historically, the risk of mortality by respiratory diseases is higher (about the double) in regions and periods with high exposure to fires than the ones with low exposure to fires. The same pattern remains in the period of the COVID-19 pandemic, where the risk of mortality for COVID-19 was 80% higher in the region and period with high exposure to fires. Regarding the SARS-COV-2 analysis, the risk of mortality related to COVID-19 is higher in the period with high exposure to fires than in the period with low exposure to fires. Another highlight concerns the relationship between fire occurrences and COVID-19 deaths. The results show that regions with high fire occurrences are associated with more cases of COVID deaths. INTERPRETATION: The decision-make process is a critical problem mainly when it involves environmental and health control policies. Environmental policies are often more cost-effective as health measures than the use of public health services. This highlight the importance of data analyses to support the decision making and to identify population in need of better infrastructure due to historical environmental factors and the knowledge of associated health risk. The results suggest that The fires occurrences contribute to the increase of the respiratory diseases hospitalization. The mortality rate related to COVID-19 was higher for the period with high exposure to fires than the period with low exposure to fires. The regions with high fire occurrences is associated with more COVID-19 deaths, mainly in the months with high number of fires. FUNDING: No additional funding source was required for this study.
RESUMO
In this work, an innovative approach using K-means and multivariate curve resolution-purity based algorithm (MCR-Purity) for the evaluation and quantification of carboxymyoglobin (Mb-CO) formation from Deoxy-Myoglobin (Deoxy-Mb) was presented. Through a multilevel multifactor experimental design, samples with different concentrations of Mb-CO were created. The UV-Vis spectra of these samples were submitted to K-means analysis, finding 3 clusters. The mean spectra of the clusters were extracted and it was possible to detect 2 totally differentiable groups through peaks 423 and 434 nm, which are wavelengths related to the Mb-CO and Deoxy-Mb components, respectively. The spectral data were subjected to MCR-Purity analysis. The MCR-Purity result successfully described the analyzed reaction, explaining more than 99.9% of the variance (R2) with a LOF of 1.43%. Then, a predictive model of MbCO was created through the linear relationship between MCR-Purity contributions and known concentrations of MbCO. The performance parameters of the created predictive model were R2CV = 0.98, RMSECV = 0.58 and RPDcv = 7.8 for the training set, and R2P = 0.98, RMSEP = 0.7 and RPDp = 6.8 for the test set. Thus, the predictive model presented an excellent performance considering that the Mb-CO variation is comprised between 0 and 21 µM. Therefore, these results demonstrate that the application of the proposed strategy to the analysis of spectral data presenting overlapping bands is feasible and robust.
Assuntos
Quimiometria , Mioglobina , Análise Multivariada , Análise EspectralRESUMO
Resumo Este artigo investiga relações entre a incidência de câncer de colo de útero (ICC) e os componentes e indicadores de qualidade da água nos municípios do Mato Grosso do Sul, entre 2014 e 2017, por correlação estatística (Determinante de Pearson) e espacial (agrupamentos por k-médias). Houve maior resposta estatística de ICC em relação à tarifa média dos serviços de abastecimento praticado (-36,28%) e de água (-34,15%); à quantidade de suas interrupções sistemáticas (28,3%) e paralizações (22,28%); ao consumo médio per capita de água (20,74%) e à quantidade de serviços executados (-17,98%), todas as respostas sob p-valor ≤ 0,001. Em Costa Rica, cidade sob maior ICC média, os agrupamentos espaciais identificaram maior efeito daquelas interrupções (z-valor = 8,741) e das paralizações (z = 7,6097); enquanto em Rochedo, também sob alta ICC, houve maior efeito à incidência de análises com resultados fora do padrão para coliformes totais (z = 8,6803) e turbidez (z = 5,7427), sob correlação estatística de 12,05% (p-valor = 0,032) e 15,18% (p-valor = 0,007), respectivamente. Dados do SISAGUA revelaram a presença de coliformes e de altos níveis de turbidez, por exemplo, em Antônio João e Tacuru, cidades sob altas ICC médias. Recomenda-se maiores investigações sobre as relações aqui apresentadas entre ICC e água.
Abstratct This article investigates relationships between the incidence of cervical cancer (CCI) and the water components and quality indicators, in the municipalities of Mato Grosso do Sul, between 2014 and 2017, by statistical (Pearson's Determinant) and spatial (k-means Clustering) correlation. There was a greater statistical response of CCI in relation to the average tariff of the practiced supply (−36.28%) and water (−34.15%) services; the number of their systematic interruptions (28.3%) and outages (22.28%); the average per capita consumption of water (20.74%); and the number of services performed (−17.98%), all answers under p-value ≤ 0.001. In Costa Rica, city with the highest average CCI, the spatial clustering identified a greater effect of those interruptions (z-value = 8.741) and outages (z = 7.6097); whereas, in Rochedo, also under high CCI, the analyses showed greater effect with non-standard results for total coliforms (z = 8.6803) and turbidity (z = 5.7427), under a statistical correlation of 12.05% (p-value = 0.032) and 15.18% (p-value = 0.007), respectively. Data from SISAGUA revealed the presence of coliforms and high levels of turbidity, for example, in Antônio João and Tacuru, cities with high average ICC. We recommend further investigation into the relationships presented here between CCI and water.
Assuntos
Qualidade da Água , Neoplasias do Colo do Útero/epidemiologia , Saneamento , Saúde Pública , Cidades , Correlação de DadosRESUMO
Data of the commercial parameters of Pleurotus ostreatus and Pleurotus djamor were analyzed using the data mining technique: K-means clustering algorithm. The parameters evaluated were: biological efficiency, crop yield ratio, productivity rate, nutritional composition, antioxidant and antimicrobial activities in the production of fruit bodies of 50 strains of Pleurotus ostreatus and 50 strains of Pleurotus djamor, cultivated on the most representative agricultural wastes from the province of Guayas: 80% sugarcane bagasse and 20% wheat straw (M1), and 60% wheat straw and 40% sugarcane bagasse (M2). The database of the parameters obtained in experimental procedures was grouped into three clusters, providing a visualization of the strains with a higher relation to each parameter (vector) measured.
RESUMO
In this paper, we group South American countries based on the number of infected cases and deaths due to COVID-19. The countries considered are: Argentina, Bolivia, Brazil, Chile, Colombia, Ecuador, Peru, Paraguay, Uruguay, and Venezuela. The data used are collected from a database of Johns Hopkins University, an institution that is dedicated to sensing and monitoring the evolution of the COVID-19 pandemic. A statistical analysis, based on principal components with modern and recent techniques, is conducted. Initially, utilizing the correlation matrix, standard components and varimax rotations are calculated. Then, by using disjoint components and functional components, the countries are grouped. An algorithm that allows us to keep the principal component analysis updated with a sensor in the data warehouse is designed. As reported in the conclusions, this grouping changes depending on the number of components considered, the type of principal component (standard, disjoint or functional) and the variable to be considered (infected cases or deaths). The results obtained are compared to the k-means technique. The COVID-19 cases and their deaths vary in the different countries due to diverse reasons, as reported in the conclusions.
Assuntos
COVID-19 , Pandemias , Argentina , Brasil , Chile , Colômbia , Equador , Humanos , Peru , Análise de Componente Principal , SARS-CoV-2 , Uruguai , VenezuelaRESUMO
The aim of this study was to identify serum ferritin (SF) cut-off points (COPs) in a cohort of healthy full-term normal birth weight infants who had repeated measurements of SF and haemoglobin every 3 months during the first year of life. The study included 746 full-term infants with birth weight ≥2,500 g, having uncomplicated gestations and births. Participants received prophylactic iron supplementation (1 mg/day of iron element) from the first to the 12th month of life and did not develop anaemia during the first year of life. Two statistical methods were considered to identify COPs for low iron stores at 3, 6, 9 and 12 months of age: deviation from mean and cluster analysis. According to the K-means cluster analysis results by age and sex, COPs at 3 and 6 months for girls were 39 and 21 µg/L and for boys 23 and 11 µg/L, respectively. A single COP of 10 µg/L was identified, for girls and boys, at both 9 and 12 months. Given the physiological changes in SF concentration during the first year of life, our study identified dynamic COPs, which differed by sex in the first semester. Adequate SF COPs are necessary to identify low iron stores at an early stage of iron deficiency, which represents one of the most widespread public health problems around the world, particularly in low- and middle-income countries.
Assuntos
Anemia Ferropriva , Ferritinas , Anemia Ferropriva/epidemiologia , Estudos de Coortes , Feminino , Hemoglobinas , Humanos , Lactente , Ferro/metabolismo , MasculinoRESUMO
Gastrointestinal nematode infections have caused expressive losses in sheep production worldwide. The improvement of host genetic resistance to worms has been used as a strategy to mitigate this problem. In this sense, the inclusion of genomic information has shown potential to increase the accuracy of prediction of breeding values and speed up selection. In this study, we aimed to compare estimates of genetic parameters and breeding values for traits that indicate the resistance to gastrointestinal nematode infection in Santa Inês sheep using the pedigree-based BLUP or including genomic information. There were 1478 animals in the pedigree, of which 271 were genotyped using the OvineSNP50 BeadChip (Illumina, Inc.). The host resistance was assessed using the following traits: fecal nematode egg counts (FEC); FAMACHA score (FAMACHA); and resistance to gastrointestinal nematode infection (RGNI) as a combination of FEC, FAMACHA, body condition score, and hematocrit. The genetic parameters and breeding values were estimated using single- and multi-trait analyses. For RGNI, the heritability estimates ranged from 0.25 using the single-trait genomic model (S-H) to 0.54 using the traditional multi-trait model (M-A). The heritability estimates for FEC ranged from 0.06 to 0.36, using the single-trait pedigree-based model (S-A) and the multi-trait genomic model (M-H), respectively. For FAMACHA, the heritability estimates ranged from 0.46 (M-H) to 0.54 (M-A). Estimates of genetic correlation ranged from 0.22 to 0.69. The inclusion of genomic information provided gain in accuracy for all traits. All estimates of predictive ability obtained using genomic data in a multi-trait setting were higher than those obtained using single-trait models. The estimates of predictive ability ranged from 0.03 (S-A) to 0.46 (M-H). The heritability estimates obtained using genomic information showed that all traits evaluated are suitable for genomic selection. Despite the low accuracies obtained, the use of the genomic model provided more accurate estimates of breeding values in comparison to the pedigree-based model.
Assuntos
Genoma , Genômica , Animais , Genótipo , Carne , Modelos Genéticos , Linhagem , Fenótipo , Ovinos/genética , Carneiro Doméstico/genéticaRESUMO
Machine learning (ML) and its multiple applications have comparative advantages for improving the interpretation of knowledge on different agricultural processes. However, there are challenges that impede proper usage, as can be seen in phenotypic characterizations of germplasm banks. The objective of this research was to test and optimize different analysis methods based on ML for the prioritization and selection of morphological descriptors of Rubus spp. 55 descriptors were evaluated in 26 genotypes and the weight of each one and its ability to discriminating capacity was determined. ML methods as random forest (RF), support vector machines, in the linear and radial forms, and neural networks were optimized and compared. Subsequently, the results were validated with two discriminating methods and their variants: hierarchical agglomerative clustering and K-means. The results indicated that RF presented the highest accuracy (0.768) of the methods evaluated, selecting 11 descriptors based on the purity (Gini index), importance, number of connected trees, and significance (p value < 0.05). Additionally, K-means method with optimized descriptors based on RF had greater discriminating power on Rubus spp., accessions according to evaluated statistics. This study presents one application of ML for the optimization of specific morphological variables for plant germplasm bank characterization.
RESUMO
The aim of this study was to determine the resistance to worm infection in Santa Inês sheep by combining different sets of gastrointestinal parasite resistance indicator traits, using the k-means algorithm. Records from 221 animals reared in the Mid-North sub-region of Brazil were used. The following phenotypes were used: hematocrit (HCT); white blood cell count; red blood cell count (RBC); hemoglobin (HGB); platelets; mean corpuscular hemoglobin; mean corpuscular volume; mean corpuscular hemoglobin concentration; fecal egg count (FEC); coloration of the ocular mucosa (FAMACHA score); body condition score (BCS); withers height; and rump height. Two files with phenotypic information of animals were edited: complete, including all traits, and reduced, in which only FAMACHA score, HCT, FEC, and BCS were used. For determination of worm resistance, three groups were formed using the k-means non-hierarchical clustering by combining the traits of the complete and reduced analyses. The animals of the group in which individuals had the lowest values for FEC and FAMACHA score, as well as the highest values for HCT, RBC, HGB, and BCS were classified as resistant. In the group with opposite values for the aforementioned traits, the animals were classified as sensitive. The animals of the group with values between the other two groups were classified as moderately resistant. The results obtained in complete and reduced analyses were equivalent. Thus, it is possible to identify animals of the Santa Inês sheep breed according to their status of resistance to worm infection based on a reduced trait set.
Assuntos
Resistência à Doença/imunologia , Helmintíase Animal/imunologia , Enteropatias Parasitárias/veterinária , Doenças dos Ovinos/imunologia , Animais , Constituição Corporal/fisiologia , Brasil , Helmintíase Animal/parasitologia , Testes Hematológicos/veterinária , Enteropatias Parasitárias/imunologia , Enteropatias Parasitárias/parasitologia , Ovinos , Doenças dos Ovinos/parasitologia , Carneiro DomésticoRESUMO
This data-driven work aims to analyze and classify the spatiotemporal distribution of all Brazilian states considering data so diverse as the number of Covid-19 cases, deaths, confirmed cases per 100 k inhabitants, mortality per 100 k inhabitants and case fatality rates as health indicators. We also considered population, area and population density as geographic indicators. Finally, GDP and HDI were taken into account as economic and social criteria. For this task data were collected from April 3rd until August 8th, 2020, corresponding to epidemiological weeks 14-32, reaching three million cases and a hundred thousand deaths. With this data it was possible to classify Brazilian states using multivariate methods into possible groups by means of non-hierarchical (k-means) cluster as well as factor analysis. It was possible to group all states plus the Federal District into five clusters, taking into account these 10 variables over the first five months of the epidemic. Group changes between states were observed over time and clusters, and between three and four factors were found. However, even with great difference on health indicators during days, the number of clusters remains fixed. Also, São Paulo and Rio de Janeiro states were ranked at top list taking into account all epidemiological weeks. Correlations were observed between variables, such as the number of Covid cases and deaths with GDP for most of epidemiological weeks. Some clusters were more critical due to specific variables, including cities that are main hotspots. These multivariate findings would provide a comprehensive description of the ongoing Covid-19 epidemic and may help to guide subsequent studies to understand and control virus transmission.
RESUMO
There has been an increase in the application of different biomaterials to repair hard tissues. Within these biomaterials, calcium phosphate (CaP) bioceramics are suitable candidates, since they can be biocompatible, biodegradable, osteoinductive, and osteoconductive. Moreover, during sintering, bioceramic materials are prone to form micropores and undergo changes in their surface topographical features, which influence cellular physiology and bone ingrowth. In this study, five geometrical properties from the surface of CaP bioceramic particles and their micropores were analyzed by data mining techniques, driven by the research question: what are the geometrical properties of individual micropores in a CaP bioceramic, and how do they relate to each other? The analysis not only shows that it is feasible to determine the existence of micropore clusters, but also to quantify their geometrical properties. As a result, these CaP bioceramic particles present three groups of micropore clusters distinctive by their geometrical properties. Consequently, this new methodological clustering assessment can be applied to advance the knowledge about CaP bioceramics and their role in bone tissue engineering.
RESUMO
Speaking and presenting in public are critical skills for academic and professional development. These skills are demanded across society, and their development and evaluation are a challenge faced by higher education institutions. There are some challenges to evaluate objectively, as well as to generate valuable information to professors and appropriate feedback to students. In this paper, in order to understand and detect patterns in oral student presentations, we collected data from 222 Computer Engineering (CE) fresh students at three different times, over two different years (2017 and 2018). For each presentation, using a developed system and Microsoft Kinect, we have detected 12 features related to corporal postures and oral speaking. These features were used as input for the clustering and statistical analysis that allowed for identifying three different clusters in the presentations of both years, with stronger patterns in the presentations of the year 2017. A Wilcoxon rank-sum test allowed us to evaluate the evolution of the presentations attributes over each year and pointed out a convergence in terms of the reduction of the number of features statistically different between presentations given at the same course time. The results can further help to give students automatic feedback in terms of their postures and speech throughout the presentations and may serve as baseline information for future comparisons with presentations from students coming from different undergraduate courses.
RESUMO
OBJECTIVE: To explore a conceptual framework of clinical conditions associated with preterm birth (PTB) by cluster analysis, assessing determinants for different PTB subtypes and related maternal and neonatal outcomes. METHODS: Secondary analysis of the Brazilian Multicentre Study on Preterm Birth of 33 740 births in 20 maternity hospitals between April 2011 and July 2012. In accordance with a prototype concept based on maternal, fetal, and placental conditions, an adapted k-means model and fuzzy algorithm were used to identify clusters using predefined conditions. The mains outcomes were phenotype clusters and maternal and neonatal outcomes. RESULTS: Among 4150 PTBs, three clusters of PTB phenotypes were identified: women who had PTB without any predefined conditions; women with mixed conditions; and women who had pre-eclampsia, eclampsia, HELLP syndrome and fetal growth restriction. The prevalence of different preterm subtypes differed significantly in the three clusters, varying from 80.95% of provider-initiated PTBs in cluster 3-6.62% in cluster 1 (P<0.001). Although some maternal characteristics differed among the clusters, maternal and neonatal outcomes did not. CONCLUSIONS: The analysis identified three clusters with distinct phenotypes. Women from the different clusters had different subtypes of PTB and maternal and pregnancy characteristics.
Assuntos
Resultado da Gravidez/epidemiologia , Nascimento Prematuro/etiologia , Adulto , Brasil/epidemiologia , Análise por Conglomerados , Feminino , Retardo do Crescimento Fetal/epidemiologia , Lógica Fuzzy , Humanos , Recém-Nascido , Fenótipo , Pré-Eclâmpsia/epidemiologia , Gravidez , Nascimento Prematuro/epidemiologia , Prevalência , Fatores de RiscoRESUMO
We identified clusters of multiple dimensions of poverty according to the capability approach theory by applying data mining approaches to the Cuatro Santos Health and Demographic Surveillance database, Nicaragua. Four municipalities in northern Nicaragua constitute the Cuatro Santos area, with 25,893 inhabitants in 5,966 households (2014). A local process analyzing poverty-related problems, prioritizing suggested actions, was initiated in 1997 and generated a community action plan 2002-2015. Interventions were school breakfasts, environmental protection, water and sanitation, preventive healthcare, home gardening, microcredit, technical training, university education stipends, and use of the Internet. In 2004, a survey of basic health and demographic information was performed in the whole population, followed by surveillance updates in 2007, 2009, and 2014 linking households and individuals. Information included the house material (floor, walls) and services (water, sanitation, electricity) as well as demographic data (birth, deaths, migration). Data on participation in interventions, food security, household assets, and women's self-rated health were collected in 2014. A K-means algorithm was used to cluster the household data (56 variables) in six clusters. The poverty ranking of household clusters using the unsatisfied basic needs index variables changed when including variables describing basic capabilities. The households in the fairly rich cluster with assets such as motorbikes and computers were described as modern. Those in the fairly poor cluster, having different degrees of food insecurity, were labeled vulnerable. Poor and poorest clusters of households were traditional, e.g., in using horses for transport. Results displayed a society transforming from traditional to modern, where the forerunners were not the richest but educated, had more working members in household, had fewer children, and were food secure. Those lagging were the poor, traditional, and food insecure. The approach may be useful for an improved understanding of poverty and to direct local policy and interventions.
RESUMO
Genomic signal processing (GSP) methods which convert DNA data to numerical values have recently been proposed, which would offer the opportunity of employing existing digital signal processing methods for genomic data. One of the most used methods for exploring data is cluster analysis which refers to the unsupervised classification of patterns in data. In this paper, we propose a novel approach for performing cluster analysis of DNA sequences that is based on the use of GSP methods and the K-means algorithm. We also propose a visualization method that facilitates the easy inspection and analysis of the results and possible hidden behaviors. Our results support the feasibility of employing the proposed method to find and easily visualize interesting features of sets of DNA data.
RESUMO
Lane detection for traffic surveillance in intelligent transportation systems is a challenge for vision-based systems. In this paper, a novel pixel-entropy based algorithm for the automatic detection of the number of lanes and their centers, as well as the formation of their division lines is proposed. Using as input a video from a static camera, each pixel behavior in the gray color space is modeled by a time series; then, for a time period τ , its histogram followed by its entropy are calculated. Three different types of theoretical pixel-entropy behaviors can be distinguished: (1) the pixel-entropy at the lane center shows a high value; (2) the pixel-entropy at the lane division line shows a low value; and (3) a pixel not belonging to the road has an entropy value close to zero. From the road video, several small rectangle areas are captured, each with only a few full rows of pixels. For each pixel of these areas, the entropy is calculated, then for each area or row an entropy curve is produced, which, when smoothed, has as many local maxima as lanes and one more local minima than lane division lines. For the purpose of testing, several real traffic scenarios under different weather conditions with other moving objects were used. However, these background objects, which are out of road, were filtered out. Our algorithm, compared to others based on trajectories of vehicles, shows the following advantages: (1) the lowest computational time for lane detection (only 32 s with a traffic flow of one vehicle/s per-lane); and (2) better results under high traffic flow with congestion and vehicle occlusion. Instead of detecting road markings, it forms lane-dividing lines. Here, the entropies of Shannon and Tsallis were used, but the entropy of Tsallis for a selected q of a finite set achieved the best results.
RESUMO
This study performed an analysis of the influence of the training and test set rational selection on the quality and predictively of the quantitative structure-activity relationship (QSAR) model. The study was carried out on three different datasets of Influenza Neuraminidase (H1N1) inhibitors. The three datasets were divided into training and test sets using three rational selection methods: based on k-means, Kennard-Stone algorithm and Activity and the results were compared with Random selection. Then, a total of 31,490 mathematical models were developed and those models that presented a determination coefficient higher than: r2train > 0.8, r2loo > 0.7, r2test > 0.5 and minimum standard deviation (SD) and minimum root-mean square error (RMS) were selected. The selected models were validated using the internal leave-one-out method and the predictive capacity was evaluated by the external test set. The results indicate that random selection could lead to erroneous results. In return, a rational selection allows for obtaining more reliable conclusions. The QSAR models with major predictive power were found using the k-means algorithm and selection by activity.
Assuntos
Antivirais/química , Neuraminidase/antagonistas & inibidores , Relação Quantitativa Estrutura-Atividade , Algoritmos , Antivirais/análise , Vírus da Influenza A Subtipo H1N1 , Modelos MolecularesRESUMO
BACKGROUND: Picea chihuahuana, which is endemic to Mexico, is currently listed as "Endangered" on the Red List. Chihuahua spruce is only found in the Sierra Madre Occidental (SMO), Mexico. About 42,600 individuals are distributed in forty populations. These populations are fragmented and can be classified into three geographically distinct clusters in the SMO. The total area covered by P. chihuahuana populations is less than 300 ha. A recent study suggested assisted migration as an alternative to the ex situ conservation of P. chihuahuana, taking into consideration the genetic structure and diversity of the populations and the predictions regarding the future climate of the habitat. However, detailed background information is required to enable development of plans for protecting and conserving species and for successful assisted migration. Thus, it is important to identify differences between populations in relation to environmental conditions. The genetic diversity of populations, which affect vigor, evolution and adaptability of the species, must also be considered. In this study, we examined 14 populations of P. chihuahuana, with the overall aim of discriminating the populations and form clusters of this species. METHODS: Each population was represented by one 50 × 50 m plot established in the center of its respective location. Climate, soil, dasometric, density variables and genetic and species diversities were assessed in these plots for further analyses. The putatively neutral and adaptive AFLP markers were used to calculate genetic diversity. Affinity Propagation (AP) clustering technique and k-means clustering algorithm were used to classify the populations in the optimal number of clusters. Later stepwise binomial logistic regression was applied to test for significant differences in variables of the southern and northern P. chihuahuana populations. Spearman's correlation test was used to analyze the relationships among all variables studied. RESULTS: The binomial logistic regression analysis revealed that seven climate variables, the geographical longitude and sand proportion in the soil separated the southern from northern populations. The northern populations grow in more arid and continental conditions and on soils with lower sand proportion. The mean genetic diversity using all AFLP studied of P. chihuahuana was significantly correlated with the mean temperature in the warmest month, where warmer temperatures are associated to larger genetic diversity. Genetic diversity of P. chihuahuana calculated with putatively adaptive AFLP was not statistically significantly correlated with any environmental factor. DISCUSSION: Future reforestation programs should take into account that at least two different groups (the northern and southern cluster) of P. chihuahuana exist, as local adaptation takes place because of different environmental conditions.