Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 71
Filtrar
Mais filtros

País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
BMC Bioinformatics ; 25(1): 218, 2024 Jun 19.
Artigo em Inglês | MEDLINE | ID: mdl-38898392

RESUMO

BACKGROUND: Compared to traditional supervised machine learning approaches employing fully labeled samples, positive-unlabeled (PU) learning techniques aim to classify "unlabeled" samples based on a smaller proportion of known positive examples. This more challenging modeling goal reflects many real-world scenarios in which negative examples are not available-posing direct challenges to defining prediction accuracy and robustness. While several studies have evaluated predictions learned from only definitive positive examples, few have investigated whether correct classification of a high proportion of known positives (KP) samples from among unlabeled samples can act as a surrogate to indicate model quality. RESULTS: In this study, we report a novel methodology combining multiple established PU learning-based strategies with permutation testing to evaluate the potential of KP samples to accurately classify unlabeled samples without using "ground truth" positive and negative labels for validation. Multivariate synthetic and real-world high-dimensional benchmark datasets were employed to demonstrate the suitability of the proposed pipeline to provide evidence of model robustness across varied underlying ground truth class label compositions among the unlabeled set and with different proportions of KP examples. Comparisons between model performance with actual and permuted labels could be used to distinguish reliable from unreliable models. CONCLUSIONS: As in fully supervised machine learning, permutation testing offers a means to set a baseline "no-information rate" benchmark in the context of semi-supervised PU learning inference tasks-providing a standard against which model performance can be compared.


Assuntos
Aprendizado de Máquina , Aprendizado de Máquina Supervisionado , Humanos , Biologia Computacional/métodos , Algoritmos
2.
BMC Bioinformatics ; 24(1): 452, 2023 Nov 30.
Artigo em Inglês | MEDLINE | ID: mdl-38036960

RESUMO

BACKGROUND: The identification of essential proteins is of great significance in biology and pathology. However, protein-protein interaction (PPI) data obtained through high-throughput technology include a high number of false positives. To overcome this limitation, numerous computational algorithms based on biological characteristics and topological features have been proposed to identify essential proteins. RESULTS: In this paper, we propose a novel method named SESN for identifying essential proteins. It is a seed expansion method based on PPI sub-networks and multiple biological characteristics. Firstly, SESN utilizes gene expression data to construct PPI sub-networks. Secondly, seed expansion is performed simultaneously in each sub-network, and the expansion process is based on the topological features of predicted essential proteins. Thirdly, the error correction mechanism is based on multiple biological characteristics and the entire PPI network. Finally, SESN analyzes the impact of each biological characteristic, including protein complex, gene expression data, GO annotations, and subcellular localization, and adopts the biological data with the best experimental results. The output of SESN is a set of predicted essential proteins. CONCLUSIONS: The analysis of each component of SESN indicates the effectiveness of all components. We conduct comparison experiments using three datasets from two species, and the experimental results demonstrate that SESN achieves superior performance compared to other methods.


Assuntos
Biologia Computacional , Mapeamento de Interação de Proteínas , Mapeamento de Interação de Proteínas/métodos , Biologia Computacional/métodos , Mapas de Interação de Proteínas , Proteínas/metabolismo , Algoritmos
3.
RNA ; 27(1): 80-98, 2021 01.
Artigo em Inglês | MEDLINE | ID: mdl-33055239

RESUMO

High-throughput RNA sequencing unveiled the complexity of transcriptome and significantly increased the records of long noncoding RNAs (lncRNAs), which were reported to participate in a variety of biological processes. Identification of lncRNAs is a key step in lncRNA analysis, and a bunch of bioinformatics tools have been developed for this purpose in recent years. While these tools allow us to identify lncRNA more efficiently and accurately, they may produce inconsistent results, making selection a confusing issue. We compared the performance of 41 analysis models based on 14 software packages and different data sets, including high-quality data and low-quality data from 33 species. In addition, computational efficiency, robustness, and joint prediction of the models were explored. As a practical guidance, key points for lncRNA identification under different situations were summarized. In this investigation, no one of these models could be superior to others under all test conditions. The performance of a model relied to a great extent on the source of transcripts and the quality of assemblies. As general references, FEELnc_all_cl, CPC, and CPAT_mouse work well in most species while COME, CNCI, and lncScore are good choices for model organisms. Since these tools are sensitive to different factors such as the species involved and the quality of assembly, researchers must carefully select the appropriate tool based on the actual data. Alternatively, our test suggests that joint prediction could behave better than any single model if proper models were chosen. All scripts/data used in this research can be accessed at http://bioinfo.ihb.ac.cn/elit.


Assuntos
Biologia Computacional/métodos , Genoma , RNA Longo não Codificante/genética , RNA Mensageiro/genética , Software , Animais , Benchmarking , Conjuntos de Dados como Assunto , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Camundongos , Modelos Genéticos , Anotação de Sequência Molecular , Plantas/genética , RNA Longo não Codificante/classificação , RNA Longo não Codificante/metabolismo , RNA Mensageiro/classificação , RNA Mensageiro/metabolismo , Especificidade da Espécie , Transcriptoma
4.
Int J Mol Sci ; 24(15)2023 Jul 27.
Artigo em Inglês | MEDLINE | ID: mdl-37569442

RESUMO

In this short review, including 113 references, issues related to dibenzo[b,f]oxepine derivatives are presented. Dibenzo[b,f]oxepine scaffold is an important framework in medicinal chemistry, and its derivatives occur in several medicinally relevant plants. At the same time, the structure, production, and therapeutic effects of dibenzo[b,f]oxepines have not been extensively discussed thus far and are presented in this review. This manuscript addresses the following issues: extracting dibenzo[b,f]oxepines from plants and its significance in medicine, the biosynthesis of dibenzo[b,f]oxepines, the active synthetic dibenzo[b,f]oxepine derivatives, the potential of dibenzo[b,f]oxepines as microtubule inhibitors, and perspective for applications of dibenzo[b,f]oxepine derivatives. In conclusion, this review describes studies on various structural features and pharmacological actions of dibenzo[b,f]oxepine derivatives.


Assuntos
Oxepinas , Oxepinas/química , Oxepinas/farmacologia
5.
Int J Mol Sci ; 24(8)2023 Apr 09.
Artigo em Inglês | MEDLINE | ID: mdl-37108117

RESUMO

Studying the association of gene function, diseases, and regulatory gene network reconstruction demands data compatibility. Data from different databases follow distinct schemas and are accessible in heterogenic ways. Although the experiments differ, data may still be related to the same biological entities. Some entities may not be strictly biological, such as geolocations of habitats or paper references, but they provide a broader context for other entities. The same entities from different datasets can share similar properties, which may or may not be found within other datasets. Joint, simultaneous data fetching from multiple data sources is complicated for the end-user or, in many cases, unsupported and inefficient due to differences in data structures and ways of accessing the data. We propose BioGraph-a new model that enables connecting and retrieving information from the linked biological data that originated from diverse datasets. We have tested the model on metadata collected from five diverse public datasets and successfully constructed a knowledge graph containing more than 17 million model objects, of which 2.5 million are individual biological entity objects. The model enables the selection of complex patterns and retrieval of matched results that can be discovered only by joining the data from multiple sources.


Assuntos
Metadados , Bases de Dados Factuais
6.
Proteomics ; 22(8): e2100197, 2022 04.
Artigo em Inglês | MEDLINE | ID: mdl-35112474

RESUMO

With the development of artificial intelligence (AI) technologies and the availability of large amounts of biological data, computational methods for proteomics have undergone a developmental process from traditional machine learning to deep learning. This review focuses on computational approaches and tools for the prediction of protein-DNA/RNA interactions using machine intelligence techniques. We provide an overview of the development progress of computational methods and summarize the advantages and shortcomings of these methods. We further compiled applications in tasks related to the protein-DNA/RNA interactions, and pointed out possible future application trends. Moreover, biological sequence-digitizing representation strategies used in different types of computational methods are also summarized and discussed.


Assuntos
Inteligência Artificial , Big Data , Aprendizado de Máquina , Proteômica , RNA
7.
Biochem Biophys Res Commun ; 633: 42-44, 2022 12 10.
Artigo em Inglês | MEDLINE | ID: mdl-36344159

RESUMO

Continuous and imaginative technological developments are leading to a massive accumulation of various types of data in all areas of biological research. As a result, the central importance of databases is increasing. Databases related to biology must not only be structured using controlled vocabularies, but also be fully integrated into the whole biological domain. To achieve this goal, they must be systematically grounded in biological evolution and exploit the available tools of evolutionary systematics to contribute to our understanding of life processes.


Assuntos
Evolução Biológica , Florestas , Bases de Dados Factuais
8.
Methods ; 192: 3-12, 2021 08.
Artigo em Inglês | MEDLINE | ID: mdl-32610158

RESUMO

Identifying disease-related genes is of importance for understanding of molecule mechanisms of diseases, as well as diagnosis and treatment of diseases. Many computational methods have been proposed to predict disease-related genes, but how to make full use of multi-source biological data to enhance the ability of disease-gene prediction is still challenging. In this paper, we proposed a novel method for predicting disease-related genes by using fast network embedding (PrGeFNE), which can integrate multiple types of associations related to diseases and genes. Specifically, we first constructed a heterogeneous network by using phenotype-disease, disease-gene, protein-protein and gene-GO associations; and low-dimensional representation of nodes is extracted from the network by using a fast network embedding algorithm. Then, a dual-layer heterogeneous network was reconstructed by using the low-dimensional representation, and a network propagation was applied to the dual-layer heterogeneous network to predict disease-related genes. Through cross-validation and newly added-association validation, we displayed the important roles of different types of association data in enhancing the ability of disease-gene prediction, and confirmed the excellent performance of PrGeFNE by comparing to state-of-the-art algorithms. Furthermore, we developed a web tool that can facilitate researchers to search for candidate genes of different diseases predicted by PrGeFNE, along with the enrichment analysis of GO and pathway on candidate gene set. This may be useful for investigation of diseases' molecular mechanisms as well as their experimental validations. The web tool is available at http://bioinformatics.csu.edu.cn/prgefne/.


Assuntos
Algoritmos , Biologia Computacional , Proteínas
9.
Int J Mol Sci ; 23(22)2022 Nov 20.
Artigo em Inglês | MEDLINE | ID: mdl-36430895

RESUMO

Here we developed KARAJ, a fast and flexible Linux command-line tool to automate the end-to-end process of querying and downloading a wide range of genomic and transcriptomic sequence data types. The input to KARAJ is a list of PMCIDs or publication URLs or various types of accession numbers to automate four tasks as follows; firstly, it provides a summary list of accessible datasets generated by or used in these scientific articles, enabling users to select appropriate datasets; secondly, KARAJ calculates the size of files that users want to download and confirms the availability of adequate space on the local disk; thirdly, it generates a metadata table containing sample information and the experimental design of the corresponding study; and lastly, it enables users to download supplementary data tables attached to publications. Further, KARAJ provides a parallel downloading framework powered by Aspera connect which reduces the downloading time significantly.


Assuntos
Software , Transcriptoma , Genoma , Genômica , Metadados
10.
Int J Mol Sci ; 23(20)2022 Oct 14.
Artigo em Inglês | MEDLINE | ID: mdl-36293133

RESUMO

Medical discoveries mainly depend on the capability to process and analyze biological datasets, which inundate the scientific community and are still expanding as the cost of next-generation sequencing technologies is decreasing. Deep learning (DL) is a viable method to exploit this massive data stream since it has advanced quickly with there being successive innovations. However, an obstacle to scientific progress emerges: the difficulty of applying DL to biology, and this because both fields are evolving at a breakneck pace, thus making it hard for an individual to occupy the front lines of both of them. This paper aims to bridge the gap and help computer scientists bring their valuable expertise into the life sciences. This work provides an overview of the most common types of biological data and data representations that are used to train DL models, with additional information on the models themselves and the various tasks that are being tackled. This is the essential information a DL expert with no background in biology needs in order to participate in DL-based research projects in biomedicine, biotechnology, and drug discovery. Alternatively, this study could be also useful to researchers in biology to understand and utilize the power of DL to gain better insights into and extract important information from the omics data.


Assuntos
Aprendizado Profundo , Descoberta de Drogas , Biotecnologia
11.
BMC Bioinformatics ; 22(1): 426, 2021 Sep 08.
Artigo em Inglês | MEDLINE | ID: mdl-34496758

RESUMO

BACKGROUND: A considerable number of data mining approaches for biomedical data analysis, including state-of-the-art associative models, require a form of data discretization. Although diverse discretization approaches have been proposed, they generally work under a strict set of statistical assumptions which are arguably insufficient to handle the diversity and heterogeneity of clinical and molecular variables within a given dataset. In addition, although an increasing number of symbolic approaches in bioinformatics are able to assign multiple items to values occurring near discretization boundaries for superior robustness, there are no reference principles on how to perform multi-item discretizations. RESULTS: In this study, an unsupervised discretization method, DI2, for variables with arbitrarily skewed distributions is proposed. Statistical tests applied to assess differences in performance confirm that DI2 generally outperforms well-established discretizations methods with statistical significance. Within classification tasks, DI2 displays either competitive or superior levels of predictive accuracy, particularly delineate for classifiers able to accommodate border values. CONCLUSIONS: This work proposes a new unsupervised method for data discretization, DI2, that takes into account the underlying data regularities, the presence of outlier values disrupting expected regularities, as well as the relevance of border values. DI2 is available at https://github.com/JupitersMight/DI2.


Assuntos
Algoritmos , Mineração de Dados , Biologia Computacional
12.
J Struct Biol ; 212(2): 107608, 2020 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-32896658

RESUMO

Tandem Repeat Proteins (TRPs) are ubiquitous in cells and are enriched in eukaryotes. They contributed to the evolution of organism complexity, specializing for functions that require quick adaptability such as immunity-related functions. To investigate the hypothesis of repeat protein evolution through exon duplication and rearrangement, we designed a tool to analyze the relationships between exon/intron patterns and structural symmetries. The tool allows comparison of the structure fragments as defined by exon/intron boundaries from Ensembl against the structural element repetitions from RepeatsDB. The all-against-all pairwise structural alignment between fragments and comparison of the two definitions (structural units and exons) are visualized in a single matrix, the "repeat/exon plot". An analysis of different repeat protein families, including the solenoids Leucine-Rich, Ankyrin, Pumilio, HEAT repeats and the ß propellers Kelch-like, WD40 and RCC1, shows different behaviors, illustrated here through examples. For each example, the analysis of the exon mapping in homologous proteins supports the conservation of their exon patterns. We propose that when a clear-cut relationship between exon and structural boundaries can be identified, it is possible to infer a specific "evolutionary pattern" which may improve TRPs detection and classification.


Assuntos
Éxons/genética , Proteínas/genética , Sequências de Repetição em Tandem/genética , Animais , Evolução Molecular , Humanos , Íntrons/genética
13.
Int J Mol Sci ; 21(2)2020 Jan 15.
Artigo em Inglês | MEDLINE | ID: mdl-31952211

RESUMO

The complexity of cancer diseases demands bioinformatic techniques and translational research based on big data and personalized medicine. Open data enables researchers to accelerate cancer studies, save resources and foster collaboration. Several tools and programming approaches are available for analyzing data, including annotation, clustering, comparison and extrapolation, merging, enrichment, functional association and statistics. We exploit openly available data via cancer gene expression analysis, we apply refinement as well as enrichment analysis via gene ontology and conclude with graph-based visualization of involved protein interaction networks as a basis for signaling. The different databases allowed for the construction of huge networks or specified ones consisting of high-confidence interactions only. Several genes associated to glioma were isolated via a network analysis from top hub nodes as well as from an outlier analysis. The latter approach highlights a mitogen-activated protein kinase next to a member of histondeacetylases and a protein phosphatase as genes uncommonly associated with glioma. Cluster analysis from top hub nodes lists several identified glioma-associated gene products to function within protein complexes, including epidermal growth factors as well as cell cycle proteins or RAS proto-oncogenes. By using selected exemplary tools and open-access resources for cancer research and differential network analysis, we highlight disturbed signaling components in brain cancer subtypes of glioma.


Assuntos
Neoplasias Encefálicas/genética , Perfilação da Expressão Gênica/métodos , Regulação Neoplásica da Expressão Gênica , Redes Reguladoras de Genes , Glioma/genética , Mapas de Interação de Proteínas/genética , Neoplasias Encefálicas/metabolismo , Neoplasias Encefálicas/patologia , Análise por Conglomerados , Biologia Computacional/métodos , Ontologia Genética , Predisposição Genética para Doença/genética , Glioma/metabolismo , Glioma/patologia , Humanos , Transdução de Sinais/genética
14.
Molecules ; 25(18)2020 Sep 18.
Artigo em Inglês | MEDLINE | ID: mdl-32961977

RESUMO

The quinoline ring system has long been known as a versatile nucleus in the design and synthesis of biologically active compounds. Currently, more than one hundred quinoline compounds have been approved in therapy as antimicrobial, local anaesthetic, antipsychotic, and anticancer drugs. In drug discovery, indeed, over the last few years, an increase in the publication of papers and patents about quinoline derivatives possessing antiproliferative properties has been observed. This trend can be justified by the versatility and accessibility of the quinoline scaffold, from which new derivatives can be easily designed and synthesized. Within the numerous quinoline small molecules developed as antiproliferative drugs, this review is focused on compounds effective on c-Met, VEGF (vascular endothelial growth factor), and EGF (epidermal growth factor) receptors, pivotal targets for the activation of important carcinogenic pathways (Ras/Raf/MEK and PI3K/AkT/mTOR). These signalling cascades are closely connected and regulate the survival processes in the cell, such as proliferation, apoptosis, differentiation, and angiogenesis. The antiproliferative biological data of remarkable quinoline compounds have been analysed, confirming the pivotal importance of this ring system in the efficacy of several approved drugs. Furthermore, in view of an SAR (structure-activity relationship) study, the most recurrent ligand-protein interactions of the reviewed molecules are summarized.


Assuntos
Receptores ErbB/antagonistas & inibidores , Proteínas Proto-Oncogênicas c-met/antagonistas & inibidores , Quinolinas/química , Receptores de Fatores de Crescimento do Endotélio Vascular/antagonistas & inibidores , Antineoplásicos/química , Antineoplásicos/metabolismo , Sobrevivência Celular/efeitos dos fármacos , Receptores ErbB/metabolismo , Humanos , Simulação de Dinâmica Molecular , Proteínas Proto-Oncogênicas c-met/metabolismo , Quinolinas/metabolismo , Receptores de Fatores de Crescimento do Endotélio Vascular/metabolismo , Transdução de Sinais/efeitos dos fármacos , Relação Estrutura-Atividade
15.
J Med Syst ; 44(7): 122, 2020 May 25.
Artigo em Inglês | MEDLINE | ID: mdl-32451808

RESUMO

Coronaviruses (CoVs) are a large family of viruses that are common in many animal species, including camels, cattle, cats and bats. Animal CoVs, such as Middle East respiratory syndrome-CoV, severe acute respiratory syndrome (SARS)-CoV, and the new virus named SARS-CoV-2, rarely infect and spread among humans. On January 30, 2020, the International Health Regulations Emergency Committee of the World Health Organisation declared the outbreak of the resulting disease from this new CoV called 'COVID-19', as a 'public health emergency of international concern'. This global pandemic has affected almost the whole planet and caused the death of more than 315,131 patients as of the date of this article. In this context, publishers, journals and researchers are urged to research different domains and stop the spread of this deadly virus. The increasing interest in developing artificial intelligence (AI) applications has addressed several medical problems. However, such applications remain insufficient given the high potential threat posed by this virus to global public health. This systematic review addresses automated AI applications based on data mining and machine learning (ML) algorithms for detecting and diagnosing COVID-19. We aimed to obtain an overview of this critical virus, address the limitations of utilising data mining and ML algorithms, and provide the health sector with the benefits of this technique. We used five databases, namely, IEEE Xplore, Web of Science, PubMed, ScienceDirect and Scopus and performed three sequences of search queries between 2010 and 2020. Accurate exclusion criteria and selection strategy were applied to screen the obtained 1305 articles. Only eight articles were fully evaluated and included in this review, and this number only emphasised the insufficiency of research in this important area. After analysing all included studies, the results were distributed following the year of publication and the commonly used data mining and ML algorithms. The results found in all papers were discussed to find the gaps in all reviewed papers. Characteristics, such as motivations, challenges, limitations, recommendations, case studies, and features and classes used, were analysed in detail. This study reviewed the state-of-the-art techniques for CoV prediction algorithms based on data mining and ML assessment. The reliability and acceptability of extracted information and datasets from implemented technologies in the literature were considered. Findings showed that researchers must proceed with insights they gain, focus on identifying solutions for CoV problems, and introduce new improvements. The growing emphasis on data mining and ML techniques in medical fields can provide the right environment for change and improvement.


Assuntos
Betacoronavirus , Infecções por Coronavirus/diagnóstico , Mineração de Dados/métodos , Aprendizado de Máquina , Pneumonia Viral/diagnóstico , Algoritmos , COVID-19 , Humanos , Pandemias , SARS-CoV-2
16.
BMC Bioinformatics ; 20(Suppl 18): 518, 2019 Nov 25.
Artigo em Inglês | MEDLINE | ID: mdl-31760937

RESUMO

BACKGROUND: It's a very urgent task to identify cancer genes that enables us to understand the mechanisms of biochemical processes at a biomolecular level and facilitates the development of bioinformatics. Although a large number of methods have been proposed to identify cancer genes at recent times, the biological data utilized by most of these methods is still quite less, which reflects an insufficient consideration of the relationship between genes and diseases from a variety of factors. RESULTS: In this paper, we propose a two-rounds random walk algorithm to identify cancer genes based on multiple biological data (TRWR-MB), including protein-protein interaction (PPI) network, pathway network, microRNA similarity network, lncRNA similarity network, cancer similarity network and protein complexes. In the first-round random walk, all cancer nodes, cancer-related genes, cancer-related microRNAs and cancer-related lncRNAs, being associated with all the cancer, are used as seed nodes, and then a random walker walks on a quadruple layer heterogeneous network constructed by multiple biological data. The first-round random walk aims to select the top score k of potential cancer genes. Then in the second-round random walk, genes, microRNAs and lncRNAs, being associated with a certain special cancer in corresponding cancer class, are regarded as seed nodes, and then the walker walks on a new quadruple layer heterogeneous network constructed by lncRNAs, microRNAs, cancer and selected potential cancer genes. After the above walks finish, we combine the results of two-rounds RWR as ranking score for experimental analysis. As a result, a higher value of area under the receiver operating characteristic curve (AUC) is obtained. Besides, cases studies for identifying new cancer genes are performed in corresponding section. CONCLUSION: In summary, TRWR-MB integrates multiple biological data to identify cancer genes by analyzing the relationship between genes and cancer from a variety of biological molecular perspective.


Assuntos
Biologia Computacional/métodos , MicroRNAs/genética , Anotação de Sequência Molecular/métodos , Neoplasias/genética , Proteínas/genética , RNA Longo não Codificante/genética , Algoritmos , Humanos , Neoplasias/metabolismo , Oncogenes , Mapas de Interação de Proteínas , Proteínas/metabolismo , RNA Longo não Codificante/metabolismo , Curva ROC
17.
Bull Math Biol ; 81(7): 2691-2705, 2019 07.
Artigo em Inglês | MEDLINE | ID: mdl-31256302

RESUMO

Model selection based on experimental data is an important challenge in biological data science. Particularly when collecting data is expensive or time-consuming, as it is often the case with clinical trial and biomolecular experiments, the problem of selecting information-rich data becomes crucial for creating relevant models. We identify geometric properties of input data that result in an unique algebraic model, and we show that if the data form a staircase, or a so-called linear shift of a staircase, the ideal of the points has a unique reduced Gröbner basis and thus corresponds to a unique model. We use linear shifts to partition data into equivalence classes with the same basis. We demonstrate the utility of the results by applying them to a Boolean model of the well-studied lac operon in E. coli.


Assuntos
Modelos Biológicos , Algoritmos , Bases de Dados Factuais , Escherichia coli/genética , Escherichia coli/metabolismo , Óperon Lac , Modelos Lineares , Conceitos Matemáticos , Biologia de Sistemas
18.
BMC Med Inform Decis Mak ; 18(1): 97, 2018 11 12.
Artigo em Inglês | MEDLINE | ID: mdl-30419910

RESUMO

BACKGROUND: Mandates abound to share publicly-funded research data for reuse, while data platforms continue to emerge to facilitate such reuse. Birth cohorts (BC) involve longitudinal designs, significant sample sizes and rich and deep datasets. Data sharing benefits include more analyses, greater research complexity, increased opportunities for collaboration, amplification of public contributions, and reduced respondent burdens. Sharing BC data involves significant challenges including consent, privacy, access policies, communication, and vulnerability of the child. Research on these issues is available for biological data, but these findings may not extend to BC data. We lack consensus on how best to approach these challenges in consent, privacy, communication and autonomy when sharing BC data. We require more stakeholder engagement to understand perspectives and generate consensus. METHODS: Parents participating in longitudinal birth cohorts completed a web-based survey investigating consent preferences for sharing their, and their child's, non-biological research data. Results from a previous qualitative inquiry informed survey development, and cognitive interviewing methods (n = 9) were used to improve the question quality and comprehension. Recruitment was via personalized email, with email and phone reminders during the 14-day window for survey completion. RESULTS: Three hundred and forty-six of 569 parents completed the survey in September 2014 (60.8%). Participants preferred consent processes for data sharing in future independent research that were less-active (i.e. no consent or opt-out). Parents' consent preferences are associated with their communication preferences. Twenty percent (20.2%) of parents generally agreed that their child should provide consent to continue participating in research at age 12, while 25.6% felt decision-making on sharing non-biological research data should begin at age 18. CONCLUSIONS: These finding reflect the parenting population's preference for less project-specific permission when research data is non-biological and de-identified and when governance practices are highly detailed and rigourous. Parents recognize that children should become involved in consent for secondary data use, but there is variability regarding when and how involvement occurs. These findings emphasize governance processes and participant notification rather than project-specific consent for secondary use of de-identified, non-biological data. Ultimately, parents prefer general consent processes for sharing de-identified, non-biological research data with ultimate involvement of the child.


Assuntos
Disseminação de Informação , Consentimento Livre e Esclarecido/psicologia , Pais/psicologia , Adolescente , Adulto , Canadá , Criança , Pré-Escolar , Estudos Transversais , Anonimização de Dados , Tomada de Decisões , Feminino , Humanos , Lactente , Recém-Nascido , Masculino , Privacidade , Pesquisa Qualitativa , Inquéritos e Questionários
19.
Molecules ; 21(8)2016 Jul 28.
Artigo em Inglês | MEDLINE | ID: mdl-27483216

RESUMO

Following the explosive growth in chemical and biological data, the shift from traditional methods of drug discovery to computer-aided means has made data mining and machine learning methods integral parts of today's drug discovery process. In this paper, extreme gradient boosting (Xgboost), which is an ensemble of Classification and Regression Tree (CART) and a variant of the Gradient Boosting Machine, was investigated for the prediction of biological activity based on quantitative description of the compound's molecular structure. Seven datasets, well known in the literature were used in this paper and experimental results show that Xgboost can outperform machine learning algorithms like Random Forest (RF), Support Vector Machines (LSVM), Radial Basis Function Neural Network (RBFN) and Naïve Bayes (NB) for the prediction of biological activities. In addition to its ability to detect minority activity classes in highly imbalanced datasets, it showed remarkable performance on both high and low diversity datasets.


Assuntos
Mineração de Dados/métodos , Descoberta de Drogas/métodos , Algoritmos , Bases de Dados de Compostos Químicos , Aprendizado de Máquina , Análise de Regressão
20.
Philos Trans R Soc Lond B Biol Sci ; 379(1904): 20230104, 2024 Jun 24.
Artigo em Inglês | MEDLINE | ID: mdl-38705176

RESUMO

Technological advancements in biological monitoring have facilitated the study of insect communities at unprecedented spatial scales. The progress allows more comprehensive coverage of the diversity within a given area while minimizing disturbance and reducing the need for extensive human labour. Compared with traditional methods, these novel technologies offer the opportunity to examine biological patterns that were previously beyond our reach. However, to address the pressing scientific inquiries of the future, data must be easily accessible, interoperable and reusable for the global research community. Biodiversity information standards and platforms provide the necessary infrastructure to standardize and share biodiversity data. This paper explores the possibilities and prerequisites of publishing insect data obtained through novel monitoring methods through GBIF, the most comprehensive global biodiversity data infrastructure. We describe the essential components of metadata standards and existing data standards for occurrence data on insects, including data extensions. By addressing the current opportunities, limitations, and future development of GBIF's publishing framework, we hope to encourage researchers to both share data and contribute to the further development of biodiversity data standards and publishing models. Wider commitments to open data initiatives will promote data interoperability and support cross-disciplinary scientific research and key policy indicators. This article is part of the theme issue 'Towards a toolkit for global insect biodiversity monitoring'.


Assuntos
Biodiversidade , Disseminação de Informação , Insetos , Animais , Entomologia/métodos , Entomologia/normas , Disseminação de Informação/métodos , Metadados
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA