Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 93
Filtrar
1.
J Chem Inf Model ; 64(8): 3205-3212, 2024 Apr 22.
Artigo em Inglês | MEDLINE | ID: mdl-38544337

RESUMO

Language models trained on domain-specific corpora have been employed to increase the performance in specialized tasks. However, little previous work has been reported on how specific a "domain-specific" corpus should be. Here, we test a number of language models trained on varyingly specific corpora by employing them in the task of extracting information from photocatalytic water splitting. We find that more specific corpora can benefit performance on downstream tasks. Furthermore, PhotocatalysisBERT, a pretrained model from scratch on scientific papers on photocatalytic water splitting, demonstrates improved performance over previous work in associating the correct photocatalyst with the correct photocatalytic activity during information extraction, achieving a precision of 60.8(+11.5)% and a recall of 37.2(+4.5)%.


Assuntos
Processos Fotoquímicos , Água , Água/química , Catálise
2.
J Chem Inf Model ; 64(5): 1486-1501, 2024 Mar 11.
Artigo em Inglês | MEDLINE | ID: mdl-38422386

RESUMO

Molecular design depends heavily on optical properties for applications such as solar cells and polymer-based batteries. Accurate prediction of these properties is essential, and multiple predictive methods exist, from ab initio to data-driven techniques. Although theoretical methods, such as time-dependent density functional theory (TD-DFT) calculations, have well-established physical relevance and are among the most popular methods in computational physics and chemistry, they exhibit errors that are inherent in their approximate nature. These high-throughput electronic structure calculations also incur a substantial computational cost. With the emergence of big-data initiatives, cost-effective, data-driven methods have gained traction, although their usability is highly contingent on the degree of data quality and sparsity. In this study, we present a workflow that employs deep residual convolutional neural networks (DR-CNN) and gradient boosting feature selection to predict peak optical absorption wavelengths (λmax) exclusively from SMILES representations of dye molecules and solvents; one would normally measure λmax using UV-vis absorption spectroscopy. We use a multifidelity modeling approach, integrating 34,893 DFT calculations and 26,395 experimentally derived λmax data, to deliver more accurate predictions via a Bayesian-optimized gradient boosting machine. Our approach is benchmarked against the state of the art that is reported in the scientific literature; results demonstrate that learnt representations via a DR-CNN workflow that is integrated with other machine learning methods can accelerate the design of molecules for specific optical characteristics.


Assuntos
Aprendizado de Máquina , Redes Neurais de Computação , Teorema de Bayes , Teoria da Densidade Funcional , Análise Espectral
3.
J Chem Inf Model ; 64(4): 1187-1200, 2024 Feb 26.
Artigo em Inglês | MEDLINE | ID: mdl-38320103

RESUMO

Machine learning (ML) methods can train a model to predict material properties by exploiting patterns in materials databases that arise from structure-property relationships. However, the importance of ML-based feature analysis and selection is often neglected when creating such models. Such analysis and selection are especially important when dealing with multifidelity data because they afford a complex feature space. This work shows how a gradient-boosted statistical feature-selection workflow can be used to train predictive models that classify materials by their metallicity and predict their band gap against experimental measurements, as well as computational data that are derived from electronic-structure calculations. These models are fine-tuned via Bayesian optimization, using solely the features that are derived from chemical compositions of the materials data. We test these models against experimental, computational, and a combination of experimental and computational data. We find that the multifidelity modeling option can reduce the number of features required to train a model. The performance of our workflow is benchmarked against state-of-the-art algorithms, the results of which demonstrate that our approach is either comparable to or superior to them. The classification model realized an accuracy score of 0.943, a macro-averaged F1-score of 0.940, area under the curve of the receiver operating characteristic curve of 0.985, and an average precision of 0.977, while the regression model achieved a mean absolute error of 0.246, a root-mean squared error of 0.402, and R2 of 0.937. This illustrates the efficacy of our modeling approach and highlights the importance of thorough feature analysis and judicious selection over a "black-box" approach to feature engineering in ML-based modeling.


Assuntos
Algoritmos , Aprendizado de Máquina , Teorema de Bayes , Fluxo de Trabalho , Bases de Dados Factuais
4.
Inorg Chem ; 62(1): 318-335, 2023 Jan 09.
Artigo em Inglês | MEDLINE | ID: mdl-36541860

RESUMO

Contemporary electrocatalysts for the reduction of CO2 often suffer from low stability, activity, and selectivity, or a combination thereof. Mn-carbonyl complexes represent a promising class of molecular electrocatalysts for the reduction of CO2 to CO as they are able to promote this reaction at relatively mild overpotentials, whereby rare-earth metals are not required. The electronic and geometric structure of the reaction center of these molecular electrocatalysts is precisely known and can be tuned via ligand modifications. However, ligand characteristics that are required to achieve high catalytic turnover at minimal overpotential remain unclear. We consider 55 Mn-carbonyl complexes, which have previously been synthesized and characterized experimentally. Four intermediates were identified that are common across all catalytic mechanisms proposed for Mn-carbonyl complexes, and their structures were used to calculate descriptors for each of the 55 Mn-carbonyl complexes. These electronic-structure-based descriptors encompass the binding energies, the highest occupied and lowest unoccupied molecular orbitals, and partial charges. Trends in turnover frequency and overpotential with these descriptors were analyzed to afford meaningful physical insights into what ligand characteristics lead to good catalytic performance, and how this is affected by the reaction conditions. These insights can be expected to significantly contribute to the rational design of more active Mn-carbonyl electrocatalysts.

5.
J Chem Inf Model ; 63(19): 6053-6067, 2023 10 09.
Artigo em Inglês | MEDLINE | ID: mdl-37729111

RESUMO

Knowledge in the chemical domain is often disseminated graphically via chemical reaction schemes. The task of describing chemical transformations is greatly simplified by introducing reaction schemes that are composed of chemical diagrams and symbols. While intuitively understood by any chemist, like most graphical representations, such drawings are not easily understood by machines; this poses a challenge in the context of data extraction. Currently available tools are limited in their scope of extraction and require manual preprocessing, thus slowing down the speed of data extraction. We present a new tool, ReactionDataExtractor v2.0, which uses a combination of neural networks and symbolic artificial intelligence to effectively remove this barrier. We have evaluated our tool on a test set composed of reaction schemes that were taken from open-source journal articles and realized F1 score metrics between 75 and 96%. These evaluation metrics can be further improved by tuning our object-detection models to a specific chemical subdomain thanks to a data-driven approach that we have adopted with synthetically generated data. The system architecture of our tool is modular, which allows it to balance speed and accuracy to afford an autonomous, high-throughput solution for image-based chemical data extraction.


Assuntos
Aprendizado Profundo , Inteligência Artificial , Redes Neurais de Computação
6.
J Chem Inf Model ; 63(22): 7045-7055, 2023 Nov 27.
Artigo em Inglês | MEDLINE | ID: mdl-37934697

RESUMO

The ever-growing amount of chemical data found in the scientific literature has led to the emergence of data-driven materials discovery. The first step in the pipeline, to automatically extract chemical information from plain text, has been driven by the development of software toolkits such as ChemDataExtractor. Such data extraction processes have created a demand for parsers that efficiently enable text mining. Here, we present Snowball 2.0, a sentence parser based on a semisupervised machine-learning algorithm. It can be used to extract any chemical property without additional training. We validate its precision, recall, and F-score by training and testing a model with sentences of semiconductor band gap information curated from journal articles. Snowball 2.0 builds on two previously developed Snowball algorithms. Evaluation of Snowball 2.0 shows a 15-20% increase in recall with marginally reduced precision over the previous version which has been incorporated into ChemDataExtractor 2.0, giving Snowball 2.0 better performance in most configurations. Snowball 2.0 offers more and better parsing options for ChemDataExtractor, and it is more capable in the pipeline of automated data extraction. Snowball 2.0 also features better generalizability, performance, learning efficiencies, and user-friendliness.


Assuntos
Algoritmos , Software , Idioma , Mineração de Dados , Aprendizado de Máquina Supervisionado
7.
J Chem Inf Model ; 63(7): 1961-1981, 2023 04 10.
Artigo em Inglês | MEDLINE | ID: mdl-36940385

RESUMO

Text mining in the optical-materials domain is becoming increasingly important as the number of scientific publications in this area grows rapidly. Language models such as Bidirectional Encoder Representations from Transformers (BERT) have opened up a new era and brought a significant boost to state-of-the-art natural-language-processing (NLP) tasks. In this paper, we present two "materials-aware" text-based language models for optical research, OpticalBERT and OpticalPureBERT, which are trained on a large corpus of scientific literature in the optical-materials domain. These two models outperform BERT and previous state-of-the-art models in a variety of text-mining tasks about optical materials. We also release the first "materials-aware" table-based language model, OpticalTable-SQA. This is a querying facility that solicits answers to questions about optical materials using tabular information that pertains to this scientific domain. The OpticalTable-SQA model was realized by fine-tuning the Tapas-SQA model using a manually annotated OpticalTableQA data set which was curated specifically for this work. While preserving its sequential question-answering performance on general tables, the OpticalTable-SQA model significantly outperforms Tapas-SQA on optical-materials-related tables. All models and data sets are available to the optical-materials-science community.


Assuntos
Mineração de Dados , Fontes de Energia Elétrica , Idioma , Ciência dos Materiais , Processamento de Linguagem Natural
8.
J Chem Phys ; 159(19)2023 Nov 21.
Artigo em Inglês | MEDLINE | ID: mdl-37971034

RESUMO

With the emergence of big data initiatives and the wealth of available chemical data, data-driven approaches are becoming a vital component of materials discovery pipelines or workflows. The screening of materials using machine-learning models, in particular, is increasingly gaining momentum to accelerate the discovery of new materials. However, the black-box treatment of machine-learning methods suffers from a lack of model interpretability, as feature relevance and interactions can be overlooked or disregarded. In addition, naive approaches to model training often lead to irrelevant features being used which necessitates the need for various regularization techniques to achieve model generalization; this incurs a high computational cost. We present a feature-selection workflow that overcomes this problem by leveraging a gradient boosting framework and statistical feature analyses to identify a subset of features, in a recursive manner, which maximizes their relevance to the target variable or classes. We subsequently obtain minimal feature redundancy through multicollinearity reduction by performing feature correlation and hierarchical cluster analyses. The features are further refined using a wrapper method, which follows a greedy search approach by evaluating all possible feature combinations against the evaluation criterion. A case study on elastic material-property prediction and a case study on the classification of materials by their metallicity are used to illustrate the use of our proposed workflow; although it is highly general, as demonstrated through our wider subsequent prediction of various material properties. Our Bayesian-optimized machine-learning models generated results, without the use of regularization techniques, which are comparable to the state-of-the-art that are reported in the scientific literature.

9.
Langmuir ; 38(3): 871-890, 2022 Jan 25.
Artigo em Inglês | MEDLINE | ID: mdl-35014533

RESUMO

In this feature article, we discuss the fundamental use of materials-characterization methods that directly determine structural information on the dye···TiO2 interface in dye-sensitized solar cells (DSCs). This interface is usually buried within the DSC and submerged in solvent and electrolyte, which renders such metrological work nontrivial. We will show how ex-situ X-ray reflectometry (XRR), atomic-force microscopy (AFM), grazing-incidence X-ray scattering (GIXS), pair-distribution-function analysis of X-ray diffraction data (gaPDF), and in-situ neutron reflectometry (NR) can be used to deliver specific structural information on the dye···TiO2 interface regarding dye anchoring, dye aggregation, molecular dye orientation, intermolecular spacing between dye molecules, interactions between the dye molecules and the TiO2 surface, and interactions between the dye molecules and the electrolyte components and precursors. Some of these materials-characterization techniques have been developed specifically for this purpose. We will demonstrate how the direct acquisition of such information from materials-characterization experiments is crucial for assembling a holistic structural picture of this interface, which in turn can be used to develop DSC design guidelines. Moreover, we will show how these methodologies can be used in the experimental-validation process of "design-to-device" pipelines for big-data- and machine-learning-based materials discovery. We conclude with an outlook on further developments of this design-to-device approach as well as the materials characterization of more dye···TiO2 interfacial structures that involve known DSC dyes using the methods described herein. In addition, we propose to combine these formally disparate metrologies so that their complementary merits can be exploited simultaneously. New metrologies of this kind could serve as a "one-stop-shop" for the materials characterization of surfaces, interfaces, and bulk structures in DSCs and other devices with layered architectures.

10.
J Chem Inf Model ; 62(24): 6365-6377, 2022 12 26.
Artigo em Inglês | MEDLINE | ID: mdl-35533012

RESUMO

A great number of scientific papers are published every year in the field of battery research, which forms a huge textual data source. However, it is difficult to explore and retrieve useful information efficiently from these large unstructured sets of text. The Bidirectional Encoder Representations from Transformers (BERT) model, trained on a large data set in an unsupervised way, provides a route to process the scientific text automatically with minimal human effort. To this end, we realized six battery-related BERT models, namely, BatteryBERT, BatteryOnlyBERT, and BatterySciBERT, each of which consists of both cased and uncased models. They have been trained specifically on a corpus of battery research papers. The pretrained BatteryBERT models were then fine-tuned on downstream tasks, including battery paper classification and extractive question-answering for battery device component classification that distinguishes anode, cathode, and electrolyte materials. Our BatteryBERT models were found to outperform the original BERT models on the specific battery tasks. The fine-tuned BatteryBERT was then used to perform battery database enhancement. We also provide a website application for its interactive use and visualization.


Assuntos
Fontes de Energia Elétrica , Idioma , Humanos , Bases de Dados Factuais , Processamento de Linguagem Natural
11.
J Chem Inf Model ; 62(11): 2670-2684, 2022 06 13.
Artigo em Inglês | MEDLINE | ID: mdl-35587269

RESUMO

Predicting the properties of materials prior to their synthesis is of great significance in materials science. Optical materials exhibit a large number of interesting properties that make them useful in a wide range of applications, including optical glasses, optical fibers, and laser optics. In all of these applications, refraction and its chromatic dispersion can directly reflect the characteristics of the transmitted light and determine the practical utility of the material. We demonstrate the feasibility of reconstructing chromatic-dispersion relations of well-known optical materials by aggregating data over a large number of independent sources, which are contained within a material database of experimentally determined refractive indices and wavelength values. We also employ this database to develop a machine-learning platform that can predict refractive indices of compounds without needing to know the structure or other properties of a material of interest. We present a web-based application that enables users to build their customized machine-learning models; this will help the scientific community to conduct further research into the discovery of optical materials.


Assuntos
Refração Ocular , Refratometria , Mineração de Dados , Luz , Aprendizado de Máquina
12.
J Chem Inf Model ; 62(5): 1207-1213, 2022 03 14.
Artigo em Inglês | MEDLINE | ID: mdl-35199519

RESUMO

Chemical Named Entity Recognition (NER) forms the basis of information extraction tasks in the chemical domain. However, while such tasks can involve multiple domains of chemistry at the same time, currently available named entity recognizers are specialized in one part of chemistry, resulting in such workflows failing for a biased subset of mentions. This paper presents a single model that performs at close to the state-of-the-art for both organic (CHEMDNER, 89.7 F1 score) and inorganic (Matscholar, 88.0 F1 score) NER tasks at the same time. Our NER system utilizing the Bert architecture is available as part of ChemDataExtractor 2.1, along with the data sets and scripts used to train the model.


Assuntos
Armazenamento e Recuperação da Informação , Compostos Inorgânicos
13.
J Chem Inf Model ; 62(7): 1633-1643, 2022 04 11.
Artigo em Inglês | MEDLINE | ID: mdl-35349259

RESUMO

The layout of portable document format (PDF) files is constant to any screen, and the metadata therein are latent, compared to mark-up languages such as HTML and XML. No semantic tags are usually provided, and a PDF file is not designed to be edited or its data interpreted by software. However, data held in PDF files need to be extracted in order to comply with open-source data requirements that are now government-regulated. In the chemical domain, related chemical and property data also need to be found, and their correlations need to be exploited to enable data science in areas such as data-driven materials discovery. Such relationships may be realized using text-mining software such as the "chemistry-aware" natural-language-processing tool, ChemDataExtractor; however, this tool has limited data-extraction capabilities from PDF files. This study presents the PDFDataExtractor tool, which can act as a plug-in to ChemDataExtractor. It outperforms other PDF-extraction tools for the chemical literature by coupling its functionalities to the chemical-named entity-recognition capabilities of ChemDataExtractor. The intrinsic PDF-reading abilities of ChemDataExtractor are much improved. The system features a template-based architecture. This enables semantic information to be extracted from the PDF files of scientific articles in order to reconstruct the logical structure of articles. While other existing PDF-extracting tools focus on quantity mining, this template-based system is more focused on quality mining on different layouts. PDFDataExtractor outputs information in JSON and plain text, including the metadata of a PDF file, such as paper title, authors, affiliation, email, abstract, keywords, journal, year, document object identifier (DOI), reference, and issue number. With a self-created evaluation article set, PDFDataExtractor achieved promising precision for all key assessed metadata areas of the document text.


Assuntos
Metadados , Leitura , Mineração de Dados , Processamento de Linguagem Natural , Software
14.
J Chem Phys ; 156(15): 154110, 2022 Apr 21.
Artigo em Inglês | MEDLINE | ID: mdl-35459320

RESUMO

A measure of chemical similarity is only useful if it implies similarity in some relevant property space. Typically, similarity calculations operate by assigning each molecule a chemical fingerprint: a fixed-length vector of bits where the on-bits signify the presence of a certain feature. Common fingerprinting schemes, such as extended-connectivity fingerprints, are by definition general and fail to capture much of the domain-specific theory that underpins similarity in a specific domain. In this work, a hierarchical fingerprinting scheme is developed that is bespoke to a database of ∼4500 organic molecules and their cognate optical absorption spectral properties. Our fingerprinting scheme incorporates molecular fragmentation and domain-specific chemical intuition into an algorithm that categorizes each fragment as being one of a core chemical group, a substituent, or a bridge. The algorithm is applied to every molecule in the database to generate a pool of chemically relevant fragments that are labeled according to their structural category. The fingerprint of each molecule is then composed of a nested Python dictionary specifying the unique identifiers of its constituent fragment entities and the structural links between them to give a hierarchical molecular encoding scheme. Four case studies show the application of our fingerprinting scheme to the subject database. In each case, the clustered molecules display a host of interesting chemical trends. The application that was used to develop and implement this bespoke fingerprinting scheme, referred to as ChemCluster, also exposes a host of other cheminformatics tools pertaining to this database, a selection of which is demonstrated in this work. The enhanced similarity comparisons afforded by our fingerprinting scheme, as well as the large repository of categorized fragments generated during its development, constitute the first step toward using this database in a data-driven materials discovery workflow.


Assuntos
Algoritmos , Análise por Conglomerados , Bases de Dados Factuais
15.
Acc Chem Res ; 53(3): 599-610, 2020 03 17.
Artigo em Inglês | MEDLINE | ID: mdl-32096410

RESUMO

The world needs new materials to stimulate the chemical industry in key sectors of our economy: environment and sustainability, information storage, optical telecommunications, and catalysis. Yet, nearly all functional materials are still discovered by "trial-and-error", of which the lack of predictability affords a major materials bottleneck to technological innovation. The average "molecule-to-market" lead time for materials discovery is currently 20 years. This is far too long for industrial needs, as highlighted by the Materials Genome Initiative, which has ambitious targets of up to 4-fold reductions in average molecule-to-market lead times. Such a large step change in progress can only be realistically achieved if one adopts an entirely new approach to materials discovery. Fortunately, a fundamentally new approach to materials discovery has been emerging, whereby data science with artificial intelligence offers a prospective solution to speed up these average molecule-to-market lead times.This approach is known as data-driven materials discovery. Its broad prospects have only recently become a reality, given the timely and major advances in "big data", artificial intelligence, and high-performance computing (HPC). Access to massive data sets has been stimulated by government-regulated open-access requirements for data and literature. Natural-language processing (NLP) and machine-learning (ML) tools that can mine data and find patterns therein are becoming mainstream. Exascale HPC capabilities that can aid data mining and pattern recognition and also generate their own data from calculations are now within our grasp. These timely advances present an ideal opportunity to develop data-driven materials-discovery strategies to systematically design and predict new chemicals for a given device application.This Account shows how data science can afford materials discovery via a four-step "design-to-device" pipeline that entails (1) data extraction, (2) data enrichment, (3) material prediction, and (4) experimental validation. Massive databases of cognate chemical and property information are first forged from "chemistry-aware" natural-language-processing tools, such as ChemDataExtractor, and enriched using machine-learning methods and high-throughput quantum-chemical calculations. New materials for a bespoke application can then be predicted by mining these databases with algorithmic encodings of relationships between chemical structures and physical properties that are known to deliver functional materials. These may take the form of classification, enumeration, or machine-learning algorithms. A data-mining workflow short-lists these predictions to a handful of lead candidate materials that go forward to experimental validation. This design-to-device approach is being developed to offer a roadmap for the accelerated discovery of new chemicals for functional applications. Case studies presented demonstrate its utility for photovoltaic, optical, and catalytic applications. While this Account is focused on applications in the physical sciences, the generic pipeline discussed is readily transferable to other scientific disciplines such as biology and medicine.

16.
Langmuir ; 37(5): 1970-1982, 2021 Feb 09.
Artigo em Inglês | MEDLINE | ID: mdl-33492974

RESUMO

The nature of an interfacial structure buried within a device assembly is often critical to its function. For example, the dye/TiO2 interfacial structure that comprises the working electrode of a dye-sensitized solar cell (DSC) governs its photovoltaic output. These structures have been determined outside of the DSC device, using ex situ characterization methods; yet, they really should be probed while held within a DSC since they are modulated by the device environment. Dye/TiO2 structures will be particularly influenced by a layer of electrolyte ions that lies above the dye self-assembly. We show that electrolyte/dye/TiO2 interfacial structures can be resolved using in situ neutron reflectometry with contrast matching. We find that electrolyte constituents ingress into the self-assembled monolayer of dye molecules that anchor onto TiO2. Some dye/TiO2 anchoring configurations are modulated by the formation of electrolyte/dye intermolecular interactions. These electrolyte-influencing structural changes will affect dye-regeneration and electron-injection DSC operational processes. This underpins the importance of this in situ structural determination of electrolyte/dye/TiO2 interfaces within representative DSC device environments.

17.
J Chem Inf Model ; 61(3): 1136-1149, 2021 03 22.
Artigo em Inglês | MEDLINE | ID: mdl-33682402

RESUMO

Automating the analysis portion of materials characterization by electron microscopy (EM) has the potential to accelerate the process of scientific discovery. To this end, we present a Bayesian deep-learning model for semantic segmentation and localization of particle instances in EM images. These segmentations can subsequently be used to compute quantitative measures such as particle-size distributions, radial- distribution functions, average sizes, and aspect ratios of the particles in an image. Moreover, by making use of the epistemic uncertainty of our model, we obtain uncertainty estimates of its outputs and use these to filter out false-positive predictions and hence produce more accurate quantitative measures. We incorporate our method into the ImageDataExtractor package, as ImageDataExtractor 2.0, which affords a full pipeline to automatically extract particle information for large-scale data-driven materials discovery. Finally, we present and make publicly available the Electron Microscopy Particle Segmentation (EMPS) data set. This is the first human-labeled particle instance segmentation data set, consisting of 465 EM images and their corresponding semantic instance segmentation maps.


Assuntos
Processamento de Imagem Assistida por Computador , Semântica , Teorema de Bayes , Humanos , Microscopia Eletrônica
18.
J Chem Inf Model ; 61(10): 4962-4974, 2021 10 25.
Artigo em Inglês | MEDLINE | ID: mdl-34525303

RESUMO

Chemical reaction schemes are commonly used for visual encapsulation of chemical information. Figures of reaction schemes contain chemical transformations, the chemical species involved, as well as reaction conditions. From a data-mining point of view, they constitute rich sources, densely packed with knowledge. Yet, the challenge of automatically extracting data from them has remained largely untackled. This work presents ReactionDataExtractor, a software tool that can be used for the automatic extraction of information from multistep reaction schemes. Its capabilities include segmentation of reaction steps, regions containing reaction conditions, chemical diagrams, as well as optical character and structure recognition. A combination of rules and unsupervised machine-learning approaches is used, with bespoke detection algorithms that identify arrows, structures, labels, and conditions detection algorithms. It can be used as a low-maintenance tool for database generation capable of extracting data from large quantities of images supplied by the user. On assessment using a self-generated evaluation set, the tool achieved precision and recall metrics of between 67% and 91% in the six core areas of data extraction. The ReactionDataExtractor tool is released under the MIT license and is available to download from http://www.reactiondataextractor.org.


Assuntos
Mineração de Dados , Software , Algoritmos , Bases de Dados Factuais , Aprendizado de Máquina não Supervisionado
19.
J Chem Inf Model ; 61(9): 4280-4289, 2021 09 27.
Artigo em Inglês | MEDLINE | ID: mdl-34529432

RESUMO

The ever-growing abundance of data found in heterogeneous sources, such as scientific publications, has forced the development of automated techniques for data extraction. While in the past, in the physical sciences domain, the focus has been on the precise extraction of individual properties, attention has recently been devoted to the extraction of higher-level relationships. Here, we present a framework for an automated population of ontologies. That is, the direct extraction of a larger group of properties linked by a semantic network. We exploit data-rich sources, such as tables within documents, and present a new model concept that enables data extraction for chemical and physical properties with the ability to organize hierarchical data as nested information. Combining these capabilities with automatically generated parsers for data extraction and forward-looking interdependency resolution, we illustrate the power of our approach via the automatic extraction of a crystallographic hierarchy of information. This includes 18 interrelated submodels of nested data, extracted from an evaluation set of scientific articles, yielding an overall precision of 92.2%, across 26 different journals. Our method and associated toolkit, ChemDataExtractor 2.0, offers a key step toward the seamless integration of primary literature sources into a data-driven scientific framework.


Assuntos
Ciência dos Materiais , Software , Armazenamento e Recuperação da Informação
20.
Chem Rev ; 119(12): 7279-7327, 2019 Jun 26.
Artigo em Inglês | MEDLINE | ID: mdl-31013076

RESUMO

Dye-sensitized solar cells (DSCs) are a next-generation photovoltaic technology, whose natural transparency and good photovoltaic output under ambient light conditions afford them niche applications in solar-powered windows and interior design for energy-sustainable buildings. Their ability to be fabricated on flexible substrates, or as fibers, also makes them attractive as passive energy harvesters in wearable devices and textiles. Cosensitization has emerged as a method that affords efficiency gains in DSCs, being most celebrated via its role in nudging power conversion efficiencies of DSCs to reach world-record values; yet, cosensitization has a much wider potential for applications, as this review will show. Cosensitization is a chemical fabrication method that produces DSC working electrodes that contain two or more different dyes with complementary optical absorption characteristics. Dye combinations that collectively afford a panchromatic absorption spectrum emulating that of the solar emission spectrum are ideal, given that such combinations use all available sunlight. This review classifies existing cosensitization efforts into seven distinct ways that dyes have been combined in order to generate panchromatic DSCs. Seven cognate molecular-engineering strategies for cosensitization are thereby developed, which tailor optical absorption toward optimal DSC-device function.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA