Machine learning algorithm extracts materials synthesis recipes from the literature
A new artificial intelligence system can sift through academic publications and extract recipes for synthesizing materials. This is an important step toward fully realizing the vision of the US Materials Genome Initiative (MGI), say developers of the data-mining technique. The MGI was launched in 2011 with the goal of accelerating the development of advanced materials. It has led to novel, computationally designed materials with applications in energy, catalysis, thermoelectrics, and hydrogen storage.
Materials researchers have made headway in identifying and designing novel compounds to give desired properties. But the process of making these new materials is still slow. “The bottleneck for materials development has shifted somewhat to synthesis of a new compound once it has been predicted to have good properties from computational work,” says Elsa Olivetti, a professor of materials science and engineering at the Massachusetts Institute of Technology (MIT).
It would help to have an automatic way to extract materials recipes from previously published articles. For this, researchers have turned to machine learning, which uses algorithms trained to discern patterns in data sets. Past efforts to apply machine learning to materials synthesis have focused on extracting text from scientific literature.
But Olivetti and colleagues at the University of Massachusetts at Amherst and the University of California at Berkeley have gone a step further. They use several machine learning and natural language processing techniques to extract materials synthesis conditions from thousands of research papers. The system then analyzes this data to correlate synthesis conditions with resulting materials properties.
Their platform, as reported in a recent issue of Chemistry of Materials, automatically analyzes research articles and deduces which paragraphs contain recipes. Then it classifies the words in that text according to their roles in the recipes: numeric quantities, names of equipment, operating conditions, and names of target materials.
Machine learning typically uses very large data sets. But since materials recipe extraction is a new research area, Olivetti and her colleagues did not have large, annotated data sets. They first trained their software with about 100 academic articles that they had manually annotated. Then they used an algorithm called Word2vec that groups together words found in similar contexts and does not require annotated data, which allowed them to increase their training set to over 640,000 articles. “The program looks for words related to synthesis, such as times, temperatures, operations, precursor, etc.,” Olivetti says.
Tests of the system on manually labeled data showed that it could identify paragraphs that contained recipes with 99% accuracy and to label the words within those paragraphs with 86% accuracy.
Furthermore, the researchers examined the synthesis conditions for various metal oxides across more than 12,900 manuscripts. The system could retrieve calcination temperatures used in these recipes, which the researchers could group by their number of constituent elements and whether or not the targets are nanostructured. They could use the data to predict the critical parameters needed to synthesize titania nanotubes through hydrothermal methods: they verified their results against known mechanisms.
“For human researchers, the vastness of literature has become overwhelmingly large to read and distill for insight,” says Benji Maruyama, a senior materials research engineer in the US Air Force Research Laboratory at Wright-Patterson Air Force Base, Ohio. “This work represents an important milestone of using artificial intelligence to extract usable information for further experimentation.” This technique is a critical advance to address the larger challenge of building autonomous, closed-loop research systems for materials development, he says.
Read the abstract in Chemistry of Materials.