Skip to main content Accessibility help

A data ecosystem to support machine learning in materials science

  • Ben Blaiszik (a1) (a2), Logan Ward (a1) (a2), Marcus Schwarting (a2), Jonathon Gaff (a1), Ryan Chard (a1) (a2), Daniel Pike (a3), Kyle Chard (a1) (a2) and Ian Foster (a1) (a2)...


Facilitating the application of machine learning (ML) to materials science problems requires enhancing the data ecosystem to enable discovery and collection of data from many sources, automated dissemination of new data across the ecosystem, and the connecting of data with materials-specific ML models. Here, we present two projects, the Materials Data Facility (MDF) and the Data and Learning Hub for Science (DLHub), that address these needs. We use examples to show how MDF and DLHub capabilities can be leveraged to link data with ML models and how users can access those capabilities through web and programmatic interfaces.


Corresponding author

Address all correspondence to Ben Blaiszik at


Hide All
1.White, A.: The materials genome initiative: one year on. MRS Bull. 37, 715716 (2012).
2.Blaiszik, B., Chard, K., Pruyne, J., Ananthakrishnan, R., Tuecke, S., and Foster, I.: The materials data facility: data services to advance materials science research. JOM 68, 20452052 (2016).
3.Chard, R., Li, Z., Chard, K., Ward, L., Babuji, Y., Woodard, A., Tuecke, S., Blaiszik, B., Franklin, M.J., and Foster, I.: DLHub: Model and Data Serving for Science, 2018. (accessed March 8, 2019).
4.Nguyen, P., Konstanty, S., Nicholson, T., OBrien, T., Schwartz-Duval, A., Spila, T., Nahrstedt, K., Campbell, R.H., Gupta, I., Chan, M., Mchenry, K., and Paquin, N.: 4CeeD: real-time data acquisition and analysis framework for material-related cyber-physical environments. In 2017 17th IEEE/ACM Int. Symp. Clust. Cloud Grid Comput., IEEE, 2017; pp. 11–20. doi:10.1109/CCGRID.2017.51.
5.O'Mara, J., Meredig, B., and Michel, K.: Materials data infrastructure: a case study of the citrination platform to examine data import, storage, and access. JOM 68, 20312034 (2016).
6.Dima, A., Bhaskarla, S., Becker, C., Brady, M., Campbell, C., Dessauw, P., Hanisch, R., Kattner, U., Kroenlein, K., Newrock, M., Peskin, A., Plante, R., Li, S.-Y., Rigodiat, P.-F., Amaral, G.S., Trautt, Z., Schmitt, X., Warren, J., and Youssef, S.: Informatics infrastructure for the materials genome initiative. JOM 68, 20532064 (2016).
7.Kirklin, S., Saal, J.E., Meredig, B., Thompson, A., Doak, J.W., Aykol, M., Rühl, S., and Wolverton, C.: The open quantum materials database (OQMD): assessing the accuracy of DFT formation energies. npj Comput. Mater 1, 15010 (2015).
8.Jain, A., Ong, S.P., Hautier, G., Chen, W., Richards, W.D., Dacek, S., Cholia, S., Gunter, D., Skinner, D., Ceder, G., and Persson, K.A.: Commentary: the materials project: a materials genome approach to accelerating materials innovation. APL Mater. 1, 011002 (2013).
9.Draxl, C. and Scheffler, M.: NOMAD: the FAIR concept for big data-driven materials science. MRS Bull. 43, 676682 (2018).
10.Carrete, J., Li, W., Mingo, N., Wang, S., and Curtarolo, S.: Finding unprecedentedly low-thermal-conductivity half-Heusler semiconductors via high-throughput materials modeling. Phys. Rev. X 4, 011019 (2014).
11.Curtarolo, S., Setyawan, W., Wang, S., Xue, J., Yang, K., Taylor, R.H., Nelson, L.J., Hart, G.L.W., Sanvito, S., Buongiorno-Nardelli, M., Mingo, N., and Levy, O.: AFLOWLIB.ORG: a distributed materials properties repository from high-throughput ab initio calculations. Comput. Mater. Sci. 58, 227235 (2012).
12.Mannodi-Kanakkithodi, A., Chandrasekaran, A., Kim, C., Huan, T.D., Pilania, G., Botu, V., and Ramprasad, R.: Scoping the polymer genome: a roadmap for rational polymer dielectrics design and beyond. Mater. Today (2017). doi:10.1016/j.mattod.2017.11.021.
13.Tchoua, R.B., Chard, K., Audus, D.J., Ward, L.T., Lequieu, J., De Pablo, J.J., and Foster, I.T.: Towards a hybrid human-computer scientific information extraction pipeline. In 2017 IEEE 13th Int. Conf. e-Science, IEEE, 2017; pp. 109–118. doi:10.1109/eScience.2017.23.
14.Puchala, B., Tarcea, G., Marquis, E.A., Hedstrom, M., Jagadish, H.V., and Allison, J.E.: The materials commons: a collaboration platform and information repository for the global materials community. JOM 68, 20352044 (2016).
15.Materials Simulation Toolkit for Machine Learning (MAST-ML), (n.d.): (accessed June 27, 2019).
16.Wheeler, D., Brough, D., Fast, T., Kalidindi, S., and Reid, A.: PyMKS: materials knowledge system in python (2014).
17.Ward, L., Dunn, A., Faghaninia, A., Zimmermann, N.E.R., Bajaj, S., Wang, Q., Montoya, J., Chen, J., Bystrom, K., Dylla, M., Chard, K., Asta, M., Persson, K.A., Snyder, G.J., Foster, I., and Jain, A.: Matminer: an open source toolkit for materials data mining. Comput. Mater. Sci. 152, 6069 (2018).
18.Ong, S.P., Richards, W.D., Jain, A., Hautier, G., Kocher, M., Cholia, S., Gunter, D., Chevrier, V.L., Persson, K.A., and Ceder, G.: Python materials genomics (pymatgen): a robust, open-source python library for materials analysis. Comput. Mater. Sci. 68, 314319 (2013).
19.Schneider, J. and Hamaekers, J.: The atomic simulation environment - a Python library for working with atoms: related content ATK-forceField: a new generation molecular dynamics software package. J. Phys. Condens. Matter Top. Rev (2017). doi:10.1088/1361-648X/aa680e.
20.Materials Data Facility Schema Repository, (n.d.): (accessed June 27, 2019).
21.Foster, I., Chard, K., and Tuecke, S.: The discovery cloud: accelerating and democratizing research on a global scale. In 2016 IEEE Int. Conf. Cloud Eng., IEEE, 2016; pp. 68–77. doi:10.1109/IC2E.2016.46.
22.Ananthakrishnan, R., Blaiszik, B., Chard, K., Chard, R., McCollam, B., Pruyne, J., Rosen, S., Tuecke, S., and Foster, I.: Globus platform services for data publication. In Proc. Pract. Exp. Adv. Res. Comput. - PEARC ’18; ACM Press, New York, NY, USA, 2018; pp. 1–7. doi:10.1145/3219104.3219127.
23.Avsec, Z., Kreuzhuber, R., Israeli, J., Xu, N., Cheng, J., Shrikumar, A., Banerjee, A., Kim, D.S., Urban, L., Kundaje, A., Stegle, O., and Gagneur, J.: Kipoi: accelerating the community exchange and reuse of predictive models for genomics. BioRxiv, 375345 (2018). doi:10.1101/375345.
24.DataCite Schema, (n.d.): (accessed March 8, 2019).
25Babuji, Y., Brizius, A., Chard, K., Foster, I., Katz, D.S., Wilde, M., and Wozniak, J.: Introducing parsl: a python parallel scripting library (2017). doi:10.5281/ZENODO.891533.
26.Stein, H.S., Guevarra, D., Newhouse, P.F., Soedarmadji, E., and Gregoire, J.M.: Machine learning of optical properties of materials – predicting spectra from images and images from spectra. Chem. Sci. 10, 4755 (2019).
27.Mitrovic, S., Soedarmadji, E., Newhouse, P.F., Suram, S.K., Haber, J.A., Jin, J., and Gregoire, J.M.: Colorimetric screening for high-throughput discovery of light absorbers. ACS Comb. Sci. 17, 176181 (2015).
28.Schwarting, M., Siol, S., Talley, K., Zakutayev, A., and Phillips, C.: Automated algorithms for band gap analysis from optical absorption spectra. Mater. Discov. 10, 4352 (2017).
29.van der Maaten, L. and Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 25792605 (2008).
30.Cherukara, M.J., Nashed, Y.S.G., and Harder, R.J.: Real-time coherent diffraction inversion using deep generative networks. Sci. Rep. 8, 16520 (2018).
31.Curtiss, L.A., Redfern, P.C., and Raghavachari, K.: Gaussian-4 theory using reduced order perturbation theory. J. Chem. Phys. 127, 124105 (2007).
32.Ward, L., Blaiszik, B., Foster, I., Assary, R.S., Narayanan, B., and Curtiss, L.: Machine learning prediction of accurate atomization energies of organic molecules from low-fidelity quantum chemical calculations. MRS Commun 9(3), 891899 (2019). doi:10.1557/mrc.2019.107.
33.Schütt, K.T., Sauceda, H.E., Kindermans, P.-J., Tkatchenko, A., and Müller, K.-R.: SchNet – a deep learning architecture for molecules and materials. J. Chem. Phys. 148, 241722 (2018).
34.Ramakrishnan, R., Dral, P.O., Rupp, M., and von Lilienfeld, O.A.: Big data meets quantum chemistry approximations: the Δ-machine learning approach. J. Chem. Theory Comput. 11, 20872096 (2015).

A data ecosystem to support machine learning in materials science

  • Ben Blaiszik (a1) (a2), Logan Ward (a1) (a2), Marcus Schwarting (a2), Jonathon Gaff (a1), Ryan Chard (a1) (a2), Daniel Pike (a3), Kyle Chard (a1) (a2) and Ian Foster (a1) (a2)...


Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed