ColabFit: Informatics for Advanced Materials and Chemistry


The emergence of data-driven approaches for predicting material properties promise to transform materials design and synthesis. Broadly, these efforts divide into two broad categories: (1) Development of data-driven interatomic potentials (DDIPs) that predict material properties through molecular simulations of material response; and (2) Direct prediction of material properties by learning structure–property relationships. Both approaches build on recent advances in machine learning by constructing models based on a large number of quantum or experimental input configurations. DDIPs enable truly predictive molecular simulations with the accuracy of first principle methods over length and time scales comparable to classical molecular simulations. Direct prediction models (DPMs) enable rapid calculation of properties and are particularly useful as a materials informatics tool.

This project aims to create a computational framework “ColabFit” that enables researchers to rapidly develop and deploy DDIPs and DPMs for complex material systems by providing:

  • ColabFit Dataset Exchange, an online resource for archiving, discovering, and accessing datasets used for training DDIPs and DPMs. The Exchange will provide exhaustive metadata on archived datasets as well as dataset analytics providing researchers with information on their scope and quality.
  • ColabFit Tools, a Python package for constructing, manipulating, and exploring ColabFit datasets. This will make it easy for researchers to use datasets archived in the Exchange in their fitting efforts and to prepare data for upload to the Exchange.
  • Archiving of DDIPs and their training procedures on Open Knowledgebase of Interatomic Models (OpenKIM) project enabling seamless and portable deployment to a large number of simulation platforms. As the name “ColabFit” suggests, this set of tools enables researchers to work collaboratively on developing DDIPs and DPMs by sharing datasets and building on each others' work.
  • The KIM-based Learning-Integrated Fitting Framework (KLIFF, a Python package for fitting DDIPs based on PyTorch, compatible with OpenKIM. KLIFF is provided as a native fitting code for ColabFit that fully integrates all of its functionality and is maintained by the ColabFit team. (Note that KLIFF is the package of choice for developing new DDIPs within the ColabFit project. However, ColabFit datasets and OpenKIM deployment can be readily used with a researcher's preferred fitting package.)

This project addresses a pressing need of the molecular simulation community. The creation of the ColabFit Dataset Exchange and associated tools will provide materials researchers with a powerful new ability to efficiently synthesize all available data and knowledge related to their particular problem of study. Datasets shared through the Exchange and DDIPs shared through OpenKIM will be archived with full provenance and version control with a persistent digital object identifier (DOI) to enable reproducible science and R&D, and be available to other researchers in the community to build upon by extending them for their own needs. Thus, major inefficiencies in today’s materials research industry will be eliminated and society as a whole can benefit from the resulting increase in scientific advancement.

ColabFit will be developed in collaboration with an international consortium of leaders in DDIP and DPM development, high-throughput first principles computation cyberinfrastructures, and materials standards organizations. Researchers interested in joining the consortium are invited to do so at the link above.


The development of the ColabFit framework involves research and development on several fronts:

  • Development of a standard, efficient format for storing heterogeneous datasets used for training DDIPs and DPMs. This format must accommodate the storage of very large datasets, minimize data duplication, and be flexible to enable highly diverse and application-specific data. The ColabFit Tools package will be developed to work with datasets stored in this format.
  • Development of a KIM Model Driver for DDIPs leveraging the KIM Application Programming Interface (KIM API), which enables portable deployment to KIM-compliant simulation platforms, and an archiving strategy for recording DDIP training procedure (e.g. loss function and optimization algorithm).
  • Extension of the KLIFF DDIP fitting package to support new ColabFit and OpenKIM functionality. This includes introduction of uncertainty quantification capabilities developed as part of the OpenKIM project.

The ColabFit framework and KLIFF will be validated on a materials science target application related to phase transformations in 2D transition metal dichalcogenides (TMDs)–an area that has important technological applications in which the PIs have extensive domain expertise. Specifically, a DDIP will be developed to study phase transformations in MoS2 and Mo1−xWxTe2 systems with an aim to generate a temperature-composition phase diagram for the latter. The training set for the DDIP will be designed to include configurations that are important for describing the various phases of the TMD and the transformation between them (e.g. the reaction pathway).