This was part of Materials Informatics: Tutorials and Hands-On

A mathematical analysis of GNoME and other materials databases

Vitaliy Kurlin, University of Liverpool

Thursday, March 14, 2024


A solid crystalline material was traditionally represented by a Crystallographic Information File (CIF) with a unit cell containing a motif of atoms, ions, or molecules, which are periodically repeated in three independent directions. This cell-based representation was highly ambiguous because a unit cell (even if primitive with a minimal volume) can be chosen in infinitely many ways. Crystallography tried to avoid this ambiguity by using a reduced cell whose best-known example is Niggli's cell. Unfortunately, all reduced cells are discontinuous under almost any noise, which can break the symmetry and arbitrarily scale up a minimal cell.This ambiguity was recently resolved by continuous invariants that provably distinguish all periodic crystals in general position (NeurIPS 2022) under isometry, which is a composition of translations, rotations, and reflections.  All 660+ thousand real materials (with no disorder) in the Cambridge Structural Database were distinguished through 200+ billion pairwise comparisons within two days on a modest desktop. The unexpected pairs of geometric duplicates, where one atom was replaced with a different one, without changing atomic coordinates, are investigated by five journals for data integrity.In November 2023, Nature published two papers reporting Google's GNoME database of 384+ thousand predicted materials ( of which 41 were claimed to be synthesized in the Berkeley lab. The question "whether an AI-controlled lab assistant actually made any novel substances" is discussed at This talk will report pairwise comparisons across several databases including GNoME, which turned out to contain thousands of identical CIFs. The relevant papers are linked at