Experiments, First-Principles Calculation, and Machine Learning
Conventional molecule design is a time-consuming, laborious, as well as expert-based mission. The most common method is trial-and-error. Countless material development experiments are often evitable, and whole process is often high investment involved, no matter in time or human power. Historically, the timescale for the development of new material, from laboratory to commercial application, is about 15-20years [6]. The number of possible small molecular structures is estimated to be about on the order of 1060 [7], constituting a well-knowm chemical space. It is almost impossible to explore this space via the conventional trial-and-error experimental method. First-principles calculation can provide material developer’s another approach that circumvents the arduousness of experimentation. By quantum mechanics, physical and chemical properties of these molecules can be exactly obtained through solving Schrodinger equation. The first-principles calculation strategy significantly reduces the cost of developing new molecules, though vast chemical space still cannot be efficiently exploited, and virtually only little part of molecules in the world has been investigated until now.
The conception of the machine learning has existed for a long time since Arthur Samuel first came up with the phrase “machine learning” in 1952. However, the lack of material data significantly impeded the development of the machine learning in the material science field. During the past lOyears, a large amount of data including material structures and its corresponding properties have been accumulated, especially from first-principles calculation. Hence, the power of the machine learning has drawrn people’s attention away from conventional molecules. A large number of studies are dedicated to implementing machine learning to learn the relationship between material structures between their properties from existing data in recent years. Gradually, the machine learning becomes a powerful method for investigating materials at a large scale in the initial stage of material design.
Machine Learning Regression Model and Property Predictor
In recent years, the machine learning has seen a burgeoning method for the application to molecular design. The machine learning model has been used as property predictor aiming to predict molecular properties from its structure by learning implicit relation between them, as shown in Figure 6.3a. The model could accurately predict thousands of molecules within minutes. Previous works have successfully integrated this technology into molecule screening pipelines [8,9]. In machine learning framework, molecular structures are transformed into a digital representation that serves as input for the machine learning model. Two ideal attributes for molecular representation are uniqueness and invertibility. Uniqueness means that a specific molecule can only be represented by a unique molecular representation. Invertibility means molecular representation can be transformed back to a specific single molecule. There are mainly two types of representation, including 3D geometry and 2D molecular graphs. The latter can be further subdivided into four categories: string-based, image-based, tensor, and others. In the following, we will briefly describe the extended-connectivity fingerprint (ECFP) [10], simplified molecular-input line-entry system (SMILES) [11], and Coulombic matrix (CM) [8] methods.
The ECFP is a topological fingerprint that is originally used for the analysis of molecular characterization, substructure, and similarity. Nowadays, it is also adopted to perform the machine learning and statistical analysis of molecules. It basically has four steps: (a) assign each atom with an identifier, (b) atom identifiers are augmented with information from the atom neighborhood, (c) delete the duplicated substructure, and (d) hash list of identifiers into a fixed-length bit vector. It does not have the attribute of invertibility. Divided molecular substructure cannot be recombined together into the original molecule exactly. Furthermore, due to the fixed-length bit vector, each bit may represent multiple substructures, which often leads to the difficulty of the analysis. It is worth mentioning that initially each binary bit represents whether the substructure exists in a molecule or not, but here in this study, the bit is replaced with the number of the substructure in the molecule, as a so-called ECFPNUM. The ECFPNUM is further used for molecular representation since it is more suitable for property prediction than ECFP.
The SMILES is a molecular representation in the form of a line notation for describing the molecular structure by using ASCII characters. For example, benzene is denoted in the form of SMILES as clcccccl. It is the most popular representation in the field of machine learning since it follows particular grammar syntax and can be directly applied to natural language processing (NLP) models. In practice, because many machine learning algorithms cannot process characters (strings) as input directly, it has to be converted into numeric form. The way to standardize converting SMILES into numeric form is first setting every different character as an atom type. Then, for a given molecule every character of its SMILES is converted into bit vectors formed by the atom types. The bit vectors then combined follow the sequence of characters appearing in its SMILES to form a binary matrix that can directly operate as machine learning input. This scheme is known as one-hot encoding. Invertibility is a main advantage of SMILES, since one-hot encoding representation can be converted back to original molecules directly. SMILES, however, also suffers drawback at the same time. One molecule can have multiple SMILES representations. The nonuniqueness SMILES stem from the arbitrary starting atom in a molecule can be used to construct its SMILES. Some cheminformatic packages, such as RDKit [12], have the function to canonize the SMILES. However, Bjerrum et al. argue that the latent space created from canonical SMILES may have problems, since only specific grammar syntax has been learned, instead of the general underlying rule of molecule structure [13].
The CM is a molecular representation to describe the electrostatic interaction between atoms in a molecule. It is calculated by Equation (6.1).

The diagonal element is the polynomial fit of the potential energy of the atom itself (self-energy), while the off-diagonal elements correspond to the energy of Coulombic interaction between pairs of different atoms in the molecules. The CM can be obtained by performing quantum mechanical calculations. Through the machine learning correlation between molecular structure and its CM could be revealed. Thus, the learned model has exhibited strong an ability to predict molecular electronic properties. The CM also suffers from a number of issues. A different number of atoms in a molecule leads to different sizes of CMs. The solution is to pad by the vacancy of matrix of little molecule. Another issue is that different atom labeling schemes lead to different CMs. A simple solution to this is to simply sort the matrices in the order of specific atomic property.