Representing molecules efficiently is paramount in cheminformatics, especially when dealing with large datasets and complex analyses. Traditional matrix representations, while structurally informative, are cumbersome in terms of disk space and impractical for routine cheminformatics tasks like compound listing and online queries. This necessitates more streamlined approaches, leading to the widespread adoption of linear notations. These notations, encoding molecular structures as strings of characters, are interpreted by systematic rules, offering a compact and easily manipulated alternative. For instance, representing d-alanine using a Molfile requires 612 bytes, whereas linear notations like SMILES and InChI achieve this in just 15 and 59 bytes, respectively. Their conciseness and ease of manipulation, such as command-line usage or integration into spreadsheets, make linear notations indispensable in modern cheminformatics. This guide will explore the key linear notations used today, drawing from established research in the field.
The Historical Quest for Universal Chemical Nomenclature
The naming of chemical compounds has evolved alongside scientific understanding and technological capabilities. Early alchemists relied on property-based names like aqua fortis (nitric acid) and sweet oil of vitriol (diethyl ether). However, the 19th century witnessed a growing demand for a systematic nomenclature in organic chemistry. The International Union of Pure and Applied Chemistry (IUPAC) stepped in to standardize chemical nomenclature, detailed extensively in the IUPAC Color Books. This IUPAC nomenclature became universally adopted in scientific literature, patents, and legal frameworks.
Despite its widespread use, IUPAC nomenclature isn’t ideally suited for cheminformatics applications. Recognizing this, in 1949, the IUPAC called for an international standard for electronic chemical notations, outlining 11 crucial “desiderata.” These included ease of use, printability, conciseness, recognizability, unique nomenclature generation, compatibility with inorganic chemistry naming, uniqueness, unambiguous enumeration, machine manipulability, association representation, and handling of partial indeterminates. This marked the beginning of a formal quest for a universally accepted chemical notation system tailored for the digital age.
In 1964, the IUPAC formalized a classification system for notations, categorizing them as unique (one notation per compound), non-unique (multiple notations per compound), ambiguous (notation can represent multiple compounds), and unambiguous (notation represents only the original compound). This classification provides a framework for evaluating and comparing different chemical notations, as discussed throughout this guide.
While several notations were proposed to the IUPAC, the Dyson cyphering and the Wiswesser Line Notation (WLN) emerged as the frontrunners. Detailed descriptions of other proposed notations can be found in key publications by Wiswesser and Gelberg, which chronicle the evolution of chemical notations up to 1984. Although initially adopted by IUPAC in 1961, the Dyson cyphering faced challenges due to its incompatibility with standard typewriters and punched-card machines, along with complex rules. The WLN, developed in 1949, gained more traction within the scientific community due to its practicality. While both WLN and IUPAC-Dyson have largely fallen out of favor, their competition highlights the technological and usability considerations that shaped the selection process for a universal chemical notation.
Contemporary Chemical Notations: Revolutionizing Molecular Representation
Simplified Molecular Input Line Entry System (SMILES)
The Wiswesser Line Notation (WLN) demanded significant expertise and familiarity with its intricate rules. In response to this, a more intuitive system, the Simplified Molecular Input Line Entry System (SMILES), was developed in 1988 by Weininger and colleagues. SMILES rapidly became the most popular linear notation and remains so today. It was integrated into the Daylight Chemical Information Systems toolkit, which continues to maintain and develop it.
SMILES representation is non-unique yet unambiguous. It is generated by assigning numbers to each atom in a molecule and then traversing the molecular graph according to this numbering. In RDKit, a depth-first search algorithm is employed for this traversal. The flexibility in atom numbering allows for multiple SMILES strings for the same molecule. This property is leveraged for data augmentation through enumerated or randomized SMILES. These are generated by randomly selecting a starting atom for graph traversal while maintaining the depth-first search algorithm, resulting in different atom orderings and thus, different SMILES representations. It’s crucial to note that randomized SMILES generation is systematic and not a random search process.
To address the issue of multiple SMILES for the same molecule, canonicalization methods have been developed to generate a unique SMILES representation. Figure 1 illustrates the difference between canonical and randomized SMILES.
Fig. 1 Canonical (a) and randomized (b) SMILES representations of aspirin. Randomized SMILES illustrate the variations arising from different starting nodes in the graph traversal algorithm, while still using depth-first search. Numbers indicate the traversal order, starting with node 1. (a) represents a canonical SMILES for aspirin. (b) shows an alternative atom ordering, resulting in a different SMILES string, but still representing the same molecule. Green arrows show the molecular graph traversal path. Adapted from David et al.
Initially, SMILES lacked stereochemical encoding. Isomeric SMILES, an extension addressing this limitation, was later introduced and is now the standard SMILES format in many software packages. Isomeric SMILES can represent configurations around double bonds (Z/E), tetrahedral centers, and other chiral centers, although support for rarer chiral types may vary. However, SMILES struggles with structures not easily represented by molecular graphs, such as organometallic compounds and ionic salts.
SMILES uses square brackets to handle situations where the total bond order of an atom deviates from standard valences. Lowercase tokens are used for aromatic molecules, though some cheminformatics software may have limitations with “extra” bonds for aromatic atoms. ChemAxon Extended SMILES (CXSMILES) overcomes some of these limitations by incorporating special features stored after the SMILES string. These extensions, separated by spaces or tabs, can be ignored if standard SMILES parsing is desired. CXSMILES can store various fields, including fragment grouping for ions and salts, ligand order, and coordinate bonds for organometallic compounds. Coordinate bonds, represented as single bonds in SMILES, are clarified by the additional CXSMILES extension information.
The OpenSMILES specification, developed in 2007, aimed to standardize SMILES and clarify ambiguous interpretations within Daylight’s SMILES system. A key challenge with Daylight SMILES is its proprietary canonicalization algorithm, leading to variations in implementations. In 2012, a novel open-source method for generating canonical SMILES was developed, utilizing canonical labels from InChI representations. These “universal” SMILES aim to improve interoperability and comparison between chemical models across different cheminformatics toolkits.
SMILES Arbitrary Target Specification (SMARTS)
SMARTS (SMILES Arbitrary Target Specification) is an extension of SMILES designed for substructure searching. While SMILES uses symbols for atoms and bonds to define molecular connectivity, SMARTS employs a broader set of symbols for more generalized molecular graph specifications. Analogous to regular expressions in computer science, SMARTS allows for describing sets of molecules with variations at specific atom or bond positions. It supports logical operators (“OR”, “NOT”) and can specify isotopes and bond types (aromatic, aliphatic). Recursive SMARTS enables detailed descriptions of atom environments, such as ortho, meta, and para substitution patterns in arenes. Crucially, all SMILES are valid SMARTS, but the reverse is not true, and decoding a SMARTS string as SMILES will generally not yield the intended pattern.
International Chemistry Identifier (InChI)
The International Chemistry Identifier (InChI) is a prime example of an open-source canonical notation. Introduced in 2006 by NIST under IUPAC auspices, InChI is a freely available standard for chemical formula representation. InChI is structured in layers, including Main, Charge, Stereochemical, and Isotopic layers, each with sublayers. The Main layer, for instance, comprises Chemical formula, Atom connections, and Hydrogen atoms sublayers, as depicted in Figure 2.
Fig. 2 InChI notation of aspirin. Red letters indicate the standard InChI notation prefix. “1” denotes the InChI version number, and “S” signifies a standard InChI. Blue slashes are delimiters between layers.
InChIKey, a hashed version of InChI, is used for web and library searching. The InChIKey’s first block represents the molecular skeleton, and the second block encodes isomerism. InChIKeys are designed to be unique representations of their parent InChI. However, InChIKey collisions, where a single InChIKey maps to multiple InChIs, can occur, albeit rarely. Unlike SMILES, InChIs are not always decodable back to the original molecular graph. SMILES retains the advantage of being more human-readable. For in-depth information on InChI applications and algorithms, refer to the works of Heller and Warr.
Molecular Descriptors: Encoding Molecular Properties
Beyond atom-based notations that reconstruct molecular structures, molecular descriptors encode physicochemical, structural, topological, and electronic properties. These descriptors, categorized into structural keys and hashed fingerprints, are unique and ambiguous notations widely used in cheminformatics. A comprehensive discussion of descriptors is extensive and warrants a dedicated review. Software like Dragon can calculate thousands of descriptors, illustrating their breadth and diversity.
Structural Keys
Structural keys are bit strings indicating the absence (0) or presence (1) of specific chemical groups. MACCS keys and CATS are prominent examples.
-
MACCS Keys: MACCS (Molecular ACCess System) keys, also known as MDL keysets, are frequently used in similarity searching. Each bit in a MACCS key represents a specific structural fragment. Variants like 166-bit and 960-bit MACCS keys encode for 166 or 960 structural fragments, respectively. It’s important to note that different software implementations of MACCS keys may assign different bits to the same substructure.
-
CATS: Chemically Advanced Template Search (CATS), a topological pharmacophore descriptor, is designed for scaffold hopping. CATS encodes for six pharmacophore points: H-bond donor, H-bond acceptor, positive charge, negative charge, aromaticity, and lipophilicity.
Hashed Fingerprints
Chemical fingerprints are ordered vectors encoding physicochemical or structural properties. Hashed fingerprints differ from structural keys in that features are generated directly from the molecule, while keys use pre-defined patterns. The length of hashed fingerprints is predetermined, and a hash function assigns molecular patterns to bits, which may not be unique. Daylight fingerprints, typically 512, 1024, or 2048 bits, are topological or path-based fingerprints encoding connectivity pathways within a molecule up to a certain length. Circular fingerprints, like Extended Connectivity Fingerprints (ECFP), represent chemical structures by atom neighborhoods and are widely used in Quantitative Structure-Activity Relationship (QSAR) analysis. ECFPs, based on the Morgan algorithm, encode heavy atoms in circular layers up to a defined diameter.
The classification of fingerprints as chemical “notations” is debated. Nevertheless, chemical fingerprints are crucial in cheminformatics and drug discovery. They provide a rapid conversion from molecular graphs to vector representations suitable for numerical models like QSAR models. Fingerprints are versatile and can also encode physicochemical properties as integers (e.g., hydrogen count) or floats (e.g., molecular weight).
In conclusion, this guide provides a foundational understanding of chemical notations, essential tools in the field of cheminformatics and drug discovery. From linear notations like SMILES and InChI to molecular descriptors and fingerprints, these methods enable efficient representation and analysis of chemical structures, driving advancements in pharmaceutical research and beyond.
References
(List of references from the original article, maintaining the original numbering and links)
2
8
9
10
11
28
38
41
42
43
44
45
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66