The Limits of Fold Space
Several key observations about the nature of proteins are in order. According to protein structure classification schemes such as CATH (Sillitoe et al. 2015), the
Fig. 3.2 Graph showing the number of experimentally-determined protein structures included in the CATH database together with the number of topologies as a function of year. It can be seen that although the number of structures added to CATH is increasing rapidly, the number of new folds has remained static since about 2009
approximately 100,000 experimentally determined protein structures in the protein data bank (Berman et al. 2000), can be grouped into about 1200 unique structural folds (unique topologies). As more and more structures are solved experimentally, the number of new folds discovered increases very slowly. In fact, according to data from the RCSB, there have been almost no new folds discovered since the last edition of this book in 2008, despite a doubling of the size of the PDB in that time
These findings have led to the broad acceptance of the view that there are a finite and relatively small number of folds found in nature (Marsden et al. 2006). There are hundreds if not thousands of examples in the structure database demonstrating that highly similar structures may have radically different sequences. So although it is true that highly similar sequences adopt highly similar structures, so too do highly dissimilar sequences sometimes adopt similar structures.
Thus, it appears that any sequence we choose from the database of sequenced genomes has a high probability of adopting a structure we have already seen. The big question is how to determine which of the 100,000 structures is the right template and how to align our sequence to that structure. Fold recognition is concerned with the search for scoring functions that can reliably detect the compatibility of a sequence with a known structure and align them accurately when simple sequence similarity cannot be seen.
Despite the size of sequence space, i.e. the space of all possible protein sequences, the space of protein structures appears considerably smaller. Whether this is related to thermodynamics, the kinetics offolding or to evolutionary selection is difficult to say and beyond the scope of this chapter. However, Magner and coworkers (Magner et al. 2015) have recently proposed an explanation that suggests thermodynamic stability may be the primary driver for this observation. Regardless of the cause, the restricted nature of fold space is a highly fortuitous fact that has been of great benefit in the field of protein structure prediction.