When Is a Sequence Not a Sequence?

Consider the ubiquitous yellow highlighter, beloved tool of many scholars. If I am reading a paper in hard copy, I always have one nearby. Perhaps I am reading about self-reference and highlight the sequence self-reference each time the author uses it. When I reread the paper and wish to find the passages where self-reference is discussed, I don’t have to match the sequential pattern selfreference word-for-word, letter-for-letter. Why should I, when I can scan for the color yellow, a faster and simpler means of randomly accessing the sequence of interest?

But if I am reading an electronic copy of the same paper, I use the control-F search function to locate each instance of self-reference. The computer scans the entire text for the pattern of zeros and ones that corresponds to self-reference. The computer cannot use a highlighter; the bits stored in memory are impossible to locate based on color. Either there is a precise bit-for-bit match of the sequence or there is not. This is how computers do random access, strictly through sequential pattern-matching, like base readout in the cell.

The ability to locate a short pattern within a longer sequence is crucial to the development and operation of cells and civilizations. We met some of these short sequences in the previous chapter when we discussed the grammar of extension. In the lac operon, for example, the repressor needs to locate and bind to a specific snippet (the operator) in order to disable the pathway. Finding a short, unique pattern among millions of bases in the genome is quite a trick, and it is made possible only by rate independence. Practically speaking, you can locate a short pattern only if the sequence is stable enough to search. In the cell DNA is a stable molecule and, among humans, a written text is more stable than a spoken utterance, and therefore much more amenable to random access.

When we use texts in everyday situations, sometimes we are concerned not only with the one-dimensional patterns of the sequences, their base readout, but also with their physical characteristics, their shape readout. Yellow highlighters are one example of shape readout, but underlining, boldface, italics, all-caps, and font size are all used to call our attention to certain sequences. Instead of saying “this is important” or “pay attention to this,” we emphasize a sequence by changing some physical property of the elements themselves.

When a handwriting expert testifies in court, her testimony concerns the shape of the sequences, not their meaning or interpretation. Morse Code operators can identify one another through subtle rate-dependent differences in timing, what they call an operator’s fist. When typewriters were ascendant, forensics experts could confirm that a particular document came from a particular typewriter by examining the characteristics of the letters themselves. All of these cases center on physical properties of sequences rather than their meaning or semantics. It is the difference between the sentences “That essayist writes beautifully” and “That calligrapher writes beautifully.”34

It would be a strange computer indeed that slowed down its processing for emphasis when it came to the important part of a computation. However, when we speak, you and I have access to a range of rate-dependent tools to achieve emphasis. We gesture with our hands and bodies, change our tone of voice, pause, or make a face. In most cases we have no choice but to gesture; rate- dependent gestures and rate-independent sequences of speech are fundamentally entangled. Changing the typography of a written sequence is our way of preserving this entanglement in our texts. Computers have no rate-dependent options for emphasis. They are stuck with sequential pattern-matching, with base readout, which they are very good at.

One in ten genes in the human genome encodes a transcription factor?5 The binding target of a transcription factor, like the lac repressor, is a specific short pattern of As, Cs, Ts, and Gs that must be recognized quickly and with precision. Many kinds of enzymes bind to specific snippets of DNA or RNA, not only transcription factors like the lac repressor but also ligases and restriction enzymes and spliceosomes and others. For these and similar white-collar enzymes, write molecular biologists Mair Churchill and Andrew Travers, “the ability to discriminate between a DNA-binding site of a specific sequence and a random sequence is absolutely essential for correct function.”36

When a transcription factor searches through millions of base pairs to locate its target DNA sequence, is this more like a human recognizing a yellow highlight or is it more like a computer searching memory for a specific pattern of bit strings? Does the transcription factor recognize some physical character of the sequence, like its shape, or does it align to it base-by-base? Are we witnessing rate-independent base readout or rate-dependent shape readout?

It may be both. So entangled are these modes that in many cases researchers cannot tell which is dominant. “The role of DNA shape may be biochemically inseparable from base readout,” write molecular biologist Namiko Abe and colleagues. “As DNA shape is a function of the nucleotide sequence, it is difficult to tease apart whether a DNA binding protein favors a particular binding site because it recognizes its nucleotide sequence or, alternatively, structural features of the DNA molecule.”37

Imagine that the sequences CGCG and ATAA were carved out of wood so that you could hold one in each hand. If you closed your eyes and I handed them to you, how would you know which is which? Would you use base readout or shape readout? Perhaps you would respond, “Well, I feel a C and here is a G and here is another C and here is another G, so this one must be CGCG.” That would correspond to base readout. Or you might respond, “Well, all of the letters are round and not pointy, so this one must be CGCG.” That would be shape readout.38 Transcription factors face the same quandary in recognizing short sequences.

Two properties influence whether the transcription factor recognizes its target more by shape readout or more by base readout. One is length. The shorter the sequence the more amenable it is to recognition by shape alone. Most sequences targeted by transcription factors number only in the dozens of bases in length. A sequence 100 bases long would be almost impossible to recognize strictly by its shape. The longer the target pattern, the more likely it can be identified only by its sequence.

The other character is prevalence. A sequence commonly found within the cell and across species in evolutionary time is more likely to be recognized by shape. Most DNA snippets targeted by transcription factors are in widespread use in the cell and have been conserved throughout the history of life. These are the consensus sequences, the closed-class deictic field of the cell. “Evolutionary processes would typically prefer careful perception to complex cognition,” write Rob Withagen and Anthony Chemero.39 Shape readout requires careful perception, base readout complex cognition. As with the yellow highlighter, shape readout is faster and more reliable.

This is also true of natural language. The ten most common verbs in English are irregular (be, have, do, go, say, can, will, see, take, get). Irregularity is linguistic shape readout and regularity is base readout. There are no uncommon irregular verbs; the pressure to regularize, to evolve from shape readout to base readout, is too intense. In fact, this pressure appears lawful: “Irregular verbs regularize at a rate that is inversely proportional to the square root of their usage frequency,” write computational biologist Erez Lieberman-Aiden and colleagues.40

Successful regulation depends on allosteric enzymes like transcription factors recognizing specific short sequences within much longer sequences. It is not always clear whether recognition is constrained by the physical shape of the sequence or by the order of bases. But it is clear that this kind of random access is only reliable when the overall sequence structure is stable, like a written text or a DNA molecule.