Designing for Software Instruments: from Gestures, through Mapping, to Sound
When designing a software-based musical instrument, either from scratch or by extending a familiar instrument, choosing its inputs and outputs is relatively easy. The instrument’s inputs are buttons, knobs, tilt sensors, cameras, or even whichever of these is found in a smartphone. Its outputs might just be dictated by what commands can be sent to the instument’s audio synthesizer, be that a chip or software. Common outputs are pitch (how high a sound is) and loudness. Other aspects of timbre can come from a set of discrete presets, such as the trumpet and harpsichord buttons on a department store keyboard.
At this point, after choosing inputs and outputs, the real work of XD begins. To see what a difference is made by the input-to-output mapping, let’s consider three real-world examples that use the same gesture-inputs and sound-outputs, varying only the mapping.
1. Conventional pickup-and-amplifier instrument, such as an electric guitar or electric violin, plus a tilt sensor. (Duct-tape a smartphone to the instrument.) Feed the pickup and the tilt sensor into a computer (perhaps that same smartphone), which computes sound to send to the amplifier.
Inputs: tilt, pitch, and loudness.
Outputs: pitch and loudness.
Unless otherwise specified, pitch maps to pitch, and loudness to loudness.
High notes are dramatic in everything from Van Halen to Wagner. To make them easier to play while maintaining drama, when the instrument points up, raise the output pitch by an octave or two.
More tilt applies stronger pitch correction, so you can rely on this crutch only in difficult passages.
Ignore tilt, but map pitch to loudness, and loudness to pitch. (Think about that for a moment.) The language that experienced players use to describe this is unprintable.
Tilt crossfades between brain melt and conventional pitch-to-pitch, loudness-to- loudness. (Don’t even try to think about this one.)
The first two mappings make the instrument easier to play. The last two make it disastrously difficult, but not artistically pointless: the equally obstreperous programming language Brainfuck has inspired surprisingly many publications, by art theorists as well as computer scientists. So, mapping affects at least ease of use. Let’s see what else it can affect.
2. Pressure-sensitive tablet computer, scrubbing through an audio recording.
Inputs: pressure and x-y position of the fingertip on the tablet’s surface.
Secondary inputs, computed from the primary inputs: speed of fingertip, and duration (so far) of the current stroke.
Outputs: index into recording (position along the audiotape segment); filter parameters (wah-wah); other effects processing.
Map x to index, pressure to loudness, and y to a filter sweep. The x-mapping works like Laurie Anderson’s tape-bow violin.
Also map tip speed to reciprocal loudness, so faster scrubs are quieter. This emulates how, in a movie, we see a whip pan as being out of focus.
Also map stroke duration to filter sweep, so each stroke sounds like a “wah.” Holding pattern
Map tip speed to index, and ignore all other inputs. Thus, when the tip circles steadily, you hear one fragment of the recording. When the tip speeds up, scrubbing moves forwards in the recording. When it slows down, scrubbing rewinds.
These last two mappings use secondary inputs. They demonstrate the antics that become possible when you use not just an input’s raw value, but also that value’s history and how fast that value is changing. The formal name for this value- history-change triplet is proportional, integral, and derivative (PID) control. (This is a fundamental mathematical way of connecting inputs to outputs, such as sensors adjusting a car engine to keep it running smoothly, or accelerometers in a quadcopter adjusting rotor speeds to compensate for wind gusts.) The point here is that a mapping need not be moment to moment, where this input value always yields that output value. Instead, the mapping might determine the output from the trajectory of input values. A similar trajectory-based mapping tool is hysteresis, which behaves like gearwheel backlash or the slop in the middle of a joystick’s range of motion.
Now that we’ve seen both playability and input-value trajectories, let’s consider how literal a mapping should be.
3. Room-size optical motion capture, playing only the black keys of five stops of a pipe organ. (Although this looks like a connected environment or smart room, it still behaves like a musical instrument.)
Inputs: x-y-z positions of a few dozen markers on the costumes of dancers (see Figure 11-2).
Secondary inputs: average and spread (mean and standard deviation) of x, y, and z individually.
Outputs: pitch average, pitch spread, loudness of each organ stop.
Map average z (height) to overall loudness. Map x to pitch, in both average and spread. Map average y to a crossfade through the organ stops in a fixed sequence. The audience immediately notices that when everyone is near the floor, it gets quiet; many raised arms make it loud. Next, they see that walking from left to right (x) is like moving up the organ’s keyboard. Finally, they notice the upstage to downstage crossfade.
Within the danceable x-y-z volume, define five subvolumes, possibly overlapping. Map the number of markers in each zone to the loudness of the corresponding organ stop. Map x to pitch as before.
Map spread of y to organ-stop crossfade. Map average x to spread of pitch, and spread of x to average pitch. Map z to loudness as before. (Ignore average y, to use as pure dance with no musical consequences.) Now the audience still detects a strong cause-and-effect, still feels that the dancers directly affect the music. But the audience isn’t quite sure how. Not many could verbalize what happens on stage: low pitches when everyone’s tightly clumped left-right, high when they’re spread out; different stops depending on upstage-downstage clumping; single pitches at stage left, broad clusters at stage right.
Figure 11-2. Motion-tracked retroreflective balls, worn by a few dancers or many dancers, can be the input gestures for a musical instrument (top: University of Illinois dance faculty Kirstie Simson and Philip Johnston experimenting in the laboratory; bottom: students improvising during a public performance)
In an evening’s performance of several dances, a simple mapping such as crossfade works well early in the program, to ensure that everyone in the audience comprehends that the dancers directly control the music. But mickey-mousing won’t stay captivating all night, so it’s good to finish with less literal mappings. Such a development of mappings, a progression from what is instantly comprehensible to what can be savored longer, also applies outside music and dance. Right after a hard day’s work a pilsner quickly slakes your thirst, but later in the evening it’s nicer to tarry over an aged port. The holy grail of an intuitive interface is better for a 20-second experience (reclining the driver’s seat) than for a 20-hour one (repainting the car). The nonintuitive stick shift may soon be preferred more by videogamers than by commuters. When designing an experience for a specialist, be they seasoned concertgoer, gourmet, car restorer, or videogamer, the experience’s very duration justifies some up-front training cost, that is, conscious reasoning (the antonym of intuition).