Thank you! Seems like that project was incredibly far ahead of its time.
The physical-modelling aspect is super interesting. Does that mean that the similarity in sound to formant-based speech synthesis is because they're both using a sawtooth wave, noise, or other relatively simple sound as the raw input? I always imagined that a physical-modelling speech synthesizer fed by a sawtooth wave would sound more like a vocoder than Votrax or TI LPC output does, but I guess not.
> Does that mean that the similarity in sound to formant-based speech synthesis is because they're both using a sawtooth wave, noise, or other relatively simple sound as the raw input?
Essentially, yes. Both are known as "source-filter" models. A sawtooth, narrow pulse, or impulse wave is a good approximation glottal excitation for the source signal, though many articulatory speech models use a more specialized source model that's analytically derived from real waveforms produce by the glottis. The Lilencrantz-Fant Derivative Glottal Waveform model is the most common, but a few others exist.
In formant synthesis, the formant frequencies are known ahead of time and are explicitly added to the spectrum using some kind of peak filter. With waveguides, those formants are implicitly created based on the shape of the vocal tract (the vocal tract here is approximated as a series of cylindrical tubes with varying diameters).
Human speech production/perception works by articulation changing the shape, hence resonant frequencies (formants), of the vocal tract, and our ear/auditory cortex then picking up these changing formants. We're especially attuned to changes in the formants since those correspond to changes in articulation. The specific resonant frequency values of the formants vary from individual to individual and aren't so important.
Similarly the sound source (aka voice) for human speech can vary a lot from individual to individual, so serves more to communicate age/sex, emotion, identity, etc, not actual speech content (formant changes).
The reason articulatory synthesis (whether based on a physical model of the vocal tract, or a software simulation of one) and formant synthesis sound so similar is because both are designed to emphasize the formants (resonant frequencies) in a somewhat overly-precise way, and neither typically do a good job of accurately modelling the voice source, and other factors that would make it sound more natural. The ultimate form of formant synthesis just uses sine waves (not a source + filter model) to model the changing formant frequencies, and is still quite intelligible.
The "Daisy" song somehow became a staple for computer speech, and can be heard here in the 1984 DECtalk formant-synthesizer version. You can still pick up DECtalks on eBay - an impressive large VCR-sized box with a 3" 68000 processor inside.
The physical-modelling aspect is super interesting. Does that mean that the similarity in sound to formant-based speech synthesis is because they're both using a sawtooth wave, noise, or other relatively simple sound as the raw input? I always imagined that a physical-modelling speech synthesizer fed by a sawtooth wave would sound more like a vocoder than Votrax or TI LPC output does, but I guess not.