Wednesday 30 July 2014

Rhythmic constraints on stress timing in English

What kind of embodied constraints affect the production of speech? Can we say anything we like when we like, or are there constraints in play that make some things easier than others? This is the question asked in Cummins & Port (1998) which we recently read in lab meeting (with our PhD student Agnes).

Cummins and Port asked participants to produce sentences over and over and examined when during the cycle a certain stress beat occurred. They set it up so that the beat was timed with a beep to occur throughout the cycle, but showed that people could actually only place the beat in 2 or 3 places in the beat reliably. The big picture result is that speech production is shaped, in part, by the underlying dynamics of production described in terms of the rhythms it is set up to produce.

The nice detail here comes from the theoretical set up and analysis that drives this study. Cummins and Port are directly inspired and guided by work in coordination dynamics. Agnes is interested in this work because she's looking at ways to investigate language and speech using the tools of dynamical systems and embodied cognition - remember, our big pitch is that language is special but not magical and we should be able to study it the way we study, say, rhythmic movement coordination. 

The task was to produce sentences over and over (speech cycling) and match a set timing. Each trial included 14 repetitions of a sentence like 'Big for a duck' (the first two were dropped from analysis). Participants heard two tones, a high and low tone 700ms apart. Their job was to produce the sentence so that 'big' happened on the high tone and 'duck' happened on the low tone. Participants then did 12 'continuation' repetitions with no pacing signal. 

A cycle was defined as running from the onset of 'Big' to the next onset of 'Big' in a participant's speech. The phase of this cycle runs from 0-1. The gap between the low and high tones was varied so that the the cycle was 1s - 2.33s in length. This meant the low tone (pacing 'duck') happened at a target phase that varied from 0.3 to 0.7. See Figure 1, a more useful version of their Figure 2.

Figure 1. A cycle is defined as the time between high tones, i.e. the production of ‘big’. Varying the high-low gap means the low tone occurs at different phases (relative times) in the cycle
So the task for people was to produce the word 'duck' 700ms after 'big' but at different locations in the cycle as defined by the two high tones pacing 'big'. The phases defining the low-high gap were changed randomly for each trial with a uniform probability. The question was, would there be any variation in performance when assessed in terms of phase?

The analogy is coordinated rhythmic movement. Any relative phase is technically possible, but people only produce two (0º and 180º) because of the way they couple their limbs (using perceived relative phasethe information for which is the relative direction of motion). Cummins and Port note that this constraint on rhythm production emerges from the coupling of the components into a task-specific device and suggest that if rhythmic speech is also the result of a TSD, the way that TSD is built might have identifiable effects constraining the rhythms that it can produce. (They have no theory or model of what this TSD might be yet, so they can't make specific predictions, but that's OK because this paper is the proof of concept work).

The task asks people to produce a variety of phases that show up from a uniform distribution. If people can simply reproduce that uniform distribution, there is no TSD dynamical device in-between the stimulus and response shaping that response. If they can't, then the distributions they do produce will demonstrate the presence of a TSD-type organisation (i.e. lots of components temporarily organised into a lower dimensional synergy with specific dynamical characteristics). Recall that you might want to do this in order to make the whole system something you could control, i.e. to solve the degrees of freedom problem

Figure 2. Frequency histograms for the phases people produced while trying to produce phases that were uniformly distributed
Figure 2 shows that people could not reproduce the uniform distribution of the stimuli. The data for three subjects showed 3 clear clusters and one showed two clusters (confirmed by more formal analysis later in the paper). The three participants were female musicians, KA was a male non-musician. (Experiment 2 replicated part of this design with female non-musicians, a male non-musician and a male musician. The latter showed 2 clusters, the rest showed 3; this task clearly has space for considerable individual variation that is not entirely accounted for by musical experience). 

Basic finding: Some phases are easier than others, and the clusters reflect attractors in the rhythmic dynamics of speech production in this task.
Figure 3. Plotting target phase vs error for each participant, with data sorted into clusters. 
Figure 3 shows that within each cluster, error gets worse the farther from that cluster's attractor. This is analogous to rhythmic movement coordination again; you can try to produce a mean relative phase of, say, 90º, but you will mostly fail. 

Basic result: People made the least errors when the required phase was in the attractor region.
Figure 4. Plotting target phase vs variability and fitting that data with quadratic regression
Figure 4 shows how production variability varied with target phase. This plot is analogous to the famous HKB potential function that describes how the required effort to produce a coordination varies with relative phase. Each plot shows a local minimum in variability that aligns with the attractors; note that as in the HKB model, not all attractors are equally stable. 

This paper was directly inspired by work in coordination dynamics and applied all the right lessons to the study of speech production. What they found was that people were unable to produce a constant 700ms interval between two stress beats in an English sentence when those beats occurred in the context of rhythmic production. The dynamical device assembled to produce the rhythmical speech imposed constraints on what timings were possible and these constraints affected behavior.

There is no explanation here as to what the dynamic might be or how it's composed or organised, but in principle the task analysis that led to Geoff's model of coordination dynamics could be used here as well. I seem to recall Bob Port mentioning in a talk that he had applied the HKB model to these kinds of data; this would work because the model allows him to a) add terms to include a third attractor (this is how Kelso and Zanone modelled learning) and b) you can parameterise it so those attractors show up at whatever phase fits the data. However, this approach suffers the same problems as the HKB model: it's purely descriptive and because it does not include a specification of the actual dynamic at work it could easily lead you to make incorrect predictions. I'll se eif I can get Fred Cummins to tell us about more recent work. 

I liked this paper a lot; it was rigourous and it did not try to simply jam the HKB model onto their data. Instead, it drew inspiration from the task dynamic approach and tailored that approach to suit the task at hand. It is also a great example of how to study things like speech using the same tools and language as we use to study action more generally. 

Cummins, F. & Port, R. (1998). Rhythmic constraints on stress timing in English, Journal of Phonetics, 26, 145-171. Download ($$)


  1. Cool to see such careful reading! Thanks. A big problem in applying HKB too directly to this is that the effectors at work here are nothing like the two hands. The magic of HKB comes from modelling the components (hands) as fairly crude non-linear oscillators (Rayleigh-Van der Pol hybrids, based on how amplitude and frequency co-vary) and then showing that with minimal assumptions about the form of coupling between the hands, the composite system admits of a simplified description, going from 4 degrees of freedom (2 hands, each with phase and rate of change of phase) to a single degree of freedom (phase difference). The dynamics of the composite system are modelled by the well-known potential function, but the important bit is showing how that falls out of the coupling between the hands.

    1. My point about the HKB model was that you, unlike many people, didn't fall into the trap of doing this kind of analysis then feeling obliged to use the HKB as a general model of coordinative phenomena. People reify the HKB and so it was nice to see you not doing it :)

      I do seem to recall Bob mentioning fitting these kind of data to a version of the HKB which included the next term in the Fourier expansion to allow for three attractors. It was at a talk at IU a long time ago though.

      As you say, the important bit is the details of the coupling which is where Geoff's model beats the HKB hands down. I'm not sure how you'd apply that analysis to this task in practice (it's a higher order of complexity) but in principle you could do it; identify why there are those three attractors rather than any other number of other ones.

  2. That's great work!
    I used a 2-dimensional potential function (2 coupled HKB-like functions) to model a 3rd / 4th attractor in order to understand the so-called allophonic mode of perception in developmental dyslexia, including hysteresis / enhanced contrast effects that change with ageing.

    Beyond the static phoneme boundary

    1. As above: my general complaint about the HKB is that it is descriptive, not explanatory. It fits data, but that's about it; it tells you nothing about the actual underlying composition of the system (unlike this model and the related empircal work).

      Kelso was always very up front about this. He even has a paper in which he lays out his behavioural approach as being completely descriptive on purpose because he thinks that's the best way to get on with the science. People tend to forget, though!

  3. I am not sure I understand what you mean by 'descriptive' vs. 'explanatory' in relation to a theory / model. Whether a model explains anything that makes sense to a scientist is irrelevant for the evaluation of its scientific credibility. We don't know why the universe behaves as described by Quantum Physical theories (10+ interpretations), just that it does so very accurately.

    The parameters of a model that can produce testable predictions and can be fitted to empirical data to corroborate those predictions have to be associated to some structural part of reality. At least, they represent a correlation with measurement outcomes in a specific measurement context (see e.g. other applications, Van Rooij et al, 2013 Modeling dynamics of risky choice )

    We can state that the model has predictive power and is empirically accurate. What it 'means' is a matter of interpretation and adds to the explanatory power of the model / theory. A post-hoc model fit alone cannot convey any scientific knowledge (see e.g. ) the same holds for a model that cannot (yet) produce any testable predictions (e.g. most cosmic string theories).

    Potential theory is used in physics, e.g. the Higgs mechanism (Higgs potential see ), so the model itself is just mathematics, meaningless, its parameters acquire meaning after they have been prospectively associated to reliable observable phenomena.

    Perhaps the qualification has to do with the nature of potential models:
    1. They describe end-state dynamics of continuous processes, i.e are not a derivative to time.
    2. The actual differential equations posited to underly the change processes that generate the end-states (the dynamics as you mention in the post), are not a part of the the model, though they do

    The functions I used are basically the Cusp catastrophe describing parameter settings for one DV in which a bifurcation can be observed. The energy states of a recurrent neural network with 2 stable states show similar dynamics (see e.g. ), HKB is derived from oscillatory processes. These are all such fundamental change physical processes, I do not see a problem if their potential function form is used to model end-state dynamics (or order parameter dynamics).

    A more urgent issue is finding models that can deal with coordination across time scales, the dynamic field model is an example, but the data show models need to generate scale-invariant time series across many trials, see: Wijnants, et al. (2012). A trade-off study revealing nested timescales of constraint. Frontiers in physiology, 3, 116.

    Currently attempts to do so need to introduce a 1/f noise component into the model which is not ideal of course.

    I will have a closer look at your model soon, looks very interesting!

  4. I am not sure I understand what you mean by 'descriptive' vs. 'explanatory' in relation to a theory / model....

    The parameters of a model that can produce testable predictions and can be fitted to empirical data to corroborate those predictions have to be associated to some structural part of reality.

    This is it.

    In the context of the HKB, it is literally just a couple of superimposed sine waves. You need two because you have to produce two attractors of different depths to model the data. So it describes the phenomena it was designed to describe, namely coordinated rhythmic movement between two limbs in a person not trained to do anything else.

    Geoff's model is two damped mass springs (matching the known characteristics of rhythmically moving limbs), each coupled together by the perceived phase of the other limb with this modified by the perceived relative phase (both implemented not as phase per se but as the information for phase and relative phase). From this set up, the observed phenomena emerge (whereas in the HKB they are literally programmed into the model).

    So Geoff's model explains why coordination looks the way it does in people; Kelso's only describes that it looks that way under some conditions. The HKB had a prominent failure when Kelso and Zanone (1994) made predictions about learning relative phases other than 90º. They predicted RPs close to 0º would be harder than those close to 180º because the stronger attractor would pull things like 30º harder than 150º. It's actually the reverse of this, because 0º is not stable because there's an attractor there. There's an attractor there because 0º is stable because that region is so clearly perceived.

    I talk about this some in this paper.