New preprint: The visual cortex extracts spectral fine details from silent speech

A guest blog post by Nina Suess

In our new preprint, we tried to find out what speech-related information can be extracted from silent lip movements by the human brain and how this information is differentially processed in healthy ageing.

It has already been shown that the visual cortex is able to track the unheard speech envelope that is accompanying the lip movements. But the acoustic signal itself is much richer in detail, showing modulations of the fundamental frequency and the resonant frequencies. Those frequencies are usually seen in the spectrogram and they are crucial for the formation of speech sounds. Recent behavioural evidence describes that modulations of those frequencies (or so-called “spectral fine details”) can also be extracted by the observation of lip movements. This raises the interesting question whether this information is also represented at the level of the visual cortex. Therefore, we aimed to investigate if the human cortex can extract those acoustic spectral fine details just from visual speech and how this changes as a function of age. 

To answer this question, we presented participants with muted videos of a person speaking and were told to pay attention to the lip movements. We used intelligibility (forward videos vs. backward videos) to investigate if the human brain is tracking the unheard spectral acoustic modulations of speech, given that only forward speech is intelligible and therefore inducing speech-related processes. We calculated coherence between brain activity, the lip movement signal and the omitted signal (the speech envelope, the fundamental frequency and the resonant frequencies modulated near the lips). 

We could identify two main findings:

1) The visual cortex is able to track unheard acoustic information that usually accompanies lip movements

We could replicate the findings from Hauswald et al., (2018) indicating that the visual cortex is able to track the unheard acoustic speech envelope just by observing lip movements. Crucially, we found that the visual cortex (Figure 1A) is also able to track the unheard modulations of resonant frequencies (or formants) and the pitch (or fundamental frequency) linked to intelligible lip movements (Figure 1B). These results show that unheard spectral fine-details (along with the unheard acoustic envelope) are transformed from a mere visual to a phonological representation, which strengthens the idea of the visual cortex as a “supporting” brain region for enhanced auditory speech understanding.

Figure 1

2) Ageing significantly affects the ability to track unheard resonant frequencies

Importantly, only the processing of intelligible unheard resonant frequencies decreases significantly with age in the visual and also in the cingulate cortex (Figure 2A, 2B and 2D). This is not the case for the processing of the unheard speech envelope, the fundamental frequency or the purely visual information carried by the lip movements. This indicates that ageing affects especially the ability to derive spectral dynamics in the frequency range of formants that are modulated near the lips. There is a clear difference between younger participants, who can distinguish very clearly between intelligible and unintelligible speech (Figure 2C), and the older participants, who cannot distinguish between those two conditions anymore.


Figure 2

These results can provide new insights into speech perception under natural conditions. Until now, most of the research has focused on the decline of auditory speech processing abilities in age, but far less attention has been paid to how visual speech contributes to preserved speech understanding, especially under adverse conditions. Our results fit very well to studies that show a decline of spectral processing in age as a unisensory phenomenon, and we add evidence that this declined processing might as well be a multisensory problem coming from both auditory and visual senses.

For questions and comments please email Nina: nina.suess@sbg.ac.at or leave a comment below.

Out now: The brain separates auditory and visual “meanings” of words

In our new article, we tried to find new answers to an old question: Are word meanings “the same” in the brain, whether we hear a spoken word or lip-read the same word by watching a speaker’s face? After more than 4 years of work (I tested the first participant in August 2016), this study now found a great home in the journal eLife.

We asked our participants to do the same task in two conditions: auditory and visual. In the auditory condition, they heard a speaker say a sentence. In the visual condition, they just saw the speaker say the sentence without sound (lip reading). In both conditions, they then chose from a list of four words, which one they had understood in the sentence.

In the auditory condition, the speech was embedded in noise so that participants would misunderstand words in some cases (on average, they understood the correct word in 70% of trials).

In the visual condition, performance was also on average 70% correct. But lip reading skills vary extremely in the population and this is something we also saw in our data: the individual performance in the lip reading task covered the whole possible range (from chance level to almost 100% correct). Needless to say, our participants were all proficient verbal speakers (mostly college students). Quite some time ago, the idea came up that the variability in lip reading reflects something other than normal speech perceptual abilities. Is it therefore possible that the processing of auditory and visual words is completely different in the brain?

To answer this question, we recorded our participants’ brain activity while they did the comprehension task. We used the magnetoencephalogram (MEG), which detects changes in magnetic fields outside the head that are produced by neural activity.

To analyse the brain’s activity during the perception of auditory and visual words, we used a classification approach: First, we tried to reconstruct which word participants had perceived by comparing their waveform patterns in the brain (stimulus classification, or decoding). Second, we analysed which of the classification patterns we found predicted whether participants actually perceived the correct word.

In a nutshell, two main findings emerged:

  1. Areas that encode the word identity very well (e.g. sensory areas) often do not predict comprehension. Looking at it from the other angle, the areas that encode the word sub-optimally (e.g. higher order language areas) influence what we actually perceive. This is true for auditory and visual speech.
Areas that encode the stimulus (left) and areas that encode the stimulus but also predict word comprehension (right).

As once pointed out by Hickok & Poeppel, we think that the task we perform is the key that determines the results we get – and which areas are most relevant for our behaviour. In our case, higher-order language areas are most important for comprehension. But if the task was to discriminate speech sounds or lip movements, early sensory areas would probably be more task-relevant.

2. The representations for auditory and visual word identities are largely distinct. They only converge in a couple of areas, situated in the left frontal cortex and the right angular gyrus (green circles below). These areas might therefore hold some kind of a-modal perceived meaning of a word.

Summary topography of areas that predict word comprehension in the auditory and visual (lip reading) conditions.

Previous studies have often looked at brain activation across the brain using fMRI (functional magnetic resonance imaging). Activation means that something is “happening” in a brain area (leading to increased oxygen demands there). These studies usually suggest that the processing of acoustic and visual speech overlap to a large extent.

But the nature of these activations can be unclear. We think that the activation of a general language network could explain such findings, without necessarily representing specific word identities. Moreover, other studies often use categories (for example, buildings vs animals) instead of single word meanings, which could give a different picture.

Overall, our analysis of specific word identities (meanings?) shows that our brain does very different things when we listen to someone speak or when we try to lip read. This could explain why our ability to understand acoustic speech is usually not related to our ability to lip read.


Please note that this is an updated version of an earlier blog post on the preprint of the same study. Data for this study can be found here.

New preprint: The brain separates auditory and visual “meanings” of words

In our new preprint, we tried to figure out whether word meanings are “the same” in the brain, whether we hear a spoken word or lip read the same word by watching a speaker’s face.

To answer this, participants did the same task in two conditions: auditory and visual. In the auditory condition, they heard a speaker say a sentence. In the visual condition, they just saw the speaker say the sentence without sound (lip reading). In both conditions, they then chose from four words which one they had understood in the sentence.

In the auditory condition, the speech was embedded in noise so that participants would misunderstand words in some cases (on average, they heard the correct word in 70% of trials).

In the visual condition, performance was also on average 70% correct. But lip reading skills vary extremely in the population and this is something we also saw in our data: the performance in the lip reading task covered the whole possible range (from chance level to almost 100% correct). Needless to say, our participants were all proficient verbal speakers (mostly college students). Quite some time ago, the idea came up that the variability in lip reading reflects something other than normal speech perceptual abilities. Is it therefore possible that the processing of auditory and visual words is completely different in the brain?

To answer this question, we recorded our participants’ brain activity while they did the comprehension task. We used the magnetoencephalogram (MEG), which detects changes in magnetic fields outside the head that are produced by neural activity.

To analyse the brain’s activity during the perception of auditory and visual words, we used a classification approach: First, we tried to reconstruct which word participants had perceived by comparing their waveform patterns in the brain (stimulus classification, or encoding). Second, we analysed which of the classification patterns we found predicted whether participants actually perceived the correct word.

In a nutshell, two main findings emerged:

  1. Areas that encode the word identity very well (sensory areas) do often not predict comprehension. Looking at it the other way round, the areas that encode the word sub-optimally (higher order language areas) influence what we actually perceive. This is true for auditory and visual speech.

AVdecod_dissemination_blog1

As once pointed out by Hickok & Poeppel, we think that the task we perform is the key that determines the results we get – and which areas are most relevant for our behaviour. In our case, higher-order language areas are most important for comprehension. But if the task was to discriminate speech sounds or lip movements, early sensory areas would probably be more task-relevant.

  1. The representations for auditory and visual word identities are largely distinct. They only overlap in a small area, comprising the left temporal pole and inferior frontal gyrus (green area in figure below). Our results therefore suggest that this small area might hold the a-modal perceived meaning of a word.

AVdecod_dissemination_blog2

Previous studies have often looked at brain activation across the brain using fMRI (functional magnetic resonance imaging) data. Activation means that something is “happening” in a brain area. These studies usually suggest that the processing of acoustic and visual speech overlap to a large extent.

But the nature of these activations can be unclear. We think that the activation of a general language network could explain such findings, without necessarily presenting specific word identities. Moreover, other studies often use categories (for example, buildings vs animals) instead of single word meanings, which could give a different picture.

Overall, our analysis of specific word identities (meanings?) showed that our brain does very different things when we listen to someone speak or when we try to lip read. This could explain why our ability to understand acoustic speech is usually not related to our ability to lip read.

New preprint on speech tracking in auditory and motor cortices

The tracking of temporal information in speech is frequently used to study speech encoding in dynamic brain activity. Often, studies use traditional, generic frequency bands in their analysis (for example delta [1 – 4 Hz] or theta [4 – 8 Hz] bands). However, there are large inter-individual differences in speech rate. For example, audiobooks are typically narrated with 150 words per minute (2.5 Hz), while the world’s fastest speaker can talk at 637 words per minute (10.6 Hz). We therefore reasoned that speech tracking analyses should take into account the specific regularities (e.g. speech rate) of the stimuli. This is exactly what we did in this study: We extracted the time-scales for phrases, words, syllables and phonemes in our sentences and based our analyses on these stimulus-specific bands.

Previous studies also mainly used continuous speech to analyse speech tracking. This is a fantastic, “real-world” paradigm, but it lacks the possibility to directly analyse comprehension. We therefore played single sentences to our participants and asked, after each sentence, to indicate which out of four words they had heard in the sentence. This way, we obtained a single-trial comprehension measure. We also recorded participants’ magnetoencephalography (MEG) and did our analyses on source projections of brain activity.

We show two different speech tracking effects that help participants to comprehend speech and both act concurrently at time-scales within the traditional delta band: First, the left middle temporal cortex (MTG) tracks speech at the word time-scale, which is probably useful for word segmentation and mapping the sound-to-meaning. And second, the left premotor cortex (PM) tracks speech at the phrasal time-scale, likely indicating the use of temporal predictions during speech perception.

AVentrain_MIoverlap

Previous studies have shown that the motor system is involved in predicting the timing of upcoming stimuli by using its beta rhythm. We therefore hypothesised that a cross-frequency coupling between beta-power and delta-phase at the phrasal time-scale could drive the effect in the motor system. This is indeed what we found and this was also directly relevant for comprehension.

By using stimulus-specific frequency bands and single-trial comprehension, we show specific functional and perceptually relevant speech tracking processes along the auditory-motor pathway. In particular, we provide new insights regarding the function and relevance of the motor system for speech perception.

If you would like to read the full manuscript, you can find a preprint on bioRxiv here.