Feature Sonification: : An investigation on the features learned for Automatic Speech Recognition

Amin Ghiasi; Hamid Kazemi; Ronny Huang; Emily Liu; Micah Goldblum; Tom Goldstein

An investigation on the features learned for Automatic Speech Recognition (ASR).

Neural networks consist of layers, each of which holds features that activate in response to certain patterns in the input. For image-based tasks, networks have been studied using feature visualization, which produces interpretable images that stimulate the response of each feature map individually . Visualization methods help us understand and interpret what networks "see." In particular, they elucidate the layer-dependent semantic meaning of features, with shallow features representing edges and deep features representing objects. While this approach has been quite effective for vision models, our understanding of networks for processing auditory inputs, such as automatic speech recognition (ASR) models, is more limited because their inputs are not visual.

In this article, we seek to understand what ASR networks hear. To this end, we consider methods to sonify, rather than visualize, their feature maps. Sonification is the process of using sounds (as opposed to visuals) to apprehend complex structures. By listening to audio, we can better understand what networks respond to, and how this response varies with depth and feature index. We hear that shallow features respond strongly to simple sounds and phonemes, while deep layers react to complex words and word parts. We also observe that shallow layers preserve speaker-specific nuisance variables like pitch, while the deep layers eschew these attributes and focus on the semantic content of speech.

Basic sonification

A phoneme is the most basic unit of sound that composes a word (in English there are 44 distinct phonemes) . Given its atomicity, we expect that sonification of lower level features would reveal phoneme-like sounds. The following demonstration shows an example of this.

How to interact with this figure: Click on the same tab to play again and click on the other tabs on the top for more examples.

Sonification of the Jasper network. The first four rows show different shallow features that correspond to four different phonemes in English. The y-axis shows the activation and the x-axis shows the time (temporal dimension). The fifth row shows the phonemes in the audio over time. The last row plots the amplitude of the waveform. Layer 0 is the shallowest layer in the network.

Jasper Network

This work focuses on the Jasper network , the progenitor in a line of state-of-the-art ASR architectures based on convolutions . Jasper takes Mel-filterbanks as temporal input channels and outputs a probability distribution over characters per frame. It is widely believed that Mel-filterbanks capture the most important features of audio while discarding unnecessary complexity of the waveforms . To extract the Mel-filterbanks from a raw waveform, we divide the waveform into overlapping 20ms frames of audio. Every frame has 10ms overlap with its proceeding frame. For every frame, the short time Fourier transform is computed. Then, groups of fourier magnitudes with similar frequencies are averaged together to compute the Mel-filterbanks.

How to interact with this figure: Click on the same tab to play again and click on the other tabs on the top for more examples.

Examples of waveform, spectrogram, and Mel-spectrogram. It is widely known that Mel-spectrogram reduces the complexity of waveform while at the same time, being more expressive than spectrogram . In all sub-plots, x-axis shows the time (temporal dimension). For the waveform, y-axis shows the amplitude. For the spectrogram and Mel-spectrogram, y-axis shows the frequency.

While ASR models are encoder-decoder architectures, we only focus on the encoder. This is because Jasper's decoder is only 1 convolutional layer that converts the features produced by the encoder into probability distributions over characters for each frame Equivalent to the fully-connected layer in classifiers. . The encoder is responsible for decomposing and interpreting speech, and is thus the focus of our study.

As mentioned before, the output of Jasper is a probability distribution over characters per frame. For instance, in an utterance of the word "five," many consecutive frames might correspond to the phoneme /f/. As a result, Jasper's sequence of predictions might be "FFFIVE". To avoid this pitfall, Connectionist Temporal Classification (CTC) decoders are used. CTC decoders use dynamic programming to get around the issue of not knowing the alignment between the input and the output. Rather, they search over various alignments of phonemes to identify an optimal phoneme sequence without redundancy

A complete overview of the ASR pipeline using the Jasper network.

The encoder stacks 10 blocks, each consisting of 5 convolutional sub-blocks (1-d convolution layer, batch normalization, ReLU, and dropout). Residual connections connect together the output of previous blocks and the input of the current block's last layer. The complete architecture of the encoder is shown in the following figure. Throughout this article, we refer to sub-blocks and layers interchangeably, and we sonify the ReLU output in each layer.

How to interact with this figure: You can either scroll horizontally or drag the network left/right to see other layers of the network. You can also move your mouse over each layer to see its connections.

The structure of Jasper encoder. Every gray box represents a block, and every colored box represents a layer (sub-block). The numbers at the center of layers show the number of channels. The numbers at the bottom of layers show the layer's number.

Activation visualization

Let us dive into the internal representations of Jasper as it processes 10 randomly sampled audio files from the Timit dataset. For the first 6 layers of Jasper, we plot each feature's activation strength over time. Shallow layers (such as layers 0 and 1) have dense, spiky activation patterns. Conversely, deeper layers (such as layers 2 to 4), have relatively sparser activations that span longer periods of time. Intriguingly however, at layer 5 the features return to the spiky pattern we saw in layers 0 and 1. We attribute this phenomenon to the skip connection between the inputs of layer 1 (first layer of block) and layer 5 (last layer of block). So what does this mean. Neat observation but need to provide some explanation. What does it mean for the last layer to look like the first layer. Reproduce in a Notebook

How to interact with this figure: To select the layer you would like to see, click on the tabs above the plot. You can scroll/drag horizontally to see the plots for other features in the same layer. By default, each plot is only showing the activation patterns for 2 (out of 10) randomly sampled audio files. To see activation patterns for more inputs, click on the legend boxes.

Activation Patterns of Shallow Layers

The activation patterns of features for the shallow layers.

This pattern pervades deeper blocks as well; the first and last layers of each block maintain similar feature profiles. Layers 11, 14, and 15 are accordingly first, fourth, and last layers of block 3. Note that layers 11 and 15 have similar activation patterns, where layer 14 has relatively sparser activations that span longer periods of time.

Activation Patterns for the First and Last Layers of Several Blocks

The activation patterns of features for the first and last layers of several blocks.

we need more evidence of this claim. The deeper blocks are not plotted so i dont know if the claim is true. For example you can provide the plots of the inputs to the first and last layer of each block for the next couple of blocks Done As we go deeper, the profiles in the penultimate layer of each block become more sparse. For instance, the following figure shows the activations for the 4th layer in various blocks of the network.

Activation Patterns for the Last Layer of Several Blocks

The activation patterns of features for the last layer of several blocks.

The trend suggests that the shallow layers of the network detect simpler patterns of audio which occur in shorter periods of time and with higher frequency, whereas the deep layers detect more complex patterns of audio which occur over longer periods of time and with lower frequency. We test this hypothesis in the following section by sonifying features from various parts of the network, and categorizing the resulting audios.

Sonification using an Auxiliary Dataset

A simple way of doing feature visualization in images is to find image patches that maximally activate a particular node or layer . To do feature sonification, we do the analogous method with audio segments rather than image patches. We draw our audio from the Timit dataset , which contains a time-aligned breakdown of each phoneme and word appearing within an audio clip along with other annotations such as speaker information. Then, when we cluster audio segments that activate the same node, we can use these annotations semantically to characterize that node.

It would look better if these were bar graphs Done! Below, we pick a feature id (F) in layer (L) and find the 20 half-second audio segments that most strongly activate that feature. Then we characterize that feature at the word level or phoneme level by listing the most common words or phonemes that were present in the audio segments. We call these word-level and phoneme-level sonifications, respectively. The following figure demonstrates word-level sonification for 7 features from shallow layers. The x-axis contains the 5 most activating words for that feature, while the y-axis indicates the number of times each word has appeared in the audio segments. The 20 audio segments can also be played. Reproduce in a Notebook

How to interact with this figure: To see different examples, click on the tabs above the figure. You can also play the audio to hear what the sonification sounds like. Dont understand purpose of the complete list of activating words. It's redudant since those words are already on the chart Done

A few examples of most activating words (word-level sonification).

All of the previous examples were chosen from shallow layers. We hypothesize that such shallow features respond to certain phonemes. For instance, Example 1 seems to correspond to utterances of the phoneme /o/. We test our pythoesis by phoneme-level sonification of such features. The following figure displays the phoneme-level sonification for the same exact features that were shown previously.

Few examples of most activating phonemes (phoneme-level sonification).

It is worth mentioning that in Example 1, most of the speakers in both of the phoneme-level and word-level sonifications were dominantly male speakers. In fact, many feature maps are mostly dominated by utterances from only one gender. The following figure displays a few examples of such features.

Few examples of word-level sonification being dominated by utterances of only one gender.

We see that, while phonemes play an important role in determining what activates a feature map, these shallow maps are also strongly effected by speaker-dependent nuisance variables such as pitch. Below, we explore where speaker-specific information lives in the network, and where information is predominantly semantic.

Speaker Information vs Speech Information

We observe that speaker information plays an important role in activating certain features. As an example of speaker information, we study the speaker's gender. More precisely, we study how many of the utterances that appear in a word-level sonification are utterances by male/female speakers. We also look at how many of the feature sonifications in each layer are dominated by only one gender. We investigate to see if a trend exists, and if so, what the trend looks like.

We assume that a feature is dominated by one gender if more than 90% of sonification audio segments are utterances by that gender. The following plot shows the portion of features that are dominated by one gender in every layer.

The portion of features that are dominated by the male speakers in every layer. The x-axis shows the layer's depth and the y-axis shows the percent of male-dominated features. As we go deeper in the network, the portion of male-dominated features becomes smaller. In other words, shallow layers react to both the speech information and the speaker information, where deep layers discard the speaker information and only focus on the speech information.

The portion of features that are dominated by female speakers in every layer. The trend is very similar to the trend in male speakers, however in general, fewer features have been dominated by female speakers.

The previous figures indicate that in shallow layers, speaker information is important for the activation of features, however, this information is discarded in deeper layers. A similar study also shows that ASR models discard speaker information in the deeper layers.

As a big picture, the following heatmap shows the gender dominance in every feature of the network. The female-dominated, male-dominated, and neutral features are colored respectively blue, red, and yellow. Every column corresponds to one layer of the network. Different layers of the network have different number of features. For instance, the first layer has only 256 features while the last layer has 896. Consequently, there are different numbers of cells in each column. In other words, the cells corresponding to the deeper layers have been squeezed so that the last columns would fit into the same space as the first column.

How to interact with this figure: Move your mouse over different cells to hear the most activating inputs for the corresponding feature map. Move your mouse out of the figure to stop playing.

Heat-map of gender dominance for every feature in the network. There are fewer features dominated by one gender in the deep layers. However, it can be seen that the features tend to activate more by male speakers than female speakers.

A fair conclusion would be that the information regarding the speaker is being discarded in deep layers. To sanity check our hypothesis, we take two different utterances of a same word, one with a male speaker, and another with a female speaker, to compare the features they activate For every feature we compute the expected value (average) of activation for every word in the dataset. We assume that a word is activating a feature, if it activates the feature map at least 1.5 times greater than the expected value. . As a result, we see disjoint sets of activated features in shallow layers, but common activated features in deep layers.

The following plot illustrates the difference between activated features for the two different utterances. Every column represents a layer, the left-most represents the shallowest, and the right-most represents the deepest.

How to interact with this figure: To select different words, use the tabs on the top of the figure. To select the gender of the speaker, use the buttons on the bottom of the figure.

This figure shows the set of features that activate by utterances of a same word by speakers of different genders. Note that in shallow layers the set of features are almost disjoint, while deep activation patterns are nearly invariant to gender.

Every layer is showing only 20 features. We made sure that when we randomly choose the features per each layer, the portion of features that get activated by each utterance is proportional to the total number of features in that layer. To make the figure more visually appealing, for each example we re-arrange the features in every layer so that features with similar node colors stay next to each other.

Discarding Speaker Information

As we saw in the previous section, shallow layers attend to the speaker information as well as the speech information. In this section, our objective is to sonify features in a way that depends less on the speaker information. One way of doing so is by finding words/phonemes that activate features on average, with averages taken across speakers. This is different from the sonifications above that focus on finding individual utterances that activate a feature the most. We call this new process the "speaker-discarded" sonification.

The following pseudo-code displays the complete steps of the speaker-discarded word-level sonification.

function speaker-discarded-word-level-sonification(layer, feature): activations = {} for word in dataset: word_activations = [] for utterance in dataset.utterance[word]: word_activation = sum(network(utterance)[layer][feature]) word_activations.append(word_activation) activations[word] = mean(word_activations) return activations.topk()

The last line of the preivous sudo code

The following plot shows a few examples of the results using the speaker-discarded algorithm. These results can be reproduced in Google Colab. Reproduce in a Notebook

Few examples of word-level speaker-discarded sonification.

Dive Deeper

So far, we have shown that the features in shallow layers detect simple patterns like phonemes. In this section, we will show that deeper layers detect complex patterns such as combinations of phonemes or even words.

Few examples of word-level speaker-discarded sonification for deep layers.

We assume that the average number of phonemes in word-level sonification is a good indicator of the complexity of the patterns that activate a particular feature. For this reason, we adopt the average number of phonemes per activating word as a metric of feature complexity. Similarly, we define a layer-level average phoneme count as the average of its features' phoneme count complexities. We also compute the average phoneme count of the feature with the highest/lowest phoneme count complexity for every layer. Our hypothesis is that shallow layers detect less complex patterns than deep layers. To verify our hypothesis quantitatively, we plot the min, average, and max, phoneme count for every layer.

How to interact with this figure: Move your mouse over different columns of the figure to hear a few randomly sampled audio clips that activate the features in the corresponding layer. By selecting the Min or Max, you can listen to the audio samples activating the features with the lowest or highest average phoneme count.

Min, Mean, and Max, phoneme count in word-level sonification. The Mean plot shows the layer-level average phoneme count. The Min and Max plots show the average phoneme count for the features with the lowest and highest values, respectively. The x-axis shows the layer depth, and the y-axis shows the phoneme count. Deep layers seem to be activated by more complex combinations of phonemes (i.e. larger average phoneme counts). This effect is easily heard by listening to the audio clips for each layer.

From the previous figure, it can be seen that shorter words activate shallow layers, while longer words activate deep layers. We can also verify that by listening to the audio sampled for each layer. To make a stronger case, we use phoneme-level sonification to show that deep layers activate via more complex combination of phonemes. For this purpose, we make an audio dataset consisting of various combinations of phonemes. For instance, we take every 3 consecutive phonemes in the Timit dataset to make a 3-gram phoneme dataset. The following figure shows the min, average, and max, phoneme count using the union of 1-gram, 3-gram, 5-gram, 7-gram, and 9-gram phoneme datasets. We observe a trend similar to word-level sonification.

Average phoneme count per layer using the union of 1-gram, 3-gram, 5-gram, 7-gram, and 9-gram phoneme datasets. The x-axis shows the layer depth, and the y-axis shows the average phoneme count.

Sonification of the Whole Network

In this section, we display a complete sonification for every feature in every layer of the Jasper network. We believe that it could be helpful for readers who like to explore beyond what we explained through this article. The following figure displays phoneme-level, word-level, and speaker-discarded sonifications, respectively. Some features are always inactive regardless of the input audio. For such features, the phrase "Dead Neuron" is being desplayed.

How to interact with this figure: Either scroll or drag the figure to move between various layers and features.

Phoneme-level, word-level, and speaker-discarded sonification for every feature in every layer of the Jasper network.

Acknowledgments

We are deeply grateful for the distill template and documentation. Not only did we find previous distill articles inspiring and insightful, we borrowed several implementations from

Citation

For attribution, please cite this work as

Ghiasi, et al. (2021, Oct. 11). Feature Sonification: : An investigation on the features learned for Automatic Speech Recognition. Retrieved from https://cs.umd.edu/~amin/apps/visxai/sonification/

BibTeX citation

@misc{ghiasi2022sonification,
  author = {Ghiasi, Amin and Kazemi, Hamid and Huang, Ronny and Liu, Emily and Goldblum, Micah and Goldstein, Tom},
  title = {Feature Sonification: An investigation on the features learned for Automatic Speech Recognition},
  url = {https://cs.umd.edu/~amin/apps/visxai/sonification/},
  year = {2021}
}