An investigation on the features learned for Automatic Speech Recognition (ASR).
Neural networks consist of layers, each of which holds features that activate in response to certain patterns in the input. For image-based tasks, networks have
been studied using feature visualization, which produces interpretable images that stimulate the response of
each feature map individually
In this article, we seek to understand what ASR networks hear. To this end, we consider methods to sonify, rather than visualize, their feature maps. Sonification is the process of using sounds (as opposed to visuals) to apprehend complex structures. By listening to audio, we can better understand what networks respond to, and how this response varies with depth and feature index. We hear that shallow features respond strongly to simple sounds and phonemes, while deep layers react to complex words and word parts. We also observe that shallow layers preserve speaker-specific nuisance variables like pitch, while the deep layers eschew these attributes and focus on the semantic content of speech.
A phoneme is the most basic unit of sound that composes a word (in English there are 44 distinct phonemes)
How to interact with this figure: Click on the same tab to play again and click on the other tabs on the top for more examples.
This work focuses on the Jasper network
How to interact with this figure: Click on the same tab to play again and click on the other tabs on the top for more examples.
While ASR models are encoder-decoder architectures,
we only focus on the encoder. This is because Jasper's decoder is only 1 convolutional
layer that converts the features produced by the encoder into probability distributions
over characters for each frame
As mentioned before, the output of Jasper is a probability distribution over characters per frame.
For instance, in an utterance of the word "five," many consecutive frames might correspond to the phoneme /f/.
As a result, Jasper's sequence of predictions might be "FFFIVE". To avoid this pitfall, Connectionist Temporal
Classification (CTC) decoders are used. CTC decoders use dynamic programming to get around the issue of not knowing
the alignment between the input and the output. Rather, they search over various alignments of phonemes to identify
an optimal phoneme sequence without redundancy
The encoder stacks 10 blocks, each consisting of 5 convolutional sub-blocks (1-d convolution layer, batch normalization, ReLU, and dropout). Residual connections connect together the output of previous blocks and the input of the current block's last layer. The complete architecture of the encoder is shown in the following figure. Throughout this article, we refer to sub-blocks and layers interchangeably, and we sonify the ReLU output in each layer.
How to interact with this figure: You can either scroll horizontally or drag the network left/right to see other layers of the network. You can also move your mouse over each layer to see its connections.
Let us dive into the internal representations of Jasper as it processes 10 randomly sampled
audio files from the Timit
How to interact with this figure: To select the layer you would like to see, click on the tabs above the plot. You can scroll/drag horizontally to see the plots for other features in the same layer. By default, each plot is only showing the activation patterns for 2 (out of 10) randomly sampled audio files. To see activation patterns for more inputs, click on the legend boxes.
This pattern pervades deeper blocks as well;
the first and last layers of each block maintain similar feature profiles.
we need more evidence of this claim. The deeper blocks are not plotted so i dont know if the claim is true. For example you can provide the plots of the inputs to the first and last layer of each block for the next couple of blocks Done As we go deeper, the profiles in the penultimate layer of each block become more sparse. For instance, the following figure shows the activations for the 4th layer in various blocks of the network.
The trend suggests that the shallow layers of the network detect simpler patterns of audio which occur in shorter periods of time and with higher frequency, whereas the deep layers detect more complex patterns of audio which occur over longer periods of time and with lower frequency. We test this hypothesis in the following section by sonifying features from various parts of the network, and categorizing the resulting audios.
A simple way of doing feature visualization in images is to find image patches that
maximally activate a particular node or layer
It would look better if these were bar graphs Done! Below, we pick a feature id (F) in layer (L) and find the 20 half-second audio segments that most strongly activate that feature. Then we characterize that feature at the word level or phoneme level by listing the most common words or phonemes that were present in the audio segments. We call these word-level and phoneme-level sonifications, respectively. The following figure demonstrates word-level sonification for 7 features from shallow layers. The x-axis contains the 5 most activating words for that feature, while the y-axis indicates the number of times each word has appeared in the audio segments. The 20 audio segments can also be played. Reproduce in a Notebook
How to interact with this figure: To see different examples, click on the tabs above the figure. You can also play the audio to hear what the sonification sounds like. Dont understand purpose of the complete list of activating words. It's redudant since those words are already on the chart Done
All of the previous examples were chosen from shallow layers. We hypothesize that such shallow features respond to certain phonemes. For instance, Example 1 seems to correspond to utterances of the phoneme /o/. We test our pythoesis by phoneme-level sonification of such features. The following figure displays the phoneme-level sonification for the same exact features that were shown previously.
It is worth mentioning that in Example 1, most of the speakers in both of the phoneme-level and word-level sonifications were dominantly male speakers. In fact, many feature maps are mostly dominated by utterances from only one gender. The following figure displays a few examples of such features.
We see that, while phonemes play an important role in determining what activates a feature map, these shallow maps are also strongly effected by speaker-dependent nuisance variables such as pitch. Below, we explore where speaker-specific information lives in the network, and where information is predominantly semantic.
We observe that speaker information plays an important role in activating certain features. As an example of speaker information, we study the speaker's gender. More precisely, we study how many of the utterances that appear in a word-level sonification are utterances by male/female speakers. We also look at how many of the feature sonifications in each layer are dominated by only one gender. We investigate to see if a trend exists, and if so, what the trend looks like.
We assume that a feature is dominated by one gender if more than 90% of sonification audio segments are utterances by that gender. The following plot shows the portion of features that are dominated by one gender in every layer.
The previous figures indicate that in shallow layers, speaker information is important for the activation of features, however, this information is discarded in deeper layers.
A similar study
As a big picture, the following heatmap shows the gender dominance in every feature of the network.
The female-dominated, male-dominated, and neutral features are colored respectively blue, red, and yellow.
Every column corresponds to one layer of the network.
How to interact with this figure: Move your mouse over different cells to hear the most activating inputs for the corresponding feature map. Move your mouse out of the figure to stop playing.
A fair conclusion would be that the information regarding the speaker is being discarded in deep layers.
To sanity check our hypothesis, we take two different utterances of a same word, one with a male speaker,
and another with a female speaker, to compare the features they activate
The following plot illustrates the difference between activated features for the two different utterances. Every column represents a layer, the left-most represents the shallowest, and the right-most represents the deepest.
How to interact with this figure: To select different words, use the tabs on the top of the figure. To select the gender of the speaker, use the buttons on the bottom of the figure.
As we saw in the previous section, shallow layers attend to the speaker information as well as the speech information. In this section, our objective is to sonify features in a way that depends less on the speaker information. One way of doing so is by finding words/phonemes that activate features on average, with averages taken across speakers. This is different from the sonifications above that focus on finding individual utterances that activate a feature the most. We call this new process the "speaker-discarded" sonification.
The following pseudo-code displays the complete steps of the speaker-discarded word-level sonification.
The following plot shows a few examples of the results using the speaker-discarded algorithm. These results can be reproduced in Google Colab. Reproduce in a Notebook
So far, we have shown that the features in shallow layers detect simple patterns like phonemes. In this section, we will show that deeper layers detect complex patterns such as combinations of phonemes or even words.
We assume that the average number of phonemes in word-level sonification is a good indicator of the complexity of the patterns that activate a particular feature. For this reason, we adopt the average number of phonemes per activating word as a metric of feature complexity. Similarly, we define a layer-level average phoneme count as the average of its features' phoneme count complexities. We also compute the average phoneme count of the feature with the highest/lowest phoneme count complexity for every layer. Our hypothesis is that shallow layers detect less complex patterns than deep layers. To verify our hypothesis quantitatively, we plot the min, average, and max, phoneme count for every layer.
How to interact with this figure: Move your mouse over different columns of the figure to hear a few randomly sampled audio clips that activate the features in the corresponding layer. By selecting the Min or Max, you can listen to the audio samples activating the features with the lowest or highest average phoneme count.
From the previous figure, it can be seen that shorter words activate shallow layers, while longer words activate
deep layers.
In this section, we display a complete sonification for every feature in every layer of the Jasper network.
We believe that it could be helpful for readers who like to explore beyond what we explained through this article.
The following figure displays phoneme-level, word-level, and speaker-discarded sonifications, respectively.
How to interact with this figure: Either scroll or drag the figure to move between various layers and features.
We are deeply grateful for the distill template and documentation.
Not only did we find previous distill articles inspiring and insightful, we borrowed several implementations from
For attribution, please cite this work as
Ghiasi, et al. (2021, Oct. 11). Feature Sonification: : An investigation on the features learned for Automatic Speech Recognition. Retrieved from https://cs.umd.edu/~amin/apps/visxai/sonification/
BibTeX citation
@misc{ghiasi2022sonification, author = {Ghiasi, Amin and Kazemi, Hamid and Huang, Ronny and Liu, Emily and Goldblum, Micah and Goldstein, Tom}, title = {Feature Sonification: An investigation on the features learned for Automatic Speech Recognition}, url = {https://cs.umd.edu/~amin/apps/visxai/sonification/}, year = {2021} }