The Lucky 13 Resources for Learning R and Statistics

As a statistics and analytics instructor I often refer students to websites, articles and books that can help them achieve their data dreams. These are the best that I have found for R and…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Live Speech Emotion Recognition

If you heard the following sentences during a conversation you could likely figure out the sentiment category — either calm, happy, sad, angry, fearful, surprise, and disgust — they fall under.

People with these conditions know how to respond to different emotions; however, they cannot interpret visual body language singles to understand the emotions of the person they are interacting with.

We set out to build a model that can categorize emotions such as calm, happy, sad, angry, fearful, surprised, and disgusted. Devinder worked on the live facial expression recognition side of the project—you can read his article here—and I worked on the live audio sentiment categorization part.

So, I began researching to figure out how to perform live audio sentiment categorization for this project and found that text sentiment analysis is often taught on day one when following any introductory NLP course. However, sentiment categorization via audio is not as commonly practiced, especially live audio sentiment categorization.

I had to figure out how to perform live audio sentiment categorization on my own and since I am a machine learning beginner, it took several long nights before the model worked. I thought I would share how I got live audio sentiment analysis to work and what my next steps are for making the model more accurate.

The dataset includes speech labeled as calm, happy, sad, angry, fearful, surprised, and disgusted. Each expression is produced with normal, strong emotional intensity with an additional neutral expression.

Although we are analyzing audio, we will be using computer vision to categorize the emotions. We will do this as the model will be trained to identify and interpret patterns that correlate to emotion categories from spectrogram representations of audio.

Spectrogram with the MEL scale

The spectrograms will be placed into the directory system on your computer categorized under a subdirectory with the name of the emotions — anger, disgust, fear, happiness, surprise, sadness, and neutral.

How To Structure the Directories

Please note that you will have to make the directories and sub-directories manually. Then the code will put the corresponding spectrograms in the corresponding directory.

Now we can start training the CNN to identify emotions by looking at the spectrograms generated from the sound clips. We will use the fastai library to train a CNN to identify emotions by looking at the spectrograms generated from the sound clips.

The pre-trained CNN (resnet34) will be trained on the spectrogram images we just generated and labeled.

Once that is done we are going to use that wav file and convert it into a spectrogram image which is then outputted into another directory on your computer. This step is optional, however I would recommend doing so if you want to analyze the differences in the spectrograms later. Finally, use model.predict(img) to predict the sentiment of the image.

Congratulations! If you followed along with this tutorial you can now perform live audio sentiment categorization! If your end goal is to help the 32 000 000 people in the United States that have trouble recognizing others’ feelings with ML, you are one step closer to your goal!

However, if that is not your end goal there are a wide array of other applications for live audio sentiment categorization. Below I touched on just a few of them.

In Amazon’s Alexa AI unit, a team experimented with machine learning to detect emotions like happiness, sadness, and anger back in 2008. They did this to help protect veterans with PTSD by using artificial intelligence to understand veterans’ mental health from the sound of their voice.

Emotion Recognition in Hospitals

An industry that’s taking advantage of this technology is healthcare. AI-powered recognition software helps healthcare workers decide when patients need medicine or helps physicians determine which patient to help first.

Marketing

Working on this project was very fun and taught me a ton about natural language processing! I am excited to iterating upon this project as there are several faults in the way this model was trained.

Firstly, the dataset the model was trained on was limited in scope. It contained only a few of the same sentences repeated over and over again with different actors.

This made me have to say “kids are talking by the door” at 2 a.m when the model finally worked. I can assure you that my acting skills were put to the test.

So, finding a dataset that has a wide array of ethnicities and ranges of phrases being said will be key for improving the accuracy of this model. Also, I will add in sentiment analysis of what is being said rather than how it is being said to further improve its accuracy.

Lastly, to better model accuracy in emotion categorization, I will merge this live audio emotion detection model with Devinder’s live facial emotion detection model.

If you have ideas on how to improve the model feel free to reach out : )

Add a comment

Related posts:

Build a custom image classifier for your android application.

Image classification or image recognition is a concept in which you showcase an image to the camera sensor of the device and it will tell you what is present in that image or tell us which class does…

What is happening with the new digital market?

The last few weeks have not been great for the world of cryptocurrencies and the last week of June a drop in the value of BTC was recorded along with other altcoins totalling $8bn. During this time…