How to Use Transfer Learning to Detect Emotions

While modules that contain pre-built models cover a large number of use cases when it comes to detecting emotions, there are benefits to knowing how to create custom emotion recognition models. Custom emotion recognition models can be more accurate in some situations. However, even when building custom models, it is usually a better idea to leverage transfer learning in some way.
By Boris Delovski • Jan 21, 2022

In the previous articles, we covered what emotion recognition is and how to perform emotion recognition. While modules that contain pre-built models cover a large number of use cases when it comes to detecting emotions, there are benefits to knowing how to create custom emotion recognition models. 

Custom emotion recognition models can be more accurate in some situations. For example, if a company wants to introduce an emotion recognition model to gauge how people react to their ads in malls, using images of people looking at their ads to train a custom model can lead to better results than using a pre-built solution. However, even when building custom models, it is usually a better idea to leverage transfer learning in some way. This way we don't need to rely on collecting a lot of high-quality data and can get good results with smaller datasets.

In this article, we will focus on creating an emotion recognition model that can try and guess whether a person is interested in a lesson or not.

 

The Dataset

Most datasets are focused on detecting certain emotions such as happy, sad, or angry. In our case, since we just want to know whether a person is interested in a lesson or not, we don't need that kind of accuracy. For our purposes, we can split emotions into three categories: positive, neutral, and negative. In an ideal situation, our students would display positive emotions, but even being neutral is acceptable. We are mainly trying to avoid holding classes that elicit negative emotions in our students.

To train our model, we will use a modified version of the FER13 dataset, which contains 48X48 pixel images that each display a certain emotion. Because there are images that represent multiple negative emotions (such as fear and anger), we will modify our labels so that there are only three categories: positive, neutral, and negative.

 

Preparing the Data

Before we create our model, we need to prepare our data. First, we need to import everything we will use:

# Let's import our data

import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Dense, GlobalMaxPool2D
from tensorflow.keras.models import Model
from tensorflow.keras.applications.mobilenet import MobileNet
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint,EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report

We will use Keras to create our neural network and train it. When working with images in Keras, it’s best to use the ImageDataGenerator class. Using Keras ImageDataGenerator, we can take our data, augment it and load it into our model for training, and later testing. While we can use the data augmentation techniques we plan on using for training, we shouldn’t use them for the images we plan on using for validation and testing. Therefore, let's define separate generators for training, validation, and testing. The validation and testing generators are the same, but for the sake of clarity we will create a separate validation generator and a separate testing generator. 

One thing to note: always rescale images. Deep learning networks are very sensitive to unscaled data and will perform poorly with it.

# Define training data generator

train_datagen = ImageDataGenerator(rescale=1./255,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True)


# Define validation data and testing data generators
# Technically the same, but separated here for the 
# sake of clarity

validation_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255)

This is not enough to load our data to our model. The generator itself only defines whether we want to load our data as is or if we want to change it in some way. Essentially, it defines how we plan on loading data into our model. To specify the data source, we need to use one of the generator’s flow methods. The two most common methods are:

  • flow_from_directory
  • flow_from_dataframe

The more commonly used method is flow_from_directory. This method requires images to be stored in separate folders. For each class, we need to have a separate folder, and we need to store images of that class in that directory. The generator uses the structure of our directory to assign labels to images. This may seem practical and simple at first, but it requires us to create multiple directories and can lead to extra steps. If we know how our images are labeled, there is no need to go through the process of storing them in separate directories. Instead, we can just feed labels directly to Keras.

This is where flow_from_dataframe comes into play. It allows us to store all of our images into a single directory and use a pandas DataFrame to assign labels to them when loading them using the ImageDataGenerator class. To be more specific, we need two columns: one column with image names, and one column with labels that are associated with our images. This skips the extra step and allows us more detailed control. For example, if we want to skip some images that are in the folder, we can just remove them from the DataFrame. Also, it will be much easier to create training, validation, and testing data that way.

The first thing we will do is create a DataFrame from our CSV file.

# Read in data into a DataFrame

df = pd.read_csv("image_dataset.csv")

This DataFrame consists of two columns: "files" and "target". The "files" column represents our images, while the "target" column represents image labels.

Now that the DataFrame has been loaded, let's modify the labels a bit. As I mentioned earlier, we are not interested in specific emotions, just in whether they are positive, negative, or neutral. Because we will use the flow_from_dataframe method, we need to make sure that our labels are properly defined, so let's map negative emotions to the "negative" label, positive to the "positive" label, and neutral to the "neutral" label.

# Map values to positive, negative, neutral

mapping = {"Anger":"Negative", 
           "Happiness":"Positive", 
           "Fear":"Negative", 
           "Neutral":"Neutral"}

df["target"] = df["target"].map(mapping)

Now that the labels have been re-mapped, let's shuffle our dataset and separate our data into training, validation, and testing data.

# Shuffle data

df = df.sample(frac=1).reset_index(drop=True)

# Separate data into training, validation, and test data

train = int(len(df)*0.75)
test = int(len(df)*0.9)

df_train = df.iloc[:train, :].copy()
df_validation = df.iloc[train:test, :].copy()
df_test = df.iloc[test:, :].copy()

Now both our ImageDataGenerator class and our three DataFrames are prepared. Let's use the flow_from_dataframe method to define how we will access our data. 

# Get training data

train_data = train_datagen.flow_from_dataframe(
    dataframe=df_train,
    target_size=(128,128),
    batch_size=32,
    directory="data",
    x_col="files",
    y_col="target")

# Get validation data

validation_data = validation_datagen.flow_from_dataframe(
    dataframe=df_validation,
    target_size=(128,128),
    batch_size=32,
    directory="data",
    x_col="files",
    y_col="target")

# Get testing data

test_data = test_datagen.flow_from_dataframe(
    dataframe=df_test,
    target_size=(128,128),
    batch_size=32,
    shuffle=False,
    directory="data",
    x_col="files",
    y_col="target")

There are two important things to note here. First, our images are 48x48 pixels in size, while the smallest dimension of the images the MobileNet network was trained on is 128x128. Since we plan on using pretrained weights, we will scale our images to 128x128. Upscaling somewhat decreases the quality of our images, but it is a necessary sacrifice. Second, in test_data it is of extreme importance that you strictly define the parameter shuffle as False. Otherwise, you won't be able to test the performance of your model.

 

Creating a Custom Model

The model we will use for emotion recognition is a model built on top of the MobileNet network. To be specific, we will use a version of MobileNet pretrained on the imagenet dataset. However, we won't include the top of the MobileNet model. Instead, we will add a global max-pooling layer and a dense prediction layer on top of it. To finish, we will freeze the first 15 layers of the model.

#Create model

mobile_net = MobileNet(
    input_shape=(128, 128, 3),
    include_top=False,
    weights="imagenet",
    classes=3
)
mobile_net_output = mobile_net.layers[-14].output
global_pool = GlobalMaxPool2D(name="global_pool")(mobile_net_output)
out = Dense(3, activation="softmax", name="out_layer")(global_pool)
model = Model(inputs=mobile_net.input, outputs=out)

for layer in model.layers[:15]:
    layer.trainable = False

Now that we’ve defined our custom model, it is time to compile it. The loss function we will use is categorical cross-entropy. We will use Adam as our optimizer and we will track accuracy.

# Compile model

model.compile(loss="categorical_crossentropy", optimizer=Adam(0.01), metrics=["accuracy"])

Before running our model, we will also define some callbacks:

  • ModelCheckpoint - to checkpoint our data and save results
  • EarlyStopping - to stop the model if it doesn't improve for a certain number of epochs
  • ReduceLROnPlateau - to reduce the learning rate if the model stops learning
# Define a path where we want to save the model

filepath = "models"

# Define some callbacks

checkpoint = ModelCheckpoint(
    filepath, 
    monitor="val_accuracy", 
    verbose=1, 
    save_best_only=True, 
    mode="max")

earlystopping = EarlyStopping(
    monitor="val_accuracy", 
    patience=15, 
    verbose=1, 
    mode="auto", 
    restore_best_weights=True)

rlrop = ReduceLROnPlateau(
    monitor="val_accuracy", 
    mode="max", 
    patience=5, 
    factor=0.5, 
    min_lr=1e-6, 
    verbose=1)

# Create a list of callbacks 

callbacks = [checkpoint, earlystopping, rlrop]

 

Training the Model 

Now that everything is ready, we can go ahead and train our model:

# Train the model

history = model.fit(
    train_data, 
    validation_data=validation_data,
    epochs=25,
    steps_per_epoch=len(train_data),
    validation_steps=len(validation_data),
    callbacks=callbacks)

Classification accuracy by itself can be misleading. Therefore, it is a much better idea to take a look at a classification report instead.

# Get predictions

predictions = np.argmax(model.predict(test_data), axis=-1)


# Get the classification report

print(classification_report(test_data.classes, predictions))

The result we get by running this code is:

The problem we are currently faced with is that we don't know which of the three labels represents which class. To access that data, we can simply access the class_indices attribute of our generator object.

# Get dictionary that connects labels with their integer representations

labels = (train_data.class_indices)
labels

The result we get by running this code is a dictionary that connects classes with their integer representations:

{'Negative': 0, 'Neutral': 1, 'Positive': 2}

Finally, let's visualize the results.

# Show training history

def visualize_train_history(train_history,train,test): 
    plt.plot(train_history.history[train])
    plt.plot(train_history.history[test])
    plt.title("Training History")
    plt.ylabel(train)
    plt.xlabel("Epoch")
    plt.legend(["Train", "Test"], loc="upper left")
    plt.show()

visualize_train_history(history, "loss", "val_loss")
visualize_train_history(history, "accuracy", "val_accuracy")

 

Super-resolution as a Data Preprocessing Technique

We used some basic data preprocessing in the form of simple image augmentations implemented by the Keras ImageDataGenerator. While this allowed us to achieve a good baseline accuracy, let's see if we can get even better results by using more advanced techniques. 

One of the fields of computer vision that has gained quite a lot of traction in recent times is super-resolution. Super-resolution imaging is a technique that consists of increasing the resolution of images. This technique was developed to solve one very simple but frequent issue, which is that training set images are often smaller resolution than commonly used models expect. Using super-resolution can therefore be considered a non-typical image augmentation technique.

This doesn't mean we should avoid using typical image augmentation techniques, some of which are: 

  • Rotation
  • Translation
  • Color augmentations
  • Flipping
  • Cropping 
  • Adding noise
  • Blurring

 We just won't focus on them in this article because they serve a different purpose: they are used to compensate for data loss by increasing the size of our dataset. Using standard image augmentation techniques will increase the number of different images our model trains on. That increase in size is usually enough to improve the accuracy of our models.

In this article, we will focus on trying to improve accuracy without actually increasing the size of our training dataset, but instead by introducing modern data upscaling techniques. The size of our dataset will stay the same, but the resolution of the images inside it will be bigger. This is also why we won't add any new data augmentation techniques aside from increasing the resolution of our images. If we used additional augmentation methods (aside from those we used earlier on in this article) it would be very hard to gauge whether using super-resolution helped our model achieve higher accuracy or whether it was the consequence of using those other image augmentation techniques.

To increase the resolution of our images we will use special neural networks designed for upscaling images while minimizing data loss. We don't even need to create such a network ourselves. Fortunately, it is very easy to implement such a network using OpenCV.

Let's import everything we need to upscale our images:

# Import necessary libraries

import cv2
import os

Then we need to define the Super Resolution object.

# Create a SR object

sr = cv2.dnn_superres.DnnSuperResImpl_create()

Since we are using a pre-trained model to upscale our images, we need to download the trained model and point Python to it:

# Define path to SR model

path_to_model = "EDSR_x4.pb"

Now everything is ready. We can read in the model that we defined.

# Read the model

sr.readModel(path_to_model)
sr.setModel("edsr",4)

The second parameter we define while setting the model tells Python how much we want to upscale our images. Our goal here is to upscale an image so that it matches one of the image sizes that MobileNet was originally trained on. I have chosen 4, which means that the dimensions of my image will be 4 times bigger. This leads us to an image size of 192x192.  This will allow us to feed 192x192 images into our network without needing to upscale them using Keras ImageDataGenerator. 

Since all of our original images are stored in the data directory, we will create a new one and call it processed_data. Each image in our data directory will get upscaled and stored inside the newly-created directory.

# Define paths to original directory and new directory

new_directory_path = "processed_data"
original_directory_path = "data"

# List images in the original directory

list_of_images = os.listdir(original_directory_path)

# Upscale images and store them in the new directory

for image_name in list_of_images:
    image = cv2.imread(f"{original_directory_path}/{image_name}")
    result = sr.upsample(image)
    cv2.imwrite(f"{new_directory_path}/{image_name}", result)

To re-run our code from earlier and use these upscaled, higher resolution images, we need to somewhat modify our code. We need to change the parts of our code that reference the size of our images. Those parts are: 

# Get training data

train_data = train_datagen.flow_from_dataframe(
    dataframe=df_train,
    target_size=(192,192),
    batch_size=32,
    directory="processed_data",
    x_col="files",
    y_col="target")

# Get validation data

validation_data = validation_datagen.flow_from_dataframe(
    dataframe=df_validation,
    target_size=(192,192),
    batch_size=32,
    directory="processed_data",
    x_col="files",
    y_col="target")

# Get testing data

test_data = test_datagen.flow_from_dataframe(
    dataframe=df_test,
    target_size=(192,192),
    batch_size=32,
    shuffle=False,
    directory="processed_data",
    x_col="files",
    y_col="target")

# Define model

mobile_net = MobileNet(
    input_shape = (192, 192, 3),
    include_top = False,
    weights = "imagenet",
    classes = 3)

x = mobile_net.layers[-14].output
global_pool = GlobalMaxPool2D(name="global_pool")(x)
out = Dense(3, activation="softmax", name="out_layer")(global_pool)

model = Model(inputs=mobile_net.input, outputs=out)

for layer in model.layers[:15]:
    layer.trainable = False

Once we retrain our network, we get the following results:

Our results are noticeably better. The change from 0.8 to 0.82 might not seem like much, but it is actually quite big if you take into consideration that this is just one technique and that the increase in accuracy we get from it can't be compared to using sets of augmentations. When you see bigger increases in accuracy in other examples, they are the product of using a set of augmentations, which in implementation look somewhat like this:

train_datagen = ImageDataGenerator(
    rescale=1./255,
    shear_range=0.2,
    zoom_range=0.2,
    width_shift_range=0.25,
    height_shift_range=0.25,
    rotation_range=10,
    brightness_range=[0.2, 1.2],
    horizontal_flip=True)

Such a set would probably lead to better results, but as you can see it is actually a combination of more than 5 different augmentation techniques.

Combining the results we got by upscaling our data from 48x48 to 192x192 with a set of augmentations such as the one we just defined is actually what would lead to the best possible results. So when designing and training a neural network, we should not limit ourselves to the most common practices, but should also try to use our knowledge from other fields to improve our results.

 

Conclusion 

In this article, we demonstrated that it is possible to build a model that would gauge how interested students are in a particular lecture. By tracking whether students exhibit positive, neutral, or negative emotions during a lecture, an instructor could track which parts of the lecture are interesting and which parts may still require some adjustments to better captivate students. Great results were achieved on a fairly limited dataset, which means that there is potential for even better results with a better dataset. This is especially true when you consider that the results we got represent what can be achieved without too much model tuning, and while using almost no traditional data augmentation techniques.

Since we didn't opt for traditional data augmentation techniques, we decided to implement something else: upscaling using neural networks. This idea led to a noticeably better result and should be considered alongside implementing traditional image augmentation techniques (such as rotation, translation, color augmentation, zooming, flipping or cropping) if we want to try and achieve the best possible results with the model presented in this article.

Overall, this series of articles was designed to demonstrate the importance of emotion AI and emotion recognition, and how one can implement and use emotion recognition. The previous article in the series was designed to give readers an easy way of performing emotion recognition with just a few lines of code, while this one delved deeper into what we need to do to train our model and showed that there is a potential application of this technology in the education industry.

Boris Delovski

Data Science Trainer

Boris Delovski

Boris is a data science trainer and consultant who is passionate about sharing his knowledge with others.

Before Edlitera, Boris applied his skills in several industries, including neuroimaging and metallurgy, using data science and deep learning to analyze images.