Video Classification Experiments: combining Image with Audio features -

Reading Time: 9 minutes

Note: code is provided at the end of the post

A couple of weeks ago I challenged myself to build a video classifier from a dataset of personal short clips (less than 1 minute). I had less than 50 videos available to me. According to their content, I assigned a 1 or 0 label to each of them and gave the binary classification task a go.

Given the very limited size of the dataset, I immediately discarded the option of a video-level model. More advanced Deep Learning approaches, such as 3D-ConvNets or CNN-RNN architectures would require far more than 50 fikes to provide any valuable result. I had briefly tested those in the past (summary in this post) but I had 30k+ available videos. A completely different story.

My only hope was a frame-level approach, consisting in turning each MP4 into its frames and flag each one of those with the label of the parent video. Stated differently, I moved from video to image classification, hoping to get performance decent enough to be accurate even when aggregated at tube level. As for inference time, I planned to get a majority vote among the frames belonging to the same video to obtain MP4-level predictions. Below a summary of the approach, in pictures.

The strategy presented a couple of clear disadvantages:

Losing the temporal information. I created a new dataset in which the concept of video didn’t exist anymore. Frames stood on their own.
In some cases, the action which identified the video as belonging to class 1 or 0, occurred in just a fraction of the tube. This meant that a bunch of frames was tagged with a dependent variable which had very little to do with the frame itslef. Information made sense at video-level, not at frame-level.

The were also some clear advantages though:

This approach allowed me to generate a much larger dataset (a couple thousand images in total – sampling every 3rd frame of each clip). This is way better than the 50ish data points I faced at video-level.
I didn’t have to handle the issue of videos of different lengths, as frames were stacked vertically. The only drawback might be one video being over-represented with respect to another, but I could live with it.

In terms of modeling, I opted for a plain and simple Random Forests. The reason behind not going for a Neural Network was because I wanted to give my full attention to various features’ extraction strategies, instead of devoting time to the purely modeling part. Not long before, I had stumbled upon a couple of blog posts illustrating how to extract the audio track from a video and how to subsequently compute relevant dimensions out of it. I found that quite interesting and decided to give it a stab. So the plan was to play around with both audio and image properties, combining them in different ways.

I ended up testing four approaches.

Table of Contents

1. Using only image data (ConvNet as static feature extractor)

Here I just passed every frame into a VGG16 pre-trained network and output the activations of the last Convolutional layer before Flatten. Those are a 512 x 7 x 7 tensors. I then applied a GlobalAveragePooling2D layer on top, to get rid of the 7 x 7 dimensions. and spit out a 512-dimensional array. This approach produced very good results both at frame and video-level.

valid_d, train_d = data.extract_features((224, 224), 'vgg16', 'only_frames')
X_train, y_train = train_d['data'][:, 1:], train_d['data'][:, 0]
X_valid, y_valid = valid_d['data'][:, 1:], valid_d['data'][:, 0]
r = RandomForestClassifier(n_estimators=150, min_samples_leaf=19, oob_score=True, random_state=40)
r.fit(X_train, y_train)

2. Boosting the image-only data approach by augmenting the dataset

In terms of feature extraction, same identical strategy as above. VGG16 is used as a static activations calculator. Before passing images through the ConvNet, though, I augmented the dataset, looping through each frame 3 times, randomly applying either rotation, shifting or flipping. Pretty standard. I hoped it would help with overfitting, as that was arguably the biggest issue I encountered with Random Forests overall (even after manually tweaking n_estimators and min_samples_leaf). I found no significant benefit. Results were globally very encouraging. Same performance as #1. Three times slower due to the augmentation part.

valid_d, train_d = data.extract_features((224, 224), 'vgg16', 'augmented', aug=True, iterations=3)
X_train, y_train = train_d['data'][:, 1:], train_d['data'][:, 0]
X_valid, y_valid = valid_d['data'][:, 1:], valid_d['data'][:, 0]
r = RandomForestClassifier(n_estimators=150, min_samples_leaf=23, oob_score=True, random_state=40)
r.fit(X_train, y_train)

3. Concatenating image data with frame-level raw audio signal

This is when it started getting interesting. The first step was to isolate raw audio from the MP4 file. For the ones (like me) who handle audio for the first time, be aware that, as expected, the result of such operation is yet another numeric array (an example is plotted below). Exactly what we need. Given this array, I thought the most obvious way of incorporating it into my Machine Learning pipeline was to ingesting it as it was. No post-processing on top. More specifically, what I did was to come up with a proxy for a “frame-level audio signal”. I needed that to be able to concatenate frame ConvNet activations with the audio trace overlapping in time with the frame itself. I.e. if the frame appeared at second 23 of the tube, I tried to associate it with the audio produced within seconds 22 and 24. Something like that. To achieve this, I split the audio array into as many chunks as the frames I had pulled from the video. As I needed to get exactly the same number of audio features for each frame, even across different clips, I decided to sample 100 random data points from each frame-level audio chunk. Results here were comparable with #1 and #2. Faster than the augmentation strategy I had tested in #2 but more complex than #1 overall.

valid_d, train_d = data.extract_features((224, 224), 'vgg16', 'frames_audio_trace', audio=True, n_mfcc=100, return_ts=True)
X_train, y_train = train_d['data'][:, 1:], train_d['data'][:, 0]
X_valid, y_valid = valid_d['data'][:, 1:], valid_d['data'][:, 0]
r = RandomForestClassifier(n_estimators=150, min_samples_leaf=19, oob_score=True, random_state=40)
r.fit(X_train, y_train)

Example of raw audio signal extracted from video

4. Concatenating image data with video-level MFCC audio features

This is a natural prosecution of the previous approach where I had ingested the raw audio signal into the ML pipeline without any specific post-processing. Turns out that one of the most adopted techniques to appropriately extract relevant features from audio is to calculate its MFF coefficients (MFFCs). I won’t try to get into the details of what those are and how they are calculated. This post does an excellent job at that and I’d suggest you look at it. Just to provide some highlights, though, MFCCs are numbers which summarize how a human ear would perceive and represent an audio wave. In one sentence, the feature extraction procedure consists in Fourier Transform-ing the temporal signal onto the frequency space and then processing those frequencies until they reasonably approximate how we, humans, hear.

Having said that, what I did in practice was to calculate the MFCCs of each video’s audio trace (`librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=100)`) and then use the coefficients at frame-level. So, frames from the same video had the same MFCCs. ConvNet features were there too, as usual. Results were disastrous. Huge overfitting and very poor validation set performance. I guess I could have tried an approach similar to #3, i.e. a proxy for frame-level MFCCs, just to introduce more variability into the feature set. This could have been achieved by splitting the raw audio signal into as many chunks as frames and calculate MFF coefficients on top of each one of those. I did not try it out and I will keep it as a reminder for future experiments.

valid_d, train_d = data.extract_features((224, 224), 'vgg16', 'frames_mfcc', audio=True, n_mfcc=100)
X_train, y_train = train_d['data'][:, 1:], train_d['data'][:, 0]
X_valid, y_valid = valid_d['data'][:, 1:], valid_d['data'][:, 0]
r = RandomForestClassifier(n_estimators=250, min_samples_leaf=25, oob_score=True, random_state=40)
r.fit(X_train, y_train)

Example of MFCC coefficients extracted by post-processing raw audio signal

Final Thoughts

In this post, I went through a couple of features engineering approaches to consider when dealing with video classification. The biggest take-aways for me were the following:

It was the first time I played with the audio signal from a video file. This turned out to be way easier than expected, thanks to the `librosa` python library.
Audio didn’t help improve model performance (at least in my case), when added on top of ConvNet image features.
Turning video into image classification (i.e. frame-level instead of video-level modeling) works pretty well. I was obliged to approach the exercise from this perspective due to the very limited size of the dataset. Nevertheless, I think it could be easily applied even when dealing with much larger datasets, when one might be tempted to immediately go for a video-level approach. It is for sure a much conceptually simpler approach than Conv3D or CNN-RNN strategies.

Appendix: Code

%load_ext autoreload
%autoreload 2
%matplotlib inline
import IPython.display as ipd
from IPython.display import HTML
import librosa
import librosa.display
import subprocess
import numpy as np
import os
import pandas as pd
import matplotlib.pyplot as plt
from collections import defaultdict
import io
import pickle
import scipy
import base64
import imageio
import itertools
from sklearn.ensemble import RandomForestClassifier 
from sklearn.metrics import confusion_matrix
from sklearn.externals import joblib
import keras
from keras.models import Model
from keras.preprocessing import image
from keras.preprocessing.image import ImageDataGenerator, array_to_img
from keras.applications.vgg16 import VGG16
from keras.layers import GlobalAveragePooling2D
folder = f"{os.path.join(os.getcwd(), 'data', 'train')}"
class DataSet():
    
    def __init__(self, folder):
        self.folder = folder
        self.dataset = self.get_dataset()
    
    def get_dataset(self):
        dataset = pd.read_csv(f"{os.path.join(folder, 'labels.txt')}", sep='\t')
        dataset.columns = ['path', 'label']    
        dataset['name'] = dataset.path
        dataset.path = dataset.path.apply(lambda x: os.path.join(folder, x+'.mp4'))
        return dataset    
    
    def preprocess_input_resnet50(self, x):
        from keras.applications.resnet50 import preprocess_input
        X = np.expand_dims(x, axis=0)
        X = preprocess_input(X)
        return X[0]
    
    def preprocess_input_vgg16(self, x):
        from keras.applications.vgg16 import preprocess_input
        X = np.expand_dims(x, axis=0)
        X = preprocess_input(X)
        return X[0]
    
    def augment(self, src, choice):
        if choice == 0:
            # Rotate 90
            src = np.rot90(src, 1)
        if choice == 1:
            # flip vertically
            src = np.flipud(src)
        if choice == 2:
            # Rotate 180
            src = np.rot90(src, 2)
        if choice == 3:
            # flip horizontally
            src = np.fliplr(src)
        if choice == 4:
            # Rotate 90 counter-clockwise
            src = np.rot90(src, 3)
        if choice == 5:
            # Rotate 180 and flip horizontally
            src = np.rot90(src, 2)
            src = np.fliplr(src)
        if choice == 6:
            # leave it as is
            src = src
        if choice == 7:
            # shift
            src = scipy.ndimage.shift(src, 0.2)
        return src
        
    def prepare_image(self, img, size, preprocessing_function, aug=False):
        img = scipy.misc.imresize(img, size)
        img = np.array(img).astype(np.float64)
        if aug: img = self.augment(img, np.random.randint(8))
        img = preprocessing_function(img)
        return img
    
    def extract_audio(self, filepath):
        command = f"ffmpeg -i {filepath} -ab 160k -ac 2 -ar 44100 -vn {filepath.replace('mp4', 'wav')}"
        subprocess.call(command);
        
    def extract_audio_features(self, filepath, n_mfcc, return_ts=False):
        if not os.path.isfile(f"{filepath.replace('mp4', 'wav')}"): self.extract_audio(filepath)
        n_mfcc = 40 if n_mfcc is None else n_mfcc
        X, sample_rate = librosa.load(f"{filepath.replace('mp4', 'wav')}", res_type='kaiser_fast')
        mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=n_mfcc).T,axis=0)
        if not return_ts: return mfccs
        else: return mfccs, X, sample_rate
    
    def get_mfccs(self, n_mfcc):
        data = self.dataset
        data['mfcc'] = data.path.apply(lambda x: self.extract_audio_features(x, n_mfcc))
        return data
            
    def process_video(self, filepath, size, preprocessing_function, aug=False):
        vid = imageio.get_reader(filepath)
        nframes = vid.get_meta_data()['nframes']
        l = []
        for frame in range(0, nframes, 3): 
            try:
                l.append(self.prepare_image(vid.get_data(frame), size, preprocessing_function, aug=aug))
            except RuntimeError:
                pass
        return l
    
    def extract_features(self, size, which_net, what, audio=False, n_mfcc=None, return_ts=False, aug=False, iterations=1):
        if which_net == 'resnet50': 
            preprocessing_function=self.preprocess_input_resnet50
            base_model = ResNet50(weights='imagenet', include_top=False)
        elif which_net == 'vgg16': 
            preprocessing_function=self.preprocess_input_vgg16
            base_model = VGG16(weights='imagenet', include_top=False)
            
        x = base_model.output
        x = GlobalAveragePooling2D()(x)
        model = Model(input=base_model.input, output=x)
        
        train = []
        valid = []
        train_i = 0
        valid_i = 0
        train_dict = {'idx': defaultdict(list), 'data': None}
        valid_dict = {'idx': defaultdict(list), 'data': None}
                
        for path, label, name in self.dataset.values:
            for _ in range(iterations):
                print(f'Processing: {path}; Label: {label}')
                v = np.array(self.process_video(path, size, preprocessing_function, aug=aug))
                frames = v.shape[0]
                p = np.squeeze(model.predict_on_batch(v))
                p = np.hstack((np.tile(label, frames).reshape(-1, 1), p))
                if audio: 
                    if not return_ts:
                        mfccs = self.extract_audio_features(path, n_mfcc=n_mfcc)
                        mfccs = np.tile(mfccs, (frames, 1))
                        p = np.hstack((p, mfccs))
                    else:
                        mfccs, X, sample_rate = self.extract_audio_features(path, n_mfcc=n_mfcc, return_ts=True)
                        X = np.array_split(X, frames)
                        X = [np.random.choice(i, 100, replace=False) for i in X]
                        X = np.vstack(X)
                        p = np.hstack((p, X))                    
                if 'validation' in path:
                    valid.append(p)
                    valid_dict['idx'][name].append((valid_i, valid_i+frames))
                    valid_i+=frames
                else:
                    train.append(p)
                    train_dict['idx'][name].append((train_i, train_i+frames))
                    train_i+=frames
                if not aug: break
                
        
        valid_dict['data'] = np.vstack(valid)
        train_dict['data'] = np.vstack(train)
        
        with open(what+'_train.pickle', 'wb') as handle:
            pickle.dump(train_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)
        with open(what+'_valid.pickle', 'wb') as handle:
            pickle.dump(valid_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)            
        
        return valid_dict, train_dict
    
    def map_frames_to_video(self, X, r, d):
        t_p = r.predict(X)
        l = []
        for k, v in d['idx'].items():
            l += [k]*(v[0][1]-v[0][0])
        d = pd.DataFrame({'name': l, 'frame_pred': t_p})
        d = d.groupby('name').frame_pred.mean().reset_index()
        d['video_pred'] = np.where(d.frame_pred > .5, 1, 0)
        return d
# instantiating the class
folder = 'path to the MP4 files'
data = DataSet(folder)
########################################################
########################################################
# ML EXPERIMENTS GO HERE. AS SHOWN ABOVE
########################################################
########################################################
# from frame-level predictions to video-level picking the best model
Xv = data.map_frames_to_video(X_valid, r, valid_d)
Xt = data.map_frames_to_video(X_train, r, train_d)
Xv['split'] = 'valid'
Xt['split'] = 'train'
X = Xt.append(Xv).merge(data.dataset)
X['accuracy'] = X.label == X.video_pred