In [1]:
%matplotlib inline

10. Introducing Decord: an efficient video reader

存在的问题:

Training deep neural networks on videos is very time consuming. For example, training a state-of-the-art SlowFast network on Kinetics400 dataset using a server with 8 V100 GPUs takes more than 10 days. Slow training causes long research cycles and is not friendly for new comers and students to work on video related problems. There are several reasons causing the slowness, big batch of data, inefficiency of video reader and huge model computation.

Another troubling matter is the complex data preprocessing and huge storage cost. Take Kinetics400 dataset as an example, this dataset has about 240K training and 20K validation videos. All the videos take 450G disk space. However, if we decode the videos to frames and use image loader to train the model, the decoded frames will take 6.8T disk space, which is unacceptable to most people. In addition, the decoding process is slow. It takes 1.5 days using 60 workers to decode all the videos to frames. If we use 8 workers (as in common laptop or standard workstation), it will take a week to perform such data preprocessing even before your actual training.

解决方案:Decord

Given the challenges aforementioned, in this tutotial, we introduce a new video reader, Decord. Decord is efficient and flexible. It provides convenient video slicing methods based on a wrapper on top of hardware accelerated video decoders, e.g. FFMPEG/LibAV and Nvidia Codecs. It is designed to handle awkward video shuffling experience in order to provide smooth experiences similar to random image loader for deep learning. In addition, it works cross-platform, e.g., Linux, Windows and Mac OS. With the new video reader, you don't need to decode videos to frames anymore, just start training on your video dataset with even higher training speed.

Install

Decord is easy to install, just

pip install decord

Usage

We provide some usage cases here to get you started. For complete API, please refer to official documentation.

Suppose we want to read a video. Let's download the example video first.

In [9]:
# 下载视频
# ! pip install --upgrade gluoncv
from gluoncv import utils
url = 'https://github.com/bryanyzhu/tiny-ucf101/raw/master/abseiling_k400.mp4'
video_fname = utils.download(url)
# video_fname = './handGesture-5_1.avi'

from decord import VideoReader
vr = VideoReader(video_fname)

使用方法:

If we want to load the video in a specific dimension so that it can be fed into a CNN for processing,

In [3]:
# 加载视频 VideoReader()
vr = VideoReader(video_fname, width=320, height=256)
  1. 判断视频有多少帧

Now we have loaded the video, if we want to know how many frames are there in the video,

In [4]:
# 判断该视频有多少帧 len()
duration = len(vr)
print('The video contains %d frames' % duration)
The video contains 250 frames
  1. 访问特定的帧

If we want to access frame at index 10,

In [5]:
# 访问特定的帧
frame = vr[10]
print(frame.shape)

# 预览帧
import matplotlib.pyplot as plt
plt.imshow(frame.asnumpy())
plt.show()
(256, 320, 3)
  1. 一次获得多帧视频 get_batch()

For deep learning, usually we want to get multiple frames at once. Now you can use get_batch function, Suppose we want to get a 32-frame video clip by skipping one frame in between (每隔1帧取1帧,总计取32帧),

In [6]:
frame_id_list = range(0, 64, 2)
frames = vr.get_batch(frame_id_list).asnumpy()
print(frames.shape)
(32, 256, 320, 3)
  1. 高级函数:获取视频的关键帧 get_key_indices()

There is another advanced functionality, you can get all the key frames as below,

In [7]:
# 获取关键帧索引(0-n)
key_indices = vr.get_key_indices()
# 根据索引取关键帧(一次获取n帧)
key_frames = vr.get_batch(key_indices)
print(key_frames.shape)

# 预览帧
plt.imshow(key_frames.asnumpy()[0])
plt.show()
(1, 256, 320, 3)

Pretty flexible, right? Try it on your videos.

Speed comparison

Now we want to compare its speed with Opencv VideoCapture to demonstrate its efficiency. Let's load the same video and get all the frames randomly using both decoders to compare their performance. We will run the loading for 11 times: use the first one as warming up, and average the rest 10 runs as the average speed.

In [8]:
import cv2
import time
import numpy as np

frames_list = np.arange(duration)
np.random.shuffle(frames_list)

# Decord
for i in range(11):
    if i == 1:
        start_time = time.time()
    decord_vr = VideoReader(video_fname)
    frames = decord_vr.get_batch(frames_list)
end_time = time.time()
print('Decord takes %4.4f seconds.' % ((end_time - start_time)/10))

# OpenCV
for i in range(11):
    if i == 1:
        start_time = time.time()
    cv2_vr = cv2.VideoCapture(video_fname)
    for frame_idx in frames_list:
        cv2_vr.set(1, frame_idx)
        _, frame = cv2_vr.read()
    cv2_vr.release()
end_time = time.time()
print('OpenCV takes %4.4f seconds.' % ((end_time - start_time)/10))
Decord takes 2.8429 seconds.
OpenCV takes 4.2057 seconds.

We can see that Decord is 2x faster than OpenCV VideoCapture. We also compare with Pyav container and demonstrate 2x speed up as well. (Decord的读取速度时OpecnCV的2倍。)

In conclusion, Decord is an efficient and flexible video reader. It supports get_batch, GPU loading, fast random access, etc, which is perfectly designed for training video deep neural networks. We use Decord in our video model training for large-scale datasets and bserve similar speed as using image loaders on decoded video frames. This significanly reduces the data preprocessing time and the storage cost for large-scale video datasets.