To train and evaluate our ensembled models for human pose estimation, we decided to use HumanEva-I (or simply HumanEva) dataset. It consists of three subjects performing a set of six pre-defined actions including Walk
, Jog
, Throw/catch
, Gesture
, Box
, and Combo
. The subjects movements were recorded using three synchronized cameras at 60 Hz, and ground-truth 3D motions were simultaneously captured using a commercial motion capture system. In total, there are 56 sequences and approximately 80000 frames. As a human pose model, a 15-joint skeleton was adopted, giving 15 keypoints. Our approach of distributed ensembles can be used for other 3D pose estimation datasets as long as the 2D/3D locations are provided.
The data and future model checkpoints will be loaded under dbfs:/VideoPose3D
.
ls dbfs:/VideoPose3D/saved_models/humaneva/checkpoints/semi_supervised
path | name | size |
---|---|---|
dbfs:/VideoPose3D/saved_models/humaneva/checkpoints/semi_supervised/1_members_ensemble0_iter0.ckpt | 1_members_ensemble0_iter0.ckpt | 3.4199173e7 |
dbfs:/VideoPose3D/saved_models/humaneva/checkpoints/semi_supervised/2_members_ensemble0_iter0.ckpt | 2_members_ensemble0_iter0.ckpt | 3.4199554e7 |
dbfs:/VideoPose3D/saved_models/humaneva/checkpoints/semi_supervised/2_members_ensemble0_iter1.ckpt | 2_members_ensemble0_iter1.ckpt | 3.4199554e7 |
dbfs:/VideoPose3D/saved_models/humaneva/checkpoints/semi_supervised/2_members_ensemble0_iter2.ckpt | 2_members_ensemble0_iter2.ckpt | 3.4199554e7 |
dbfs:/VideoPose3D/saved_models/humaneva/checkpoints/semi_supervised/2_members_ensemble0_iter3.ckpt | 2_members_ensemble0_iter3.ckpt | 3.4199554e7 |
dbfs:/VideoPose3D/saved_models/humaneva/checkpoints/semi_supervised/2_members_ensemble0_iter4.ckpt | 2_members_ensemble0_iter4.ckpt | 3.4199554e7 |
dbfs:/VideoPose3D/saved_models/humaneva/checkpoints/semi_supervised/2_members_ensemble1_iter0.ckpt | 2_members_ensemble1_iter0.ckpt | 3.4199554e7 |
dbfs:/VideoPose3D/saved_models/humaneva/checkpoints/semi_supervised/2_members_ensemble1_iter1.ckpt | 2_members_ensemble1_iter1.ckpt | 3.4199554e7 |
dbfs:/VideoPose3D/saved_models/humaneva/checkpoints/semi_supervised/2_members_ensemble1_iter2.ckpt | 2_members_ensemble1_iter2.ckpt | 3.4199554e7 |
dbfs:/VideoPose3D/saved_models/humaneva/checkpoints/semi_supervised/2_members_ensemble1_iter3.ckpt | 2_members_ensemble1_iter3.ckpt | 3.4199554e7 |
dbfs:/VideoPose3D/saved_models/humaneva/checkpoints/semi_supervised/2_members_ensemble1_iter4.ckpt | 2_members_ensemble1_iter4.ckpt | 3.4199554e7 |
We start by loading the data that has been pre-processed by the authors of [1] and [2]. Further down we show how we transform it into the .csv
format, which is well supported by Apache Spark.
First we load the two .npz
files with the 2D and 3D keypoints of the HumanEva dataset, respectively.
pip install pip --upgrade --quiet
pip install gdown --quiet
cd /dbfs/VideoPose3D/humaneva
gdown 1EngBymOjXWPntjfNVaGZhBX7sNCNg9pu # data_2d_humaneva15_gt.npz
gdown 1ErTRudqF8ugAwopL3ieral0YMEtE28Dd # data_3d_humaneva15.npz
data_2d_humaneva15_gt.npz
containspos2d
with 2D keypoint locations of the joints of moving humans in various video sequences. The format is as following:- it is a dictionary with keys corresponding to different subjects:
S1
,S2
, andS3
- since it was pre-split into train-validation data by the dataset authors, the keys we see are
Train/S1
...Valiation/S1
..., however we ignore that split - each subject contains another dictionary with keys corresponding to different actions:
Jog
,Box
,Walking
,Gestures
,ThrowCatch
. - again, since it was pre-split, instead of the full videos we get the chunks of videos,
Jog chunk0
... - for each video, we have three views (
camera
can be0
,1
, or2
), since three cameras were looking at the moving subjects during data collection
- it is a dictionary with keys corresponding to different subjects:
data_3d_humaneva15.npz
containspos3d
which has the same structure aspos2d
, but instead of the 2D keypoint locations, it contains the ground-truth 3D keypoint locations, and also it doesn't have three different views.
We transform the .npz
files into the .csv
files that will be used further when working with RDDs. We split the data into train and test subsets for convenience and make sure that both contain a reasonable portion of data for each subject and each action.
import numpy as np
import pandas as pd
ROOTDIR = '/dbfs/VideoPose3D'
path2d = f'{ROOTDIR}/humaneva/data_2d_humaneva15_gt.npz'
path3d = f'{ROOTDIR}/humaneva/data_3d_humaneva15.npz'
pos2d = np.load(path2d, allow_pickle=True)['positions_2d'].item()
pos3d = np.load(path3d, allow_pickle=True)['positions_3d'].item()
pos_data = []
pos_data_train = []
pos_data_test = []
assert(pos2d.keys() == pos3d.keys())
for subject in pos2d.keys():
print(f'{subject}: {sum([pos2d[subject][action][0].shape[0] for action in pos2d[subject].keys()])} frames in total')
assert(pos2d[subject].keys() == pos3d[subject].keys())
print(list(pos2d[subject].keys()))
# Train-Test split
actions_for_test = [[a for a in pos2d[subject].keys() if action_name in a] for action_name in ['Jog', 'Box', 'Walking', 'Gestures', 'ThrowCatch']]
actions_for_test = [a[1] for a in actions_for_test if len(a)>1]
# Add to full data
for action in pos2d[subject].keys():
for camera in [0,1,2]:
n_frames = pos2d[subject][action][camera].shape[0]
assert(n_frames==pos3d[subject][action].shape[0])
frames = np.hstack([
pos2d[subject][action][camera].reshape(n_frames,-1),
pos3d[subject][action].reshape(n_frames,-1)])
row = [[subject, action, camera, *frame] for frame in frames]
pos_data.extend(row)
# Add to train data
for action in set(pos2d[subject].keys()) - set(actions_for_test):
for camera in [0,1,2]:
n_frames = pos2d[subject][action][camera].shape[0]
assert(n_frames==pos3d[subject][action].shape[0])
frames = np.hstack([
pos2d[subject][action][camera].reshape(n_frames,-1),
pos3d[subject][action].reshape(n_frames,-1)])
row = [[subject, action, camera, *frame] for frame in frames]
pos_data_train.extend(row)
# Add to test data
for action in actions_for_test:
for camera in [0,1,2]:
n_frames = pos2d[subject][action][camera].shape[0]
assert(n_frames==pos3d[subject][action].shape[0])
frames = np.hstack([
pos2d[subject][action][camera].reshape(n_frames,-1),
pos3d[subject][action].reshape(n_frames,-1)])
row = [[subject, action, camera, *frame] for frame in frames]
pos_data_test.extend(row)
print('Creating full dataframe...')
pos_df = pd.DataFrame(pos_data, columns=['Subject','Action','Camera'] + (','.join([f'x{i+1},y{i+1}' for i in range(15)])).split(',') + (','.join([f'X{i+1},Y{i+1},Z{i+1}' for i in range(15)])).split(','))
print('Creating train dataframe...')
pos_df_train = pd.DataFrame(pos_data_train, columns=['Subject','Action','Camera'] + (','.join([f'x{i+1},y{i+1}' for i in range(15)])).split(',') + (','.join([f'X{i+1},Y{i+1},Z{i+1}' for i in range(15)])).split(','))
print('Creating test dataframe...')
pos_df_test = pd.DataFrame(pos_data_test, columns=['Subject','Action','Camera'] + (','.join([f'x{i+1},y{i+1}' for i in range(15)])).split(',') + (','.join([f'X{i+1},Y{i+1},Z{i+1}' for i in range(15)])).split(','))
SAVE = False
if SAVE:
pos_df.to_csv(f'{ROOTDIR}/humaneva/humaneva15.csv')
pos_df_train.to_csv(f'{ROOTDIR}/humaneva/humaneva15_train.csv')
pos_df_test.to_csv(f'{ROOTDIR}/humaneva/humaneva15_test.csv')
print('Done.')
We also experimented with 2D keypoint detections produced by Mask-RCNN, which we load in the cell below.
cd /dbfs/VideoPose3D/humaneva/
wget https://dl.fbaipublicfiles.com/video-pose-3d/data_2d_humaneva15_detectron_pt_coco.npz
We pre-save the skeleton data for HumanEva dataset.
humaneva_skeleton = {
'parents': [-1, 0, 1, 2, 3, 1, 5, 6, 0, 8, 9, 0, 11, 12, 1],
'joints_left': [2, 3, 4, 8, 9, 10],
'joints_right': [5, 6, 7, 11, 12, 13],
}
np.savez(f'{ROOTDIR}/humaneva/humaneva_skeleton.npz', data=humaneva_skeleton)
Let's plot the first frames of the video sequences in our dataset to understand, what kind of input is expected by the neural network (see more details on that in the next noteook).
from IPython.display import HTML
from matplotlib import pyplot as plt
from matplotlib.animation import FuncAnimation
import matplotlib as mpl
desc_list = []
pos2d_list = []
for row in pos_df.iterrows():
if row[1].values[2]==0:
desc_list.append(': '.join(row[1].values[:2]) + f' (Cam {row[1].values[2]})')
pos2d_list.append(np.array(row[1].values[3:3+15*2]).reshape(15,2))
pos2d_list = np.array(pos2d_list)
figure, ax = plt.subplots(figsize=(10,10))
ax.axis('equal')
xmin, xmax = pos2d_list[:,:,0].min(), pos2d_list[:,:,0].max()
ymin, ymax = pos2d_list[:,:,1].min(), pos2d_list[:,:,1].max()
def animation_function(i):
ax.clear()
# Setting title as subject + action + camera
ax.set_title(desc_list[i])
# Setting limits for x and y axis
ax.set_xlim(xmin, xmax)
ax.set_ylim(ymin, ymax)
ax.invert_yaxis()
# Plotting the 2D keypoints
x = pos2d_list[2*i,:,0]
y = pos2d_list[2*i,:,1]
plt.scatter(x, y)
# Plotting the 2D skeleton
for indices in [np.array([-1, 1, 0]),
np.array([1, 2, 3, 4]),
np.array([1, 5, 6, 7]),
np.array([0, 8, 9, 10]),
np.array([0, 11, 12, 13])]:
plt.plot(x[indices], y[indices], 'b-')
animation = FuncAnimation(figure, animation_function, frames=500)
#Output too big for mdbook. Animation available in Databricks.
#HTML(animation.to_jshtml())