In my new project at work I had to process a sufficiently large set of image data for a multi-label multi-class classification task. Despite the GPU utilization being close to 100%, a single training epoch over 2 million images took close to 3.5 hrs to run. This is a big issue if you’re running your baseline experiments and want quick results. I first thought that since I was processing original size images each of which were at least a few MBs the bottleneck was disk I/O. I used Imagemagick mogrify to resize all 2 million images which took a long time. To my astonishment resizing images didn’t reduce the training time at all! Well, not noticeably. So, I went through the code and found out that the major bottleneck were the image augmentation operations in Pytorch.
from torchvision import transforms
defget_image_transforms()-> transforms.Compose: """ These transformations meant for data augmentation are a bottleneck since all the operations are done on CPU and then the tensors are copied to the GPU device. """ return transforms.Compose([ transforms.RandomSizedCrop(224), transforms.RandomHorizontalFlip(), transforms.ToTensor(), transforms.Normalize(mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225]) ])
While stumbling on github I found that people working at Nvidia had recently released a library - DALI that is supposed to tackle exactly this issue. The library is still under active development and supports fast data augmentation for all major ML development libraries out there - Pytorch, Tensorflow, MXNet.