Recently, there are increasing use cases/applications of personal live streaming (e.g., video calls, personal broadcasting, etc.) with a user’s mobile device. Here, the user may want to modify some area (e.g., change background, augment reality, etc.). Good segmentation on the device is possible when the model is adapted to the target environment using ODL. In this demo, we show a video segmentation use case which runs efficiently and works well on any unseen target environment.
For adapting the small and efficient segmentation model to the user’s video stream, we use a much larger teacher model to generate pseudo-masks for the user and background in the initial frames. The pseudo-labeled frames along with the background images are used to fine-tune the efficient model. This method of on-device distillation of a much larger model to a much smaller model minimizes the distribution shift due to the user’s videos. However, on-device fine-tuning requires significant training time. Hence, we propose to use distributed training to parallelize the fine-tuning procedure. Essentially, we train multiple models on different processor cores initialized with different hyper-parameters and training iterations. After the training procedure is completed, the multiple models are aggregated, and the final predictions used in the inference of human segmentation masks.