Training and evaluation of automotive perception models, i.e., 3D object detection and semantic segmentation, requires high quality annotated images. Collecting annotated data is costly and, in some cases, inherently hard, i.e., rare objects or events such as emergency vehicles, accidents, or odd driving behaviors. The rapid advancements of generative models, in terms of realism, diversity and controllability, has recently shown a great promise in using them to generate samples to be consumed by perception models for better training or evaluation.
We demonstrate an application of state-of-the-art generative models, namely diffusion models and differentiable renderers, to create data for automotive perception models. One key requirement for this application is the tight fidelity requirements ensuring that the generated data precisely matches the intended annotations in the 3D space. This is especially challenging for standard diffusion models, i.e., StableDiffusion or ControlNet, that are solely trained for images with no explicit understanding of 3D geometries. Therefore, most existing image editing tools, built on top of standard diffusion models, lack the high-fidelity requirements needed for automotive applications. Using our explicit 3D conditioning of diffusion models and post-generation fidelity filters, we reinforce the object pose fidelity of diffusion models when editing shape and appearance of vehicles. Additionally, we showcase an application of dynamic Gaussian splatting for 3D reconstruction of driving scenes that allows to seamlessly remove, translate, or add new vehicles to the scene with a high fidelity. We demonstrate these capabilities on multi-camera driving videos. We rely on cross-scene and cross-dataset actor transfer to increase the diversity of the edits beyond the objects existing in the scene.