Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Compositional Learning: Perspectives, Methods, and Paths Forward

Faster Slot Decoding using Masked Transformer

Akihiro Nakano · Masahiro Suzuki · Yutaka Matsuo

Keywords: [ Masked Token Prediction ] [ Image Transformers ] [ Compositional Representation ] [ Object-Centric Learning ]


Abstract:

Common object-centric learning models learn a set of representations, or "slots". Recent advancements in object-centric learning have introduced autoregressive decoders to decode slots into features or images, allowing the model to learn compositional representations from more complex and realistic datasets. However, the autoregressive decoding process is time-consuming due to its sequential nature, making it difficult to apply to downstream tasks such as video generation. In this paper, we introduce MaskSDT, a masked bidirectional transformer that decodes all slots simultaneously. Our experiments on the 3D Shapes and CLEVR datasets demonstrate that our model shows improvement in reconstruction performance and generation speed, as well as comparable results in compositional generation.

Chat is not available.