Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Responsibly Building Next Generation of Multimodal Foundation Models

BigDocs: A Permissively-Licensed Dataset for Training Vision-Language Models on Document and Code Tasks

Juan Rodriguez · Xiangru Jian · Siba Smarak Panigrahi · Tianyu Zhang · Aarash Feizi · Abhay Puri · Akshay Kalkunte Suresh · François Savard · Amirhossein Abaskohi · Ahmed Masry · Shravan Nayak · Mahsa Massoud · Rabiul Awal · Pierre-André Noël · Mats L Richter · Saverio Vadacchino · Shubham Agarwal · Sanket Biswas · Ying Zhang · Sathwik Tejaswi Madhusudhan · Joao Monteiro · Krishnamurthy Dvijotham · Torsten Scholak · Nicolas Chapados · Sean Hughes · M. Tamer Özsu · Aishwarya Agrawal · Marco Pedersoli · Chris Pal · Perouz Taslakian · David Vazquez · Issam Hadj Laradji · Spandana Gella · Sai Rajeswar Mudumba

Keywords: [ LLM ] [ vision-language models ] [ document understanding ] [ VLM ]


Abstract:

Vision and language models that can accurately understand both images and textare crucial for deeper document understanding. These models can efficiently perform enterprise-level tasks, such as receipt processing from screenshots, website and business workflow generation from sketches, and extracting information from structured documents. These tasks often require generating long, structured outputs, an area where models trained on current datasets struggle. Additionally, many existing datasets are not license-permissive, limiting their use to non-commercial applications. To address these limitations, we present BigDocs, a high-quality, specifically curated dataset to train license-permissive Vision and Language Models (VLMs) capable of performing a wide variety of tasks. This dataset focuses on acquiring accurate image-text pairs across diverse tasks while adhering to accountability, responsibility, and transparency (ART) standards. Our preliminary experiments demonstrate that pre-training with BigDocs yields performance boosts in document reasoning and tasks requiring long structured outputs such as screenshot-to-HTML, table-to-Latex, or image-to-SVG. We believe that VLMs trained on BigDocs have the potential to enhance multimodal capabilities significantly, benefiting broader research in multimodal document understanding.

Chat is not available.