Skip to yearly menu bar Skip to main content


Poster
in
Workshop: The First Workshop on Large Foundation Models for Educational Assessment

Gemini Pro Defeated by GPT-4V: Evidence from Education

Ehsan Latif · Xiaoming Zhai

[ ] [ Project Page ]
Sun 15 Dec 12:25 p.m. PST — 2 p.m. PST

Abstract:

This study compared the classification performance of Gemini Pro and GPT-4V in educational settings. Employing visual question-answering (VQA) techniques, the study examined both models' ability to read text-based rubrics and automatically score student-drawn models in science education. We employed quantitative and qualitative analyses using a dataset derived from student-drawn scientific models and NERIF (Notation-Enhanced Rubrics for Image Feedback) prompting methods. The findings reveal that GPT-4V significantly outperforms Gemini Pro regarding scoring accuracy and quadratic weighted kappa. The qualitative analysis shows that the differences may be due to the models' ability to process fine-grained texts in images and overall image classification performance. Even adapting the NERIF approach by further de-sizing the input images, Gemini Pro seems unable to perform as well as GPT-4V. The findings suggest GPT-4V's superior capability in handling complex multimodal educational tasks. The study concludes that while both models represent advancements in AI, GPT-4V's higher performance makes it a more suitable tool for educational applications involving multimodal data interpretation.

Chat is not available.