Poster
in
Workshop: Attributing Model Behavior at Scale (ATTRIB)
GRADE: A Fine-grained Approach to Measure Sample Diversity in Text-to-Image Models
Royi Rassin · Aviv Slobodkin · Shauli Ravfogel · Yanai Elazar · Yoav Goldberg
Existing diversity metrics like Fréchet Inception Distance (FID) and Recall require reference images and are generally not reliable. Evaluating the diversity of text-to-image (T2I) model outputs remains a challenge, especially in capturing fine-grained variations essential for creativity and bias mitigation. We propose \textbf{Gr}anular \textbf{A}ttribute \textbf{D}iversity \textbf{E}valuation (GRADE), a descriptive and fine-grained method for assessing sample diversity in T2I models without requiring reference images. GRADE estimates the distribution of attributes within generated images of a concept, such as the shape or flavor distribution of the concept ``cookie'', and computes its normalized entropy, providing interpretable insights into model behavior and a diversity score. We show GRADE achieves over 90\% agreement with human evaluation while having weak correlation to FID and Recall, indicating it captures new, fine-grained forms of diversity.We use GRADE to measure and compare the diversity of 12 T2I models and reveal that the most advanced models are the least diverse, scoring just 0.47 entropy and defaulting to depicting concepts with the same attributes (e.g., cookies are round) 88\% of the time, despite varied prompts. We observe an inherent trade-off between diversity and prompt adherence, akin to the Precision-Recall trade-off and negative correlation between diversity and model size. We identify that underspecified captions in training data contribute significantly to low sample diversity, leading models to depicting concepts with the same attributes. GRADE serves as a valuable tool for benchmarking and guiding the development of more diverse T2I models.