As an empirical field, AI and ML research relies on a foundation of evaluation – it is critical that observers can assess and compare the results of different approaches and come to reliable conclusions about their performance and effectiveness. Indeed, evaluation has never been more important than it is today, given the rapid rise and acceleration in progress for generative models, LLMs, and related methods. However, the problem of evaluation in this domain is far from trivial. There are high level issues around definitions of ground truth and assessing correctness and context, logistical issues around cost and reliability, theoretical issues around defining an appropriate evaluation distribution of tasks, organizational issues of which entities can be trusted to perform evaluations without undue influence, and practical issues as researchers and developers struggle reconcile a myriad of reported benchmarks and metrics. On top of this, we recall Feynman’s famous dictum that the most important thing in any science is “not to fool yourself – and you are the easiest person to fool.” It is all too easy to encounter issues of contamination and leakage that can invalidate results. In this talk, we take a tour through current approaches to addressing these many complexities, and offer thoughts on ways forward for the field. We also share experience from Kaggle in ways that broad community efforts such as competitions can help in this domain. In particular, we describe methods that have been developed to help make competitions resistant to cheating from bad actors, and how they are also of significant value in helping ensure that benchmarks and evaluations are set up to help researchers avoid fooling ourselves.