Poster
in
Workshop: Medical Imaging meets NeurIPS
Variable Importance on Medical Images and Socio-Demographic Data
Ahmad CHAMMA · Denis A. Engemann · Bertrand Thirion
Biomarker development is increasingly focusing on heterogeneous sources of data including brain images, biological samples and social data. Biobanks give access to tens of thousands of brain images and other social and biomedical data. These large-scale datasets make it possible to model biomedical outcomes using machine learning. To interpret predictive models, it is crucial to understand how input features influence the prediction. Over the past decades, a wide range of methods has been developed for ranking variables according to their importance in predictive models. Given the variety of settings (e.g. dimensionality or non-linearities, classification vs regression) it remains unclear which method provides the most accurate feature rankings. Benchmarks have been conducted for multiple methods using simulations and empirical validation, yet, these efforts have been disconnected so far because of the diversity of research settings. As a result, some of the most popular methods for estimating variable importance have never been compared. In this work, we extend the literature by systematically comparing the most popular methods for linear and non-linear inputs in classification and regression tasks. For methods providing assessment of statistical significance, we checked if the p-values are well calibrated. We confronted performance metrics with computation time. Deep Neural Networks (DNN) were most reliable at ranking variables according to their importance. SHAP values did not provide reliable population-level importance scores, whereas BART and MDI provided a reasonable tradeoff between computation time and reliability while not providing statistical guarantees. Marginal selection, knockoffs and d0CRT did not generalize well when data were non-linear or correlated. Applied to biomarker learning, DNN and BART provided overall similar importance rankings. Our results emphasize the importance of systematic empirical benchmarks across applied contexts.