Abstract:
Humans have the remarkable ability to rapidly learn new problem solving strategies from a narrow range of examples and extend to examples out of the distribution (OOD) used in learning, but such generalization remains a challenge for neural networks. This seems especially important for learning new mathematical techniques, which apply to huge problem spaces (e.g. all real numbers). We explore this limitation by training neural networks on strategies for solving specified cells in $6\times6$ Sudoku puzzles using a novel curriculum of tasks that build upon each other. We train transformers sequentially on two preliminary tasks, then assess OOD generalization of a more complex solution strategy from a range of restricted training distributions. Baseline models master the training distribution, but fail to generalize to OOD data. However, we find that a combination of extensions is sufficient to support highly accurate and reliable OOD generalization. These results suggest directions for improving the robustness of larger transformer models under the highly imbalanced data distributions provided by natural data sets.