Skip to yearly menu bar Skip to main content


Poster

Source Code Foundation Models are Transferable Binary Analysis Knowledge Bases

Zian Su · Xiangzhe Xu · Ziyang Huang · Kaiyuan Zhang · Xiangyu Zhang

[ ]
Fri 13 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

Human-Oriented Binary Reverse Engineering (HOBRE) lies at the intersection of binary and source code, aiming to lift binary code to human-readable content relevant to source code, thereby bridging the binary-source semantic gap. Recent advancements in uni-modal code model pre-training, particularly in generative Source Code Foundation Models (SCFMs) and binary understanding models, have shown promise. However, existing approaches for HOBRE rely heavily on uni-modal models like SCFMs for supervised fine-tuning or general LLMs for prompting, resulting in sub-optimal solutions. Inspired by recent progress in multi-modal models, we argue that it is possible to harness the strengths of both uni-modal code models to bridge the semantic gap effectively. In this paper, we propose a novel probe-and-recover framework that incorporates a binary-source encoder-decoder model and black-box LLMs for binary analysis. Our approach leverages the pre-trained knowledge within SCFMs to synthesize relevant, symbol-rich code fragments as context. This additional context enables black-box LLMs (recoverers) to enhance recovery accuracy. We demonstrate significant improvements in zero-shot binary summarization and binary function name recovery, with a 10.3\% relative gain in CHRF and a 16.7\% relative gain in a GPT4-based metric for summarization, as well as a 6.7\% and 7.4\% absolute increase in token-level precision and recall for name recovery, respectively. These results highlight the effectiveness of our approach in automating and improving binary code analysis.

Live content is unavailable. Log in and register to view live content