Decompiling the Synergy: How Humans and AI Team Up in Reverse Engineering

2025/12/18

Categories: Blogpost Tags: Software Reverse Engineering Human Study Artificial Intelligence

Software Reverse Engineering (SRE) is one of the most intellectually demanding disciplines in cybersecurity. It involves analyzing compiled software to understand how it operates without access to the source code. The goal is to determine how the software works, how it can fail, and, in some cases, how it is used to carry out attacks.

Reverse engineers analyze malware, audit software for vulnerabilities, investigate digital forensics cases, and verify critical systems where transparency is essential but source code is unavailable.
The dark side of this art is cracking, where SRE techniques are used to bypass software protections such as license checks, digital rights management, or copy-protection mechanisms.

With the rise of Large Language Models (LLMs), the flagship technology of modern Artificial Intelligence, analysts increasingly rely on AI assistants embedded into decompilers via plugins such as aiDAPal for IDA Pro or ReverserAI for Binary Ninja.
These tools promise to accelerate comprehension, but a fundamental question remained unanswered:

Do LLMs actually help human analysts perform better in real reverse engineering workflows?

Our team of researchers from Arizona State University, the University of Padua, and EURECOM, set out to answer this question with the first systematic, human-centered study of LLM-assisted Software Reverse Engineering.

We present our results in Decompiling the Synergy: An Empirical Study of Human-LLM Teaming in Software Reverse Engineering, recently accepted at the Network and Distributed System Security (NDSS) Symposium 2026, to be held in February 2026 in San Diego, California.

You can download the paper 👉 here

To support this study, we (credit to Zion Basque, the first author) developed DAILA, a research-oriented LLM plugin that integrates a superset of AI features offered by existing tools.
DAILA abstracts over the underlying decompiler and currently supports IDA Pro, Ghidra, Binary Ninja, and angr, while remaining model-agnostic via LiteLLM–enabling both closed models (e.g., GPT-4o, Claude) and open ones (e.g., LLaMA).


đź§Ş The Study

To move beyond anecdotes, we designed a three-phase controlled human study combining fine-grained behavioral instrumentation with qualitative feedback.

Three Phases

  1. Pre-Study Survey (153 practitioners)
    We surveyed a diverse population of SRE practitioners to understand how LLMs are already used in practice.

    • 68% reported LLMs as sometimes or often beneficial
    • 86% primarily used GPT-based models
    • The most common use cases were function summarization, renaming, and explaining known algorithms
  2. Experiment Design
    We built a browser-based reverse-engineering platform integrating DAILA, exposing six AI features:

    • Function summarization
    • Function renaming
    • Variable renaming
    • Known algorithm identification
    • Vulnerability identification
    • Library documentation lookup
    • Plus a free-form chat interface

    We designed two realistic CTF-style challenges, carefully balanced in size, complexity, and difficulty. Each challenge contained realistic bugs (e.g., weak cryptography, path traversal) and representative program structure.

  3. Human Study
    We recruited 48 participants (24 self-reported experts and 24 novices).
    Each participant solved one challenge with LLM assistance and one without, acting as their own control.

    In total, participants produced:

    • 109 hours of recorded reverse-engineering activity
    • 96 solution write-ups
    • 1,517 distinct LLM interactions

📊 What We Found

1. AI as an Equalizer

LLMs dramatically narrow the expertise gap.
Novices using LLMs achieved a ~98% improvement in comprehension rate, reaching expert-level understanding speed.

“I would not have understood the binary half as well without the LLM.”
— study participant

This effect was consistent across both challenges and robust to different analysis metrics.


2. Experts: Redistribution, Not Acceleration

Experts did not experience a statistically significant increase in overall comprehension rate.

They did benefit selectively:

However, experts reported:


3. More Artifacts ≠ Better Understanding

With LLM support, participants recovered ~66% more artifacts (comments, variable names, function names, inferred types).

Yet:

This suggests that the act of naming and structuring code is itself a cognitive process, one that automation can partially bypass (sometimes to the analyst’s detriment).


4. Hallucinations Are Rare, But Costly

Hallucinations occurred infrequently, but their impact was severe.

In about 20% of sessions, participants pursued false hypotheses (e.g., nonexistent buffer overflows) introduced by LLM suggestions.
In these cases, analysts spent up to 2Ă— more time chasing bugs that were not there.

Vulnerability-identification prompts were:


5. Strategy Matters More Than Frequency

Top performers shared a common habit:

This “first-visit strategy” led to a ~63% higher understanding rate than repeated or late LLM use.

In contrast:


đź’ˇ Key Takeaways

Insight What We Observed
LLMs close the novice-expert gap Novices reach expert-level comprehension with AI
Experts remain irreplaceable Domain knowledge is essential for validation
AI boosts quantity, not always quality More artifacts ≠ deeper understanding
Best use: early summarization Use AI at first contact with code
*Worst use: vulnerability detection False positives harm trust and performance

In short: LLMs help you think faster, but only if you stay in charge.


⚖️ In Context: Comparing to Developer–AI Studies

Our findings echo recent the work “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity” by Model Evaluation & Threat Research (METR), an independent AI evaluation organization. In that randomized controlled trial, experienced open-source developers were 19% slower when allowed to use AI tools, largely due to time spent validating, correcting, or undoing AI-generated output.

Our experts did not become slower in the same uniform way, but the underlying failure mode was similar. Crucially, expertise did not provide immunity. When an LLM confidently suggested a vulnerability that did not exist, experts frequently followed the lead, sometimes spending substantially more time on those functions than they would have without AI. The cost of these errors outweighed the occasional speedups elsewhere. In our data, the vulnerability-detection feature was the only AI capability negatively correlated with understanding, and experts were among those most affected.

Taken together, our results and METR’s point to the same uncomfortable conclusion:
current LLMs do not reliably accelerate expert cognition. Instead, they introduce a new tax—verification, skepticism, and recovery from subtle errors. For experts, this tax often cancels out, or even exceeds, any raw efficiency gains.

The implication is not that experts are replaceable, but that expert workflows are fragile when augmented with tools that speak fluently but reason shallowly. Until LLMs can support long-horizon reasoning and maintain semantic grounding, expert users may remain paradoxically among the least well-served by AI assistance.


đź§­ Final Thoughts

Our study shows that the future of software reverse engineering is collaborative.
LLMs do not replace experts; they reshape the cognitive workflow. Used carefully, they accelerate understanding, expose hidden structure, and empower less experienced analysts to reason like experts.

Yet the same tools can mislead when over-trusted. Hallucinated vulnerabilities and confident but wrong explanations remind us that LLMs remain probabilistic assistants, not oracles.

The real question is no longer whether we should use AI in SRE, but how we design tools, workflows, and training that keep humans firmly in the loop.

As we conclude in the paper:

“LLMs are neither oracles nor impostors, but mirrors held to human insight: their reflections sharpen only in the steady gaze of critical thought and domain expertise.”


Authors

>> Home