Code for Verifying LLM Inference to Detect Model Weight Exfiltration

Python 94.3%
Shell 5.7%

Find a file

axhoover ce8e06eb1a anonymize README		2026-01-07 10:13:21 -08:00
assets	update README	2025-11-06 11:12:08 -05:00
demonstration	adding a straightforward demo	2025-11-06 16:08:37 +00:00
inference_verification	update generation and verification process to take a config parameter	2025-11-06 10:40:29 -05:00
scripts	initial commit	2025-10-28 21:07:21 +00:00
.gitignore	adding a straightforward demo	2025-11-06 16:08:37 +00:00
pre-commit	initial commit	2025-10-28 21:07:21 +00:00
pyproject.toml	adding a straightforward demo	2025-11-06 16:08:37 +00:00
README.md	anonymize README	2026-01-07 10:13:21 -08:00
uv.lock	adding a straightforward demo	2025-11-06 16:08:37 +00:00

README.md

Inference Verification to Detect Model Weight Exfiltration.

As large AI models become increasingly valuable assets, the risk of model weight exfiltration from inference servers grows accordingly. An attacker controlling an inference server may exfiltrate model weights by hiding them within ordinary model responses, a strategy known as steganography. This work investigates how to verify LLM model inference to defend against such attacks and, more broadly, to detect anomalous or buggy behavior during inference. To enable verification, we characterize valid sources of non-determinism in large language model inference and introduce two practical estimators for them: one for inference that uses the Gumbel-Max Trick (GM-LS) and one for inference that uses Inverse Probability Transform (IPT-LS). This work develops these methods with a concurrent work, Token-DiFR, which has minor implementation differences, but focuses on the method development, rather than the application of these methods.

The world we are hoping for:

The world we want to protect against:

In the paper, we

Formalize model weight exfiltration as a security game,
Propose a verification framework that can provably mitigate steganographic exfiltration,
Specify the trust assumptions associated with our scheme.
Give empirical bounds on exfiltratable information, from an information-theoretic perspective, such that no-adversary (regardless of their computational power) could exfiltrate more information
We evaluate our detection framework on several open-weight models ranging from 3B to 30B parameters. On MOE-Qwen-30B, our detector reduces exfiltratable information to <0.5% with false-positive rate of <0.01, corresponding to a >200x slowdown for adversaries.

Our scheme:

Quick Start - I just want to verify if my tokens are honestly generated!

Optional pre-step: Generate tokens

python inference_verification/generate.py --config demonstration/config_example.yaml

Verify tokens

python inference_verification/verify.py 
      --input generated_outputs/generated_outputs.pkl 
      --config demonstration/config_example.yaml

You can verify 18,500 tokens in 17 seconds! Here is how the output will look:

Results:
  Total tokens: 18531
  Safe tokens: 18518 (99.93%)
  Suspicious tokens: 13 (0.07%)
  Dangerous tokens: 0 (0.00%)

Screenshot of inference verification process (verify.py):

Quick Start - I just want to verify those plots that you made!

./scripts/test_minimal.sh

This will:

Generate text using vLLM (100 prompts, 100 tokens each)
Compute GLS/CGS verification scores
Produce Pareto frontier plots showing FPR vs extractable information trade-offs

Multi-Model Sweep

For production experiments across multiple models:

cd scripts
./run_GLS_all_models.sh

Repo Description

This repository contains code for inference verification applied to preventing model weight exfiltration. Specifically we implement the Gumbel Likelihood Score (GLS) and Convolved Gaussian Score (CGS) methods for verifying tokens generated by large language models when running inference using the Inverse Probability Transform and Gumbel-Max Trick.

Note on terminology: In the paper, these methods are referred to as:

Inverse Probability Transform (IPT) Likelihood Score (IPT-LS) - implemented in the codebase as Convolved Gaussian Score (CGS)
Gumbel Max Likelihood Score (GM-LS) - implemented in the codebase as Gumbel Likelihood Score (GLS)