Code for Verifying LLM Inference to Detect Model Weight Exfiltration
  • Python 94.3%
  • Shell 5.7%
Find a file
2026-01-07 10:13:21 -08:00
assets update README 2025-11-06 11:12:08 -05:00
demonstration adding a straightforward demo 2025-11-06 16:08:37 +00:00
inference_verification update generation and verification process to take a config parameter 2025-11-06 10:40:29 -05:00
scripts initial commit 2025-10-28 21:07:21 +00:00
.gitignore adding a straightforward demo 2025-11-06 16:08:37 +00:00
pre-commit initial commit 2025-10-28 21:07:21 +00:00
pyproject.toml adding a straightforward demo 2025-11-06 16:08:37 +00:00
README.md anonymize README 2026-01-07 10:13:21 -08:00
uv.lock adding a straightforward demo 2025-11-06 16:08:37 +00:00

Inference Verification to Detect Model Weight Exfiltration.

As large AI models become increasingly valuable assets, the risk of model weight exfiltration from inference servers grows accordingly. An attacker controlling an inference server may exfiltrate model weights by hiding them within ordinary model responses, a strategy known as steganography. This work investigates how to verify LLM model inference to defend against such attacks and, more broadly, to detect anomalous or buggy behavior during inference. To enable verification, we characterize valid sources of non-determinism in large language model inference and introduce two practical estimators for them: one for inference that uses the Gumbel-Max Trick (GM-LS) and one for inference that uses Inverse Probability Transform (IPT-LS). This work develops these methods with a concurrent work, Token-DiFR, which has minor implementation differences, but focuses on the method development, rather than the application of these methods.

The world we are hoping for:
image
The world we want to protect against:
image

In the paper, we

  1. Formalize model weight exfiltration as a security game,
  2. Propose a verification framework that can provably mitigate steganographic exfiltration,
  3. Specify the trust assumptions associated with our scheme.
  4. Give empirical bounds on exfiltratable information, from an information-theoretic perspective, such that no-adversary (regardless of their computational power) could exfiltrate more information
  5. We evaluate our detection framework on several open-weight models ranging from 3B to 30B parameters. On MOE-Qwen-30B, our detector reduces exfiltratable information to <0.5% with false-positive rate of <0.01, corresponding to a >200x slowdown for adversaries.

Our scheme: image

Quick Start - I just want to verify if my tokens are honestly generated!

Optional pre-step: Generate tokens

python inference_verification/generate.py --config demonstration/config_example.yaml 

Verify tokens

python inference_verification/verify.py 
      --input generated_outputs/generated_outputs.pkl 
      --config demonstration/config_example.yaml 

You can verify 18,500 tokens in 17 seconds! Here is how the output will look:

Results:
  Total tokens: 18531
  Safe tokens: 18518 (99.93%)
  Suspicious tokens: 13 (0.07%)
  Dangerous tokens: 0 (0.00%)

Screenshot of inference verification process (verify.py):

Screenshot of inference verification process

Quick Start - I just want to verify those plots that you made!

./scripts/test_minimal.sh

This will:

  1. Generate text using vLLM (100 prompts, 100 tokens each)
  2. Compute GLS/CGS verification scores
  3. Produce Pareto frontier plots showing FPR vs extractable information trade-offs

Multi-Model Sweep

For production experiments across multiple models:

cd scripts
./run_GLS_all_models.sh

Repo Description

This repository contains code for inference verification applied to preventing model weight exfiltration. Specifically we implement the Gumbel Likelihood Score (GLS) and Convolved Gaussian Score (CGS) methods for verifying tokens generated by large language models when running inference using the Inverse Probability Transform and Gumbel-Max Trick.

Note on terminology: In the paper, these methods are referred to as:

  • Inverse Probability Transform (IPT) Likelihood Score (IPT-LS) - implemented in the codebase as Convolved Gaussian Score (CGS)
  • Gumbel Max Likelihood Score (GM-LS) - implemented in the codebase as Gumbel Likelihood Score (GLS)