- Python 94.3%
- Shell 5.7%
| assets | ||
| demonstration | ||
| inference_verification | ||
| scripts | ||
| .gitignore | ||
| pre-commit | ||
| pyproject.toml | ||
| README.md | ||
| uv.lock | ||
Inference Verification to Detect Model Weight Exfiltration.
As large AI models become increasingly valuable assets, the risk of model weight exfiltration from inference servers grows accordingly. An attacker controlling an inference server may exfiltrate model weights by hiding them within ordinary model responses, a strategy known as steganography. This work investigates how to verify LLM model inference to defend against such attacks and, more broadly, to detect anomalous or buggy behavior during inference. To enable verification, we characterize valid sources of non-determinism in large language model inference and introduce two practical estimators for them: one for inference that uses the Gumbel-Max Trick (GM-LS) and one for inference that uses Inverse Probability Transform (IPT-LS). This work develops these methods with a concurrent work, Token-DiFR, which has minor implementation differences, but focuses on the method development, rather than the application of these methods.
|
The world we are hoping for: |
The world we want to protect against: |
In the paper, we
- Formalize model weight exfiltration as a security game,
- Propose a verification framework that can provably mitigate steganographic exfiltration,
- Specify the trust assumptions associated with our scheme.
- Give empirical bounds on exfiltratable information, from an information-theoretic perspective, such that no-adversary (regardless of their computational power) could exfiltrate more information
- We evaluate our detection framework on several open-weight models ranging from 3B to 30B parameters. On MOE-Qwen-30B, our detector reduces exfiltratable information to <0.5% with false-positive rate of <0.01, corresponding to a >200x slowdown for adversaries.
Our scheme:
Quick Start - I just want to verify if my tokens are honestly generated!
Optional pre-step: Generate tokens
python inference_verification/generate.py --config demonstration/config_example.yaml
Verify tokens
python inference_verification/verify.py
--input generated_outputs/generated_outputs.pkl
--config demonstration/config_example.yaml
You can verify 18,500 tokens in 17 seconds! Here is how the output will look:
Results:
Total tokens: 18531
Safe tokens: 18518 (99.93%)
Suspicious tokens: 13 (0.07%)
Dangerous tokens: 0 (0.00%)
Screenshot of inference verification process (verify.py):
Quick Start - I just want to verify those plots that you made!
./scripts/test_minimal.sh
This will:
- Generate text using vLLM (100 prompts, 100 tokens each)
- Compute GLS/CGS verification scores
- Produce Pareto frontier plots showing FPR vs extractable information trade-offs
Multi-Model Sweep
For production experiments across multiple models:
cd scripts
./run_GLS_all_models.sh
Repo Description
This repository contains code for inference verification applied to preventing model weight exfiltration. Specifically we implement the Gumbel Likelihood Score (GLS) and Convolved Gaussian Score (CGS) methods for verifying tokens generated by large language models when running inference using the Inverse Probability Transform and Gumbel-Max Trick.
Note on terminology: In the paper, these methods are referred to as:
- Inverse Probability Transform (IPT) Likelihood Score (IPT-LS) - implemented in the codebase as Convolved Gaussian Score (CGS)
- Gumbel Max Likelihood Score (GM-LS) - implemented in the codebase as Gumbel Likelihood Score (GLS)