unplug benchmarks

REGEX_VS_ML

The ML layer closes the recall gap.

Same public datasets, same threshold, fresh context per sample. The chart shows recall/F1 uplift from regex-only to regex + ML.

uplift summary

direct recall: 0.39 -> 0.98

direct F1: 0.56 -> 0.99

indirect recall: 0.05 -> 0.91

metric bars

direct recall / regex

0.39

direct recall / +ML

0.98

direct F1 / regex

0.56

direct F1 / +ML

0.99

indirect recall / regex

0.05

indirect recall / +ML

0.91

Public injection datasets, isolated single-turn sessions.
Dataset	Mode	Recall	F1	FPR
neuralchemy (direct, 4,391)	regex-only	0.39	0.56	<1%
neuralchemy (direct, 4,391)	regex + ML	0.98	0.99	<1%
microsoft llmail (indirect, 2,500)	regex-only	0.05	n/a	n/a
microsoft llmail (indirect, 2,500)	regex + ML	0.91	n/a	n/a

BASELINES

Smaller model, richer output.

Same-harness comparison against a public 184M-parameter DeBERTa prompt-injection classifier baseline. The baseline returns a document label; unplug-tiny returns document risk plus span evidence.

unplug-tiny-v1

~22M

dual-head span model: doc risk + localized spans for redact/review/block.

~12% of baseline params

public baseline

~184M

binary document classifier: unsafe/safe label for the whole input.

~8.4x larger

output shape

spans

unplug can remove only the malicious region instead of discarding the whole document.

span localization: yes

Same-harness model comparison. Values are doc-level unless noted.
Holdout	Metric	unplug-tiny	184M baseline	Delta
public validation mix (10,000)	recall / F1 / FPR	0.9998 / 0.9997 / 0.06%	0.6060 / 0.7301 / 10.03%	+39.38 recall pts
neuralchemy test (942)	recall / F1 / FPR	0.9438 / 0.9693 / 0.51%	0.8641 / 0.9173 / 2.82%	+7.97 recall pts
BIPIA indirect (2,000)	recall / F1 / FPR	0.9630 / 0.9812 / 0%	0.0275 / 0.0535 / 0%	+93.55 recall pts
NotInject benign (339)	false-positive rate	0.88%	43.36%	-42.48 FPR pts
Deepset direct OOD (281)	recall / F1 / FPR	0.6190 / 0.6914 / 10.23%	0.3714 / 0.5379 / 0.57%	higher recall, worse FPR
XSTest safe contrast (250)	false-positive rate	2.80%	0%	baseline wins

The comparison is not a blanket leaderboard claim. It uses one fixed same-harness eval snapshot across public holdouts. unplug-tiny wins on span output, size, indirect-injection recall, and NotInject false positives; the binary baseline is cleaner on XSTest-style safe/harmful contrast.

CAVEATS

Honest caveats.

scope

Numbers are single-turn. Multi-turn trajectory and crescendo detection are measured separately.

false positives

On benign text containing trigger-shaped phrases, the regex layer is the dominant false-positive source.

reproduce

Full methodology, per-dataset tables, and the reproduce command live in the SDK repo.

SDK BENCHMARKS.md model card live demo API waitlist