Image-Pretrained Vision Transformers for
Real-Time Traffic Anomaly Detection

Chengqi Li
TU Eindhoven · Mobile Perception Systems Lab

The Accuracy-Efficiency Trade-off

Can we leverage Image-Pretrained Backbones for real-time TAD?

🎯 State of the Art: Video-based Models

85.2%

AUC-ROC on DoTA (Simple-TAD/VideoMAE)

High accuracy via 3D spatiotemporal attention
Heavy computation (3D tubelet processing)
76 FPS

⚡ Our Approach: Adapt VidEoMT for TAD

VidEoMT

Lightweight video model with frame-independent processing

→ Image-pretrained backbone (DINOv2 ViT-S)
→ Lightweight query propagation across frames
→ Leverage efficiency for real-time TAD deployment

Research Question

Can we adapt Frame-based vision foundation models to create an efficient alternative
for video anomaly detection?

VidEoMT-TAD-S Architecture

Frame-independent processing with Query Fusion (T=16 frames)

Ablation: Impact of Pretraining

Representation quality matters more than label supervision

Backbone Pretraining Comparison

Pretraining	AUC-ROC	FPS
DINOv2 (self-supervised)	76.5%	190
ImageNet-1K (supervised)	72.6%	202
ImageNet-21K (supervised)	72.0%	197

Fixed architecture: Q=1, Temporal Attention pooling. FPS on Titan V (FP16).

+3.9 pp

DINOv2 vs ImageNet-1K

Key Insight

Self-supervised DINOv2 outperforms supervised ImageNet by +3.9 pp. More label diversity (21K vs 1K) does not help.

Takeaway: Representation quality matters more than label supervision for anomaly detection.

Layer-wise Learning Rate Decay

LLRD	AUC-ROC
0.6	74.9%
0.8	73.3%

Q=1, Linear propagator, Last Frame, DINOv2

Lower LLRD preserves pretrained features in early layers → +1.6 pp

Ablation: Query Pooling

Aggregating Q query tokens into a single frame representation

Ablation (Q=50)

METHOD	AUC-ROC	FPS	PARAMS
Max	76.3%	189	0
Attention	75.1%	184	296K
Mean	73.4%	190	0

Linear propagator, Last Frame

Max pooling achieves the best AUC_ROC with zero additional parameters. Anomalous events strongly activate specific queries rather than uniformly across all queries — max pooling preserves this discriminative signal while mean pooling dilutes it.

Ablation: Temporal Pooling

Aggregating per-frame features across time for clip-level prediction

Query Propagator

PROPAGATOR	AUC-ROC	FPS
GRU	78.1%	200
Linear	76.3%	189

Q=50, Max pool, Last Frame

Temporal Pooling

METHOD	AUC-ROC	FPS
Temp. Attn	76.5%	190
Last Frame	74.9%	190

Q=1, Linear propagator

The GRU propagator accumulates temporal context in its hidden state, making additional temporal attention redundant. Only the final query q_T is needed for classification.

Summary

Accuracy-Efficiency Trade-off on DoTA Dataset

Comparison on DoTA Dataset

Model	Prop.	Pool	AUC-ROC	FPS
Simple-TAD (VideoMAE-S)	—	—	85.2%	76
VidEoMT-TAD-S (Q=50)	GRU	LF+Max	78.1%	200
VidEoMT-TAD-S (Q=1)	GRU	LF	77.3%	200
VidEoMT-TAD-S (Q=50)	Linear	LF+Max	76.3%	189
VidEoMT-TAD-S (Q=1)	Linear	LF	74.9%	190

All VidEoMT-TAD variants use DINOv2 pretraining. LF = Last Frame. FPS on Titan V (FP16).

Accuracy vs Speed Trade-off

Best Config

78.1%

200 FPS

vs. VideoMAE

2.6x

Faster

-7.1pp

Accuracy

Note: Scaling the DINOv2 backbone from ViT-S/14 to ViT-B/14 only yields 78.3% AUC-ROC (vs. 78.1%), suggesting further gains require temporal modeling rather than a larger image encoder. (S. Orlova)

Per-Type Breakdown

Where does the accuracy gap come from?

AUC-ROC by Anomaly Type (sorted by gap)

VidEoMT Competitive

Start/Stop (−4.5pp), Obstacle (−7.3pp)
Static/slow anomalies: frame-independent processing sufficient

VidEoMT Struggles

Pedestrian (−16.4pp), Unknown (−12.8pp)
Complex trajectories require temporal modeling

Takeaway: Gap is concentrated in motion-heavy types — validates the “lack of spatiotemporal features” hypothesis.

Qualitative Verification (1/3)

Small gap categories (−4.5 to −8.9pp) — spatial cues largely sufficient.

Start/Stop (−4.5pp): Static anomaly — spatial cues alone suffice

Obstacle (−7.3pp): Stationary obstruction detected from single frame

Lateral (−7.7pp): Side-to-side motion captured by spatial features

Oncoming (−8.9pp): Focuses on the approaching hazard

Qualitative Verification (2/3)

Medium gap categories (−9.0 to −11.2pp) — increasing need for temporal context.

Turning (−9.0pp): Most common type — attends to turning vehicle

Leave Left (−10.1pp): Vehicle departing lane leftward

Moving Ahead (−10.5pp): Forward collision detected through spatial proximity

Leave Right (−11.2pp): Vehicle departing lane rightward

Qualitative Verification (3/3)

Large gap categories (−12.8 to −16.4pp) — temporal dynamics essential.

Unknown (−12.8pp): Ambiguous scenarios requiring temporal context

Pedestrian (−16.4pp): Hardest category — trajectory understanding needed

Conclusion: Learned queries capture semantically relevant regions across all 10 categories. Even for Pedestrian (−16.4pp), the model attends correctly — the failure is in temporal dynamics, not spatial attention.

Limitations & Conclusion

Critical analysis and deployment guidelines

⚠️ The Accuracy Gap

The ~7% accuracy gap vs. VideoMAE is likely due to the lack of joint spatiotemporal features.

Frame-independent processing struggles to model complex motion dynamics (acceleration, trajectory changes) that joint spatiotemporal attention captures naturally.

Root Cause: No explicit motion modeling between frames.

📋 Deployment Guidelines

VidEoMT-TAD for:

Driver Alert Systems (high FPS needed)
Resource-constrained edge devices
~78% AUC-ROC is acceptable

VideoMAE for:

Autonomous Braking (high recall critical)
Safety-critical applications
Maximum accuracy required

Conclusion

Image-Pretrained ViTs are a viable solution for resource-constrained environments, offering 200 FPS inference.
Possible improvements: (1) Video self-supervised pretraining (e.g., V-JEPA) to inject temporal priors; (2) Explicit spatiotemporal modeling via lightweight temporal modules (e.g., temporal convolution) in earlier backbone layers.

Thank You!

Chengqi Li
Mobile Perception Systems Lab
TU Eindhoven

Q&A

Image-Pretrained Vision Transformers forReal-Time Traffic Anomaly Detection