Image-Pretrained Vision Transformers for
Real-Time Traffic Anomaly Detection

Chengqi Li
TU Eindhoven ยท Mobile Perception Systems Lab

The Accuracy-Efficiency Trade-off

Can we leverage Image-Pretrained Backbones for real-time TAD?

๐ŸŽฏ State of the Art: Video-based Models

85.2%
AUC-ROC on DoTA (Simple-TAD/VideoMAE)
  • High accuracy via 3D spatiotemporal attention
  • Heavy computation (3D tubelet processing)
  • 76 FPS

โšก Our Approach: Adapt VidEoMT for TAD

VidEoMT
Lightweight video model with frame-independent processing
  • โ†’ Image-pretrained backbone (DINOv2 ViT-S)
  • โ†’ Lightweight query propagation across frames
  • โ†’ Leverage efficiency for real-time TAD deployment
Research Question
Can we adapt Frame-based vision foundation models to create an efficient alternative
for video anomaly detection?

VidEoMT-TAD-S Architecture

Frame-independent processing with Query Fusion (T=16 frames)

VidEoMT-TAD Architecture Num_Q queries

Ablation: Impact of Pretraining

Representation quality matters more than label supervision

Backbone Pretraining Comparison

Pretraining AUC-ROC FPS
DINOv2 (self-supervised) 76.5% 190
ImageNet-1K (supervised) 72.6% 202
ImageNet-21K (supervised) 72.0% 197
Fixed architecture: Q=1, Temporal Attention pooling. FPS on Titan V (FP16).
+3.9 pp
DINOv2 vs ImageNet-1K

Key Insight

Self-supervised DINOv2 outperforms supervised ImageNet by +3.9 pp. More label diversity (21K vs 1K) does not help.

Takeaway: Representation quality matters more than label supervision for anomaly detection.

Layer-wise Learning Rate Decay

LLRD AUC-ROC
0.6 74.9%
0.8 73.3%
Q=1, Linear propagator, Last Frame, DINOv2
Lower LLRD preserves pretrained features in early layers → +1.6 pp

Ablation: Query Pooling

Aggregating Q query tokens into a single frame representation

q1 q2 ... qQ Q=50, D=384 Max Pooling
$\mathbf{q}_t[d] = \max_i \mathbf{q}_t^{(i)}[d]$ 0 extra params
Attention Pool
$\mathbf{w} = \text{softmax}(\mathbf{W} \cdot \mathbf{q})$ 296K params
Mean Pool
$\mathbf{q}_t = \frac{1}{Q}\sum_i \mathbf{q}^{(i)}$ 0 extra params
$\mathbf{q}_t \in \mathbb{R}^D$
$\mathbf{q}_t \in \mathbb{R}^D$
$\mathbf{q}_t \in \mathbb{R}^D$

Ablation (Q=50)

METHOD AUC-ROC FPS PARAMS
Max 76.3% 189 0
Attention 75.1% 184 296K
Mean 73.4% 190 0
Linear propagator, Last Frame
Max pooling achieves the best AUC_ROC with zero additional parameters. Anomalous events strongly activate specific queries rather than uniformly across all queries — max pooling preserves this discriminative signal while mean pooling dilutes it.

Ablation: Temporal Pooling

Aggregating per-frame features across time for clip-level prediction

Temporal Attention q1 q2 ... qT Attention w = softmax(W·q) Classifier score Last Frame q1 q2 ... qT unused unused GRU hidden state already encodes t=1..T Classifier score GRU + Last Frame (77.3%) > Linear + Temporal Attention (76.5%)

Query Propagator

PROPAGATOR AUC-ROC FPS
GRU 78.1% 200
Linear 76.3% 189
Q=50, Max pool, Last Frame

Temporal Pooling

METHOD AUC-ROC FPS
Temp. Attn 76.5% 190
Last Frame 74.9% 190
Q=1, Linear propagator
The GRU propagator accumulates temporal context in its hidden state, making additional temporal attention redundant. Only the final query qT is needed for classification.

Summary

Accuracy-Efficiency Trade-off on DoTA Dataset

Comparison on DoTA Dataset

Model Prop. Pool AUC-ROC FPS
Simple-TAD (VideoMAE-S) 85.2% 76
VidEoMT-TAD-S (Q=50) GRU LF+Max 78.1% 200
VidEoMT-TAD-S (Q=1) GRU LF 77.3% 200
VidEoMT-TAD-S (Q=50) Linear LF+Max 76.3% 189
VidEoMT-TAD-S (Q=1) Linear LF 74.9% 190
All VidEoMT-TAD variants use DINOv2 pretraining. LF = Last Frame. FPS on Titan V (FP16).
Accuracy vs Speed Trade-off
72 76 80 84 88 50 100 150 200 FPS (higher is better) โ†’ AUC-ROC % โ†‘ 85.2% 78.1% 77.3 76.3 74.9 Simple-TAD Ours (Q=50 GRU) Q=1 GRU Linear
Best Config
78.1%
200 FPS
vs. VideoMAE
2.6x
Faster
-7.1pp
Accuracy
Note: Scaling the DINOv2 backbone from ViT-S/14 to ViT-B/14 only yields 78.3% AUC-ROC (vs. 78.1%), suggesting further gains require temporal modeling rather than a larger image encoder. (S. Orlova)

Per-Type Breakdown

Where does the accuracy gap come from?

AUC-ROC by Anomaly Type (sorted by gap)

60 65 70 75 80 85 AUC-ROC (%) Gap (pp) Start/Stop 83.4 78.9 −4.5 Obstacle 83.1 75.8 −7.3 Lateral 85.6 77.9 −7.7 Oncoming 86.2 77.3 −8.9 Turning 86.9 77.9 −9.0 Leave Left 86.7 76.6 −10.1 Moving Ahead 85.7 75.2 −10.5 Leave Right 84.9 73.7 −11.2 Unknown 74.3 61.5 −12.8 Pedestrian 85.0 68.6 −16.4 Simple-TAD (VideoMAE) VidEoMT-TAD (DINOv2)

VidEoMT Competitive

Start/Stop (−4.5pp), Obstacle (−7.3pp)
Static/slow anomalies: frame-independent processing sufficient

VidEoMT Struggles

Pedestrian (−16.4pp), Unknown (−12.8pp)
Complex trajectories require temporal modeling
Takeaway: Gap is concentrated in motion-heavy types — validates the “lack of spatiotemporal features” hypothesis.

Qualitative Verification (1/3)

Small gap categories (−4.5 to −8.9pp) โ€” spatial cues largely sufficient.

Start/Stop (−4.5pp): Static anomaly โ€” spatial cues alone suffice
Obstacle (−7.3pp): Stationary obstruction detected from single frame
Lateral (−7.7pp): Side-to-side motion captured by spatial features
Oncoming (−8.9pp): Focuses on the approaching hazard

Qualitative Verification (2/3)

Medium gap categories (−9.0 to −11.2pp) โ€” increasing need for temporal context.

Turning (−9.0pp): Most common type โ€” attends to turning vehicle
Leave Left (−10.1pp): Vehicle departing lane leftward
Moving Ahead (−10.5pp): Forward collision detected through spatial proximity
Leave Right (−11.2pp): Vehicle departing lane rightward

Qualitative Verification (3/3)

Large gap categories (−12.8 to −16.4pp) โ€” temporal dynamics essential.

Unknown (−12.8pp): Ambiguous scenarios requiring temporal context
Pedestrian (−16.4pp): Hardest category โ€” trajectory understanding needed
Conclusion: Learned queries capture semantically relevant regions across all 10 categories. Even for Pedestrian (−16.4pp), the model attends correctly โ€” the failure is in temporal dynamics, not spatial attention.

Limitations & Conclusion

Critical analysis and deployment guidelines

โš ๏ธ The Accuracy Gap

The ~7% accuracy gap vs. VideoMAE is likely due to the lack of joint spatiotemporal features.

Frame-independent processing struggles to model complex motion dynamics (acceleration, trajectory changes) that joint spatiotemporal attention captures naturally.

Root Cause: No explicit motion modeling between frames.

๐Ÿ“‹ Deployment Guidelines

VidEoMT-TAD for:
Driver Alert Systems (high FPS needed)
Resource-constrained edge devices
~78% AUC-ROC is acceptable
VideoMAE for:
Autonomous Braking (high recall critical)
Safety-critical applications
Maximum accuracy required

Conclusion

Image-Pretrained ViTs are a viable solution for resource-constrained environments, offering 200 FPS inference.
Possible improvements: (1) Video self-supervised pretraining (e.g., V-JEPA) to inject temporal priors; (2) Explicit spatiotemporal modeling via lightweight temporal modules (e.g., temporal convolution) in earlier backbone layers.
Thank You!
Chengqi Li
Mobile Perception Systems Lab
TU Eindhoven

Q&A
1 / 13
โ†‘โ†“ Navigate   F Fullscreen   N Notes Swipe โ†‘โ†“ to navigate
16px