Image-Pretrained Vision Transformers for Real-Time Traffic Anomaly Detection

Nov 1, 2025

Motivation

Traffic anomaly detection (TAD) is critical for autonomous driving, requiring both high accuracy and low latency. The task involves identifying abnormal or dangerous events from ego-centric dashcam footage in real-time — a binary classification problem at the frame level.

Recent work such as Simple-TAD shows that a simple encoder-only VideoMAE model with strong pre-training can outperform complex multi-stage architectures, achieving 85.2% AUC-ROC on the DoTA benchmark. However, video-pretrained models like VideoMAE process 3D spatiotemporal tokens, making them computationally expensive for edge deployment.

This raises a natural question: can we trade some accuracy for significantly faster inference by using lighter, image-pretrained backbones instead?

Approach: VidEoMT-TAD

We adapted VidEoMT, a video instance segmentation model, for the TAD task. The key idea is to use a frame-based image foundation model (DINOv2) instead of a video foundation model, and handle temporal modeling through lightweight query propagation.

VidEoMT-TAD architecture: frame-independent DINOv2 encoding with query fusion across time

The architecture works in three stages:

Frame-independent encoding: Each frame is processed independently through a DINOv2 ViT-S/14 backbone, producing 16x16 = 256 patch tokens per frame.
Query-based cross-attention: Learnable query tokens attend to patch features via cross-attention in transformer blocks 9-11. This is where the model learns what to look for in each frame.
Temporal aggregation: Queries are propagated across frames using a GRU-based updater, enabling the model to track evolving patterns across time. The final frame’s query representation is used for binary classification.

For multi-query configurations (Q > 1), we aggregate per-frame queries via max pooling, which preserves the strongest response per feature — effective because anomalous events tend to strongly activate specific queries.

Key Findings

Self-supervised pretraining matters most

We compared different backbone initializations under controlled settings:

Pretraining	AUC-ROC	FPS
DINOv2 (self-supervised)	76.5	190
ImageNet-1K (supervised)	72.6	202
ImageNet-21K (supervised)	72.0	197

DINOv2 outperforms supervised ImageNet by +3.9 pp AUC-ROC, suggesting that self-supervised representation quality matters more than label diversity for anomaly detection. Interestingly, more categories (ImageNet-21K) doesn’t help.

GRU-based temporal modeling

A GRU query propagator maintains hidden state across frames, enabling the model to track evolving temporal patterns. Compared to a simple linear projection baseline:

Propagator	AUC-ROC	FPS
GRU	78.1	200
Linear	76.3	189

The GRU improves AUC-ROC by +1.8 pp while maintaining comparable throughput, validating that explicit temporal memory benefits anomaly detection.

The accuracy-efficiency trade-off

Accuracy vs throughput trade-off on the DoTA dataset

Our best model achieves 78.1% AUC-ROC at 200 FPS — 2.6x faster than Simple-TAD (85.2% at 76 FPS), trailing by 7.1 pp. All models far exceed the 30 FPS video rate, leaving significant headroom for real-time deployment.

Model	AUC-ROC	FPS
Simple-TAD (VideoMAE-S)	85.2	76
VidEoMT-TAD ViT-S (Q=50, GRU)	78.1	200
VidEoMT-TAD ViT-S (Q=1, GRU)	77.3	200
VidEoMT-TAD ViT-S (Q=1, Linear)	74.9	190

When to use image-pretrained models for TAD

Our findings suggest image-pretrained models are appropriate when:

Latency-critical: Applications requiring >100 FPS (e.g., parallel multi-camera processing)
Resource-constrained: Edge devices where VideoMAE’s memory footprint is prohibitive
Acceptable accuracy: Scenarios where ~78% detection rate meets requirements (e.g., driver alert systems with human oversight)

Conversely, safety-critical applications requiring high recall should prefer video-pretrained models despite computational costs.

Conclusion

This internship investigated the trade-offs of using image-pretrained ViTs for traffic anomaly detection. The key takeaways:

DINOv2 > ImageNet: Self-supervised pretraining outperforms supervised ImageNet by 3.9 pp AUC-ROC
Temporal modeling helps: GRU-based query propagation improves AUC-ROC by up to 1.8 pp over a linear baseline
A significant accuracy gap exists: 7.1-10.3 pp compared to video-pretrained VideoMAE, offset by 2.5-2.6x faster inference

These results provide practical guidance on when image-pretrained backbones are suitable for TAD deployment.

Per-Category Analysis

Breaking down the accuracy gap by anomaly category reveals where image-pretrained models fall short and why:

Category	Gap (pp)	Interpretation
Start/Stop	−4.5	Static anomaly — spatial cues alone suffice
Obstacle	−7.3	Stationary obstruction detected from single frame
Lateral	−7.7	Side-to-side motion captured by spatial features
Oncoming	−8.9	Focuses on the approaching hazard
Turning	−9.0	Most common type — attends to turning vehicle
Leave Left	−10.1	Vehicle departing lane leftward
Moving Ahead	−10.5	Forward collision detected through spatial proximity
Leave Right	−11.2	Vehicle departing lane rightward
Unknown	−12.8	Ambiguous scenarios requiring temporal context
Pedestrian	−16.4	Hardest category — trajectory understanding needed

The pattern is clear: categories where the anomaly is spatially evident (e.g., a stationary obstacle, a stopped vehicle) have small gaps (−4.5 to −8.9pp). Categories requiring temporal reasoning (e.g., predicting a pedestrian’s trajectory, understanding lane-departure dynamics) show large gaps (−12.8 to −16.4pp).

Attention Visualization

The model learns to attend to anomalous regions. Heatmaps show attention weights aggregated across queries, with confidence scores indicating anomaly probability.

Collision event: the model tracks the anomalous vehicle throughout the clip

Below are visualizations grouped by gap severity. Even for the hardest categories, the model attends to the correct regions — the failure is in temporal dynamics, not spatial attention.