Chengqi Li
TU Eindhoven ยท Mobile Perception Systems Lab
Can we leverage Image-Pretrained Backbones for real-time TAD?
Frame-independent processing with Query Fusion (T=16 frames)
Representation quality matters more than label supervision
| Pretraining | AUC-ROC | FPS |
|---|---|---|
| DINOv2 (self-supervised) | 76.5% | 190 |
| ImageNet-1K (supervised) | 72.6% | 202 |
| ImageNet-21K (supervised) | 72.0% | 197 |
Self-supervised DINOv2 outperforms supervised ImageNet by +3.9 pp. More label diversity (21K vs 1K) does not help.
| LLRD | AUC-ROC |
|---|---|
| 0.6 | 74.9% |
| 0.8 | 73.3% |
Aggregating Q query tokens into a single frame representation
| METHOD | AUC-ROC | FPS | PARAMS |
|---|---|---|---|
| Max | 76.3% | 189 | 0 |
| Attention | 75.1% | 184 | 296K |
| Mean | 73.4% | 190 | 0 |
Aggregating per-frame features across time for clip-level prediction
| PROPAGATOR | AUC-ROC | FPS |
|---|---|---|
| GRU | 78.1% | 200 |
| Linear | 76.3% | 189 |
| METHOD | AUC-ROC | FPS |
|---|---|---|
| Temp. Attn | 76.5% | 190 |
| Last Frame | 74.9% | 190 |
Accuracy-Efficiency Trade-off on DoTA Dataset
| Model | Prop. | Pool | AUC-ROC | FPS |
|---|---|---|---|---|
| Simple-TAD (VideoMAE-S) | — | — | 85.2% | 76 |
| VidEoMT-TAD-S (Q=50) | GRU | LF+Max | 78.1% | 200 |
| VidEoMT-TAD-S (Q=1) | GRU | LF | 77.3% | 200 |
| VidEoMT-TAD-S (Q=50) | Linear | LF+Max | 76.3% | 189 |
| VidEoMT-TAD-S (Q=1) | Linear | LF | 74.9% | 190 |
Where does the accuracy gap come from?
Small gap categories (−4.5 to −8.9pp) โ spatial cues largely sufficient.
Medium gap categories (−9.0 to −11.2pp) โ increasing need for temporal context.
Large gap categories (−12.8 to −16.4pp) โ temporal dynamics essential.
Critical analysis and deployment guidelines
The ~7% accuracy gap vs. VideoMAE is likely due to the lack of joint spatiotemporal features.
Frame-independent processing struggles to model complex motion dynamics (acceleration, trajectory changes) that joint spatiotemporal attention captures naturally.