A Simple Framework for End-to-End Predictive Visual Tracking

ICCV 2023

1Carnegie Mellon University,2National University of Singapore,3Tongji University
4New York University, 5Tsinghua University

PVT++ is motivated by onboard latency of visual trackers. By virtue of end-to-end joint learning of motion and visual cues, PVT++ achieves online results that is comparable to offline setting!


Visual object tracking is essential to intelligent robots. Most existing approaches have ignored the online latency that can cause severe performance degradation during real-world processing. Especially for unmanned aerial vehicles (UAVs), where robust tracking is more challenging and onboard computation is limited, the latency issue can be fatal.

In this work, we present a simple framework for end-to-end latency-aware tracking, end-to-end predictive visual tracking (PVT++). Unlike existing solutions that naively append Kalman Filters after trackers, PVT++ can be jointly optimized, so that it takes not only motion information but can also leverage the rich visual knowledge in most pre-trained tracker models for robust prediction.

Besides, to bridge the training-evaluation domain gap, we propose a relative motion factor, empowering PVT++ to generalize to the challenging and complex UAV tracking scenes. These careful designs have made the small-capacity lightweight PVT++ a widely effective solution.

Additionally, this work presents an extended latency-aware evaluation benchmark for assessing an any-speed tracker in the online setting. Empirical results on a robotic platform from the aerial perspective show that PVT++ can achieve significant performance gain on various trackers and exhibit higher accuracy than prior solutions, largely mitigating the degradation brought by latency.


  • We propose the first end-to-end general framework, PVT++, for latency-aware tracking, compensating for online latency by accurate motion prediction.
  • We propose "relative motion factor", bridging the training-evaluation domain gap.
  • We propose "auxiliary branch" and "joint training" techniques to effectively incorperate pre-exsiting visual features in prediction.
  • We propose "e-LAE" to evaluate an any-speed tracker in the online setting.
  • We conduct exhaustive experiments, using "e-LAE" to evaluate various SOTA trackers and validate PVT++ on several tracker models.

  • Method

    (a) Structure of PVT++, consisting of tracker and predictior. (b) Extended latency-aware benchmark.


    LAE sets two policies for online evaluation of trackers:

  • During inference, the tracker finds the latest frame to process when finishing the previous one.
  • During evaluation, the ground-truth is compared with the latest result from the tracker at the world time stampn.
  • However, this fall shorts when evaluating real-time trackers, since the latency is always 1 frame, ϕ(f) ≡ f − 1.

    e-LAE sets permitted latency threshold σ, instead of hard requiring the latest result. By setting different thresholds σ ∈ [0, 1), we get a curve for the online results. e-LAE finally takes the aera under curve for the final comparing results. Since different real-time trackers have distinct latency, the thresholds can distinguish between them.

    We have conducted exhaustive experiments using robot platform and e-LAE here:


    PVT++ consists of a tracker module and predictor module. The critical designs are:

  • Relative motion factor: the training objective needs to be carefully designed to enable a generalizable framework.
  • here our predictor is trying to output "adjustment" on "average speed motion" assumption.

  • Lightweight structure: The predictor module is carefully designed as below to avoid extra latency:

  • Optimizing techniques: We develop joint training and auxiliary branch as techniques to utilize pre-existing tracker visual features.

  • Qualitative  Results


    Effect on Traditional Base Trackers

    The base trackers are easy to lose the target object due to latency. Powered by PVT++, the predictive trackers is more accurate.



    Comparison with KF

    Kalman filters fall short in in-plane-rotation and view-point change, thus causing the predictive trackers to fail. While our PVT++ is more robust under these challenges due to the incoorperation of visual cues.


          author={Li, Bowen and Huang, Ziyuan and Ye, Junjie and Li, Yiming and Scherer, Sebastian and Zhao, Hang and Fu, Changhong},   
          booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, 
          title={{PVT++: A Simple End-to-End Latency-Aware Visual Tracking Framework}},


    The work was done when Bowen Li and Ziyuan Huang were visiting students in the MARS Lab at Shanghai Qizhi Institute. The authors would like to express gratitude to the developers and authors of PySot and "Towards Streaming Perception".