Tracking UAV in infrared videos using siamese networks

Siamese network-based trackers have achieved excellent performance

on visual object tracking. Some Siamese network experiments on long-terms visual

tracking benchmarks achieve state-of-the-art performance, confirming its

effectiveness and efficiency. In this work, we study state-of-the-art Siamese

networks, then, propose a model based on Siamese architecture to tracking UAV

from the Anti-UAV Challenge dataset include 100 videos infrared. Network

architecture using pre-trained ResNet50, depth-wise cross-correlation, focal loss.

Experiments on the Anti-UAV infrared dataset show its robustness to the different

challenges of real infrared scenes with a high efficiency.

Download

Trang 1

Trang 2

Trang 3

Trang 4

Trang 5

Trang 6

Trang 7

Trang 8

8 trang xuanhieu 10380

Download

Bạn đang xem tài liệu "Tracking UAV in infrared videos using siamese networks", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

Tóm tắt nội dung tài liệu: Tracking UAV in infrared videos using siamese networks

t:
456 H . D. Thang, , N. C. Thanh, “Tracking UAV in infrared videos using Siamese networks.”
Nghiên cứu khoa học công nghệ
(2)
In this section, we describe the proposed network, which is a more advanced ConvNet
to learn an effective model that enhances tracking robustness and accuracy. As shown
in figure 2, network consists of a Siamese network backbone and multiple RPN heads. The
Siamese network backbone is responsible for computing the convolutional feature maps of
the template patch and the search patch, which uses an off-the-shelf convolutional
network. The RPN head includes a classification module and a regression module.
3.1. Network Backbone
In our tracking method, we adopt ResNet-50 [21] as the backbone network by
modifying the strides and adding dilated convolutions for conv4 and conv5 blocks, detail
in table 1. Feature maps in outputs of conv3, conv4, conv5 are fed into three RPN head
modules individually.
Table 1. ResNet50 backbone.
Bottleneck in conv4 Bottleneck in conv5
conv1x1 conv3x3 conv1x1 conv1x1 conv3x3 conv1x1
original stride 1 2 1 1 2 1
ResNet-50 padding 0 1 0 0 1 0
dilation 1 1 1 1 1 1
modified stride 1 1 1 1 1 1
ResNet-50 padding 0 2 0 0 4 0
dilation 1 2 1 1 4 1
3.2. RPN Head
RPN head (figure 1, right) consists of a classification module and a regression module.
Both modules receive features from the template branch and the search branch. Features
from the template branch and the search branch is adjusted to the same number of
channels. Then two feature maps with the same number of channels do the depth-wise
cross-correlation channel by channel.
Figure 1. Illustration of proposed framework. The left sub-figure shows its main structure,
where c3, c4, and c5 denote the feature maps of the backbone network. The right sub-
figure shows each RPN head, where DW-Corr means depth-wise cross-correlation
operation, Reg/Cls Map denote the feature maps of the RPN heads output.
With k anchors, RPN Head needs to output 2k channels for classification and 4k
channels for regression. The correlation is computed on both the classification branch and
the regression branch:
∗
(3)
∗
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san Hội thảo Quốc gia FEE, 10 - 2020 457
Toán học – Công nghệ thông tin
where * denotes the convolution operation with [ (z)]cls or [ (z)]reg as the convolution
kernel, denotes classification map, indicates regression map.
3.3. Classification Loss and Regression Loss
With specifies the ground-truth class and is the model’s
estimated probability for the class with label , define :
(4)
Loss for classification is the focal loss [10]:
(5)
Smooth loss with with normalized coordinates for regression:
∗
(6)
Let , denote center point and shape of the anchor boxes. Let ,
denote those of the ground truth boxes, normalized distance [2]:
(7)
Loss for regression is:
(8)
The overall loss of the network is a combination of the classification loss and the
regression loss:
(9)
where λ1, λ2 is hyper-parameter to balance the two parts.
3.4. Training and Inference
Training. Our entire network can be trained end-to-end on large-scale datasets. The
backbone network ResNet50[21] is pre-train on ImageNet[22] for image labeling. We
train network with image pairs sampled on videos or still images to learn a generic notion
of how to measure the similarities between general objects for visual tracking. The training
sets include, ImageNet VID[22], ImageNet DET[22], COCO[23] and GOT-10k[24]. The
size of a template patch is 127×127 pixels, while the size of a search patch is 255×255
pixels. The number of anchors are k=5 with stride=8, scales=8 and
ratios=[0.33,0.5,1,2,3] hyper-parameter .
Data augmentation: we use data augmentation techniques such as flipping, shifting
scale, blurring, gray scale etc. to increase the variety of samples fed to the network.
Inference. During inference, we crop the template patch from the first frame and feed
it to the feature extraction network. For subsequent frames, we crop the search patch and
extract features based on the target position of the previous frame, and then perform
prediction in the search region to get the total classification map and regression map.
Afterward, we can get prediction boxes based on strategy is mentioned in [2]. After
prediction boxes are generated, we use the cosine window and scale change penalty to
smooth target movements and changes, then the prediction box with the best score is
selected and its size is updated by linear interpolation with the state in the previous frame.
In the case of Anti-UAV have challenges of long-terms tracking datasets are severe out-of-
458 H . D. Thang, , N. C. Thanh, “Tracking UAV in infrared videos using Siamese networks.”
Nghiên cứu khoa học công nghệ
view, full occlusion, and fast motion. During failure cases, we gradually increase the
search region by local-to-global strategy [5]. Specifically, the size of the search region is
iteratively growing with a constant step when failed tracking is indicated.
4. EXPERIEMNTS
We perform a lot of experiments on the Anti-UAV infrared dataset and evaluate the
performance of the proposed tracking approach.
4.1. Implementation Details
The network backbone is pre-trained on the ImageNet-1k classification task. The
Network is trained with stochastic gradient descent (SGD). We use synchronized SGD
over 4 GPUs with a total of 64 pairs per minibatch (16 pairs per GPU), which takes 48
hours to converge. We train a total of 20 epochs, using a warmup learning rate of 0.001 to
0.005 in the first 5 epochs, and a learning rate exponentially decayed from 0.005 to
0.00005 in the last 15 epochs. Weight decay of 0.0005 and momentum of 0.9 are used.
The training loss is the sum of classification loss (focal-loss) and the standard
loss for regression. During the inference phase, the sizes of the search region in
the short-term phase and defined failure cases are set to 255 and 832, respectively. The
thresholds to enter and leave failure cases are set to 0.825 and 0.996. The code is
implemented in Python using PyTorch base on the PySOT.
4.2. Evaluation
The Anti-UAV workshop (https://anti-uav.github.io) presents a benchmark dataset and
evaluation methodology for detecting and tracking UAVs. The test-dev dataset consists of
100 high-quality infrared video sequences, spanning multiple occurrences of multi-scale
UAVs. This dataset has different challenges in: varying sizes, varying ratios, motion blur,
fast motion, indistinguishable background (figure 2).
Figure 2. Anti-UAV Dataset (https://anti-uav.github.io/dataset/).
We use the data to evaluate our algorithm. The tracking average accuracy score (acc) is
utilized for evaluation. The acc is defined as:
(10)
At frame t, the is the IoU between the corresponding ground truth and tracking
boxes. The is the visibility flag of the ground truth. If the target exists in the current
frame, . When the target does not exist in the frame and
, if the tracker’s prediction is empty, , otherwise,
. The accuracy is averaged over all the frames. Our acc score is
calculated according to the average results on the 100 IR videos. From table 2 it is seen
that our approach gets the high average accuracy, more than SiamRPN++[6]. Results also
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san Hội thảo Quốc gia FEE, 10 - 2020 459
Toán học – Công nghệ thông tin
show that models released of SiamMask [7] and SiamBAN [8] not suitable for long-terms
tracking datasets as Anti-UAV. The code of SiamFC [1] is the Pytorch version and the
model is provided by the AntiUAV organizer. For SiamMask [7], SiamRPN++ [6] we use
the codes and models released at https://github.com/STVIR/pysot. For SiamBAN [8] we
uses the codes and models released at https://github.com/hqucv/siamban. Results of
ATOM [16], DiMP [25], SiamDW-LT [26] are referenced from [20].
Table 2. Benchmark results.
Tracker Name acc
SiamBAN[8] 0.390
SiamMask[7] 0.403
SiamFC[1] 0.420
ATOM[16] 0.5322
DiMP[25] 0.5507
SiamDW-LT[26] 0.6379
SiamRPN++[6] 0.648
Ours 0.654
4.3. Visual Results
To visualize the performance of our tracker, we provide some representative results of
our tracker. The frames are from the Anti-UAV dataset. As shown in Fig. 3, each row
represents a video sequence. The green box denotes ours, the yellow one denotes the
ground truth.
530_1_5
20190925_131
2130
_1_4
01
20190925_
516_1_7
20190926_133
400_1_3
20190926_183
515_1_8
20190926_193
Figure 3. Visual results of our tracker on 5 videos, all the data from Anti-UAV infrared
dataset. The yellow box denotes ground truth, the green box denotes ours.
460 H . D. Thang, , N. C. Thanh, “Tracking UAV in infrared videos using Siamese networks.”
Nghiên cứu khoa học công nghệ
5. CONCLUSION
In this paper, a tracker UAVs in infrared video-based on Siamese Network is proposed,
which consists of backbone ResNet50 and three RPN heads for classification and
regression. Experiments on the Anti-UAV dataset show that the proposed infrared tracking
algorithm is robust to the challenges in real infrared scenes with high efficiency. Our
approach gets acc equal to 0.654, compare to another method such as ATMF, we need
more improve the model next time by applying some strategy when a failed case such as
using random search or classifier.
REFERENCES
[1]. L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr. “Fully-
convolutional siamese networks for object tracking”. In ECCV Workshops, 2016.
[2]. B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu. “High performance visual tracking with
siamese region proposal network”. In CVPR, 2018.
[3]. Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu. “Distractor-aware siamese
networks for visual object tracking”. In ECCV, 2018.
[4]. H. Fan and H. Ling. “Siamese cascaded region proposal networks for real-time visual
tracking”. In CVPR, 2019.
[5]. Z. Zhang, H. Peng, and Q. Wang. “Deeper and wider siamese networks for real-time
visual tracking”. In CVPR, 2019. 2, 17, 20
[6]. Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan.
SiamRPN++: “Evolution of siamese visual tracking with very deep networks”. In
CVPR, 2019.
[7]. Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, and Philip HS Torr. “Fast
online object tracking and segmentation: A unifying approach”. In CVPR, 2019.
[8]. Chen, Zedu and Zhong, Bineng and Li, Guorong and Zhang, Shengping and Ji,
Rongrong. SiamBAN: “Siamese Box Adaptive Network for Visual Tracking”. In
CVPR, 2020.
[9]. S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: “Towards real-time object
detection with region proposal networks”. In NIPS, 2015.
[10]. Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollar. “Focal Loss for
Dense Object Detection”. In ICCV, 2017.
[11]. H. Nam and B. Han. “Learning multi-domain convolutional neural networks for
visual tracking”. In CVPR, 2016.
[12]. K. Dai, D. Wang, H. Lu, C. Sun, and J. Li. “Visual tracking via adaptive spatially-
regularized correlation filters”. In CVPR, 2019.
[13]. A. S. Tripathi, M. Danelljan, L. Van Gool, and R. Timofte. “Tracking the known and
the unknown by leveraging semantic information”. In BMVC, 2019.
[14]. M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg. “Beyond correlation filters:
Learning continuous convolution operators for visual tracking”. In ECCV, 2016.
[15]. M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg. “ECO: Efficient convolution
operators for tracking”. In CVPR, 2017.
[16]. M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg. “ATOM: accurate tracking by
overlap maximization”. In CVPR, 2019.
[17]. L. Huang, X. Zhao, and K. Huang. “Bridging the gap between detection and tracking:
A unified approach”. In ICCV, 2019.
[18]. Aybora Koksal, Kutalmis Gokalp Ince, A. Aydin Alatan. “Effect of Annotation Errors
on Drone Detection with YOLOv3”. In CVPR, 2020.
Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san Hội thảo Quốc gia FEE, 10 - 2020 461
Toán học – Công nghệ thông tin
[19]. Joseph Redmon and Ali Farhadi. “Yolov3: An incremental improvement”. CoRR,
abs/1804.02767, 2018.
[20]. Chunhui Zhang, Haolin Liu, Tianyang Xu, Yong Wang, Shiming Ge. “ATMF:
Accurate Tracking by Multi-Modal Fusion”. In CVPR, 2020.
[21]. K. He, X. Zhang, S. Ren, and J. Sun. “Deep residual learning for image recognition”.
In CVPR, 2016.
[22]. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A.
Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. “ImageNet Large
Scale Visual Recognition Challenge”. IJCV, 2015.
[23]. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva
Ramanan, Piotr Dollar, and C Lawrence Zitnick. “Microsoft COCO: Common objects
in context”. In ECCV, 2014.
[24]. Lianghua Huang, Xin Zhao, and Kaiqi Huang. “GOT-10k: A large high-diversity
benchmark for ggeneric object tracking in the wild”. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 2019.
[25]. Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu Timofte. “Learning
discriminative model prediction for tracking”. In IEEE International Conference on
Computer Vision, 2019
[26]. Zhipeng Zhang, Houwen Peng. “Deeper and Wider Siamese Networks for Real-Time
Visual Tracking”. In CVPR, 2019.
TÓM TẮT
THEO DÕI CHUYỂN ĐỘNG CỦA UAV TRONG VIDEO HỒNG NGOẠI
SỬ DỤNG MẠNG SIAMESE
Trình theo dõi dựa trên mạng Siamese đã đạt được hiệu suất cao trong theo dõi
đối tượng trực quan. Một số mạng Siamese thực nghiệm trên các bộ dữ liệu lớn về
theo dõi đối tượng đạt được hiệu suất hiện đại, khẳng định hiệu suất và hiệu quả.
Trong bài báo này, chúng tôi nghiên cứu các mạng Siamese hiện đại, sau đó đề xuất
mô hình dựa trên kiến trúc Siamese để theo dõi UAV từ bộ dữ liệu Anti-UAV
Challenge gồm 100 video hồng ngoại. Kiến trúc mạng sử dụng mạng đã được đào
tạo trước như ResNet50, tương quan chéo sâu và rộng và focal-loss cho phân lớp
UAV với nền. Các thử nghiệm trên bộ dữ liệu hồng ngoại Anti-UAV cho thấy, sự
mạnh mẽ của mô hình đối với các thách thức khác nhau của cảnh hồng ngoại thực
tế với hiệu quả cao.
Từ khóa: Theo dõi UAV; Mạng Siamese; Theo dõi đối tượng; Học sâu.
Received 3rd August 2020
Revised 5th October 2020
Published 5th October 2020
Author affiliations:
1 Military Information Technology Institute, Academy of Military Science and Technology;
2 University of Engineering and Technology, Vietnam National University.
*Email: hoangdinhthang@gmail.com.
462 H . D. Thang, , N. C. Thanh, “Tracking UAV in infrared videos using Siamese networks.”

File đính kèm:

tracking_uav_in_infrared_videos_using_siamese_networks.pdf