International Conference on Communications and Media Studies: Lightweight aerial image object detection algorithm based on improved YOLOv5s

Abstract

YOLOv5 is one of the most popular object detection algorithms, which is divided into multiple series according to the control of network depth and width. To realize the deployment of mobile devices or embedded devices, the paper proposes a lightweight aerial image object detection algorithm (LAI-YOLOv5s) based on the improvement of YOLOv5s with a relatively small amount of calculation and parameter and relatively fast reasoning speed. Firstly, to better detect small objects, the paper replaces the minimum detection head with the maximum detection head and proposes a new feature fusion method, DFM-CPFN(Deep Feature Map Cross Path Fusion Network), to enrich the semantic information of deep features. Secondly, the paper designs a new module based on VoVNet to improve the feature extraction ability of the backbone network. Finally, based on the idea of ShuffleNetV2, the paper makes the network more lightweight without affecting detection accuracy. Based on the VisDrone2019 dataset, the detection accuracy of LAI-YOLOv5s on the mAP@0.5 index is 8.3% higher than that of the original algorithm. Compared with other series of YOLOv5 and YOLOv3 algorithms, LAI-YOLOv5s has the advantages of low computational cost and high detection accuracy.

Introduction

With the continuous application of UAVs in modern life, aerial photography technology has been widely used in various fields such as civil or military. Target detection of aerial images is one of the important parts of intelligent transportation system. Positioning and tracking of ground vehicle targets through aerial photography technology can convey and reflect the ground traffic information more clearly. And it is helpful for the construction of mature intelligent transportation system. Due to the large size of aerial images and the small and dense objects such as vehicles in the images, the detection accuracy of this detection task is low1. Traditional vehicle detection methods in aerial images usually adopt sliding window method, while in the process of feature extraction, fixed-size Windows and hand-crafted features often affect their detection accuracy2. In addition, compared with common object detection, highly complex background and variable object appearance further increase the difficulty of object detection in aerial images3.

Deep learning with nonlinear models has been widely used in object detection, and it can transform input data features into more abstract features. The algorithm can automatically discover the features needed for classification or detection tasks, and it has powerful representation and learning capabilities. In 2015, He et al.4 proposed a ResNet residual network through a cross-layer connection, which improved the network's performance, and it had no effect on error while increasing the depth of the network. In 2014, the R-CNN algorithm proposed by Girshick et al.5 used the proposal box extraction method to segment the input image into multiple modules and merge these modules according to the similarity information, which could obtain about 2000 candidate regions of different sizes. This is a two-stage target detection method, and it has the slower detection speed and the poorer real-time detection. Therefore, single-stage object detection method is proposed, and it can directly obtain the final output results on the original image. The YOLOv1 object detection algorithm proposed by Joseph Redmon et al.6 in 2015 treated the object detection task as a regression problem and removed the branch of extracting candidate boxes. Its detection speed was far faster than the two-stage object detection algorithm. The YOLOv37 proposed by Joseph Redmon et al. in 2018, which used Darknet-53 with better effect as the backbone network, adopted multi-scale fusion prediction based on FPN8 and used three sizes of feature maps to detect objects. Bochkovskiy et al.9 proposed YOLOv4 in 2020, which adopted image data enhancement technology at the input end, carried out multi-channel feature fusion based on PANet10, and adopted CIoU as the position loss function of regression box, which had greatly improved the detection speed and detection accuracy.

The work in this paper includes three parts:

1.In order to detect the small and many features in detection tasks, the paper proposes a new Feature Fusion Network DFM-CPFN (Deep Feature Map Cross Path Fusion Network). The medium-size detection head in the original algorithm is replaced by the enormous-size detection head after two upsampling operations, and then it is fused with the features in the backbone network respectively, which enriches the location information of the in-depth features.

2.In order to solve the problem of gradient disappearance caused by network deepening, the paper designs a VB module based on VoVNet20 to improve the backbone network. On the premise of retaining the residual structure, the output of multiple convolutional layers is spliced together at the end, which better ensures the transmission of features and gradients, and avoids feature redundancy.

3. Due to the high computational cost, object detection algorithm is difficult to deploy on mobile devices with limited performance. In order to solve the problem, the paper designs the C3SFN module based on ShuffleNetV221, which can make the improved algorithm model more lightweight. The computational cost of the algorithm is effectively reduced.

YOLOv5s methods

network structure diagram of YOLOv5s. YOLOv5 is divided into multiple series such as s, m, and l by controlling the depth and width of the network, and their differences are only in the different scaling multiples. The series with a profound or vast network has relatively good detection effect, but the computational cost is also relatively high. The series with a shallow network has significantly reduced computational cost and faster detection speed, but detection effect is relatively poor. Based on the idea of CSPNet22, the C3 module is added in YOLOv5s backbone network, which divides the feature map into two paths and uses the cross-stage hierarchical structure to merge. The new network architecture realizes richer gradient combinations while reducing the amount of calculation. The SPPF module with a spatial pyramid pooling structure borrowed the idea of SPPNet23. The neck network of YOLOv5 performs multi-scale feature fusion based on PANet. Compared with FPN, PANet adds a bottom-up feature fusion path, and its output head adds a fully connected branch to improve the quality of the prediction mask.

LAI-YOLOv5s

Firstly, a new Feature Fusion Network DFM-CPFN (Deep Feature Map Cross Path Fusion Network) is proposed, which can effectively improve the problem of profound information loss of deep features for small targets. In addition, based on VoVNet and ShuffleNetV2, the paper designs two new modules, VB and C3SFN, respectively. Two new modules can improve the feature extraction performance of the backbone network, meanwhile have a lightweight network. Compared with some other object detection algorithms and other series of YOLOv5, ablation experiments show the proposed algorithm not only has a more lightweight network model, but also has better performance in detection accuracy and detection effect. Figure 2 shows the network structure of LAI-YOLOv5s.

8th Edition COMMS | 27-28 July 2023 | Delhi

Links : https://communications-conferences.sciencefather.com

Instagram : https://lnkd.in/eWBagA4R

Twitter : https://lnkd.in/edceqzb9

Pinterest : https://lnkd.in/ezMJRzBr

Tumblr : https://www.tumblr.com/dashboard

Facebook : https://www.facebook.com/profile.php?id=100092167878105

#MediaIndustry

#DigitalMedia

#MassCommunication

#MediaTrends

#MediaStudies