Deep Learning for Understanding Violence in Videos

Please use this identifier to cite or link to this item: http://202.28.34.124/dspace/handle123456789/3058

Title:	Deep Learning for Understanding Violence in Videos การเรียนรู้เชิงลึกสำหรับการเข้าใจความรุนแรงในวิดีโอ
Authors:	Wimolsree Getsopon วิมลศรี เกตุโสภณ Olarik Surinta โอฬาริก สุรินต๊ะ Mahasarakham University Olarik Surinta โอฬาริก สุรินต๊ะ olarik.s@msu.ac.th olarik.s@msu.ac.th
Keywords:	Violent Video Understanding Violent Video Recognition Video Recognition Convolutional Neural Network Recurrent Neural Network Feature Extraction Features Fusion Technique
Issue Date:	2
Publisher:	Mahasarakham University
Abstract:	Chapter 1 briefly introduces violent video understanding and research questions. Additionally, the objectives of the dissertation and contributions are described. Chapter 2 describes a background of violent video understanding using deep learning techniques and related work. The background includes deep learning techniques, convolution neural networks, convolution neural network architecture, 3D Convolutional Neural Networks (3D-CNN), Recurrent Neural Networks (RNN), Deep feature extraction, deep feature fusion methods, and violent video datasets. Next, a related work section, which has reviewed research from the past until now, consists of six main parts as follows: deep learning for video classification, handcrafted features for violent recognition, violent recognition with 2D-CNN, violent recognition with 3D-CNN, violent recognition with combination of CNN and RNN, and violent recognition with fusion features. Chapter 3 proposed a fusion MobileNets-BiLSTM architecture. In the first part, I proposed using the lightweight MobileNetV1 and MobileNetV2 to extract the robust deep spatial features from the video so that only 16 non-adjacent frames were selected. The spatial features were transferred to the global average pooling, batch normalization, and time distribution layer. In the second part, the spatial features from the first part were concatenated and then transferred to a Bidirectional Long Short-Term Memory (BiLSTM). The proposed fusion MobileNets-BiLSTM architecture was evaluated on the hockey fight dataset. The experimental results showed that the proposed method achieved 95.20% accuracy on the test set of the hockey fight dataset. Chapter 4 proposed a method to understand violence within video using deep feature integration with 3D-CNN. I proposed CNN to extract the spatial feature from the last convolution layer at the frame level. The concatenate operation was proposed to combine the spatial features of both CNNs at the frame level before being transferred to the 3D-CNN architecture to learn the spatiotemporal features, consisting of batch normalization, 3D convolution, dropout layers, global average pooling layer followed by a fully connected layer. Finally, the softmax was used to classify as a violent and non-violent video. Chapter 5 comprises two main sections: the answers to the research questions and suggestions for future work. This chapter briefly explains the proposed approaches and answers two main research questions in video understanding. -
URI:	http://202.28.34.124/dspace/handle123456789/3058
Appears in Collections:	The Faculty of Informatics

Files in This Item:

File	Description	Size	Format
63011261001.pdf		3.97 MB	Adobe PDF	View/Open

Show full item record