Deep Learning for Video Subtitle Detection and Recognition

Please use this identifier to cite or link to this item: http://202.28.34.124/dspace/handle123456789/1510

Title:	Deep Learning for Video Subtitle Detection and Recognition การเรียนรู้เชิงลึกสำหรับการตรวจจับและรู้จำคำบรรยายในวีดิทัศน์
Authors:	Thanadol Singkhornart ธนดล สิงขรอาสน์ Olarik Surinta โอฬาริก สุรินต๊ะ Mahasarakham University. The Faculty of Informatics
Keywords:	การรู้จำคำบรรรยายวีดิทัศน์ โครงข่ายประสาทคอนโวลูชัน หน่วยความจำระยะสั้นระยะยาว การจำแนกการเชื่อมต่อชั่วคราว Video subtitle text recognition Convolutional neural networks long short-term memory network Connectionist temporal classification
Issue Date:	28
Publisher:	Mahasarakham University
Abstract:	Nowadays, many videos have been published on Internet channels such as Youtube and Facebook. Many audiences, however, cannot understand the contents of the video, maybe due to the different languages and even hearing impairment. As a result, subtitles have been added to videos. In this paper, we proposed deep learning techniques, which were the combination between convolutional neural networks (CNN) and long short-term memory (LSTM) networks, called CNN-LSTM, to recognize video subtitles. We created the simplified CNN architecture with 16 weight layers. The last layer of the CNN was downsampling using max-pooling before sending it to the LSTM network. We first trained our CNN-LSTM architecture on printed text data which contained various font styles, diverse font sizes, and complicated backgrounds. The connectionist temporal classification was then used as a loss function to calculate the loss value and decode the output of the network. For the video subtitle dataset, we collected 24 videos from Youtube and Facebook, consisting of Thai, English, Arabic, and Thai numbers. The dataset also contained 157 characters. In this dataset, we extracted 4,224 subtitle images from the videos. The proposed CNN-LSTM architecture achieved an average character error rate of 11.06%. ในปัจจุบันมีวีดิทัศน์จำนวนมากที่ถูกเผยแพร่ผ่านอินเทอร์เน็ตในช่องทางต่าง ๆ เช่น Youtube และ Facebook มีผู้ชมบางส่วนที่มีปัญหาในการรับรู้ข้อมูลจากวีดิทัศน์เนื่องจากปัญหาทางด้านภาษาหรือมีปัญหาด้านการฟัง ดังนั้นคำบรรยายจึงถูกเพิ่มเข้ามาในวีดิทัศน์ ในวิทยานิพนธ์นี้ได้นำเสนอถึงการนำวิธีการเรียนรู้เชิงลึกมาใช้โดยใช้วิธีโครงข่ายประสาทแบบคอนโวลูชัน (CNN) ร่วมกับ วิธีหน่วยความจำระยะสั้นระยะยาว (LSTM) ซึ่งเรียกว่า CNN-LSTM เพื่อที่จะนำมารู้จำคำบรรยายจากวีดิทัศน์ เราได้สร้างตัวอย่างต้นแบบ CNN ที่มีจำนวน 16 ชั้น โดยชั้นสุดท้ายเป็น การย่อขนาดโดยใช้ค่าสูงสุด (Max-pooling) ก่อนที่จะส่งเข้า LSTM โดยในการเรียนรู้นั้นเราได้ใช้รูปภาพคำบรรยายที่มีรูปแบบ ขนาด และพื้นหลังที่หลากหลาย แล้วใช้ การจำแนกการเชื่อมต่อชั่วคราว (CTC loss) ในการคำนวนหาค่า loss และถอดรหัสเป็นผลลัพธ์ สำหรับข้อมูลที่นำมาใช้ในการเรียนรู้นั้นได้มาจากการรวมรวม 24 วีดิทัศน์จาก Youtube และ Facebook ที่มีคำบรรยายภาษาไทย อังกฤษ ตัวเลขไทยและตัวเลขอารบิก ซึ่งมีทั้งหมด 157 ตัวเพื่อนำมาถอดรหัสข้อมูลในชุดรูปภาพนั้นมีทั้งหมด 4,224 รูป ซึ่งได้ค่าเฉลี่ยความผิดพลาดที่น้อยที่สุดคือ 11.06%
Description:	Master of Science (M.Sc.) วิทยาศาสตรมหาบัณฑิต (วท.ม.)
URI:	http://202.28.34.124/dspace/handle123456789/1510
Appears in Collections:	The Faculty of Informatics

Files in This Item:

File	Description	Size	Format
63011283003.pdf		4.9 MB	Adobe PDF	View/Open

Show full item record