Detection and recognition of cursive text from video frames

Textual content appearing in videos represents an interesting index for semantic retrieval of videos (from archives), generation of alerts (live streams), as well as high level applications like opinion mining and content summarization. The key components of such systems require detection and recognition of textual content which also make the subject of our study. This paper presents a comprehensive framework for detection and recognition of textual content in video frames. More specifically, we target cursive scripts taking Urdu text as a case study. Detection of textual regions in video frames is carried out by fine-tuning deep neural networks based object detectors for the specific case of text detection. Script of the detected textual content is identified using convoluational neural networks (CNNs), while for recognition, we propose a UrduNet, a combination of CNNs and long short- term memory (LSTM) networks. A benchmark dataset containing cursive text with more than 13,000 video frame is also developed. A comprehensive series of experiments is carried out reporting an F-measure of 88.3% for detection while a recognition rate of 87%.


Introduction
In the recent years, there has been a tremendous increase in the amount of digital multimedia data, especially the video content, both in the form of video archives and live streams. According to statistics [1], 300 h of video is being uploaded every minute on the YouTube. A key factor responsible for this enormous increase is the availability of low-cost smart phones equipped with cameras. With such huge collections of data, there is a need to have efficient as well as effective retrieval techniques allowing users retrieve the desired content. Traditionally, videos are mostly stored with user assigned annotations or keywords which are called tags. When a content is to be searched, a keyword provided as query is matched with these tags to retrieve the relevant content. The assigned tags, naturally, cannot encompass the rich video content leading to a constrained retrieval. A better and more effective strategy is to search within the actual content rather *Correspondence: alimirza@bahria.edu.pk 1 Bahria University, Islamabad, Pakistan than simply matching the tags, i.e., content based (image or) video retrieval. CBVR systems have been researched and developed for a long time and allow a smarter way of retrieving the desired content. The term content may refer to the visual content (for example, objects or persons in the video), audio content (the spoken keywords for instance), or the textual content (News tickers, anchor names, score cards, etc.). Among these, the focus of our current study lies on textual content. More specifically, we target a smart retrieval system that exploits the textual content in videos as an index.
The textual content in video can be categorized into two broad classes, scene text, and caption text. Scene text (Fig. 1) is captured through camera during the video recording process and may not always be correlated with the content. Examples of scene text include advertisement banners, sign boards, and text on a T-shirt. Scene text is commonly employed for applications like robot navigation and assistance systems for the visually impaired. Artificial or caption text (Fig. 2), on the other hand, is superimposed on the video and typical examples include (2020) 2020: 34 Page 2 of 19

Fig. 1 Examples of scene text
News tickers, movie credits, and score cards. Caption text is generally correlated with the video content and is mostly applied for semantic retrieval of videos. The focus of our present study also lies on the caption text.
The key components of a textual content based indexing and retrieval system include detection of text regions, extraction of text (segmentation from background), identification of script (for multi-script videos), and finally recognition of text (video OCR). Detection of text can be carried out using unsupervised [2][3][4], supervised [5][6][7], or hybrid [8,9] approaches. Unsupervised text detection relies on image analysis techniques to discriminate between text and non-text regions. Supervised methods, on the other hand, involve training a learning algorithm with examples of text and non-text regions to discriminate between the two. In some cases, a combination of the two techniques is also employed where the candidate text regions identified by unsupervised methods are validated through a supervised approach.
Once the text is detected, it can be fed to the recognition engine. In case of videos which include text in multiple scripts, an additional module to identify the script is required so that each type of text can be recognized through its respective OCR. While mature recognition systems are available of text in the Roman script (ABBYY Fine Reader and Tesseract [10], etc.), recognition of cursive scripts remains a challenging task. Furthermore, as opposed to document images which are scanned at high resolution, video text is mostly in low resolution ( Fig. 3) and can appear on complex backgrounds making its recognition more challenging.
This paper presents a comprehensive framework for video text detection and recognition in a multi-script environment. The key highlights of this study are outlined in the following.
• Development of a comprehensive dataset of video frames with ground truth information supporting evaluation of detection and recognition tasks. • Investigation of state-of-the-art deep learning based object detectors, fine-tuning them to detection of textual content. • Video text script identification using convolutional neural networks (CNN). This paper is organized as follows. In the next section, we present an overview of the current state-of-the-art from the view point of text detection, script identification, and text recognition. In Section 3, we present the database developed in our study along with the ground truth information. Section 4 presents the details of the proposed framework while Section 5 details the experimental protocol, the realized results, and the corresponding discussion. Finally, Section 6 concludes the paper with a discussion on open challenges on this subject.

Background
Detection and recognition of textual content in videos, images, documents, and natural scenes has been investigated for more than four decades. The domain has matured progressively over the years starting with trivial systems recognizing isolated digits and characters to complex end-to-end systems capable of reading text in natural scenes. This section presents an overview of the notable contributions to detection and recognition of textual content in images and videos. Comprehensive and detailed surveys on the problem can found in [11][12][13][14][15]. We organize our discussion in three main sections. We first discuss the text detection problem followed by an overview of script identification techniques. At the end, we present the current state-of-the-art and open challenges from the view point of text recognition.

Text detection
Text detection refers to localization of textual content in images. Techniques proposed for detection of text are typically categorized into unsupervised and supervised approaches. While unsupervised methods primarily rely on image analysis techniques to segment text from background, supervised techniques involve training a learning algorithm to discriminate between text and non-text regions. Unsupervised text detection techniques include edgebased methods [2,3,16] which (assume and) exploit the high contrast between text and its background; connected component-based methods [17,18] which mostly rely on the color/intensity of text pixels and texture-based methods [19,20] which consider textual content in the image as a unique texture that distinguishes itself from the non-text regions. Texture based methods have remained a popular choice of researchers and features based on Gabor filters [21], wavelets [22], curvelets [23], local binary patterns (LBP) [24], discrete cosine transformation (DCT) [25], histograms of oriented gradients (HoG) [26], and Fourier transformation [27] have been investigated in the literature. Another common category of techniques includes color-based methods [28,29] which are similar in many aspects to the component-based methods and employ color information of text pixels to distinguish it from non-text regions.
Supervised approaches for detection of textual content typically employ state-of-the-art learning algorithms which are trained on examples of text and non-text blocks either using pixel values or by first extracting relevant features. Classifiers like Naïve Bayes [30], support vector machine (SVM) [31], artificial neural network (ANN) [8,32], and deep neural networks (DNN) [33] have been investigated over the years.
In the recent years, deep learning based solutions have been widely applied to a variety of classification problems and have outperformed the traditional techniques. A major development contributing to the current fame of deep learning was the application of convolutional neural networks (CNNs) by Krizhevsky et al. [34] on the Ima-geNet Large Scale Visual Recognition competition [35], which greatly reduced the error rates. Since then, CNNs are considered to be state-of-the-art feature extractors and classifiers [36,37] and have been applied to a number of recognition tasks. While traditional CNNs are typically employed for object classification, region-based convolutional networks (R-CNN) [38] and their further enhancements Fast R-CNN [39] and Faster R-CNN [40] represent common object detectors. In addition to different variants of R-CNN, a number of new architectures have also been proposed in the recent years for real time object detection. The most notable of these include YOLO (You Only Look Once) [41] and SSD (Single Shot Detector) [42]. These object detectors can be fine-tuned with textual data to serve as text detectors and are likely to provide good results. Among deep learning based techniques adapted for text detection, Huang et al. [43] employed sliding windows with CNNs to detect textual regions in low-resolution scene images. Likewise, fully convolutional networks are explored for detection of textual regions in [44] and the technique is evaluated on various ICDAR datasets. A similar work is presented by Gupta et al. [45] where CNNs are trained using synthetic data for detection of text at multiple scales from natural images. Another method called "SegLink, " is proposed in [46] that relies on decomposing the text into segments (oriented boxes of words or lines) and links (connecting two adjacent segments). The segments and links are detected using fully convolutional networks at multiple scales and combined together to detect the complete text line. In [47], vertical anchorbased method is reported that predicts text and non-text scores of fixed size regions and reports high detection performance on the ICDAR 2013 and ICDAR 2015 datasets. In another notable work, Wang et al. [48] present a framework based on conditional random field (CRF) to detect text in scene images. Authors define a cost function by considering the color, stroke, shape, and spatial features with CNN for effective detection of textual regions.
Among other end-to-end trainable deep neural networks based systems, Liao et al. [7] present a system called "TextBoxes" which detects text in natural images in a single forward pass network. The technique was later extended to "TextBoxes++" and was evaluated on four public databases outperforming the state-of-the-art. He et al. [6] improved the convolutional layer of CNNs to detect text with arbitrary orientation. EAST [49] is another wellknown scene text detector that provides promising results in challenging scenarios. In another study [5], an ensemble of CNNs is trained on synthetic data to detect video text in East Asian languages.
Summarizing, it can be concluded that the problem of text detection has been mostly dominated by the application of different deep learning architectures in the recent years. The availability of benchmark datasets has also contributed to the rapid developments in this area. Among open problems, development of a generic text detector that could work in multi-script environment remains a challenging issue.
In the next section, we discuss different techniques for recognition of script of the text from documents and video images.

Script recognition
In many cases, videos may contain textual content in more than one script. These scripts must be identified prior to feeding the text regions to the respective OCR engines for recognition. Script recognition has been studied by researchers for text in video images as well as printed and handwritten documents [50,51]. Recognition of script in video text is naturally much more challenging as opposed to printed or handwritten documents due to low resolution of text and in some cases complex backgrounds [52,53]. From simple methods based on template matching [54] to sophisticated structural [55] and statistical [56] features, a number of techniques have been reported in the literature. Among various features exploited to characterize the script, texture-based features [53,[57][58][59] are known to be very effective reporting high classification rates. Textural measures like Gabor filters [60], LBP [24], and gray level co-occurrence matrix (GLCM) [61] have been investigated in a number of studies. The extracted features are typically fed to traditional classifiers to discriminate between the script classes under study. Among well-known methods, Zhao et al. [62] employed "Spatial Gradient Features (SGF)" to characterize script of text using a dataset of six different scripts. A similar study with "Gradient Angular Feature (GAF)" is reported in [63] where authors presented "Potential Text Candidate (PTC)" method for studying the cursiveness of text with histogram operations. Sharma et al. [64] proposed script identification using "Gradient Local Auto-Correlation (GLAC)" for English, Bengali, and Hindi script in low-resolution video frames. A recent comprehensive survey on script identification can be found in [50]. A competition on this problem was also organized in conjunction with ICDAR 2015 [65]. The competition involved four challenging tasks with 10 different languages, and among the submitted systems, Google Inc. was declared the winner of the competition.
Among recent contributions to script identification, Jieru et al. [66] combine a CNN and RNN into a single end-to-end trainable network. The technique is evaluated on multiple datasets and reports high identification rates. In [67], authors propose a set of mid-level features for script identification with very less labeled data. Experiments on CVSI dataset report an identification rate of more than 96%. Gomez et al. [68] employed Naïve Bayes classifier with convolutional features to identify script in unconstrained scene text. The work was later extended to apply patch-based classification using CNNs [69]. In other recent works, transfer learning and fine-tuning with AlexNet and VGG-16 are explored in [70] for script identification. Bag of visual words model is investigated in [71] by using convolutional features extracted from image patches in the form of triplets. Bhunia et al. [72] propose a CNN-LSTM framework to extract local as well as global features. The features are weighted dynamically, and the technique is evaluated on four public datasets reporting high identification rates.
Summarizing, like many other problems, deep learningbased methods have been the dominant technique for script identification in the recent years. While the initial research primarily focused on document images, script identification of text appearing in videos has been an attractive research theme for many years now. Though many sophisticated systems have come to scene, low resolution of video images and the high similarity between different scripts are keeping it an active research area.
In the next section, we discuss different techniques for text recognition from documents and video images.

Text recognition
Recognition of text, generally termed as OCR (Optical Character Recognition) is one of the most classical pattern recognition tasks that has been explored for decades. Recognition systems have been investigated for printed as well handwritten documents (scanned or camera-based), text in natural scene images, and the caption text appearing in videos. State-of-the-art recognition systems (for instance Google Tesseract [10], Abbyy FineReader, etc.) are known to report near to 100% recognition rates for textual content in a number of scripts. Recognition of text in cursive scripts, however, still remains challenging, especially for the text appearing in videos.
From the view point of document OCRs, research endeavors can be categorized into two main classes, analytical (character-based) and holistic (word-based) techniques. Analytical techniques which work either on isolated characters or first segment the text into characters. For text recognition in document images, a number of techniques have been presented both at character (analytical) and word (holistic) levels. Typically, techniques like graph-based models [73][74][75], Bayesian classifiers [76,77], and Hidden Markov Models [78][79][80], etc., have been explored for character level recognition of text. Likewise, a number of features and classifiers have been investigated for word level recognition [81][82][83]. A number of recent studies also employed deep learning-based solutions for analytical as well holistic recognition of text [84,85].
Unlike document images, recognition of text from scene images offers a more challenging scenario due to different positions of camera while capturing the text, non-uniform illumination, and complex backgrounds. Among handengineered features, popular descriptors investigated for detection of natural scene text include the Scale Invariant Feature Transform (SIFT) [86,87], Strokelets [88], and Histogram of Oriented Gradients (HOG) [89,90], etc. Likewise, ANNs [91][92][93] and SVMs [94,95] have been commonly employed as classifiers. In addition to recognition, techniques based on word spotting have also been investigated on scene text images [96,97]. Recognition of text in road signs also represents an important subproblem within the umbrella of scene text recognition and has been explored in a number of studies [98][99][100].
A recent trend in text recognition has been the combination of convolutional and recurrent neural networks [101][102][103][104][105] where the CNN part serves to map the raw text images to effective feature representations while the recurrent part exploits the feature sequences to predict the transcription. In addition to the standard C-RNN, a number of enhancements have been proposed in the network architectures [106][107][108][109] to deal with the challenges of a scene text.
From the perspective of recognition of caption text, a key challenge, as discussed earlier, is the low resolution of text. A number of studies address this problem as a pre-processing step and combine the information from multiple frames to produce a high-resolution image which is subsequently fed to the recognizer [110,111]. Recognition of caption text has been mostly employed for indexing and retrieval applications [112,113].
In the context of cursive text, a holistic technique based on multi-dimensional LSTMs is presented in [114] for recognition of Arabic video text. The technique is evaluated on two datasets ACTiV [115] and the ALIF [116,117] and reports high recognition rates. A similar work is reported in [118] where a combination of CNN and LSTM is employed to recognize Arabic text in video frames. Another deep learning-based solution is presented in [119] where Lu et al. compare the performance of different pre-trained ConvNets for detection and recognition of caption text. CNNs are also employed for recognition of Chinese video text in [5].
Recognition of Urdu text has recently received significant research attention. Most of the developed systems mainly target digitized printed documents. Similar to other cursive scripts, recognition techniques are categorized into analytical (segmentation-free) and holistic (segmentation-based) methods. Unlike other scripts however, segmentation of Urdu text into characters is a challenging problem [120]. As a result, a number of implicit segmentation-based techniques have been recently proposed where the learning algorithm is provided with the text line images and the corresponding output transcription to learn different shapes of a character as well as the segmentation points [121][122][123]. Likewise, in holistic approaches, the word boundaries are difficult to identify hence subwords (ligatures) are typically employed as recognition units in these methods.
Among notable holistic approaches, HMMs have been widely employed for recognition of ligatures [124][125][126]. These techniques typically employ sliding windows to extract features from ligature images which are projected in the quantized feature space, hence representing each ligature image as a sequence. In some cases, the main body and dots are separately recognized [127] to reduce the total number of unique classes which can be very high (Urdu, for example, has more than 26,000 unique ligatures [128]). The implicit segmentationbased recognition techniques mostly employ different variants of LSTMs [121,129,130] with a connectionist temporal classification (CTC) output layer to recognize characters. A significant proportion of studies targeting recognition of Urdu text employ the publicly available UPTI [131] and CLE [132] datasets. The UPTI dataset comprises more than 10,000 synthetic text lines while the CLE dataset consists of scanned images of printed Urdu books as well as a collection of high frequency ligatures. While recognition of printed Urdu text in document images has progressively matured in the recent years, research endeavors targeting caption text are fairly limited. A holistic approach for recognition of a small set of Urdu ligatures (collected from video text) is presented in [133]. Pre-trained ConvNets are employed for feature extraction and classification, and though high recognition rates are reported, the number of considered ligature classes is very limited (290 ligature classes). Likewise, an implicit segmentation-based technique using LSTMs is presented for recognition of Urdu News tickers in [134]. The experimental study is carried out on a private dataset of videos, and the recognizer performance is compared with a commercial OCR.
After having discussed the significant contributions to text detection, script identification, and text recognition, we now present the dataset that has been developed to support evaluation of text detection and recognition modules.

Dataset
For experimental study of our system, we have collected and labeled a comprehensive dataset of video frames. The frames are labeled to allow evaluation of text detection, script identification, and text recognition performance. For each frame, the location of text regions, the script information, and the ground truth transcription are stored.
The first step in database development is the collection of videos. We have collected 60 videos by recording live streams from five different News channels. All videos are recorded at a resolution of 900 × 600 and a frame rate of 25 fps. While the videos contain textual content in both Urdu and English, videos from four of the channels have dominant occurrences of Urdu text while those from one channel mostly contain textual content in English. Since successive frames in a video contain redundant information, we extract one frame every two seconds for labeling. The main reason of extracting a single frame every 2 s is to ensure that the collected frames have different textual content. This allows variation in training and test data as opposed to the case where a sequence of frames contains (mostly) similar textual content. Having as many unique words and character combinations allows the learning algorithm generalize better.
To facilitate the labeling process and standardize the ground truth data, a comprehensive labeling tool has been developed that allows storing the location of each textual region in a frame along with its ground truth transcription. A screen shot of the developed tool is presented in Fig. 4. The tool allows loading frames in a video and labeling them one by one for text locations as well as ground truth transcription. The ground truth information of each frame is stored as an XML file and comprises frame meta data (video and channel details) and information about text regions in the frame. For each script (English & Urdu in our case), we store information on total number of text lines, and for each line, we store a unique ID, the type of text (scene text or artificial text), the location of text region within the frame, and the transcription of text. The ground truth information of an example frame (stored in the XML format) is illustrated in Fig. 5 while a summary of the labeled data in terms of number of videos, number of frames, and number of text lines is presented in Table 1.

Methods
This section presents the details of the proposed framework which is summarized in Fig. 6. The overall system comprises of three main modules, text detector, script identifier, and text recognizer. On top of these modules, a wide range of systems can be developed at the application layer including indexing and retrieval, key-word-based alert generation, and content summarization. The first module, text detector is responsible for identifying and localizing all textual content in a frame. Since text can be in more than one script (within the same frame), the detected textual regions are fed to the script identification module which separates the text lines as a function of the script (English and Urdu being the two scripts considered in the present study). The text is finally passed to the respective recognition engines of each script to convert the images of text lines into strings which can be subsequently employed for a number of applications. Each of these modules is discussed in detail in the following.

Text detection
The first step in the proposed framework is the detection of candidate text regions from the extracted video frames. For detection of textual content in a given frame, we have employed state-of-the-art convolutional neural networks (CNN) -based object detectors. Although, many object detectors are trained with thousands of class examples and provide high accuracy in detection and recognition of different objects, these object detectors cannot be directly applied to identify text regions in images. These models have to be tuned to the specific problem of discrimination of text from non-text regions. The convolutional base of these models can be trained from scratch or, known pre-trained models (VGG, Inception, or ResNet) can be fine-tuned by training them on text and non-text regions. In our study, we investigated the following object detectors for localization of text regions. The idea is to study which of these can be better adapted for text detection problem.
• Faster R-CNN • Single Shot Detector (SSD) • Efficient and Accurate Scene Text detection pipeline (EAST) Faster R-CNN [40] is an enhanced version of its predecessors R-CNN [38] and Fast R-CNN [39]. Faster R-CNN merges a region proposal network with Fast R-CNN for effective and efficient localization and recognition of objects. The SSD (Single Shot multibox Detector) architecture was proposed by Liu et al. [42] and reported high precision on object detection on standard datasets like PascalVOC and COCO. The architecture has an input  . 7 Overview of fine-tuning object detectors for text also a unique characteristic of this model. While Faster R-CNN and SSD are generalized object detectors, EAST was specifically designed to detect text from scene images. In our study however, it did not report acceptable detection performance once applied directly to the detection of caption text from video images. Consequently, in addition to Faster R-CNN and SSD, EAST was also fine-tuned to our dataset.
To investigate the performance of different object detectors on our specific problem, we carried out a comprehensive series of experiments by training these three models for various setting of hyper parameters. Text regions, containing Urdu and English text lines, are given as training examples to these detectors. Once tuned, the detection performance is evaluated using the test set of images. The overall flow of the text detection is summarized in Fig. 7. The final fine-tuned model takes a video frame as input and localizes the text regions (Fig. 8). Once the text is localized, the detected regions are fed to the script identification module to recognize the script of the detected text.

Script identification
Script identification takes text lines as input and identifies the script of the text. It is important to mention that during text detection, we treat it as a two-class problem, i.e., discrimination of text from non-text regions (irrespective of the script). It is also possible to treat it as a k + 1 class problem where k represents the number of scripts (k = 2 for our problem). In other words, the detection system, can be trained to detect the text in a particular script, hence avoiding the need of a separate script identification module. However, it is known that text in multiple scripts share some common characteristics (for instances edges of strokes, alignment, edge density etc.). Hence, designing a system to learn what is text and what is non-text seemed a more natural approach and is followed in our study. For script identification, we employ CNNs in a classification framework (rather than detection). Urdu and English text lines are employed to fine-tune pre-trained ConvNets to discriminate between the two classes. Once trained, the model is able to separate text lines as a function of the script (Fig. 9).
Once the script of a text line is identified, it is fed to the respective recognition engine as discussed in the following.

Text recognition
As discussed earlier, we primarily target videos of News channels which contain cursive text (Urdu in our problem) along with text in the Roman script (English in our case). OCR systems for text in English (and other languages sharing the same script) are pretty mature.
Hence, for recognition of English text, we employ off-theshelf Google Tesseract OCR engine. For cursive text in Urdu, however, the performance of Google recognition engine was not very promising. Hence, we developed our own recognition engine (UrduNet) to recognize the text lines in Urdu. Recognition details are presented in the following.

Google Tesseract
Google Tesseract [10] is considered as the state-of-theart OCR engine which provides high accuracy for many different languages including English. In our system, we have employed Tesseract version 4.0 which was recently released by Google. Version 4.0 is developed using deep neural network, and more specifically, it employs recurrent neural networks with long short-term memory architecture. The English text lines are fed to the recognition engine which returns the corresponding textual strings.

UrduNet
For recognition of Urdu text, we have designed and trained our own architecture which is a combination of a convolutional and a recurrent neural network and is termed as "UrduNet. " The key motivation of employing a CNN is to convert raw pixel values into robust feature representations while a recurrent net is employed to model the problem using an implicit segmentation based approach. This allows directly feeding the text lines along with ground truth transcription to the model and no ligature or character level segmentation or labeling is required. Recurrent nets have reported significant performance enhancements on problems like speech, handwriting, and caption text recognition, in the recent years. While simple RNNs fail to model long-term dependencies in the input sequences, variants like long short-term memory (LSTM) networks are commonly employed. LSTM represents a special type of recurrent unit with three gates, i.e., input, output, and forget. These gates are implemented using the sigmoid function and regulate the memory of an LSTM cell. It is also common to employ bi-directional LSTMs which parse the input in both forward and backward directions and concatenate the information for better predictions.
For feature extraction, we have designed a seven layer convolutional neural network. Input text line images are pre-processed and fed to the ConvNet. The preprocessing includes height-normalization, image binarization, and flipping. The flipping is carried out as Urdu is printed from right to left unlike western languages which are printed from left to right. Flipping the image ensures that the character sequences in the transcription are in correspondence with the image. The CNN maps input text line images to a feature map which is fed as a sequence to the recurrent layers. The recurrent part of the network is implemented using two layers of bidirectional LSTMs. The LSTM outputs pre-frame predictions which are converted to class labels using a CTC layer. Finally, a look-up table is used to map the class labels to the true Unicodes and produce the output transcription. A summary of these steps is illustrated in Fig. 10 while sample text lines used to train the model are presented in Fig. 11. Training is carried out in an end-to-end manner using the CTC loss function. Comprehensive findings on impact of pre-processing and the recurrent units can be found on our related studies [135] and [136], respectively.

Experiments, results, and discussion
To study the effectiveness of the proposed framework, a comprehensive series of experiments is carried out to evaluate the text detection, script identification, and text recognition modules. We first present the experimental protocol and realized results of text detection followed by (2020) 2020:34 Page 13 of 19

Fig. 13
Comparison of different models on detection performance those of script identification. Finally, we discuss the performance of our recognition engine trained using different sets of images. In addition to our own dataset (presented in Section 3), we also employed other datasets in the training process as discussed in the following.

Text detection results
Text detection is evaluated by applying fine-tuning on different object detectors. These include Faster R-CNN [40], SSD [42] and EAST (Efficient and Accurate Scene Text detection pipeline) [49]. We used 9546 frames from our dataset containing more than 50,000 text lines for tuning these models while evaluations are carried out using more than 21,000 text lines from 4000 frames. Since Urdu and Arabic share many common characteristics, we also employed the publicly available dataset AcTiV developed by Zayene et al. [137]. The dataset contains about 1841 video frames with more than 5000 text lines, from different Arabic News channels. The text lines in the AcTiV dataset are used only in the training set and not in the test set. Three experimental scenarios are considered in our evaluation as listed in the following.

• Scenario-I (S-I):
Text lines from our custom-developed dataset are used to train the models. • Scenario-II (S-II): Text lines from AcTiV dataset [137] are used to train the models.  Table 2 summarizes the three scenarios along with the distribution of text lines into training and test sets for these scenarios.
The three models are trained for each of the scenarios by applying different settings of hyper parameters (optimized on the training set). Performance is quantified using the standard precision, recall, and F-measure where the bounding boxes of the detector are compared with  those in the ground truth. Detection results on sample video frames are illustrated in Fig. 12 while the quantified results are presented in Table 3. Comparing the performance of three models, it can be seen that Faster R-CNN outperforms the other two models. Comparing the three experimental scenarios, the lowest F-measures are reported in S-II which can be explained due to smaller number of (Arabic) text lines (around 5000) in the training set. Furthermore, the training set did not contain any text lines in English. The detection performance of S-I and S-III is comparable with S-III slightly better than S-I due to larger number of text lines in the training set. Over all, the highest F-measure of 88.3% is reported when using a combination of the two datasets in the training set. A comparison of detection performance of the three models for the three experimental scenarios is presented in Fig. 13.
In an attempt to provide an insight into the detection errors, few of the errors are illustrated in Fig. 14. It can be seen that in most cases, the detector is able to detect the textual region but the localization is not perfect, i.e., in some cases, the bounding box is larger (shorter) than the actual content leading to a reduced precision (recall).  In order to compare the performance of our text detection method with already published works, we summarize the results of various studies targeting detection of cursive caption text in Table 4. It is important to mention that since different methods are evaluated on different datasets, a direct quantitative comparison may not be very meaningful. Most of the listed studies are evaluated on relatively smaller datasets (mostly ≤ 1000). Moradi et al. [138] and Zayene et al. [139] report results on relatively larger datasets with F-measures of 0.89 and 0.84, respectively. In comparison to other studies, we employ a significantly larger set of images with an F-measure of 0.88 indicating the robustness and scalability of our detector.

Script identification results
Script identification refers to classification of text regions into one of the script classes. In our study, we aim to distinguish between cursive Urdu and English text. Script identification is carried out by fine-tuning a pre-trained ConvNet with more than 29,000 Urdu and around 24,000 English text lines. For consistency, the distribution of data into training and test sets is kept the same as in case of experiments on text detection and is summarized in Table 5. The confusion matrix for script identification is presented in Table 6 where it can be seen that the model was able to recognize the scripts with less then 3% error rate.

Text recognition results
To study the recognition performance of our model (UrduNet), we carried out a series of experiments. It is important to mention that from the view point of a complete system, English text is recognized using off-theshelf recognition engine. Consequently, we only report the  Table 7. It can be seen that the once the model is trained using text lines from video frames, a recognition rate of 83% is reported. With UPTI text lines, the recognition rate drops significantly (46%). UPTI dataset contains high-resolution printed text lines, and once the system is only trained using these lines, the performance drops once tested on the challenging set of text lines from video frames. Combining the UPTI text lines with those from video frames (in the third experiment) reports the highest recognition rate of 87%. Few of the recognition errors are illustrated in Fig. 15 where it can be seen that a major proportion of errors results due to false recognition of secondary ligatures (dots and diacritics) while the main body ligatures are correctly recognized in most cases.
We also provide a performance comparison of recent studies focusing on recognition of cursive text in general and Urdu text in particular. Since different methods are evaluated on different datasets under different experimental settings, a direct comparison of recognition rates is not the objective. The key idea is to provide an overview of the current state-of-the-art on this problem and assess the effectiveness of our recognizer with respect to it. Many interesting observations can be made from the summary of results presented in Table 8. The recognition rates on printed document images, in general, are naturally higher as compared to those reported on caption text. The highest reported recognition rate is 98.12% on 10,000 text lines of the UPTI dataset. It is however important to recall that UPTI contains synthetically generated text lines and do not offer the same kind of recognition challenges as those encountered in case of scanned documents or video text. In case of video text, Zayene et al. [114] reported 96.85% recognition rate of on a relatively smaller set of around 8000 Arabic text lines. For Urdu caption text, Tayyab et al. [134] reports 93% recognition rate on approximately 20,000 text lines. Hayat et al. [133] report a high classification rate of more than 99%, but the number of ligature classes is fairly small. In our experiments, we report a recognition rate of 87% which, though not directly comparable with reported studies, is indeed very promising considering the complexity of the problem and a much larger dataset.

Conclusion
In this paper, we presented a comprehensive framework for text detection and recognition in video frames containing textual occurrences in English and Urdu. A number of contributions are made in the presented study. We developed a comprehensive dataset of video frames with ground truth information allowing evaluation of detection and recognition tasks. For detection of textual regions, we employed state-of-the-art deep learning-based object detectors and fine-tuned them to detect text in multiple scripts. Script of the detected textual regions is identified using CNNs in a classification framework. A key contribution of this study is the development UrduNet, a combination of CNN and bidirectional LSTMs which reports high recognition rates for the challenging video text in cursive Urdu. In our further work on this problem, we intend to develop a complete indexing and retrieval system that can be queried for keywords. The system will be optimized to work on live streams in addition to archived videos. This will in turn allow development of keyword based alert generation systems. Furthermore, the dataset is also planned to be enhanced and made available publicly. The study can also be extended to include additional scripts by integrating their respective OCRs.