- Research Article
- Open Access
Exploiting Speech for Automatic TV Delinearization: From Streams to Cross-Media Semantic Navigation
© Guillaume Gravier et al. 2011
- Received: 25 June 2010
- Accepted: 20 January 2011
- Published: 7 February 2011
The gradual migration of television from broadcast diffusion to Internet diffusion offers countless possibilities for the generation of rich navigable contents. However, it also raises numerous scientific issues regarding delinearization of TV streams and content enrichment. In this paper, we study how speech can be used at different levels of the delinearization process, using automatic speech transcription and natural language processing (NLP) for the segmentation and characterization of TV programs and for the generation of semantic hyperlinks in videos. Transcript-based video delinearization requires natural language processing techniques robust to transcription peculiarities, such as transcription errors, and to domain and genre differences. We therefore propose to modify classical NLP techniques, initially designed for regular texts, to improve their robustness in the context of TV delinearization. We demonstrate that the modified NLP techniques can efficiently handle various types of TV material and be exploited for program description, for topic segmentation, and for the generation of semantic hyperlinks between multimedia contents. We illustrate the concept of cross-media semantic navigation with a description of our news navigation demonstrator presented during the NEM Summit 2009.
- Natural Language Processing
- Semantic Relation
- Automatic Speech Recognition
- Confidence Measure
- Word Error Rate
Television is currently undergoing a deep mutation, gradually shifting from broadcast diffusion to Internet diffusion. This so-called TV-Internet convergence raises several issues with respect to future services and authoring tools, due to fundamental differences between the two diffusion modes. The most crucial difference lies in the fact that, by nature, broadcast diffusion is eminently linear while Internet diffusion is not, thus permitting features such as navigation, search, and personalization. In particular, navigation by means of links between videos or, in a more general manner, between multimedia contents, is a crucial issue of Internet TV diffusion.
The growth and the impact of nonlinear Internet TV diffusion however remain limited for several reasons. Apart from strategical and political reasons (Currently, commercial policies of broadcasters imply that navigation is almost always limited to contents within a broadcaster's site to prevent users from browsing away. However, we believe that in the future, these limitations will vanish with the emergence of video portals independent of the broadcasting companies.), several technical reasons prevail. Firstly, the amount of content available is limited as repurposing TV contents for Internet diffusion is a costly process. The delinearization overhead is particularly cumbersome for services which require a semantic description of contents. Secondly, most Internet diffusion sites offer poor search features and lack browsing capabilities enabling users to navigate between contents. Indeed, browsing is often limited to a suggestion of videos sharing some tags which poorly describe the content. The main reason for this fact is, again, that obtaining an exploitable semantic description of some content is a difficult task. In brief, Internet browser-enabled diffusion of TV contents is mostly limited by the lack of a detailed semantic description of TV contents for enhanced search and navigation capabilities. There is therefore a strong need for automatic delinearization tools that break streams into their constituents (programs, events, topics, etc.) and generate indexes and links for all constituents. Breaking video streams into programs have been addressed on several occasions [1–4] but does not account for a semantic interpretation. Breaking programs into their constituents has received a lot of attention for specific genres such as sports [5, 6] and broadcast news [7–9]. Most methods are nonetheless either highly domain- and genre-specific or limited in their semantic content description. Moreover, regardless of the segmentation step, automatically enriching video contents with semantic links, eventually across modalities, have seldom been attempted [10, 11].
Spoken material embedded in videos, accessible by means of automatic speech recognition (ASR), is a key feature to semantic description of video contents. However, spoken language is seldom exploited in the delinearization process, in particular for TV streams containing various types of programs. The main reason for this fact is that, apart from specific genres such as broadcast news [8, 9, 12, 13], natural language processing (NLP) and information retrieval (IR) techniques originally designed for regular texts (by regular texts, we designate texts originally designed in their written form, for which structural elements such as casing, sentence boundary markers, and eventually paragraphs are explicitly defined) are not robust enough and fail to perform sufficiently well on automatic transcripts—mostly because of transcription errors and because of the lack of sentence boundary markers—and/or are highly dependent on a particular domain. Indeed, depending on many factors such as recording conditions or speaking style, automatic speech recognition performance can drop drastically on some programs. Hence, the need for genre- and domain-independent spoken content analysis techniques robust to ASR peculiarities for TV stream delinearization.
In this paper, we propose to adapt existing NLP and IR techniques to ASR transcripts, exploiting confidence measures and external knowledge such as semantic relations, to develop robust spoken content processing techniques at various stages of the delinearization chain. We show that this strategy is efficient in robustifying the processing of noisy ASR transcripts and permits speech-based automatic delinearization of TV streams. In particular, the proposed robust spoken document processing techniques are used for content description across a wide variety of program genres and for efficient topic segmentation of news, reports and documentaries. We also propose an original method to create semantic hyperlinks across modalities, thus enabling navigational features in news videos. We finally illustrate how those techniques were used at the core of a news navigation system, delinearizing news shows for semantic cross-media navigation, demonstrated during the NEM Summit 2009.
The paper is organized as follows. We first present the speech transcription system used in this study, highlighting peculiarities of automatic transcripts. In Section 3, a bag-of-words description of TV programs is presented to automatically link programs with their synopses. In Section 4, a novel measure of lexical cohesion for topic segmentation is proposed and validated on news, reports, and documentaries. Section 5 is dedicated to an original method for the automatic generation of links across modalities, using transcription as a pivot modality. The NEM Summit 2009 news navigation demonstration is described in Section 6. Section 7 concludes the paper, providing future research directions towards better automatic delinearization technologies.
The first step to speech-based processing of TV contents is their transcription by an automatic speech recognition engine. We recall here the general principles of speech recognition to highlight the peculiarities of automatic transcripts with respect to regular texts and the impact on NLP and IR techniques. In the second part, details on the ASR system used in this work are given.
2.1. Transcription Principles
Language models (LM), that is, probability distributions over sequences of N words (N-gram models), are used to get the prior probability of a word sequence . Acoustic models, typically continuous density hidden Markov models (HMM) representing phones, are used to compute the probability of the acoustic material for a given word sequence, . The relation between words and acoustic models of phone-like units is provided by a pronunciation dictionary which lists the words recognizable by the ASR system, along with their corresponding pronunciations. Hence, ASR systems operate on a closed vocabulary whose typical size is between 60,000 and 100,000 words. Words out of the vocabulary (OOV) cannot be recognized as is and are therefore one cause of recognition errors, resulting in the correct word being replaced by one or several similarly sounding erroneous words. The vocabulary is usually chosen by selecting the most frequent words, eventually adding domain-specific words when necessary. However, named entities (proper names, locations, etc.) are often missing from a closed vocabulary, in particular in the case of domain-independent applications such as ours.
Evaluating (1) over all possible word sequences of unknown length is costly in spite of efficient approximate beam search strategies [14, 15] and is usually performed over short utterances of 10 s to 30 s. Hence, prior to transcription, the stream is partitioned into short sentence-like segments which are processed independently of one another by the ASR system. Regions containing speech are first detected, and each region is further broken into short utterances based on the detection of silences and breath intakes.
Clearly, ASR transcripts significantly differ from regular texts. First, recognition errors can strongly impact the grammatical structure and semantic meaning of the transcript. In particular, poor recording conditions, environmental noises, such as laughter and applause, and spontaneity of speech are all factors that might occur in TV contents and which drastically increase recognition errors. Second, unlike most texts, transcripts are unstructured, lacking sentence boundary markers and paragraphs. In some cases, transcripts are also case insensitive so as to limit the number of OOV words. These oddities might be detrimental to NLP where casing and punctuation marks are often considered as critical cues. However, ASR transcripts are more than just degraded texts. In particular, word hypotheses are accompanied by confidence measures indicating for each word an estimation of its correctness by the ASR system . Using confidence measures for NLP and IR can help avoiding error-prone hard decisions from the ASR system and partially compensate for recognition errors, but this requires that standard NLP and IR algorithms be modified, as we propose in this paper.
2.2. The IRENE Transcription System
In this paper, all TV programs were transcribed using our IRENE ASR system, originally developed for broadcast news transcription. IRENE implements a multiple-pass strategy, progressively narrowing the set of candidate transcriptions—the search space—in order to use more complex models. In the final steps, a 4-gram LM over a vocabulary of 65,000 words is used with context-dependent phone models to generate a list of 1,000 transcription hypotheses. Morphosyntactic tagging, using a tagger specifically designed for ASR transcripts, is used in a postprocessing stage to generate a final transcription with word-posterior-based confidence measures, combining the acoustic, language model, and morphosyntactic scores . Finally, part-of-speech tags are used for lemmatization, and, unless otherwise specified, lemmas (a lemma is an arbitrary canonical form grouping all inflexions of a word in a grammatical category, e.g., the infinitive form for verbs, the masculine singular form for adjectives, etc.) are considered instead of words in this work.
The language model probabilities were estimated on 500 million words from French newspapers and interpolated with LM probabilities estimated over 2 million words corresponding to reference transcription of radio broadcast news shows. The system exhibits a word error rate (WER) of 16% on the nonaccented news programs of the ESTER 2 evaluation campaign . As far as TV contents are concerned, we estimated word error rates ranging from 15% on news programs to more than 70% on talk shows or movies.
The first step in TV delinearization is the stream segmentation step which usually consists in splitting the stream into programs and interprograms (commercials, trailers/teasers, sponsorships, or channel jingles). Several methods have been proposed to this end, exploiting information from an electronic program guide (EPG) to segment the video stream and label each of the resulting segments with the corresponding program name [2–4, 19]. Note that stream segmentation exploiting interprogram detection, as in , results in segments corresponding to a TV program, to a fraction of a program, or, on some rare occasions, to several programs. In all cases, aligning the video signal with the EPG relies on low-level audio and visual features along with time information and does not consider speech indexing and understanding to match program descriptions with video segments.
We first describe how traditional information retrieval approaches are modified to create associations between synopses and a video segment based on its transcription. The use of these associations to label segments is rapidly discussed, the reader being referred to  for more details.
3.1. Pairwise Comparison of Transcripts and Synopses
The entire process of associating synopses and transcripts relies on pairwise comparisons between a synopsis and a segment's transcript. We propose a technique for such a pairwise comparison, inspired from word-based textual information retrieval techniques. In order to deal with transcription errors and OOV words, some modifications of the traditional vector space model (VSM) indexing framework  are proposed: confidence measures are taken into account in the index term weights, and a phonetic-based document retrieval technique, enabling to retrieve in a transcript the proper nouns contained in a synopsis, is also considered.
3.1.1. Modified tf-idf Criterion
where denotes the average word-level confidence over all occurrences of in . Equation (5) simply states that words for which a low confidence is estimated by the ASR system will contribute less to the tf-idf weight than words with a high confidence measure. The parameter is used to smooth the impact of confidence measures. Indeed, confidence measures, which correspond to a self-estimation of the correctness of each word hypothesis by the ASR system, are not fully reliable. Therefore, , experimentally set to 0.25 in this work, prevents from fully discarding a word based on its sole confidence measure.
Given the vector of tf-idf weights for a synopsis and the vector of modified tf-idf weights for a segment's transcript, the pairwise distance between the two is given by the cosine measure between the two description vectors.
3.1.2. Phonetic Association
Named entities, in particular proper names, require particular attention in the context of TV content description. Indeed, proper names are frequent in this context (e.g., characters' names in movies and series) and are often included in the synopses. However, proper names are likely to be OOV words that will therefore not appear in ASR transcripts. As a consequence, proper names are likely to jeopardize or, at least, to not contribute to the distance between a transcript and a synopsis when using the tf-idf weighted vector space model.
To skirt such problems, a phonetic measure of similarity is defined to phonetically search a transcript for proper names appearing in a synopsis. Each proper name in the synopsis is automatically converted into a string of phonemes. A segmental variant of the dynamic alignment algorithm is used to find in the phonetic output of an ASR transcript the substring of phonemes that best matches the proper name's phonetization. The normalized edit distance between the proper name's phoneme string and the best matching substring defines the similarity between the ASR transcript and the proper name in the synopsis. The final distance between the synopsis and the transcript is given by summing over all proper names occurring in the synopsis.
3.2. Validating the Segmentation
We demonstrate on a practical task that the comparison techniques of Section 3.1 enable the use of ASR transcripts for genre-independent characterization of TV segments in spite of potentially numerous transcription errors. The word- and phonetic-level pairwise distances are used to validate, and eventually modify, the label (i.e., the program name) attached to each segment as a result of the alignment of the stream with an EPG. This validation step is performed by associating a unique synopsis with each segment before checking whether the synopsis corresponds to the program name obtained from the EPG or not, as illustrated in Figure 2. In case of mismatch, a decision is made to maintain, to change, or to invalidate the segment's label, based on the scheduled and broadcasted start times. Associating a unique synopsis with each segment relies on shortlists of candidate segments for each synopsis. For a given synopsis, two shortlists of candidate segments are established, one based on the word-level distance as given using the modified tf-idf criterion, the other based on the phonetic distance. Details on shortlist generation can be found in . The synopsis associated with a given segment is the one with the highest association score among those synopses for which the shortlists contain the segment.
In spite of the very limited gain incurred by the synopsis-based label correction process, these results clearly demonstrate that the proposed lexical and phonetic pairwise distances enable us to efficiently use automatic speech transcripts as a description of TV segments, for a wide range of program genres. However, the word-level description considered is a "bag-of-words" representation which conveys only limited semantics, probably partially explaining the robustness of the description to transcription errors. For programs with reasonable error rates between 15% and 30%, such as news, documentaries, and reports, speech can be used for finer semantic analysis, provided adequate techniques are proposed to compensate for the peculiarities of automatic transcripts. In the following sections, we propose robust techniques for topic segmentation and link generation, respectively, limiting ourselves to news, documentaries, and reports.
Segmentation of programs into topics is a crucial step to allow users to directly access parts of a show dealing with their topics of interest or to navigate between the different parts of a show. Within the framework of TV delinearization, topic segmentation aims at splitting shows for which the notion of topic is relevant (e.g., broadcast news, reports, and documentaries in this work) into segments dealing with a single topic, for example, to further enrich those segments with hyperlinks. Note that such segments usually include the introduction and eventually the conclusion by the anchor speaker in addition to the development (by development, we refer to the actual report on the topic of interest. A typical situation is that of news programs where the anchor introduces the subject before handing over to a live report, the latter being eventually followed by a conclusion and/or a transition to the next news item. All these elements should be kept as a single topic segment) itself and are therefore hardly detectable from low-level audio and visual descriptions. Moreover, contrary to the TDT framework , no prior knowledge on topics of interest is provided so as to not depend on any particular domain and, in the context of arbitrary TV contents, segments can exhibit very different lengths.
Topic segmentation has been studied for years by the NLP community which developed methods dedicated to textual documents. Most methods rely on the notion of lexical cohesion, corresponding to lexical relations that exist within a text, and are mainly enforced by word repetitions. Topic segmentation methods using this principle are based on an analysis of the distribution of words within the text: a topic change is detected when the vocabulary changes significantly [24, 25]. As an alternative to lexical cohesion, discourse markers, obtained from a preliminary learning process or provided by a human expert, can also be used to identify topic boundaries [26, 27]. However discourse markers are domain- and genre-dependent and sensitive to transcription errors while lexical cohesion does not depend on specific knowledge. However, lexical cohesion is also sensitive to transcription errors. We therefore propose to improve the lexical cohesion measure at the core of one of the best text segmentation method  to accommodate for confidence measures and to account for semantic relations other than the mere word repetitions (e.g., the semantic proximity between the words "car" and "drive") to compensate for the limited number of repetitions in certain genres. As we will argue in Section 4.2, the use of semantic relations serves a double purpose: better semantic description and increased robustness to transcription errors. However, such relations are often domain dependent and their use should not be detrimental to the segmentation of out-of-domain transcripts.
We rapidly describe the topic segmentation method of Utiyama and Isahara  which serves as a baseline in this work, emphasizing the probabilistic lexical cohesion measure on which this method is based. We extend this measure to successively account for confidence measures and semantic relations. Finally, experimental results on TV programs are presented in Section 4.3.
4.1. Topic Segmentation Based on Lexical Cohesion
The topic segmentation method introduced by Utiyama and Isahara  for textual documents was chosen in the context of transcript-based TV program segmentation for two main reasons. It is currently one of the best performing methods that makes no assumption on a particular domain (no discourse markers, no topic models, etc.). Moreover, contrary to many methods based on local measures of the lexical cohesion, the global criterion used in  makes it possible to account for the high variability in segment lengths.
In (7), lexical cohesion is considered by means of the probability terms computed independently for each segment. As no prior model of the distribution of words for a given segment is available, generalized probabilities are used. Lexical cohesion is therefore measured as the ability of a unigram language model , whose parameters are estimated from the words in , to predict the words in .
where is the vocabulary of the text, containing different words, and where the count denotes the number of occurrences of in . The probability distribution is smoothed by incrementing the count of each word by 1. The normalization term ensures that is a probability mass function and, in the particular case of (8), with being the number of word occurrences in .
where denotes the j th word in . Intuitively, according to (9), lexically consistent segments exhibit higher lexical cohesion values than others as the generalized probability increases as the number of repetitions increases.
In this work, the basic units considered were utterances as given by the partitioning step of the ASR system, thus limiting possible topic boundaries to utterance boundaries. Moreover, lexical cohesion was computed on lemmas rather than words, discarding words other than nouns, adjectives, and nonmodal verbs.
4.2. Improved Lexical Cohesion for Spoken TV Contents
As argued previously, confidence measures and semantic relations can be used as additional information to improve the generalized probability measure of lexical cohesion so as to be robust to transcription errors and to the absence of repetitions. We propose extensions of the lexical cohesion measure to account for such information.
4.2.1. Confidence Measures
where corresponds to the confidence measure of . Confidence measures are raised to the power of in order to reduce the relative importance of words whose confidence measure value is low. Indeed, the larger the , the smaller the impact in the total count of terms for which is low.
The idea is that word occurrences with low confidence measures contribute less to the measure of the lexical cohesion than those with high confidence measures. In this case, the language model can be either estimated from the counts , thus limiting the use of confidence measures to the probability calculation, or from the modified counts .
4.2.2. Semantic Relations
As mentioned previously, integrating semantic relations in the measure of the lexical cohesion serves a double objective. The primary goal is obviously to ensure that two semantically related words, for example, "car" and "drive", contribute to the lexical cohesion, thus avoiding erroneous topic boundaries between two such words. This is particularly crucial when short segments might occur as they exhibit few vocabulary repetitions, but semantic relations can also limit the impact of recognition errors. Indeed, contrary to correctly transcribed words, misrecognized words are unlikely to be semantically linked to other words in the segment. As a consequence, the weight of nonproperly transcribed words in the edges' weights will be less important than the one of correct words.
where denotes the semantic proximity of words and . The semantic proximity is close to 1 for highly related words and null for nonrelated words. Details on the estimation of the semantic relation function from text corpora are given in the next section.
Modified counts as defined in (12) are used to compute the language model that in turn is used to compute the generalized probability. Clearly, combining confidence measures and semantic relations is possible using confidence measures in the generalized probability computation with a language model including semantic relations and/or replacing by in (12).
One of the benefits of the proposed technique is that it is by nature robust to domain mismatch. Indeed, in the worst case scenario, semantic relations learnt on a specific domain will leave unchanged with respect to , the relations between any two words of being null. In other words, out-of-domain relations will have no impact on topic segmentation, a property which does not hold for approaches based on latent semantic space or model [29, 30].
4.3. Experimental Results
Comparison of the news and reports corpora in terms of word repetitions and of confidence measures.
Average number of repetitions
Average confidence measure
Reports on current affairs
In each show, headlines and closing remarks were removed, these two particular parts disturbing the segmentation algorithm and being easily detectable from audiovisual clues. A reference segmentation was established by considering a topic change associated with each report, the start and end boundaries being, respectively, placed at the beginning of the report's introduction and at the end of the report's closing remarks. Note that in the news corpus, considering a topic change between each report is a choice that can be argued as, in most cases, the first reports all refer to the main news of the day and are therefore dealing with the same broad topic. A total of 1,180 topic boundaries are obtained for the news corpus and 86 for reports. Recall and precision on topic boundaries are considered for evaluation purposes after alignment between reference and hypothesized boundaries, with a tolerance on boundary locations of, respectively, 10 and 30 s for news and reports, while different trade-offs between precision and recall are obtained by varying in (7).
We first report results regarding confidence measures before considering semantic relations. Finally, both are taken into account simultaneously.
4.3.1. Segmentation with Confidence Measures
Overall, these results demonstrate not only that including confidence measures in the lexical cohesion measure improves topic segmentation of spoken TV contents, but also that the gain obtained thanks to confidence measures is larger when the transcription quality is low. This last point is a key result which clearly demonstrates that adapting text-based NLP methods to the peculiarities of automatic transcripts is crucial, in particular when transcription error rates increase.
4.3.2. Segmentation with Semantic Relations
Words with the highest association scores, in decreasing order, for the word "cigarette", automatically extracted from newspapers articles. Italicized entries correspond to cigarette brand names.
It is important to note that, contrary to many studies on the acquisition of semantic relations, both types of relations were not obtained from thematic corpora. However, they are, to a certain extent, specific to the news domain as a consequence of the data on which they have been obtained and do not reflect the French language in general.
As for the experiments of Section 3 in speech-based program description, these results again prove that adapting NLP tools to better interface with ASR is a good answer to robustness issues. However, in spite of the proposed modifications, high transcription error rates are still believed to be detrimental to topic segmentation, and progress is required towards truly genre-independent topic segmentation techniques. Still, from the results presented, topic segmentation of spoken contents has reached a level where it can be used as a valuable tool for the automatic delinearization of TV data, however, limiting the use of such techniques to specific program genres where reasonable error rates are achieved and where topic segmentation makes sense. This claim is supported by our experience on automatic news delinearization, as illustrated by the NEM Summit demonstration presented in Section 6 or by the Voxalead news indexing prototype presented for the ACM Multimedia 2010 Grand Challenge .
One of the key features of Internet TV diffusion is to enhance navigability by adding links between contents and across modalities. So far, we have considered speech as a descriptor for characterization or segmentation purposes. However, semantic analysis can also be used at the far end of the delinearization process illustrated in Figure 1 to automatically create links between documents. In this section, we exploit a keyword-based representation of spoken contents to create connections between segments resulting from the topic segmentation step or between a segment and related textual resources on the Web. Textual keywords extracted from speech transcripts are used as a pivot semantic representation upon which characterization and navigation functionalities can be automatically built. We propose adaptations of classical keyword extraction methods to account for spoken contents and describe original techniques to query the Web so as to create links.
We briefly highlight the specificities of keyword extraction from transcripts, exploiting the confidence measure weighted tf-idf criterion of Section 3. We then propose a robust strategy to find relations among documents, exploiting keywords and IR techniques.
5.1. Keyword Characterization
We propose the use of keywords to characterize spoken contents as keywords offer compact yet accurate semantic description capabilities. Moreover, keywords are commonly used to describe various multimedia contents such as images in Flickr or videos in portals such as YouTube. Hence, a keyword-based description is a natural candidate for cross-modal link generation.
where is the number of words whose corresponding lemma is . This biased term frequency is used in (5) for keyword selection. In the navigation experiments presented in the next section, proper names are detected based on the part-of-speech tags and a dictionary, where nouns with no definition in the dictionary are considered as proper names.
Beyond the help provided to users in quickly understanding the content of a segment, this characterization scheme can also be used as a support for linking segments with semantically related documents.
5.2. Hyperlink Generation
In the context of delinearization and Internet TV diffusion, it is of particular interest to create links from TV segments to related resources on the Web. The problem of link generation is therefore to automatically find Web contents related to the segment's topic as characterized by the keywords. We propose a two step procedure for link generation, where candidate Web pages are first retrieved based on keywords before filtering candidate pages using the entire transcript as a description.
5.2.1. Querying the Web with Keywords
Given keywords, contents on the Web can be automatically retrieved using classical Web search engines (Yahoo! and Bing in this work) by deriving one or several queries from the keywords. Creating a single meaningful query from a handful of keywords is not trivial, all the more when misrecognized words are included as keywords in spite of the confidence measure weighted tf-idf scores. Thus, using several small queries appears to be clearly more judicious. Another issue is that queries must be precise enough to return topic-related documents without being too specific in order to retrieve at least one document. The number of keywords included in a query is a good parameter to handle these constraints. Indeed, submitting too long queries, that is, composed of too many keywords, usually results in no or only few hits, whereas using isolated keywords as queries is frequently ineffective since the meaning of many words is ambiguous regardless of the context. Hence, we found that a good query generation strategy consists in building many queries combining subsets of 2 or 3 keywords. Furthermore, in practice, as words are more precise than lemmas when submitting a query, each lemma is replaced by its most frequently inflected form in the transcript of the segment or document considered.
Example of queries formed based on subsets of the 5 best-scored keywords. Queries in bold include at least one misrecognized word.
5.2.2. Selecting Relevant Links
The outcome of the querying strategy is a list of documents—a.k.a hits—on the Web ordered by relevance with respect to the queries. Relevant links are established by finding among these hits the few ones that best match the topic of a segment characterized by the entire set of keywords rather than by two or three keywords. Assuming that the relevance of a Web page with respect to a query decreases with its rank in the list of hits, we solely consider the first few results of each query as candidate links. In this work, 7 documents are considered for each of the 15 queries. To select the most relevant links among the candidate ones, the vector space model with tf-idf weights is used. Candidate Web pages are cleaned and converted into regular texts (A Web page in HTML format is cleaned by pruning the DOM tree based on typographical clues (punctuation signs and uppercase characters frequencies, length of sentences, number of non-alphanumeric characters, etc.), so as to remove irrelevant parts of the document such as menus, advertisements, abstracts, or copyright notifications.) represented in the vector space model using tf-idf scores. Similarly, a segment's automatic transcript is represented by a vector of modified tf-idf scores as in Section 3. For both the Web pages and the transcript, the weight of proper names is softened as previously explained. The cosine distances between the segment considered and the candidate Web pages finally enables to keep only those candidate links with the highest similarity.
We have proposed a domain-independent method to automatically create a link between transcripts and Web documents, using a keyword-based characterization of spoken contents. Particular emphasis has been put on robustness to transcription errors, using modified tf-idf weights for keyword selection, designing a querying strategy able to cope with erroneous keywords and using an efficient filtering technique to select relevant links based on the characterization presented in Section 3.
Though no objective evaluation of automatic link generation has been performed, we observed in the framework of the NEM Summit demonstration described in the next section that the generated links are in most cases very relevant. However, in a different setting, these links were also used to collect data for the unsupervised adaptation of the ASR system language model . Good results obtained on this LM adaptation task are also an indirect measure of the quality of the links generated. Nevertheless, the proposed hyperlink creation method could still be improved. For example, pages could be clustered based on their respective similarities. By doing so, the different topic aspects of a segment could be highlighted and characterized by specific keywords extracted from clustered pages. "Key pages" could also be returned by selecting the centroids of each cluster. The scope of the topic similarity could also be changed depending on the abstraction level desired for the segment characterization. For example, pages telling the exact event of a segment—instead of pages dealing with the same broad topic—could be returned by reintegrating proper names into keyword vectors.
Finally, let us note that, beside the retrieval of Web pages, the link generation technique proposed here can also be used to infer a structure between segments of the same media (e.g., between a collection of transcribed segments as in the example below). The technique can also be extended to cross-media link generation, assuming a keyword-based description and an efficient cross-modal filtering strategy are provided.
Note that this preliminary demonstration, limited to the broadcast news domain, is intended as an illustration to illustrate automatic delinearization of (spoken) TV contents and to validate our work on a robust interface between ASR and NLP. Similar demonstrations on broadcast news collections have been developed in the past (see, e.g., [7, 8, 10, 36, 37]), but they mostly rely on genre-dependent techniques. On the contrary, we rely on robust genre- and domain-independent techniques, thus making it possible to extend the concept to virtually all kinds of contents. Moreover, all of the above mentioned applications lack navigation capabilities other than through a regular search engine.
We briefly describe the demonstration before discussing the quality of the links generated. For lack of objective evaluation criteria, we provide a qualitative evaluation to illustrate the remaining challenges for spoken language processing in the media context.
6.1. Overview of the Hypernews Showcase
The demonstration was built on top of a collection of evening news shows from the French channel France 2 recorded daily over a 1-month period. (Le Journal de 20h, France 2, from Feb. 2, 2007 to Mar. 23, 2007.) After transcription, topic segmentation as described in Section 4 was applied to each show in order to find out segments corresponding to different topics (and hence events in the broadcast news context). Keyword extraction as described in Section 5.1 was applied in order to characterize each of the 553 segments obtained as a result of the segmentation step. Based on the resulting keywords, exogenous links to related Web sites were generated as explained in Section 5.2. Endogenous links between segments, within the collection, were established based on a simple keyword comparison heuristics. (Note that different techniques could have been used for endogenous link generation. In particular, the same filtering technique as for exogenous links could be used. The idea behind a simple keyword comparison was, in the long term, to be able to incrementally add new segments daily, a task which requires highly efficient techniques to compare two segments.)
Figure 7(a) illustrates the segmentation step. Segments resulting from topic segmentation are presented as a table of contents for the show with links to the corresponding portions of the video and a few characteristic keywords to provide an overview of each topic addressed. Figure 7(b) illustrates the navigation step where "See also" provides a list of links to related documents on the Web while "Related videos" offers navigation capabilities within the collection.
6.2. Qualitative Analysis
Quantitative assessment of the links automatically generated is a difficult task, and we therefore limit ourselves to a qualitative discussion on the relevance of the generated links. As mentioned in the introduction, we are fully aware of the fact that a qualitative analysis, illustrated with a few selected examples, does not provide the ground for sounded scientific conclusions as a quantitative analysis would. However, this analysis gives an idea of the types of links that can be obtained and of the remaining problems.
6.2.1. External Links
It was observed that links to external resources on the Web are mostly relevant and permit to access related information. As such links are primarily generated from queries made of a few general keywords that do not emphasize named entities, they point to Web pages containing additional information rather than to Web pages dealing with the same story. (This fact is also partially explained by the time lag between the corpus (Feb.-Mar., 2007) and the date at which the demonstration's links were established (Jun. 2009), as most news articles on the Web regarding the Feb.-Mar. 2007 period had been removed from the news sites in 2009.) Taking the example of the cyclone Gamède which struck the Î le de la Réunion in February 2007, illustrated in Figure 7(b), all links are relevant. Several links target sites related to cyclones in general (list of cyclones, emergency rules in case of cyclones, cyclone season, etc.) or to sites dedicated to specific cyclones, including the Wikipedia page for cyclone Gamède. Additionally, one link points to a description of the geography and climate in the Î le de la Réunion while the less relevant link points to a flood in Mozambique due to a cyclone.
General information links such as those described previously present a clear interest for users and offer the great advantage of not being related to news sites whose content changes at a fast pace. Moreover, the benefit of enriching contents with general-purpose links is not limited to the news domain and applies to many types of programs. For example, in movies or talk shows, users might be interested in having links to documents on the Web related to the topic discussed. However, in the news domain, more precise links to the same story or to similar stories on other medias are required, a feature that is not covered by the technique proposed. We believe that accounting for the peculiar nature of named entities in the link generation process is one way of focusing links on very similar contents, yet remaining domain independent.
6.2.2. Internal Links
update on the island's situation (Feb. 27),
snow storm and avalanches in France (Feb. 27),
return to normal life after the cyclone and risk of epidemic (Mar. 3),
aftermath of the cyclone (Feb. 28),
damages due to the cyclone (Mar. 2).
The remaining links are mostly irrelevant, apart from two links to other natural disasters. Regardless of item 2, the first links are all related to the cyclone and, using the broadcasting dates and times for navigation, one can follow the evolution of the story across time. Note that a finer organization of the collection into clusters and threads [36, 37] is possible, but the notion of threads seldom applies outside of the news domain while links generation on the base of a few keywords is domain independent. Finally, as for external links, accounting for named entities would clearly improve relevance but possibly also prevent connections to different stories of the same nature, for example, from the cyclone to other natural disasters.
We have presented research work targeting the use of speech for the automatic delinearization of TV streams. To deal with the challenges of ASR transcripts in this context, such as potentially high error rates and domain independence, we have proposed several adaptations of traditional domain-independent information retrieval and natural language processing techniques so as to increase their robustness to the peculiarities of automatically transcribed TV data. In particular, in Section 3, we have proposed a modified tf-idf weighting scheme to exploit noisy transcripts, with error rates varying from 15% to 70%. We also adapted a lexical cohesion criterion and demonstrated that speech can be used for the segmentation of TV streams into topics. Experimental results show that, in spite of high error rates, a "bag-of-words" representation of TV contents can be used as description, for the purposes of information retrieval and navigation. All these results clearly indicate that designing techniques genuinely targeting spoken contents increase the robustness of spoken document processing to transcription errors. This in turn leads us to believe that this philosophy will pave the road towards sufficient robustness for speech to be used as a valuable source of semantic information for all genres of programs.
Clearly, not all aspects of speech-based delinearization have been tackled in this paper and many work is still required in order to make the most of speech transcripts for TV stream processing. We now briefly review the main research directions that we feel are crucial.
First of all, the potential of speech transcription has been experienced solely on a very specific content type, broadcast news and therefore still needs to be validated on other types of programs such as investigation programs and debates for which transcription error rates are significantly higher. However, results presented on the validation of the EPG alignment and on the topic segmentation tasks indicate that speech is a valuable source of information to process in spite of transcription errors. In this regard, we firmly believe that a better integration of NLP and ASR—accounting for confidence measures and alternate transcription hypotheses in NLP, incorporating high level linguistic knowledge in ASR systems, accounting for phonetics in addition to a lexical transcription, and so forth—is a crucial need to develop robust and generic spoken content processing techniques in the framework of TV stream delinearization. In particular, named entities such as locations, organizations, or proper names play a very particular role in TV contents and should therefore receive particular attention in designing content-based descriptions. However if acceptable named entity detection solutions exists for textual data, many factors prevent the straightforward use of such solutions for automatic transcripts from being viable. Among those factors are transcription errors and, most of all, the fact that named entities are often not in the vocabulary of the ASR system and hence not recognized (see  for a detailed analysis).
Topic segmentation of TV programs is another point which requires additional research effort. Domain-independent topic segmentation methods such as the one presented in this paper exhibit almost acceptable performance. In fact, in the demonstration, we observed that in most cases, segmentation errors have little impact on the acceptability of the results. Indeed, in a segment where two topics are discussed, keywords will often be a mix of keywords characterizing each of the two topics. This has little impact on our demonstration since broad characterization is considered, linking segments and documents from the same broad topic. However, we expect such errors to be a strong limitation as soon as a more detailed description will be required. Unless the number of keywords is increased drastically, it will be difficult to precisely characterize a two-topic segment, but significantly increasing the number of keywords will result in more noise and errors in the description. Hence, progress is still required in the topic segmentation domain. Moreover, only linear topic segmentation has been considered so far. But there is clearly a hierarchical topic structure in most programs, depending on the precision one wants on a topic. A typical example of this fact is that of the main title in news shows where several reports tackle different aspects and implications of the main title, each report eventually consisting of different points of views on a particular question, but hierarchical topic segmentation methods and hierarchical content description have been seldom studied and still require significant progress to be operational.
Finally, link generation based on automatically extracted keywords has proved quite efficient but lacks finesse in creating a semantic Web of interconnected multimedia contents, even in the news domain. More elaborated domain-independent techniques to automatically build threads based on speech understanding are still required in spite of the recent efforts in that direction [36, 37]. Moreover, most multimedia documents are by nature multimodal, and modalities other than text (eventually resulting from automatic speech transcription) should be fully exploited. Limiting ourselves to the news domain, image comparison could, for example, be used to link similar contents. Evidently, modalities other than language cannot provide as detailed a semantic description as language can, but we hope that, to a certain extent, they can compensate for errors in ASR and NLP and increase robustness and precision of automatically generated semantic links. However, many issues remain open in this area from the construction of threads to the use of multiple modalities for content-based comparison of documents.
From a more philosophical point of view, it is interesting to note that the key goal of topic segmentation is to shift from the notion of stream to that of document, the latter being the segment, in order to back off to well-known information retrieval techniques which operates at the document level. For example, the very notion of tf-idf is closely related to that of document, and so is the notion of index. Establishing links between contents also strongly relies on the notion of document as current techniques solely permit the comparison of two documents with well-defined boundaries. However, one can wonder whether the notion of document still makes sense in a continuous stream or not. Going back to the cyclone example of Section 6.2, it might be interesting to link a particular portion of the report to only a portion of a related document where the latter might contain more than required. The idea of hierarchical topic segmentation is one step in that direction, enabling to choose the extent of the portion of the stream to be considered, but it might also prove interesting to revisit information retrieval techniques in the light of this reflexion and design new techniques not dependent on the notion of document.
The authors are most grateful to Sébastien Campion and Mathieu Ben for their hard work on assembling bits and pieces of research results into an integrated demonstration and for presenting this demonstration during the NEM Summit 2009. This work was partially funded by OSEO in the framework of the Quaero project.
- Zimmermann J, Marmaropoulos G, van Heerden C: Interfacedesign of video scout: a selection, recording, and segmentation system for TVs. Proceedings of the International Conference on Human-Computer Interaction, 2001Google Scholar
- Liang L, Lu H, Xue X, Tan YP: Program segmentation for TV videos. Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS'05), May 2005 1549-1552.Google Scholar
- Naturel X, Gravier G, Gros P: Fast structuring of large television streams using program guides. In Proceedings of the Internaional Workshop on Adaptive Multimedia Retrieval, 2006, Lecture Notes in Computer Science. Volume 4398. Edited by: Marchand-Maillet S, Bruno E, Nürnberger A, Detyniecki M. Springer; 223-232.Google Scholar
- Manson G, Berrani SA: Automatic TV broadcast structuring. International Journal of Digital Multimedia Broadcasting 2010., 2010:Google Scholar
- Xie L, Xu P, Chang SF, Divakaran A, Sun H: Structure analysis of soccer video with domain knowledge and hidden Markov models. Pattern Recognition Letters 2004, 25(7):767-775. 10.1016/j.patrec.2004.01.005View ArticleGoogle Scholar
- Kijak E, Gravier G, Oisel L, Gros P: Audiovisual integration for tennis broadcast structuring. Multimedia Tools and Applications 2006, 30(3):289-311. 10.1007/s11042-006-0031-5View ArticleGoogle Scholar
- Merlino A, Morey D, Maybury M: Broadcast news navigation using story segmentation. Proceedings of the 5th ACM International Multimedia Conference, November 1997 381-389.Google Scholar
- Maybury MT: Broadcast news navigator (BNN) demonstration. Proceedings of the International Joint Conferences on Artificial Intelligence, 2003Google Scholar
- Ohtsuki K, Bessho K, Matsuo Y, Matsunaga S, Hayashi Y: Automatic multimedia indexing: combining audio, speech,and visual information to index broadcast news. IEEE Signal Processing Magazine 2006, 23(2):69-78.View ArticleGoogle Scholar
- Dowman M, Tablan V, Cunningham H, Ursu C, Popov B: Semantically enhanced television news through web and video integration. Proceedings of the Multimedia and the Semantic Web, Workshop of the 2nd European Semantic Web Conference, 2005Google Scholar
- Miyamori H, Tanaka K: Webified video: media conversion from TV programs to Web content for cross-media information integration. In Proceedings of the International Conference on Database and Expert Systems Applications, 2005, Lecture Notes in Computer Science. Volume 3588. Edited by: Andersen IV, Debenham JK, Wagner R, Detyniecki M. Springer; 176-185.Google Scholar
- Hauptmann A, Baron R, Chen M-Y, et al.: Informedia at TRECVID 2003: analyzing and searching broadcast news video. Proceedings of the Text Retrieval Conference, 2003Google Scholar
- Law-To J, Grefenstette G, Gauvain JL: VoxaleadNews: robust automatic segmentation of video into browsable content. Proceedings of the 17th ACM International Conference on Multimedia (MM '09), October 2009 1119-1120.View ArticleGoogle Scholar
- Deshmukh N, Ganapathiraju A, Picone J: Hierarchical search for large-vocabulary conversational speech recognition. IEEE Signal Processing Magazine 1999, 16(5):84-107. 10.1109/79.790985View ArticleGoogle Scholar
- Ney HJ, Ortmanns S: Dynamic programming search for continuous speech recognition. IEEE Signal Processing Magazine 1999, 16(5):64-83. 10.1109/79.790984View ArticleGoogle Scholar
- Jiang H: Confidence measures for speech recognition: a survey. Speech Communication 2005, 45(4):455-470. 10.1016/j.specom.2004.12.004View ArticleGoogle Scholar
- Huet S, Gravier G, Sébillot P: Morpho-syntactic post-processing of N-best lists for improved French automatic speech recognition. Computer Speech and Language 2010, 24(4):663-684. 10.1016/j.csl.2009.10.001View ArticleGoogle Scholar
- Galliano S, Gravier G, Chaubard L: The ESTER 2evaluation campaign for the rich transcription of French radio broadcasts. Proceedings of the Annual Conference of the International Speech Communication Association, 2009Google Scholar
- Naturel X, Berrani SA: Content-based TV stream analysis techniques toward building a catch-up TV service. Proceedings of the 11th IEEE International Symposium on Multimedia (ISM '09), December 2009 412-417.Google Scholar
- Guinaudeau C, Gravier G, Sebillot P: Can automatic speech transcripts be used for large scale TV stream description and structuring? Proceedings of the International Workshop on Content-Based Audio/Video Analysisfor Novel TV Services, 2009View ArticleGoogle Scholar
- Salton G: Automatic Text Processing: The Transformation, Analysis, Andretrieval of Information by Computer. Addison-Wesley Longman, Reading, Mass, USA; 1989.Google Scholar
- Mamou J, Carmel D, Hoory R: Spoken document retrieval from call-center conversations. Proceedings of the 29th Annual International Conference on Research and Development in Information Retrieval (SIGIR '06), August 2006 51-58.Google Scholar
- Allan J: Topic Detection and Tracking: Event-Based Information Organization, The Information Retrieval Series. Volume 12. Kluwer Academic, Boston, Mass, USA; 2002.MATHGoogle Scholar
- Hearst MA: TextTiling: segmenting text into multi-paragraph subtopic passages. Computational Linguistics 1997, 23(1):33-64.Google Scholar
- van Mulbregt P, Carp I, Gillick L, Lowe S, Yamron J: Segmentation of automatically transcribed broadcast news text. Proceedings of the DARPA Broadcast News Workshop, 1999Google Scholar
- Beeferman D, Berger A, Lafferty J: Statistical models for text segmentation. Machine Learning 1999, 34(1–3):177-210.View ArticleMATHGoogle Scholar
- Christensen H, Kolluru B, Gotoh Y, Renals S: Maximum entropy segmentation of broadcast news. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), March 2005 I1029-I1032.Google Scholar
- Utiyama M, Isahara H: A statistical model for domainindependent text segmentation. Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2001Google Scholar
- Choi FYY, Wiemer-Hastings P, Moore J: Latent semantic analysis for text segmentation. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2001 109-117.Google Scholar
- Misra H, Yvon F, Jose JM, Cappé O: Text segmentation via topic modeling: an analytical study. Proceedings of the ACM 18th International Conference on Information and Knowledge Management (CIKM '09), November 2009 1553-1556.Google Scholar
- Fayolle J, Moreau F, Raymond C, Gravier G, Gros P: CRF-based combination of contextual features to improve aposteriori word-level confidences measures. Proceedings of the Annual International Speech Communication Association Conference (Interspeech '10), 2010Google Scholar
- Daille B: Study and implementation of combined techniques for automatic extraction of terminology. In The Balancing Act: Combining Symbolic and Statistical Approaches to Language. Edited by: Resnik P, Klavans JL. MIT Press, Cambridge, Mass, USA; 1996:49-66.Google Scholar
- Guinaudeau C, Gravier G, Sebillot P: Improving ASR based topic segmentation of TV programs with confidence measures and semantic relations. Proceedings of the Annual International Speech Communication Association Conference (Interspeech '10), 2010Google Scholar
- Law-To J, Grefenstete G, Gauvain J-L, Gravier G, Lamel L, Despres J: VoxaleadNews: robust automatic segmentation of video content into browsable and searchable subjects. Proceedings of the International Conference on Multimedia (MM '10), 2010Google Scholar
- Lecorvé G, Gravier G, Sébillot P: An unsupervised web-based topic language model adaptation method. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '08), April 2008 5081-5084.Google Scholar
- Ide I, Mo H, Katayama N, Satoh S: Topic threading for structuring a large-scale news video archive. Proceedings of the International Conference on Image and Video Retrieval, 2004View ArticleGoogle Scholar
- Wu X, Ngo C-W, Li Q: Threading and autodocumenting news videos: a promising solution to rapidly browse news topics. IEEE Signal Processing Magazine 2006, 23(2):59-68.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.