Content based video retrieval is a way to simplify fast and accurate content access to video data. The advances in technology such as capturing, refining and transferring video content has advanced over the years, but still there is a lack of efficiency for retrieving content based video data. It requires more than just connecting to video databases and fetching the information to the users through networks. To address these problems, we have segmented a video file into frames depicting different shots and scenes. We have implemented Key Frame extraction by extracting key frames from the derived frames of the video for efficient content based retrieval of video data.
CBVR-Content Based Video Retrieval
CBIR-Content Based Image Retrieval
Content Based Video Retrieval is a relatively new search topic. It has its roots in retrieval of information as it aims at retrieving information rather than retrieving data. Content Based Video Retrieval System interprets the collection of video data and ranks them according to the applicability of the query. The fundamentals of CBVR consist of segmenting or indexing video data to get an organised media for analysing it step by step. Often it becomes hard to find the appropriate video content you are trying to search over the web; or retrieving a particular portion of the video which is of interest. CBVR deals with such analysis where it tries to ease the user’s manual hardship by extracting the video data that the user is concerned with in an automated system which is less time consuming, and is more systematic and effective.
Various search engines apply different strategies and algorithms to find, or segregate the video data. Data Mining is also crucial to CBVR.
The essence of CBVR lies in its way of structuring the video data based on the content. Extraction of portions and sub-portions to better analyse the video data in question. The video sequence must partitioned into scenes, shots or frames for this purpose. Once you are dealing with images, the analysis for retrieving videos or a part of the video becomes easier as the principles of Content Based Image Retrieval (CBIR) might be applied to individual frames. This can give deeper insight into the videos concerned and help us find patterns which can be the applied various applications such as CCTV surveillance, highlights generation in sports events, quick news feed and documentary, anomaly detection in videos etc.
All this needs Video Data Management. The process involves:
VIDEO DATA MANAGEMENT
In CBVR system, managing the video data is very important for fast and accurate content access of video data. The process of Video Data Management is divided into three sub-processes which mainly are Video Parsing, Video Abstraction and Summarization, and Video Indexing and Retrieval. Let us take a look at these sub-processes individually.
1. VIDEO PARSING
How a long text is straightened out into smaller entities such as paragraphs, sentences, words and letters, similarly a long video array can be straightened out into smaller and more convenient pieces. So, the technique of splitting down the video data into smaller units is known as ‘Video Parsing’.
These units are managed in a hierarchical way with five levels. The data on the levels reduces as the order of granularity reduces. These five levels include key-frame, shot, group, scene and video data.
Fig: Structure of Video
The basic architectural unit is the shot which depicts an array of frames which represents an extended action in time. The frame which represents the entire shot is called as the ‘key-frame’. The video data is divided into shots by a technique which is known as Segmentation used in Video Shot Detection. Different shots are joined together to form a group. Different groups represent a scene. And different scenes when joined together depict the entire video data.
2. VIDEO ABSTRACTION AND SUMMARIZATION
Video Abstraction and Summarization refers to a short illustration of the original video which can be used in the later sub-processes for video indexing, retrieval, cataloguing, etc. There are two main approaches for Video Abstraction and Summarization. They are Key Frame Abstraction and Highlighting Sequences.
In Key Frame Extraction, all the key-frames are extracts that best depict all the contents of the shot. The easiest way of extracting the key-frame is by using the first or the last or the middle frame as the key-frame. The Key Frame Extraction will be discussed in detail further.
Highlighting Sequences is also referred to as ‘Video Skimming’ or ‘Video Summaries’. It aims at illustrating a long sequence of data into a short sequence of data. ‘InforMedia Project’ is a successful application of this approach where the text and visual content information is taken as input which are then merged together to find the video sequence which highlights the important contents of the video.
3. VIDEO INDEXING AND RETRIEVAL
Video Indexing and Retrieval is necessary for efficient and effective handling of video data. Video Indexing and Retrieval can be performed based on the representative frames, motion information and object based retrieval.
3.1 BASED ON REPRSENTATIVE FRAMES
The most common way of creating an index is by using a representative frame that depicts each shot. Now from the selected representative frame, different features are extracted and indexed based on colour, shape, texture, etc.
Any frame can be selected as the representative frame if the shots are static. But two issues involve-How many frames must be selected from each shot? And How many frames per shot? There are three methods to solve the issues.
‘ We select one frame per shot. But this method does not consider the length and content changes.
‘ We select many frames per shot. The numbers of frames which are selected consider the length of the video shot. This method does not handle the content properly.
‘ We divide the shots into sub-shots and then one representative frame is selected from each sub-shot. This method takes into account the length and content changes.
Another issue which gets involved is how to select the frames? There are four methods to solve this issue.
‘ We select the first frame from the shot as the representative frame. The first frame is selected from the segment because it has been observed that the first frame usually describes the whole segment.
‘ We select the average frame as the representative frame. The average frame is such that the each pixel in this frame gives the average of all the pixel values at the same grid point in all the frames of the segment. Then the frames within the segment which are most similar to the average frame are selected as the representative frame.
‘ We select the average histogram frame as the representative frame. The histograms of all the frames are averaged. Then the frames within the segment which are closest to the average histogram frame are selected as the representative frame.
‘ Each frame is divided into background and foreground objects. The area outside the primary object is the background and the area where the primary object resides is the foreground. We select the large background as the representative frame. The large background is constructed from the background and by the superimposition of the main foreground objects of all the frames on the constructed background.
3.2 BASED ON MOTION INFORMATION
The other way of creating index is by using the motion content. Motion content is the measure of the total amount of motion within a given video which measures the action of the video. For example, if a video with a person talking on a phone then that video has small motion content while a video with a violent car explosion has high motion content.
Motion information consists of motion uniformity which is a measure of the smoothness of the motion within a video with respect to time and motion panning which is a measure that captures panning (left to right or right to left motion of a camera).
3.3 BASED ON OBJECT BASED RETRIEVAL
The other way of creating index is on the object. Any given scene in the video is a complex organization of parts or objects. The content of the scene is defined by the location and physical parameters of the objects as well as their communication with other objects. Object based retrieval techniques identify the different objects and their inter-relationship amongst the objects. In a video sequence, the object moves as a whole. Therefore, we can club pixels that move together to identify that object.
SRATEGIES FOR RETRIEVAL
The retrieval phase uses video segments as the concerned data and queries are performed on them. Work presented in Fast Video Retrieval via the Statistics of Motion Within the Regions-of-Interest deals with very important issue to quickly retrieve semantic information from a vast multimedia database. In this work, algorithms are proposed to retrieve the videos that contain the requested object motion from video database.
For efficient Content Based Video Retrieval we make use of many algorithms. Some of those algorithms are as follows:
‘ Key frame Extraction Algorithm (Multimodal Content based Browsing)
‘ Scene-Cut Detection Algorithm/Scene Matching
‘ Character Identification
‘ Semantic Video Retrieval Technique
‘ Semantically Meaningful Summaries
‘ Color Texture Classification.
‘ GLCM Texture Feature Extraction.
‘ Gabor Texture Feature.
‘ Histogram Based Range Finder Indexing Algorithm.
‘ Superficial Similarity Algorithm.
The description of a few algorithms in brief is as follows:
In this paper, the Key Frame Extraction Algorithm has been explained in detail.
Key Frame Extraction Algorithm:
A key frame is a single still image in a video sequence that occurs at an important point in that sequence. For example; in a video of a swinging cricket bat, the cricket bat at rest would be one key frame, and the cricket bat at the end of its swing would be another. In this approach, Key frames are extracted from all the frames extracted from a video on the basis of color, texture, shape etc.
Example of key frames:
The pictures below shows the key frames of a parking lot gate. It only shows the frames when a car enters or leaves the gate. The frames when there is no car is not considered as a key frame.
Key Frame Extraction can be done based on:
‘ Shot Boundary.
‘ Color Histogram.
1. Based on Shot Boundary
A shot is a sequence of frames captured by a single camera in a single continuous shot. It is the most basic unit of video data. A scene is a logical group of shots into a semantic unit.
Key frame extraction based on Shot Boundary is the easiest technique. In this technique, the first frame and the last frame are considered as the key frames. This technique is very easy to understand and operate. This technique is most suitable for shots with still content. The disadvantage of this technique is that it limits the number of key frames and the key frames extracted cannot represent the shots well. The first and the last frame normally are not stable and does not capture the major visual content. It does not consider the complexity of the video content.
2. Based on Color Histogram
A color histogram is a representation of the distribution of colors in an image. A color histogram for digital images represents the number of pixels that have colors in each of a fixed list of color ranges that span the image’s color space, the set of all possible colors. The abscissa denotes the color range and the ordinate denotes the number of pixels.
Color histogram technique represents the color feature properly. The color histogram is calculated using the formula ,
H(i) = ni/N
i = 1,2,3′,k.
N-total number of pixels.
ni-number of pixels with color i.
K-number of color values in the histogram.
The color histogram of image M is the vector as H(M)=(h1,h2,h3’..hk).
3. Based on Sampling
In this technique, a time interval T is set. Frames are extracted at every interval and these frames are considered as key frames. The main disadvantage of this technique is that it causes redundancy i.e. we may miss out on important frames.
4. Based on Clustering
Clustering is a technique in which without any advance knowledge of root definitions elements are grouped together based on similarity criteria. The similarity of two frames is defined as the similarity of their visual content, the visual content can be color, texture, shape etc. In this technique, frames are clustered into classes. Firstly, an initial class center is determined. Then if the current frame belongs to the particular class or not is determined by the distance between the class and the current frame. If the distance is very long then that frame is considered as the new class center. These class centers are considered as the key frames. This technique is used to classify frames of shot.
In designing these applications and products in the area of multimedia content analysis, we must keep the user in mind at all times, because users at different levels view a technology’s usefulness differently. We can broadly classify users into two extremes:
‘ Nontechnical consumers
‘ Trained, technical, professional corporate users who regularly use the products.
The Nontechnical customers are usually the end of users of the designed system who use the interface that a technician or a developer developed to ease the end user’s interaction with the system. The nontechnical user may use manuals/guide/instruction pamphlets to know better about the system’s functionality and know how to use it. Such customers might not use these products regularly either.
Example: A Security personnel inspecting the CCTV surveillance footage might not know how the system works internally to grab only the stills from the entire video sequence where any cars or visitors might enter the gate instead of going through the whole footage which would have been rather time consuming. The security personnel might not have the knowledge of how these particular stills of interest are segregated from the entire video, but the system allows to inspect the CCTV footage efficiently.
Trained, technical, professional corporate users
For technology-savvy users working with content analysis, indexing, searching, and authoring tools on a daily basis, it makes sense to design systems that require more user sophistication. They frequently require to use such systems for scientific, economic, entertainment and other types of research. Hence the systems must be modified and updated to better versions after short spans of time to maintain integrity.
Example: Major news agencies and TV broadcasters own large video archives. If we develop automated indexing, analysis, and search products for these applications, it’s conceivable to have trained individuals to retrieve and access required multimedia information, much the same way as today’s trained professional librarians and information specialists retrieve information based on textual data.
Under these circumstances, we can expect the operator or user to search images via textures, color histograms, or other low-level feature analysis that we wouldn’t expect a typical consumer to be able or willing to cope with.
Activities that involve generating or using large volumes of video and multimedia data are prime candidates for taking advantage of video-content analysis technique. The Applications are as follows:
‘ Professional Application
‘ Educational Application
‘ Consumer Domain Application
‘ Automated authoring of Web content
Media organizations and TV broadcasting companies are interested to showcase their news or related information about entertainment over the internet. Analysis of the already composed video content through image and video understanding, speech transcription, and linguistic processing can serve to create alternative presentations of the information suitable for the Web. We can have automated systems for collecting and archiving the various video files into the binary libraries. Example, Pictorial Transcripts.
Pictorial Transcripts uses video and text analysis techniques to convert closed-captioned video programs to Hypertext Markup Language
(HTML) presentations with still frames containing the visual information accompanied by text derived from the closed captions.
‘ Search and Browsing Large Video Files
This is used to facilitate efficient and effective use of the resources for internal use. Major news agencies and TV broadcasters own large archives of video that have been accumulated over many years. Besides the producers, others outside the organization use the footage from these archives to meet various needs. These large archives usually exist on numerous different storage media, ranging from black-and-white film to magnetic-tape formats. Video segmentation into static images is used. We can browse these images to spot information and use image similarity searches to find shots with similar content and motion analysis to categorize the video segments. Audio and speech event detection can make the processor faster.
‘ Easy access to educational material
It facilitates turning small libraries that contain a small number of books and multimedia sources into ones with immediate access to every book, audio program, video program, and other multimedia educational material. It also gives students access to large data resources without even leaving the class.
‘ Indexing and Archiving Multimedia Presentations
Existing video compression and transmission standards have made it possible to transmit presentations to remote sites. We can then store these presentations for on-demand replay. Different media components of the presentation can be processed to characterize and index it.
‘ Indexing and Archiving Multimedia Collaborative sessions
Communication networks give people the ability to work together despite geographic distances. The multimedia collaborative sessions involve real-time exchange of visual, textual, and auditory information.
CONSUMER DOMAIN APPLICATION
The widest audience for video-content analysis is consumers.
We all have video content pouring through broadcast TV and cable. Also, as consumers, we own unlabeled home video and recorded tapes.
Consumer devices can use many additional features for video cataloging, advanced video control, personalization, profiling, and time-saving functions. Information filtering functions for converging PCs and TVs will add value to video applications beyond digital capture, playback, and interconnect.
‘ Video Overview and Access
‘ Video Content Filtering
‘ Enhanced Access to broadcast video
This paper presents the main areas involved in Content Based Video Retrieval. The main goal of CBVR system is to provide an efficient and easy way for the users so that they can view the video data describing its content. The CBVR system must be capable of handling different diversities of video contents and give us numerous workings for finding different types of content. CBVR system is not a sound system. Video stream will become the main stream in the coming years, so we need to have an efficient CBVR system. Content Based Video Retrieval requires a combined approach such as image processing, video indexing, content querying, etc. Content Based Video Retrieval is an evolving area of research and development and so it is desirable to develop different efficient CBVR systems so that users can make efficient use of the different content based video data.
‘ CONTENT BASED VIDEO RETRIEVAL SYSTEMS by B V Patel1 and B B Meshram, Shah & Anchor Kutchhi Polytechnic, Chembur, Mumbai, INDIA and Computer Technology Department, Veermata Jijabai Technological Institute, Matunga, Mumbai, INDIA.
‘ Applications of Video Content Analysis And Retrieval by Nevenka Dimitrova, Philips Research; Hong-Jiang Zhang, Microsoft Research; Behzad Shahraray, AT&T Labs Research; Ibrahim Sezan, Sharp Laboratories of America; Thomas Huang, University of Illinois at Urbana’Champaign; Avideh Zakhor, University of California at Berkeley
‘ Content based Video Retrieval, Classification and Summarization: The State-of-the-Art and the Future by Xiang Ma, Xu Chen, Ashfaq Khokhar and Dan Schonfeld
‘ CBIVR: Content-Based Image and Video Retrieval Prepared by Stan Sclaroff
‘ Content-Based Image and Video Retrieval, Dagstuhl Seminar 02021, Schloss Dagstuhl, Wadern, Germany, 6-11 January, 2002. http://www.dagstuhl.de/DATA/Reports/02021.
‘ Content-based video retrieval by M. Petkovic, Centre for Telematics and Information Technology, University of Twente.
‘ Xi Li, Weiming Hu, Zhongfei Zhang, Xiaoqin Zhang, Guan Luo (2008), ‘Trajectory-Based Video Retrieval Using Dirichlet Process Mixture Models’, 19th British Machine Vision Conference (BMVC).
‘ Egon L. van den Broek, Peter M. F. Kisters, and Louis G. Vuurpijl (2004),’ Design Guidelines for a Content-Based Image Retrieval Color-Selection Interface’ACM Dutch Directions in HCI, Amsterdam.