- Object Detection and Tracking in Video Using Deep Learning Techniques: A Review
- Challenges in Video Tracking
- Fundamentals of Object Tracking
- Object Representation
- Shape Representation
- Appearance Representation
- Object Detection
- Frame Differencing
- Optical Flow
- Background Subtraction
- Object Classification Method
Object Detection and Tracking in Video Using Deep Learning Techniques: A Review
Computer vision is a broad area that describes how a machine is able to recognize data available in images or scenes. It replicates what human intelligence can do with the human visual system [1-3]. The multi-level data from the real world can be extracted using the following sequence of steps: capturing, processing, analyzing and understanding [4-7]. Object tracking [8, 9] has the following processes: object representation, object detection, and object tracking.
This chapter will give an idea of object tracking methods and their field of applications . It will provide fundamental concepts to develop an object tracker. Different sources of image or video can be used, such as surveillance video, outlooks from multiple cameras, and medical scanners. The system can be developed by various theories and models.
In this chapter, Section 3.1 gives introduction. Section 3.2 focuses on challenges in visual tracking. Section 3.3 contains a detailed study on several fundamentals of object tracking methods. Section 3.4 explores a detailed study on feature extraction methods. Section 3.5 discusses various object classification methods. Section 3.6 focuses on various object tracking methods. Section 3.7 provides an introduction to machine and deep learning techniques. Section 3.8 discusses the results. Sections 9 and 10 contain the conclusion, future scope, and references.
Challenges in Video Tracking
Moving objects (single or many) viewed by a camera over a time period can be followed using video tracking. One major issue is object clutter. Often the object region of interest is similar to its background or sometimes it may be hidden by another object present in the scene. The presence of an object in the scene can cause problems for the following reasons (Figure 3.1 ): 
FIGURE 3.1 Challenges in visual tracking.
Some of the assumptions made during the tracking process are:
- • Object motion is even.
- • No sudden appearance variations take place.
Novel tracking methods can handle the following problems very easily: leaving objects out from scenes and drifting. The following parameters are necessary to build a good complete tracking model:
It is able to track the object even under difficult conditions. The difficult conditions are mainly caused by clutter background, change in the lighting conditions, blockages, or complicated object motion.
In addition to changes in the environment, the object itself can undergo changes, which requires a good adaptation technique for the tracking model.
iii. Real-time processing
Live videos requires a high-speed processing methods to track the object. The performance of the algorithms mainly depends on the motion of the object. The speed of the algorithm has to be 15 frames per second to maintain a good quality output video.
Fundamentals of Object Tracking
Object tracking [8, 9] has the following processes: object representation, object detection, and object tracking.
Some applications of object tracking are tracing particular people in a video frame for video surveillance, tracking land-dwelling objects, or using satellite data for astral studies. Object selection depends on the application. If the application is traffic surveillance, then the object of interest may be a human, building, or car. For satellite applications the objects may be planets, whereas for gaming the object of interest may the human face. Figure 3.2 shows the region of interest in a video.
To keep the track of an object requires a pre-processing technique that converts the data into computer-understandable machine code. Shape and appearance form the basis, but extracted features are also used for object representation. Additional parameters that need to be included in object representation are application domain, persistence, and goals.
The representation determines the selection of best algorithm [Ю, 12]. In simple words object representation depends on shape representation and/or an appearance representation. Detailed descriptions of different shape representation methods are discussed below.
The shape of an object can be represented using several methods, which also require some calculations for locating and tracing the object. It is essential to know' the
FIGURE 3.2 Interested objects in video tracking (left) group of people, (right) face of single person.
FIGURE 3.3 Shape representation techniques .
advantages and disadvantages of all the techniques because all methods are not appropriate for all applications. The general shape representations are discussed below and refer to Figure 3.3:
Single (Figure 3.3(a)) or multiple points (Figure 3.3(b)) can be suitable for representing the objects.
These points can play a vital role if the object tracking is an image. The scenario considered can vary from single object tracking to many objects present in the scene so that interaction between objects can be obtained. These interactions may cause error. For simple and small objects, the point method is very suitable.
ii. Geometric shapes
Shape representation uses basic elementary shapes (Figures 3.3(c) and 3.3(d)). It is suitable for both fixed and moving objects. The nature of moving objects is very complex, so the most similar parts of the objects are included in the shape template.
iii. Silhouette and contour
This technique uses a skeleton or border (Figure 3.3(g), Figure 3.3(h)) for represent an object. Within that it uses another region (Figure 3.3(i)). It makes the representation easier and can be used to represent flexible and non-flexible objects. This model is able include any change wide range of object forms.
iv. Articulated shape models
Various parts of the object can be combined to form an articulated object. Various parts of the human being shown in Figure 3.3(e) are used during representation. Elliptical shapes are used to represent the object.
v. Skeletal model
A skeleton (Figure 3.3(f)) of the object can be extracted from the object outline. This is possible with medical axis transforms. This method is widely used in object recognition, but not when tracking objects.
There are number of methods available to express an object by its appearance. Some of the most widely used ways of representation are explained below .
i. Estimation of probability densities of object
Probability density functions (PDF) are used to express the probability of random variables . With the help of a shape model, inner regions of an image can be recognized. For example, object appearance features can be determined using probability density functions. Some parameters are color or texture. The PDF can be either parametric (Gaussian distribution) or non-parametric (histograms).
Templates can carry both appearance and spatial information. The template uses basic elementary shapes for representation. Templates are not suitable for challenging objects because they differ for different views. Object appearance parameters as well as object postures make the model more efficient. Templates create problems when the object features tend to change, for example when the lighting changes.
iii. Active appearance models
The shape of the object can be computed using the boundary of an object or object region. For each location the appearance of the object is modeled using color, texture, or gradient.
iv. Multi-view appearance models
Various outlooks of the object can be encoded using this method. Many methods are available. From the various views the model generates a subspace. This has been used in principal component analysis (PCA) and independent component analysis (ICA) .
The fundamental procedure is to find the object that needs to be tracked in a video scene. Next, cluster the pixels of these objects. This involves the steps below. The
FIGURE 3.4 Fundamental steps in object tracking.
main data to be extracted is moving objects. Almost all the methods focus on object detection. The object detection in a video can be found when the object is entering into the video . Object detection types are shown in Figure 3.4.
The target in a video sequence can be identified by subtracting the target from successive images or frames. For dynamic environments this method has strong adaptability. However, locating the moving object becomes very complicated due to the unavailability of complete outline information .
Optical flow is used to cluster an image. Entire data can be extracted from the background. Real-time applications sometimes depend on the following parameters: noise sensitivity, anti-noise performance, etc.
Background modeling provides a basis model. The presence of a moving object in a video sequence can be identified by matching every frame with the reference model. Background subtraction uses different methods to find these objects. Even though the implementation is easy, it is more sensitive to surrounding factors. The total object data can be extracted only when the background is known. For real-time applications the static background model cannot produce good results. Some of the factors influencing the background changes are reflections, animated images, and indoor scenes. Static backgrounds face great difficulties with outdoor scenes. Background subtraction based motion detection or tracking systems need to hold in the following critical situations : noisy image and lighting conditions, minor movements of non-static objects due to the wind, movements of objects, shadow regions and multiple objects in the scene. There are two approaches:
i. Recursive algorithm [17, 18]
ii. Non-recursive algorithm
Object Classification Method
The object classification method assigns a class label based on the features extracted. Object features can be based on its size, color, structure, and motion. Features used for object classification are listed below:
i. Edges: intense intensity variations are found near the object boundary. Edge detection techniques are applied to calculate these intensity variations. Edges are less sensitive to lighting conditions .
ii. Motion: object classification methods produce good results for non-rigid objects .
iii. Color: many color spaces are available to store data from different frames.
iv. Texture: texture is used to identify the target or object of interest .
-  Object position: the presence of an intended object in a video frame mayvary from one frame to another frame and also vary its projection on avideo frame plane. • Ambient illumination: the ambient light on the object of interest can changein intensity, direction, and color in a video plane. • Noise: the video may contain some amount of noise during signal acquisition, based on the quality of the sensor during signal acquisition. • Occlusions: a moving object in a scene may be hidden by another objectpresent in the same scene; the object cannot be tracked even if it is presentin the scene.