Hostname: page-component-6766d58669-bkrcr Total loading time: 0 Render date: 2026-05-14T11:21:49.967Z Has data issue: false hasContentIssue false

Occlusion-aware temporal frame interpolation in a highly scalable video coding setting

Published online by Cambridge University Press:  01 April 2016

Dominic Rüfenacht*
Affiliation:
School of EE &T, University of New South Wales, Sydney, Australia
Reji Mathew
Affiliation:
School of EE &T, University of New South Wales, Sydney, Australia
David Taubman
Affiliation:
School of EE &T, University of New South Wales, Sydney, Australia
*
Corresponding author:D. Rüfenacht Email: d.ruefenacht@unsw.edu.au

Abstract

We recently proposed a bidirectional hierarchical anchoring (BIHA) of motion fields for highly scalable video coding. The BIHA scheme employs piecewise-smooth motion fields, and uses breakpoints to signal motion discontinuities. In this paper, we show how the fundamental building block of the BIHA scheme can be used to perform bidirectional, occlusion-aware temporal frame interpolation (BOA-TFI). From a “parent” motion field between two reference frames, we use information about motion discontinuities to compose motion fields from both reference frames to the target frame; these then get inverted so that they can be used to predict the target frame. During the motion inversion process, we compute a reliable occlusion mask, which is used to guide the bidirectional motion-compensated prediction of the target frame. The scheme can be used in any state-of-the-art codec, but is most beneficial if used in conjunction with a highly scalable video coder which employs piecewise-smooth motion fields with motion discontinuities. We evaluate the proposed BOA-TFI scheme on a large variety of natural and challenging computer-generated sequences, and our results compare favorably to state-of-the-art TFI methods.

Information

Type
Original Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © The Authors, 2016
Figure 0

Fig. 1. Overview of the proposed TFI method: The input to the scheme are a (potentially estimated) motion field Mac, as well as breakpoint fields estimated on Mac for frame fa, and on Mce (only used to obtain breakpoints) for frame fc; furthermore, the two reference frames fa and fc. In the first step, estimated breakpoints at reference frames fa and fc (Ba and Bc) are transferred to the target frame fb (Bb). Next, Mab is obtained by halving its parent motion field Mac. Mac and Mab are then used to infer the motion field Mcb. The last step consists of inverting Mab and Mcb to obtain ba and bc. During the motion inversion process, we compute disocclusion masks Ŝba and Ŝbc, which are used to guide the bidirectional MCTFI process to temporally interpolate the frame b. Breakpoints are used to resolve double mappings and handle occluded regions during both the motion inference and inversion process.

Figure 1

Table 1. Table of notations used throughout the paper.

Figure 2

Fig. 2. (a) Traditional anchoring of motion fields in the target frames and (b) bidirectional hierarchical anchoring (BIHA) of motion fields at reference frames.

Figure 3

Fig. 3. A rectangle moves from left to right, with accelerated motion. (a) shows the true location of the rectangle (green), and (b) the predicted position of the rectangle under constant motion assumption. Note that because the inferred motion (orange dashed line) follows the scaled motion (blue dotted), the two motion fields ab and cb are geometrically consistent.

Figure 4

Fig. 4. Scalable geometry representation: Two breakpoints on the perimeter of the same cell can induce discontinuity information onto the root arcs (purple crosses). If the root arc contains a vertex (red cross), the inducing is stopped.

Figure 5

Fig. 5. Spatio-temporal induction of breakpoints. Going from coarse to fine spatial resolution, the proposed temporal induction process consists of three steps at each resolution level η: (1) Assessment of temporal compatibility of line segments induced by breakpoints between two coarse-level frames fa and fc; (2) Warping of compatible line segments to fb; (3) Spatial induction of all breakpoints to the next finer spatial resolution η−1. For better visualization, root arcs are not shown in this figure.

Figure 6

Fig. 6. Illustration of the proposed CAW procedure. These figures show color-coded motion fields. (a) The reference motion field is partitioned into triangles; (b) each such triangle is then mapped from the reference to the target frame, where each integer location gets assigned the corresponding affine motion. In regions that get disoccluded, triangles stretch without changing orientation (e.g., the green triangle), and the affine model assigns a interpolated value between the foreground and background motion, without leaving any hole.

Figure 7

Fig. 7. Resolving of double mappings in the mapped motion field by reasoning about motion discontinuities (represented as red dashed lines around the scepter). The key idea in identifying the foreground motion is that the motion discontinuities travel with the foreground object.

Figure 8

Fig. 8. Closeup of the scene in Fig. 6, to illustrate the motion extrapolation technique applied in disoccluded regions. Panel (a) shows a triangle in the reference frame fi, which straddles a motion discontinuity boundary. Panel (b) shows the warped, stretched triangle in the target frame fj; panel (c) introduces the relevant notations used in the text to describe the motion extrapolation procedure. Instead of linearly interpolating motion from foreground to background, we instead extrapolate motion from the vertices to the motion discontinuity boundary, represented by B1 and B2; this results in sharp boundaries, as exemplified in (d), where the blue dotted line corresponds to linearly interpolated motion, and the green solid line corresponds to extrapolated motion.

Figure 9

Fig. 9. Example results for estimated motion on a natural sequence with reasonably complex motion. Panel (a) shows the motion field ac, estimated using [14] with the default parameters; Panel (b) shows the breakpoint field (at the second coarsest spatial level for visualization), which was estimated on ac using the breakpoint estimation method described in [16]. Panel (c) shows the union of the estimated disocclusion masks, where yellow and cyan indicate that the pixel is not visible in the previous (fa) and future (fc) frame, respectively. Panels (d) and (e) show the inverted motion fields, anchored at the target frame fb, which together with the disocclusion mask are used to obtain (f), the bidirectionally predicted target frame b.

Figure 10

Fig. 10. First frame of each of the natural sequences used in the experiments. All sequences are readily available on https://media.xiph.org/video/derf/.

Figure 11

Table 2. Quantitative comparison of the proposed method with [6,7] on common natural test sequences. In parantheses (·), we show the difference between the PSNR of the proposed BOA-TFI method and the respective method we compare it with \newline (“−” means that the proposed BOA-TFI performs better, “+” means worse performance).

Figure 12

Fig. 11. Qualitative comparison of TFI on natural sequences. The first row shows the full frame. The second to last row show crops of the ground truth, proposed BOA-TFI, Jeong et al. [6] and Veselov and Gilmutdinov [7], respectively.

Figure 13

Table 3. Average per-frame processing time (in sec) on all the frames tested in Section VIII.A, split up in ME and FI, as well as total time. We further provide the CPU and amount of RAM of the machines the results were obtained.

Figure 14

Fig. 12. TFI results on Sintel Sequence, which highlights the effectiveness of the proposed method to handle occluded regions. The first column shows the (color-coded) ground truth motion fields between the two reference frames, which, together with the two reference frames (not shown), form the input to our method in this experiment. The second column shows the union of the forward and backward disocclusion mask produced by the proposed BOA-TFI method, where yellow pixels are locations that get disoccluded between the previous and the interpolated frame; similarly, cyan are locations that get disoccluded between the future reference frame and the interpolated frame; red are regions that are not visible in either of the frames. The last column shows crops of the temporally interpolated frames obtained by the proposed BOA-TFI method.