To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Deep learning has demonstrated its superiority in computer vision. Landsat images have specific characteristics compared with natural images. The spectral and texture features of the same class vary along with the imaging conditions. In this paper, we extend the use of deep learning to remote sensing image classification to large geographical regions, and explore a way to make deep learning classifiers transferable for different regions. We take Jingjinji region and Henan province in China as the study areas, and choose FCN, ResNet, and PSPNet as classifiers. The models are trained by different proportions of training samples from Jingjinji region. Then we use the trained models to predict results of the study areas. Experimental results show that the overall accuracy decreases when trained by small samples, but the recognition ability on mislabeled areas increases. All methods can obtain great performance when used to Jingjinji region while they all need to be fine-tuned with new training samples from Henan province, due to the reason that images of Henan province have different spectral features from the original trained area.
In this paper, we compare the video codecs AV1 (version 1.0.0-2242 from August 2019), HEVC (HM and x265), AVC (x264), the exploration software JEM which is based on HEVC, and the VVC (successor of HEVC) test model VTM (version 4.0 from February 2019) under two fair and balanced configurations: All Intra for the assessment of intra coding and Maximum Coding Efficiency with all codecs being tuned for their best coding efficiency settings. VTM achieves the highest coding efficiency in both configurations, followed by JEM and AV1. The worst coding efficiency is achieved by x264 and x265, even in the placebo preset for highest coding efficiency. AV1 gained a lot in terms of coding efficiency compared to previous versions and now outperforms HM by 24% BD-Rate gains. VTM gains 5% over AV1 in terms of BD-Rates. By reporting separate numbers for JVET and AOM test sequences, it is ensured that no bias in the test sequences exists. When comparing only intra coding tools, it is observed that the complexity increases exponentially for linearly increasing coding efficiency.
The kernel extreme learning machine (KELM) is more robust and has a faster learning speed when compared with the traditional neural networks, and thus it is increasingly gaining attention in hyperspectral image (HSI) classification. Although the Gaussian radial basis function kernel widely used in KELM has achieved promising classification performance in supervised HSI classification, it does not consider the underlying data structure of HSIs. In this paper, we propose a novel spectral-spatial KELM method (termed as MF-KELM) by incorporating the mean filtering kernel into the KELM model, which can properly compute the mean value of the spatial neighboring pixels in the kernel space. Considering that in the situation of limited training samples the classification result is very noisy, the spatial bilateral filtering information on spectral band-subsets is introduced to improve the accuracy. Experiment results show that our method outperforms other kernel functions based on KELM in terms of classification accuracy and visual comparison.
This study proposes two multimodal frameworks to classify pathological voice samples by combining acoustic signals and medical records. In the first framework, acoustic signals are transformed into static supervectors via Gaussian mixture models; then, a deep neural network (DNN) combines the supervectors with the medical record and classifies the voice signals. In the second framework, both acoustic features and medical data are processed through first-stage DNNs individually; then, a second-stage DNN combines the outputs of the first-stage DNNs and performs classification. Voice samples were recorded in a specific voice clinic of a tertiary teaching hospital, including three common categories of vocal diseases, i.e. glottic neoplasm, phonotraumatic lesions, and vocal paralysis. Experimental results demonstrated that the proposed framework yields significant accuracy and unweighted average recall (UAR) improvements of 2.02–10.32% and 2.48–17.31%, respectively, compared with systems that use only acoustic signals or medical records. The proposed algorithm also provides higher accuracy and UAR than traditional feature-based and model-based combination methods.
This paper reports a visible and thermal drone monitoring system that integrates deep-learning-based detection and tracking modules. The biggest challenge in adopting deep learning methods for drone detection is the paucity of training drone images especially thermal drone images. To address this issue, we develop two data augmentation techniques. One is a model-based drone augmentation technique that automatically generates visible drone images with a bounding box label on the drone's location. The other is exploiting an adversarial data augmentation methodology to create thermal drone images. To track a small flying drone, we utilize the residual information between consecutive image frames. Finally, we present an integrated detection and tracking system that outperforms the performance of each individual module containing detection or tracking only. The experiments show that, even being trained on synthetic data, the proposed system performs well on real-world drone images with complex background. The USC drone detection and tracking dataset with user labeled bounding boxes is available to the public.
The steering vector mismatch causes signal self-nulling for adaptive beamforming when the training data contain the desired signal component. To prevent signal self-nulling, many beamformers use robust technology, which is usually equivalent to the diagonal loading approach. Unfortunately, the diagonal loading approach achieves better signal enhancement at the cost of losing its interference suppression capability, especially at high input signal-to-noise ratio. In this paper, a novel robust adaptive beamforming method is developed to improve the interference suppression capability. The proposed beamformer is based on the worst-case performance optimization technology with a new estimated steering vector and a special set parameter. Firstly, a subspace which is orthogonal to the interference's steering vector is obtained by using the interference-plus-noise covariance matrix; then a new steering vector which is orthogonal to each interference's steering vector is estimated; finally, the beamformer's weight is solved with the worst-case performance optimization technology with a special set parameter. Theoretical analysis of the interference suppression principle is analyzed in detail, and some simulation results are presented to evaluate the performance of the proposed beamformer.
It is well-known that a number of convolutional neural networks (CNNs) generate checkerboard artifacts in both of two processes: forward-propagation of upsampling layers and backpropagation of convolutional layers. A condition for avoiding the artifacts is proposed in this paper. So far, these artifacts have been studied mainly for linear multirate systems, but the conventional condition for avoiding them cannot be applied to CNNs due to the non-linearity of CNNs. We extend the avoidance condition for CNNs and apply the proposed structure to typical CNNs to confirm whether the novel structure is effective. Experimental results demonstrate that the proposed structure can perfectly avoid generating checkerboard artifacts while keeping the excellent properties that CNNs have.
Over the past 20 years, research on quality of experience (QoE) has been actively expanded even to cover aesthetic, emotional and psychological experiences. QoE has been an important research topic in determining the perceptual factors that are essential to users in keeping with the emergence of new display technologies. In this paper, we provide in-depth reviews of recent assessment studies in this field. Compared to previous reviews, our research examines the human factors observed over various recent displays and their associated assessment methods. In this study, we first provide a comprehensive QoE analysis on 2D display including image/video quality assessment (I/VQA), visual preference, and human visual system-related studies. Second, we analyze stereoscopic 3D (S3D) QoE research on the topics of I/VQA and visual discomfort from the human perception point of view on S3D display. Third, we investigate QoE in a head-mounted display-based virtual reality (VR) environment, and deal with VR sickness and 360 I/VQA with their individual approach. All of our reviews are analyzed through comparison of benchmark models. Furthermore, we layout QoE works on future display and modern deep-learning applications.
In conventional studies, cryptographic techniques are used to ensure the security of transaction between a seller and buyer in a fingerprinting system. However, the tracing protocol from a pirated copy has not been studied from the security point of view though the collusion resistance is considered by employing a collusion secure fingerprinting code. In this paper, we consider the secrecy of parameters for a fingerprinting code and burdens at a trusted center, and propose a secure tracing protocol jointly executed by a seller and a delegated server. Our main idea is to delegate authority to a server so that the center is required to operate only at the initialization phase in the system. When a pirated copy is found, a seller calculates a correlation score for each user's codeword in an encrypted domain, and identifies illegal users by sending the ciphertexts of scores as queries to the server. The information leakage from the server can be managed at the restriction of response from the server to check the maliciousness of the queries.
Due to the increased popularity of augmented (AR) and virtual (VR) reality experiences, the interest in representing the real world in an immersive fashion has never been higher. Distributing such representations enables users all over the world to freely navigate in never seen before media experiences. Unfortunately, such representations require a large amount of data, not feasible for transmission on today's networks. Thus, efficient compression technologies are in high demand. This paper proposes an approach to compress 3D video data utilizing 2D video coding technology. The proposed solution was developed to address the needs of “tele-immersive” applications, such as VR, AR, or mixed reality with “Six Degrees of Freedom” capabilities. Volumetric video data is projected on 2D image planes and compressed using standard 2D video coding solutions. A key benefit of this approach is its compatibility with readily available 2D video coding infrastructure. Furthermore, objective and subjective evaluation shows significant improvement in coding efficiency over reference technology. The proposed solution was contributed and evaluated in international standardization. Although it is was not selected as the winning proposal, as very similar solution has been selected developed since then.
A large number of studies have been made on denoising of a digital noisy image. In regression filters, a convolution kernel was determined based on the spatial distance or the photometric distance. In non-local mean (NLM) filters, pixel-wise calculation of the distance was replaced with patch-wise one. Later on, NLM filters have been developed to be adaptive to the local statistics of an image with introduction of the prior knowledge in a Bayesian framework. Unlike those existing approaches, we introduce the prior knowledge, not on the local patch in NLM filters but, on the noise bias (NB) which has not been utilized so far. Although the mean of noise is assumed to be zero before tone mapping (TM), it becomes non-zero value after TM due to the non-linearity of TM. Utilizing this fact, we propose a new denoising method for a tone mapped noisy image. In this method, pixels in the noisy image are classified into several subsets according to the observed pixel value, and the pixel values in each subset are compensated based on the prior knowledge so that NB of the subset becomes close to zero. As a result of experiments, effectiveness of the proposed method is confirmed.
This paper describes several important methods for the blind source separation of audio signals in an integrated manner. Two historically developed routes are featured. One started from independent component analysis and evolved to independent vector analysis (IVA) by extending the notion of independence from a scalar to a vector. In the other route, nonnegative matrix factorization (NMF) has been extended to multichannel NMF (MNMF). As a convergence point of these two routes, independent low-rank matrix analysis has been proposed, which integrates IVA and MNMF in a clever way. All the objective functions in these methods are efficiently optimized by majorization-minimization algorithms with appropriately designed auxiliary functions. Experimental results for a simple two-source two-microphone case are given to illustrate the characteristics of these five methods.
A novel grayscale-based block scrambling image encryption scheme is presented not only to enhance security, but also to improve the compression performance for Encryption-then-Compression (EtC) systems with JPEG compression, which are used to securely transmit images through an untrusted channel provider. The proposed scheme enables the use of a smaller block size and a larger number of blocks than the color-based image encryption scheme. Images encrypted using the proposed scheme include less color information due to the use of grayscale images even when the original image has three color channels. These features enhance security against various attacks, such as jigsaw puzzle solver and brute-force attacks. Moreover, generating the grayscale-based images from a full-color image in YCbCr color space allows the use of color sub-sampling operation, which can provide the higher compression performance than the conventional grayscale-based encryption scheme, although the encrypted images have no color information. In an experiment, encrypted images were uploaded to and then downloaded from Twitter and Facebook, and the results demonstrated that the proposed scheme is effective for EtC systems and enhances the compression performance, while maintaining the security against brute-force and jigsaw puzzle solver attacks.
Vehicle license platerecognition in natural scene is an important research topic in computer vision. The license plate recognition approach in the specific scene has become a relatively mature technology. However, license plate recognition in the natural scene is still a challenge since the image parameters are highly affected by the complicated environment. For the purpose of improving the performance of license plate recognition in natural scene, we proposed a solution to recognize real-world Chinese license plate photographs using the DCNN-RNN model. With the implementation of DCNN, the license plate is located and the features of the license plate are extracted after the correction process. Finally, an RNN model is performed to decode the deep features to characters without character segmentation. Our state-of-the-art system results in the accuracy and recall of 92.32 and 91.89% on the car accident scene dataset collected in the natural scene, and 92.88 and 92.09% on Caltech Cars 1999 dataset.
In this paper we combine video compression and modern image processing methods. We construct novel iterative filter methods for prediction signals based on Partial Differential Equation (PDE)-based methods. The central idea of the signal adaptive filters is explained and demonstrated geometrically. The meaning of particular parameters is discussed in detail. Furthermore, thorough parameter tests are introduced which improve the overall bitrate savings. It is shown that these filters enhance the rate-distortion performance of the state-of-the-art hybrid video codecs. In particular, based on mathematical denoising techniques, two types of diffusion filters are constructed: a uniform diffusion filter using a fixed filter mask and a signal adaptive diffusion filter that incorporates the structures of the underlying prediction signal. The latter has the advantage of not attenuating existing edges while the uniform filter is less complex. The filters are embedded into a software based on HEVC with additional QTBT (Quadtree plus Binary Tree) and MTT (Multi-Type-Tree) block structure. Overall, the diffusion filter method achieves average bitrate savings of 2.27% for Random Access having an average encoder runtime increase of 19% and 17% decoder runtime increase. For UHD (Ultra High Definition) test sequences, bitrate savings of up to 7.36% for Random Access are accomplished.
This paper proposes a novel approach for lossless coding of light field (LF) images based on a macro-pixel (MP) synthesis technique which synthesizes the entire LF image in one step. The reference views used in the synthesis process are selected based on four different view configurations and define the reference LF image. This image is stored as an array of reference MPs which collect one pixel from each reference view, being losslessly encoded as a base layer. A first contribution focuses on a novel network design for view synthesis which synthesizes the entire LF image as an array of synthesized MPs. A second contribution proposes a network model for coding which computes the MP prediction used for lossless encoding of the remaining views as an enhancement layer. Synthesis results show an average distortion of 29.82 dB based on four reference views and up to 36.19 dB based on 25 reference views. Compression results show an average improvement of 29.9% over the traditional lossless image codecs and 9.1% over the state-of-the-art.
Laughter commonly occurs in daily interactions, and is not only simply related to funny situations, but also to expressing some type of attitudes, having important social functions in communication. The background of the present work is to generate natural motions in a humanoid robot, so that miscommunication might be caused if there is mismatching between audio and visual modalities, especially in laughter events. In the present work, we used a multimodal dialogue database, and analyzed facial, head, and body motion during laughing speech. Based on the analysis results of human behaviors during laughing speech, we proposed a motion generation method given the speech signal and the laughing speech intervals. Subjective experiments were conducted using our android robot by generating five different motion types, considering several modalities. Evaluation results showed the effectiveness of controlling different parts of the face, head, and upper body (eyelid narrowing, lip corner/cheek raising, eye blinking, head motion, and upper body motion control).
One challenge faced by reinforcement learning (RL) agents is that in many environments the reward signal is sparse, leading to slow improvement of the agent’s performance in early learning episodes. Potential-based reward shaping can help to resolve the aforementioned issue of sparse reward by incorporating an expert’s domain knowledge into the learning through a potential function. Past work on reinforcement learning from demonstration (RLfD) directly mapped (sub-optimal) human expert demonstration to a potential function, which can speed up RL. In this paper we propose an introspective RL agent that significantly further speeds up the learning. An introspective RL agent records its state–action decisions and experience during learning in a priority queue. Good quality decisions, according to a Monte Carlo estimation, will be kept in the queue, while poorer decisions will be rejected. The queue is then used as demonstration to speed up RL via reward shaping. A human expert’s demonstration can be used to initialize the priority queue before the learning process starts. Experimental validation in the 4-dimensional CartPole domain and the 27-dimensional Super Mario AI domain shows that our approach significantly outperforms non-introspective RL and state-of-the-art approaches in RLfD in both domains.
A semi-fragile watermarking scheme is proposed in this paper for detecting tampering in speech signals. The scheme can effectively identify whether or not original signals have been tampered with by embedding hidden information into them. It is based on singular-spectrum analysis, where watermark bits are embedded into speech signals by modifying a part of the singular spectrum of a host signal. Convolutional neural network (CNN)-based parameter estimation is deployed to quickly and properly select the part of the singular spectrum to be modified so that it meets inaudibility and robustness requirements. Evaluation results show that CNN-based parameter estimation reduces the computational time of the scheme and also makes the scheme blind, i.e. we require only a watermarked signal in order to extract a hidden watermark. In addition, a semi-fragility property, which allows us to detect tampering in speech signals, is achieved. Moreover, due to the time efficiency of the CNN-based parameter estimation, the proposed scheme can be practically used in real-time applications.