To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In this paper, a multiple feature regularized kernel is proposed for hyperspectral imagery classification. To exploit the label information, a regularized kernel is used to refine the original kernel in the Support Vector Machine classifier. Furthermore, since spatial features have been widely investigated for hyperspectral imagery classification, different types of spatial features including spectral feature, local feature (i.e. local binary pattern), global feature (i.e. Gabor feature), and shape feature (i.e. extended multiattribute profiles) are included to provide distinguish discriminative information. Finally, a majority voting-based ensemble approach, which combines different types of features, is adopted to further increase the classification performance. Combining different discriminative feature information can improve the classification performance since one type of feature may result in poor performance, especially when the number of training samples is limited. Experimental results demonstrated that the proposed approach has superior performance compared with the state-of-the-art classifiers.
With wider luminance range than conventional low dynamic range (LDR) images, high dynamic range (HDR) images are more consistent with human visual system (HVS). Recently, JPEG committee releases a new HDR image compression standard JPEG XT. It decomposes an input HDR image into base layer and extension layer. The base layer code stream provides JPEG (ISO/IEC 10918) backward compatibility, while the extension layer code stream helps to reconstruct the original HDR image. However, this method does not make full use of HVS, causing waste of bits on imperceptible regions to human eyes. In this paper, a visual saliency-based HDR image compression scheme is proposed. The saliency map of tone mapped HDR image is first extracted, then it is used to guide the encoding of extension layer. The compression quality is adaptive to the saliency of the coding region of the image. Extensive experimental results show that our method outperforms JPEG XT profile A, B, C and other state-of-the-art methods. Moreover, our proposed method offers the JPEG compatibility at the same time.
Indoor localization technology, which can provide the location information of the target object or stochastic things, is becoming essential requirement for many applications and services such as Internet-of-Things (IoT), real-time control in the development of Fifth-generation (5G) technology. Leaky coaxial cable which can be used as antennas is able to detect the location of the user in a simple way due to its potential property. In this paper, we proposes a simple method to improve the localization accuracy of 2-D indoor localization using multiple LCX cables. In addition, we also evaluate the channel capacity loss due to the localization error of the LCX-MIMO using our proposed method.
Accurate disparity prediction is a hot spot in computer vision, and how to efficiently exploit contextual information is the key to improve the performance. In this paper, we propose a simple yet effective non-local context attention network to exploit the global context information by using attention mechanisms and semantic information for stereo matching. First, we develop a 2D geometry feature learning module to get a more discriminative representation by taking advantage of multi-scale features and form them into the variance-based cost volume. Then, we construct a non-local attention matching module by using the non-local block and hierarchical 3D convolutions, which can effectively regularize the cost volume and capture the global contextual information. Finally, we adopt a geometry refinement module to refine the disparity map to further improve the performance. Moreover, we add the warping loss function to help the model learn the matching rule of the non-occluded region. Our experiments show that (1) our approach achieves competitive results on KITTI and SceneFlow datasets in the end-point error and the fraction of erroneous pixels $({D_1})$; (2) our proposed method particularly has superior performance in the reflective regions and occluded areas.
As high-dynamic range (HDR) and wide-color gamut (WCG) contents become more and more popular in multimedia markets, color mapping of the distributed contents to different rendering devices plays a pivotal role in HDR distribution eco-systems. The widely used and economic gamut-clipping (GC)-based techniques perform poorly in mapping WCG contents to narrow gamut devices; and high-performance color-appearance model (CAM)-based techniques are computationally expensive to commercial applications. In this paper, we propose a novel color gamut mapping (CGM) algorithm to solve the problem. By introducing a color transition/protection zone (TPZ) and a set of perceptual hue fidelity constraints into the CIE-1931 space, the proposed algorithm directly carries out CGM in the perceptually non-uniform space, thus greatly decreases the computational complexity. The proposed TPZ effectively achieves a reasonable compromise between saturation preserving and details protection in out-of-gamut colors. The proposed hue fidelity constraints reference the measurements of human subjects' visual responses, thus effectively preserve the perceptual hue of the original colors. Experimental results show that the proposed algorithm clearly outperforms the GC-CGM, and performs similarly or better than the expensive CAM-CGM. The proposed algorithm is real-time and hardware friendly. It is an important supplement of the SMPTE ST.2094-40 standard.
Vision and language are two fundamental capabilities of human intelligence. Humans routinely perform tasks through the interactions between vision and language, supporting the uniquely human capacity to talk about what they see or hallucinate a picture on a natural-language description. The valid question of how language interacts with vision motivates us researchers to expand the horizons of computer vision area. In particular, “vision to language” is probably one of the most popular topics in the past 5 years, with a significant growth in both volume of publications and extensive applications, e.g. captioning, visual question answering, visual dialog, language navigation, etc. Such tasks boost visual perception with more comprehensive understanding and diverse linguistic representations. Going beyond the progresses made in “vision to language,” language can also contribute to vision understanding and offer new possibilities of visual content creation, i.e. “language to vision.” The process performs as a prism through which to create visual content conditioning on the language inputs. This paper reviews the recent advances along these two dimensions: “vision to language” and “language to vision.” More concretely, the former mainly focuses on the development of image/video captioning, as well as typical encoder–decoder structures and benchmarks, while the latter summarizes the technologies of visual content creation. The real-world deployment or services of vision and language are elaborated as well.
Previous methods have dealt with discrete manipulation of facial attributes such as smile, sad, angry, surprise, etc., out of canonical expressions and they are not flexible, operating in single modality. In this paper, we propose a novel framework that supports continuous edits and multi-modality portrait manipulation using adversarial learning. Specifically, we adapt cycle-consistency into the conditional setting by leveraging additional facial landmarks information. This has two effects: first cycle mapping induces bidirectional manipulation and identity preserving; second pairing samples from different modalities can thus be utilized. To ensure high-quality synthesis, we adopt texture-loss that enforces texture consistency and multi-level adversarial supervision that facilitates gradient flow. Quantitative and qualitative experiments show the effectiveness of our framework in performing flexible and multi-modality portrait manipulation with photo-realistic effects. Code will be made public: shorturl.at/chopD.
Most deep latent factor models choose simple priors for simplicity, tractability, or not knowing what prior to use. Recent studies show that the choice of the prior may have a profound effect on the expressiveness of the model, especially when its generative network has limited capacity. In this paper, we propose to learn a proper prior from data for adversarial autoencoders (AAEs). We introduce the notion of code generators to transform manually selected simple priors into ones that can better characterize the data distribution. Experimental results show that the proposed model can generate better image quality and learn better disentangled representations than AAEs in both supervised and unsupervised settings. Lastly, we present its ability to do cross-domain translation in a text-to-image synthesis task.
We consider signal denoising via transform-domain shrinkage based on a novel risk criterion called the minimum probability of error (MPE), which measures the probability that the estimated parameter lies outside an ε-neighborhood of the true value. The underlying parameter is assumed to be deterministic. The MPE, similar to the mean-squared error (MSE), depends on the ground-truth parameter, and therefore, has to be estimated from the noisy observations. The optimum shrinkage parameter is obtained by minimizing an estimate of the MPE. When the probability of error is integrated over ε, it leads to the expected ℓ1 distortion. The proposed MPE and ℓ1 distortion formulations are applicable to various noise distributions by invoking a Gaussian mixture model approximation. Within the realm of MPE, we also develop a specific extension to subband shrinkage. The denoising performance of MPE turns out to be better than that obtained using the minimum MSE-based approaches formulated within Stein's unbiased risk estimation (SURE) framework, especially in the low signal-to-noise ratio (SNR) regime. Performance comparisons with three benchmarking algorithms carried out on electrocardiogram signals and standard test signals taken from the Wavelab toolbox show that the MPE framework results in SNR gains particularly for low input SNR.
The majority of research in speech emotion recognition (SER) is conducted to recognize emotion categories. Recognizing dimensional emotion attributes is also important, however, and it has several advantages over categorical emotion. For this research, we investigate dimensional SER using both speech features and word embeddings. The concatenation network joins acoustic networks and text networks from bimodal features. We demonstrate that those bimodal features, both are extracted from speech, improve the performance of dimensional SER over unimodal SER either using acoustic features or word embeddings. A significant improvement on the valence dimension is contributed by the addition of word embeddings to SER system, while arousal and dominance dimensions are also improved. We proposed a multitask learning (MTL) approach for the prediction of all emotional attributes. This MTL maximizes the concordance correlation between predicted emotion degrees and true emotion labels simultaneously. The findings suggest that the use of MTL with two parameters is better than other evaluated methods in representing the interrelation of emotional attributes. In unimodal results, speech features attain higher performance on arousal and dominance, while word embeddings are better for predicting valence. The overall evaluation uses the concordance correlation coefficient score of the three emotional attributes. We also discuss some differences between categorical and dimensional emotion results from psychological and engineering perspectives.
Many end-to-end, large vocabulary, continuous speech recognition systems are now able to achieve better speech recognition performance than conventional systems. Most of these approaches are based on bidirectional networks and sequence-to-sequence modeling however, so automatic speech recognition (ASR) systems using such techniques need to wait for an entire segment of voice input to be entered before they can begin processing the data, resulting in a lengthy time-lag, which can be a serious drawback in some applications. An obvious solution to this problem is to develop a speech recognition algorithm capable of processing streaming data. Therefore, in this paper we explore the possibility of a streaming, online, ASR system for Japanese using a model based on unidirectional LSTMs trained using connectionist temporal classification (CTC) criteria, with local attention. Such an approach has not been well investigated for use with Japanese, as most Japanese-language ASR systems employ bidirectional networks. The best result for our proposed system during experimental evaluation was a character error rate of 9.87%.
Deep neural networks (DNNs) have the same structure as the neocognitron proposed in 1979 but have much better performance, which is because DNNs include many heuristic techniques such as pre-training, dropout, skip connections, batch normalization (BN), and stochastic depth. However, the reason why these techniques improve the performance is not fully understood. Recently, two tools for theoretical analyses have been proposed. One is to evaluate the generalization gap, defined as the difference between the expected loss and empirical loss, by calculating the algorithmic stability, and the other is to evaluate the convergence rate by calculating the eigenvalues of the Fisher information matrix of DNNs. This overview paper briefly introduces the tools and shows their usefulness by showing why the skip connections and BN improve the performance.
People are the very heart of our daily work and life. As we strive to leverage artificial intelligence to empower every person on the planet to achieve more, we need to understand people far better than we can today. Human–computer interaction plays a significant role in human-machine hybrid intelligence, and human understanding becomes a critical step in addressing the tremendous challenges of video understanding. In this paper, we share our views on why and how to use a human centric approach to address the challenging video understanding problems. We discuss human-centric vision tasks and their status, highlighting the challenges and how our understanding of human brain functions can be leveraged to effectively address some of the challenges. We show that semantic models, view-invariant models, and spatial-temporal visual attention mechanisms are important building blocks. We also discuss the future perspectives of video understanding.
In this paper, we study the semantic segmentation of 3D LiDAR point cloud data in urban environments for autonomous driving, and a method utilizing the surface information of the ground plane was proposed. In practice, the resolution of a LiDAR sensor installed in a self-driving vehicle is relatively low and thus the acquired point cloud is indeed quite sparse. While recent work on dense point cloud segmentation has achieved promising results, the performance is relatively low when directly applied to sparse point clouds. This paper is focusing on semantic segmentation of the sparse point clouds obtained from 32-channel LiDAR sensor with deep neural networks. The main contribution is the integration of the ground information which is used to group ground points far away from each other. Qualitative and quantitative experiments on two large-scale point cloud datasets show that the proposed method outperforms the current state-of-the-art.
Road scene understanding is a critical component in an autonomous driving system. Although the deep learning-based road scene segmentation can achieve very high accuracy, its complexity is also very high for developing real-time applications. It is challenging to design a neural net with high accuracy and low computational complexity. To address this issue, we investigate the advantages and disadvantages of several popular convolutional neural network (CNN) architectures in terms of speed, storage, and segmentation accuracy. We start from the fully convolutional network with VGG, and then we study ResNet and DenseNet. Through detailed experiments, we pick up the favorable components from the existing architectures and at the end, we construct a light-weight network architecture based on the DenseNet. Our proposed network, called DSNet, demonstrates a real-time testing (inferencing) ability (on the popular GPU platform) and it maintains an accuracy comparable with most previous systems. We test our system on several datasets including the challenging Cityscapes dataset (resolution of 1024 × 512) with an Mean Intersection over Union (mIoU) of about 69.1% and runtime of 0.0147 s/image on a single GTX 1080Ti. We also design a more accurate model but at the price of a slower speed, which has an mIoU of about 72.6% on the CamVid dataset.
Nowadays, the security of ASV systems is increasingly gaining attention. As one of the common spoofing methods, replay attacks are easy to implement but difficult to detect. Many researchers focus on designing various features to detect the distortion of replay attack attempts. Constant-Q cepstral coefficients (CQCC), based on the magnitude of the constant-Q transform (CQT), is one of the striking features in the field of replay detection. However, it ignores phase information, which may also be distorted in the replay processes. In this work, we propose a CQT-based modified group delay feature (CQTMGD) which can capture the phase information of CQT. Furthermore, a multi-branch residual convolution network, ResNeWt, is proposed to distinguish replay attacks from bonafide attempts. We evaluated our proposal in the ASVspoof 2019 physical access dataset. Results show that CQTMGD outperformed the traditional MGD feature, and the fusion with other magnitude-based and phase-based features achieved a further improvement. Our best fusion system achieved 0.0096 min-tDCF and 0.39% EER on the evaluation set and it outperformed all the other state-of-the-art methods in the ASVspoof 2019 physical access challenge.
In recent years, automatic speaker verification (ASV) is used extensively for voice biometrics. This leads to an increased interest to secure these voice biometric systems for real-world applications. The ASV systems are vulnerable to various kinds of spoofing attacks, namely, synthetic speech (SS), voice conversion (VC), replay, twins, and impersonation. This paper provides the literature review of ASV spoof detection, novel acoustic feature representations, deep learning, end-to-end systems, etc. Furthermore, the paper also summaries previous studies of spoofing attacks with emphasis on SS, VC, and replay along with recent efforts to develop countermeasures for spoof speech detection (SSD) task. The limitations and challenges of SSD task are also presented. While several countermeasures were reported in the literature, they are mostly validated on a particular database, furthermore, their performance is far from perfect. The security of voice biometrics systems against spoofing attacks remains a challenging topic. This paper is based on a tutorial presented at APSIPA Annual Summit and Conference 2017 to serve as a quick start for those interested in the topic.
In 2018, the Alliance for Open Media (AOMedia) finalized its first video compression format AV1, which is jointly developed by the industry consortium of leading video technology companies. The main goal of AV1 is to provide an open source and royalty-free video coding format that substantially outperforms state-of-the-art codecs available on the market in compression efficiency while remaining practical decoding complexity as well as being optimized for hardware feasibility and scalability on modern devices. To give detailed insights into how the targeted performance and feasibility is realized, this paper provides a technical overview of key coding techniques in AV1. Besides, the coding performance gains are validated by video compression tests performed with the libaom AV1 encoder against the libvpx VP9 encoder. Preliminary comparison with two leading HEVC encoders, x265 and HM, and the reference software of VVC is also conducted on AOM's common test set and an open 4k set.
This article presents an overview of the recent standardization activities for point cloud compression (PCC). A point cloud is a 3D data representation used in diverse applications associated with immersive media including virtual/augmented reality, immersive telepresence, autonomous driving and cultural heritage archival. The international standard body for media compression, also known as the Motion Picture Experts Group (MPEG), is planning to release in 2020 two PCC standard specifications: video-based PCC (V-CC) and geometry-based PCC (G-PCC). V-PCC and G-PCC will be part of the ISO/IEC 23090 series on the coded representation of immersive media content. In this paper, we provide a detailed description of both codec algorithms and their coding performances. Moreover, we will also discuss certain unique aspects of point cloud compression.
The impulsive noise (IN) damages the performance of wireless communication in modern 5G scenarios such as manufacturing and automatic factories. The proposed receiver utilizes constant false alarm rate to obtain the threshold and combines with blanking to further improve the performance of the conventional blanking scheme with acceptable complexity. The simulated results show that the proposed receiver can achieve a lower bit error rate even if the probability of IN occurrence is very high and the power of the IN is much larger than that of the background noise.