To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Parameter estimation is generally difficult, requiring advanced methods such as the expectation-maximization (EM). This chapter focuses on the ideas behind EM, rather than its complex mathematical properties or proofs. We use the Gaussian mixture model (GMM) as an illustrative example to find what leads us to the EM algorithms, e.g., complete and incomplete data likelihood, concave and nonconcave loss functions, and observed and hidden variables. We then derive the EM algorithm in general and its application to GMM.
This chapter presents a simple but working face recognition system, which is based on the nearest neighbor search algorithm. Albeit simple, it is a complete pattern recognition pipeline. We can then examine every component in it, and analyze potential difficulties and pitfalls one may encounter. Furthermore, we introduce a problem-solving framework, which will be useful in the rest of this book and in solving other tasks.
This chapter is not about one particular method (or a family of methods). Instead, it provides a set of tools useful for better pattern recognition, especially for real-world applications. They include the definition of distance metrics, vector norms, a brief introduction to the idea of distance metric learning, and power mean kernels (which is a family of useful metrics). We also establish by examples that proper normalizations of our data are essential, and introduce a few data normalization and transformation methods.
Starting from this chapter, Part III introduces several commonly used algorithms in pattern recognition and machine learning. Support vector machines (SVM) starts from a simple and beautiful idea: large margin. We first show that in order to find such an idea, we may need to simplify our problem setup by assuming a linearly separable binary one. Then we visualize and calculate the margin to reach the SVM formulation, which is complex and difficult to optimize. We practice the simplification procedure again until the formulation becomes viable, briefly mention the primal--dual relationship, but do not go into details of its optimization. We show that the simplification assumptions (linear, separable, and binary) can be relaxed such that SVM will solve more difficult tasks---and the key ideas here are also useful in other tasks: slack variables and kernel methods.
Information theory is developed in the communications community, but it turns out to be very useful for pattern recognition. In this chapter, we start with an example to develop the ideas of uncertainty and its measurement, i.e., entropy. A few core results in information theory are introduced: entropy, joint and conditional entropy, mutual information, and their relationships. We then move to differential entropy for continuous random variables and find distributions with maximum entropy under certain constraints, which are useful for pattern recognition. Finally, we introduce the applications of information theory in our context: maximum entropy learning, minimum cross entropy, feature selection, and decision trees (a widely used family of models for pattern recognition and machine learning).
What does a probabilistic program actually compute? How can one formally reason about such probabilistic programs? This valuable guide covers such elementary questions and more. It provides a state-of-the-art overview of the theoretical underpinnings of modern probabilistic programming and their applications in machine learning, security, and other domains, at a level suitable for graduate students and non-experts in the field. In addition, the book treats the connection between probabilistic programs and mathematical logic, security (what is the probability that software leaks confidential information?), and presents three programming languages for different applications: Excel tables, program testing, and approximate computing. This title is also available as Open Access on Cambridge Core.
Successful prevention of cyberbullying depends on the adequate detection of harmful messages. Given the impossibility of human moderation on the Social Web, intelligent systems are required to identify clues of cyberbullying automatically. Much work on cyberbullying detection focuses on detecting abusive language without analyzing the severity of the event nor the participants involved. Automatic analysis of participant roles in cyberbullying traces enables targeted bullying prevention strategies. In this paper, we aim to automatically detect different participant roles involved in textual cyberbullying traces, including bullies, victims, and bystanders. We describe the construction of two cyberbullying corpora (a Dutch and English corpus) that were both manually annotated with bullying types and participant roles and we perform a series of multiclass classification experiments to determine the feasibility of text-based cyberbullying participant role detection. The representative datasets present a data imbalance problem for which we investigate feature filtering and data resampling as skew mitigation techniques. We investigate the performance of feature-engineered single and ensemble classifier setups as well as transformer-based pretrained language models (PLMs). Cross-validation experiments revealed promising results for the detection of cyberbullying roles using PLM fine-tuning techniques, with the best classifier for English (RoBERTa) yielding a macro-averaged ${F_1}$-score of 55.84%, and the best one for Dutch (RobBERT) yielding an ${F_1}$-score of 56.73%. Experiment replication data and source code are available at https://osf.io/nb2r3.
Using natural language processing, it is possible to extract structured information from raw text in the electronic health record (EHR) at reasonably high accuracy. However, the accurate distinction between negated and non-negated mentions of clinical terms remains a challenge. EHR text includes cases where diseases are stated not to be present or only hypothesised, meaning a disease can be mentioned in a report when it is not being reported as present. This makes tasks such as document classification and summarisation more difficult. We have developed the rule-based EdIE-R-Neg, part of an existing text mining pipeline called EdIE-R (Edinburgh Information Extraction for Radiology reports), developed to process brain imaging reports, (https://www.ltg.ed.ac.uk/software/edie-r/) and two machine learning approaches; one using a bidirectional long short-term memory network and another using a feedforward neural network. These were developed on data from the Edinburgh Stroke Study (ESS) and tested on data from routine reports from NHS Tayside (Tayside). Both datasets consist of written reports from medical scans. These models are compared with two existing rule-based models: pyConText (Harkema et al. 2009. Journal of Biomedical Informatics42(5), 839–851), a python implementation of a generalisation of NegEx, and NegBio (Peng et al. 2017. NegBio: A high-performance tool for negation and uncertainty detection in radiology reports. arXiv e-prints, p. arXiv:1712.05898), which identifies negation scopes through patterns applied to a syntactic representation of the sentence. On both the test set of the dataset from which our models were developed, as well as the largely similar Tayside test set, the neural network models and our custom-built rule-based system outperformed the existing methods. EdIE-R-Neg scored highest on F1 score, particularly on the test set of the Tayside dataset, from which no development data were used in these experiments, showing the power of custom-built rule-based systems for negation detection on datasets of this size. The performance gap of the machine learning models to EdIE-R-Neg on the Tayside test set was reduced through adding development Tayside data into the ESS training set, demonstrating the adaptability of the neural network models.
This paper proposes a task-related electroencephalogram research framework (tEEG framework) to guide scholars’ research on EEG-based cognitive and affective studies in the context of design. The proposed tEEG framework aims to investigate design activities with loosely controlled experiments and decompose a complex design process into multiple primitive cognitive activities, corresponding to which different research hypotheses on basic design activities can be effectively formulated and tested. Thereafter, existing EEG techniques and methods can be applied to analyse EEG signals related to design. Three application examples are presented at the end of this paper to demonstrate how the proposed framework can be applied to analyse design activities. The tEEG framework is presented to guide EEG-based cognitive and affective studies in the context of design. Existing methods and models are summarized, for the effective application of the tEEG framework, from the current literature spread in a wide spectrum of resources and fields.
Multicomponent polymer systems are of interest in organic photovoltaic and drug delivery applications, among others where diverse morphologies influence performance. An improved understanding of morphology classification, driven by composition-informed prediction tools, will aid polymer engineering practice. We use a modified Cahn–Hilliard model to simulate polymer precipitation. Such physics-based models require high-performance computations that prevent rapid prototyping and iteration in engineering settings. To reduce the required computational costs, we apply machine learning (ML) techniques for clustering and consequent prediction of the simulated polymer-blend images in conjunction with simulations. Integrating ML and simulations in such a manner reduces the number of simulations needed to map out the morphology of polymer blends as a function of input parameters and also generates a data set which can be used by others to this end. We explore dimensionality reduction, via principal component analysis and autoencoder techniques, and analyze the resulting morphology clusters. Supervised ML using Gaussian process classification was subsequently used to predict morphology clusters according to species molar fraction and interaction parameter inputs. Manual pattern clustering yielded the best results, but ML techniques were able to predict the morphology of polymer blends with ≥90% accuracy.
Green–Griffiths–Kerr introduced Hodge representations to classify the Hodge groups of polarized Hodge structures, and the corresponding Mumford–Tate subdomains. We summarize how, given a fixed period domain $ \mathcal{D} $, to enumerate the Hodge representations and corresponding Mumford–Tate subdomains $ D \subset\mathcal{D} $. The procedure is illustrated in two examples: (i) weight two Hodge structures with $ {p}_g={h}^{2,0}=2 $; and (ii) weight three CY-type Hodge structures.
As products are being developed over time and across organisations, the risk for unintended accumulation and mis-conception of margins allocated may occur. Accumulation of margins can result in over design, but also add risk due to under allocation. This paper describes the different terminology used in one organisation and shows the different roles margins play across the design process and in particular the how margins are a critical but often overlooked aspect of product platform design. The research was conducted in close collaboration with a truck manufacturer between 2013 and 2018. The objective was to gain understanding of the current use of margins, and associated concepts evolve along the product life cycle, across organisation and product platform representations. It was found that margins already play an important role throughout the entire design process; however, it is not recognised as a unified concept which is clearly communicated and tracked throughout the design process. Rather different stakeholders have different notions of margins and do not disclose the rationale behind adding margins or the amount that they have added. Margins also enabled designers to avoid design changes as existing components and systems can accommodate new requirements and thereby saving significant design time.
Technical challenges associated with telomere length (TL) measurements have prompted concerns regarding their utility as a biomarker of aging. Several factors influence TL assessment via qPCR, the most common measurement method in epidemiological studies, including storage conditions and DNA extraction method. Here, we tested the impact of power supply during the qPCR assay. Momentary fluctuations in power can affect the functioning of high-performance electronics, including real-time thermocyclers. We investigated if mitigating these fluctuations by using an uninterruptible power supply (UPS) influenced TL assessment via qPCR. Samples run with a UPS had significantly lower standard deviation (p < 0.001) and coefficient of variation (p < 0.001) across technical replicates than those run without a UPS. UPS usage also improved exponential amplification efficiency at the replicate, sample, and plate levels. Together these improvements translated to increased performance across metrics of external validity including correlation with age, within-person correlation across tissues, and correlation between parents and offspring.