High-dimensional data is prevalent in machine learning and related areas. Indeed, there often arises the situation in which there are more data dimensions than there are data examples. In such cases we seek a lower-dimensional representation of the data. In this chapter we discuss some standard methods which can also improve the prediction performance by removing ‘noise’ from the representation.
High-dimensional spaces – low-dimensional manifolds
In machine learning problems data is often high dimensional – images, bag-of-word descriptions, gene-expressions etc. In such cases we cannot expect the training data to densely populate the space, meaning that there will be large parts in which little is known about the data. For the hand written digits from Chapter 14, the data is 784 dimensional and for binary valued pixels the number of possible images is 2784 ≈ 10236. Nevertheless, we would expect that only a handful of examples of a digit should be sufficient (for a human) to understand how to recognise a 7. Digit-like images must therefore occupy a highly constrained volume in the 784 dimensions and we expect only a small number of degrees of freedom to be required to describe the data to a reasonable accuracy. Whilst the data vectors may be very high dimensional, they will therefore typically lie close to a much lower-dimensional ‘manifold’ (informally, a two-dimensional manifold corresponds to a warped sheet of paper embedded in a high-dimensional space), meaning that the distribution of the data is heavily constrained.
Review the options below to login to check your access.
Log in with your Cambridge Aspire website account to check access.
If you believe you should have access to this content, please contact your institutional librarian or consult our FAQ page for further information about accessing our content.