To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Ever since the mathematician Clive Humby coined the phrase ‘Data is the new oil’ in 2006, we have all become a bit obsessed with what data is ‘like’ so as to sell its virtues, to convince more people to be data cheerleaders and work with data as an asset. We have seen the phrase used on numerous occasions: world leaders, business leaders and publications worldwide have picked it up and acted as if it was the most important thing that Humby said. Michael Palmer, writing a blog post in November 2006, stated: ‘Data is just like crude. It's valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc. to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value’ (https://ana.blogs.com/maestros/2006/11/data_is_the_new.html).
The point to be understood is that data in its raw form doesn't really do very much. Humby's phrase also portrays that data can be used in many different ways and can be turned into a multitude of different and varied products for us to get some value from it. It sits there full of potential, waiting for us to refine, clean, link, structure and analyse it; basically to unlock it so that we can turn it into a model for predicting when extreme weather will affect us, or how to cope with spikes in demand in our medical services, or how to predict customer or citizen behaviour.
The phrase also highlights that oil and data have some attributes in common. There may be some value in taking our understanding of how we use oil and applying it to data. We can look at how oil as an asset is treated and draw useful parallels for how we can treat data. From understanding what stages oil goes through and how it is treated, we can move on to thinking about how the same principle can be applied to data and the processes data needs to go through in order to be useful; in other words, the refining process, understanding what it is going to be used for, the preparation phase and so on. The words also bring to mind the engineering and the energy required to convert oil into something useful (think about the complexity and scale of an oil refinery).
In Chapter 5 we looked at going ‘beyond’ data to more comprehensive data that extends the boundaries of conventional laws and thinking. This data creates the Halo around the original data point.
There is more to learn and understand about Halo data from the discipline of physics. In Niels Bohr's presentation of the model of the atom in 1913, the most stable, lowest energy level is found in the innermost orbit. This first orbital forms a shell around the nucleus and is assigned a principal quantum number (n) of n=1. So the metadata is the most stable. In data science terms it has the lowest potential energy to release but has attained the highest order n=1, therefore it has the greatest realised value to the business, it occupies the innermost orbit and forms a shell around the central data point.
Additional orbital shells are assigned values n=2, n=3, n=4, etc.
As electrons move further away from the nucleus, they have potential energy and become less stable. So, with our Halo data, as we move further away from the data point and more into assumption and unverified data, the data becomes less stable but the ‘potential energy’ of that data increases. For example, the political leanings of Peter may be unverified, a matter of assumption rather than fact, and so that piece of data may sit out in the n=6 orbit – but it may have huge potential energy if we can verify it as a fact. But as a ‘fact’ at this stage it is very unstable, very unassured, the confidence level is low. Our Halo data fits Bohr's model of the atom.
To continue with Bohr's model, atoms with electrons in their lowest energy orbits are in a ‘ground’ state, and those with electrons at higher energy orbits are in an ‘excited’ state. Quantum mechanics describes the movement of electrons from an outer orbit to an inner orbit and energy being released. So, data points (remember our Peter example) with simple metadata associated with them are in a ‘ground state’. Data points with a Halo of data are in an ‘excited state’. As data professionals, as data scientists, we want data in an excited state: this is where the ‘potential’ exists.
A few years ago we had the pleasure of meeting Catherine Mandungu, who is an expert in revenue operations and spends her time helping businesses to think about their revenue in a different way. She approaches data and its value from that perspective. We invited her to share her thinking with you, as it demonstrates how the value of data is used from a non-data professional point of view.
Your morning alarm goes off. You stretch a little, have a glass of water and reach out to your phone. Some time ago you bought an app, SleepScore, which tracks your sleep activity. Looks good, you had a peaceful night. You also check your Instagram and Facebook to check what you have missed while asleep. You like a few healthy recipe pages. You come across an advertisement about a yoga app. Perfect! You’re all about health and fitness. Maybe you’ll subscribe for a trial. Now, time to get ready for a working day.
You manage a commercial team at an analytics company for ecommerce businesses and have sales targets to hit, so before your online meeting with the team you check the performance stats. You need to figure out where productivity gains could be made. Later in the day, you speak to the Head of Product. You have been collecting data about deals that were lost due to product (lack of) features, and the business needs to deliberate on how to improve the product so more deals can be closed.
After a long working day, you pop down to the grocery shop to get something for dinner. At the checkout it asks you if you have a loyalty card. Yes indeed. Register the goods you bought; you might get a discount. Once dinner is out of the way, it's time to unwind. You decide to watch a movie on Netflix. Immediately it recommends some movies and series you might like. Great! It was going to be too much effort to look for something to watch. Finally, to bed, but you sometimes have trouble sleeping, so you got a sleep and meditation app called Calm to ease you into it. You’re in bed and ready. Press play.
We are living in a digital world which has fuelled a data economy.
Let's remind ourselves of a question we posed in Chapter 2, ‘What is Metadata?’ Are we stuck in 1967? Are we stuck in the Summer of Love, Sergeant Pepper's Lonely Hearts Club Band, mini-skirts and the Ford Mustang (although we are both rather partial to the newer version of this car)? Perhaps more pertinent, are we currently stuck with an understanding and usage of metadata set by the initial thinking of 1967? Haven't we moved forward in our thinking? In Chapter 2 we also briefly explored the definition of ‘meta’ (‘after’, ‘beyond’, ‘more compre hensive’) and aligned it with metaphysics: beyond the physical laws of physics. So again we pose the question: shouldn't we be taking metadata beyond the accepted laws and definitions cast in the mould of 1967?
Again looking back to Chapter 3, we had two definitions of ontology and we discussed the second in some detail: ‘a set of concepts and categories in a subject area or domain that shows their properties and the relations between them’. We said that we would return to the first definition: ‘the branch of metaphysics dealing with the nature of being’. This first definition makes the link between ontology and metaphysics and nature of the ‘being’ of data. This may help us to understand the inherent value in data and how it changes. It may also help us to better understand the very nature of data and explore those concepts of the ‘single version of the truth’ or the ‘golden source’.
Data and metaphysics
Let's move the conversation along and step into the world of metaphysics. Consider the ‘old school’ view of the atom, its orbiting electrons and nucleus. This is synonymous with our current view of metadata: a few pieces of information (electrons) circulating around the data point (the nucleus), in very structured and defined paths. The electrons (the metadata) occupy very defined paths and orbits, and perhaps we can see these as the types of metadata: structural, descriptive and administrative, or perhaps even as the W7 framework. The orbits occupied by the electrons (metadata) are also very structured, so perhaps they are the ontology of the metadata. Applying these concepts from the old-school view of the atom to metadata demonstrates that our current view of metadata is very rigid and fixed.
Before we look at a couple of examples, let's recap our paradigm shift on metadata. We now see metadata as a Halo around the CDE. The CDE is the nucleus; the quantum element attached to it, defines its purpose and the metadata in the Halo either adds value to the CDE or has the potential to add value. The data in the metadata Halo is assigned four values:
n – the distance of the data from the nucleus
v – the value that the metadata provides to CDE
p – the potential that the metadata has to provide value to the CDE
c – the level of confidence given to the metadata.
We are pushing the concept of metadata beyond the limits of the 1960s definition to demonstrate that metadata in its own right has the potential to deliver value to the business.
This all might seem obvious and straightforward: we’ve always had metadata and we’ve always had some concept that data provides value to the business. But have we really? How much value do organisations really put on the metadata that they already have, let alone that they might collect and use? Across the wider business it is likely very little; and also likely that they don't leverage the value locked up in the metadata.
We are only now starting to talk about data having value and demonstrating that value; the idea that the metadata could have a demonstrated value as well is lagging behind in our thinking.
There are a number of reasons why organisations are still stuck in the 1960s with the understanding about metadata.
1 There is a lack of creative thinking about metadata as a source of potential value. This was discussed in Chapter 2. Metadata has been viewed an unexciting chore that must be dealt with, with no obvious RoI but plenty of overheads, looked after by the technology teams.
2 Organisations have not been accustomed to searching for and ingesting new and different types of data that may or may not deliver value to their existing datasets.
3 Linked to the above, but a reason in its own right, is a fear of failure. Organisations aren't prepared to invest in a ‘data project’ if there is no clear RoI – and even more so if they can't show that it has already effectively been done somewhere else.
We show that an $n$-uniform maximal intersecting family has size at most $e^{-n^{0.5+o(1)}}n^n$. This improves a recent bound by Frankl ((2019) Comb. Probab. Comput.28(5) 733–739.). The Spread Lemma of Alweiss et al. ((2020) Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing.) plays an important role in the proof.
We study generalised quasirandom graphs whose vertex set consists of $q$ parts (of not necessarily the same sizes) with edges within each part and between each pair of parts distributed quasirandomly; such graphs correspond to the stochastic block model studied in statistics and network science. Lovász and Sós showed that the structure of such graphs is forced by homomorphism densities of graphs with at most $(10q)^q+q$ vertices; subsequently, Lovász refined the argument to show that graphs with $4(2q+3)^8$ vertices suffice. Our results imply that the structure of generalised quasirandom graphs with $q\ge 2$ parts is forced by homomorphism densities of graphs with at most $4q^2-q$ vertices, and, if vertices in distinct parts have distinct degrees, then $2q+1$ vertices suffice. The latter improves the bound of $8q-4$ due to Spencer.