Published online by Cambridge University Press: 10 September 2022
Learning outcomes of this chapter
• Adopting a broader view of metadata quality
• Why you need to clean your metadata in the context of linked data
• Identification of most common metadata quality issues
• Understanding the possibilities and limits of automated metadata cleaning
• Case study: cleaning metadata of the Schoenberg Database of Manuscripts
Introduction
‘It is not a bug, it is a feature’ is one of the more interesting lines one of us learnt when working for a software company. When a customer noticed an inconsistency in one of the products, the challenge was to convince the client the issue was not a shortcoming but actually a quality of the software. This line comes to mind when we think about the relation between linked data and metadata quality. The lack of consistent, formalized and well structured data on the web is often presented as the biggest Achilles’ heel for the realization of the semantic web and linked data vision. However, we prefer to see the same reality from another viewpoint. Even the most ardent critic of linked data must admit at least one positive outcome: linked data have put metadata quality in the spotlight, finally giving this topic the attention it deserves.
If you only remember one thing from this chapter, it should be this: all metadata is dirty, but you can do something about it. Recurrent metadata quality issues such as duplicate records or inconsistent encoding of dates or names all have a negative impact on the use of your metadata but also on the implementation of linked data methodologies. As Chapters 4 and 5 will demonstrate, the success rate of methods such as reconciliation and enrichment depends to a large extent on how consistent and well structured your metadata are. Data profiling and cleaning techniques will teach you how to spot these issues and where possible mitigate them.
The difficulty of combining theory and practice
Data quality has attracted a lot of attention recently within academic circles. A large number of papers and books describe data quality with the help of theoretical concepts, models and frameworks which often refer to and build upon one another.
To save this book to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Find out more about the Kindle Personal Document Service.
To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.
To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.