Multilingual information access
Increasingly, modern digital libraries (DLs) have to deal with an array of different media types, such as photos, paintings, sounds, maps, manuscripts, books, newspapers and archival papers, and often in large volumes. For example, in mid2011, Europeana, a portal that provides online access to much of Europe's cultural heritage, contained well over 19 million digitized cultural objects, of which over 60% were images. Of course the challenges faced do not stop at just handling different media types; DLs must also support digital materials that are multicultural and multilanguage (Borgman, 1997; Oard, 1997). In June 2010 it was estimated that, of the 1.9 billion internet users, 27% used English, 23% Chinese and 8% Spanish (with Arabic ranked 7th) (Internet World Stats). According to Online Computer Library Centre (OCLC), in 2010 its global online library catalogue accessible to users, World-Cat, contained nearly 197 million records for library items in 479 languages from more than 17,000 libraries in 52 countries (OCLC, 2010). More than 57% of records in World-Cat are written in languages other than English. DL infrastructures must therefore be able to handle increasing volumes of multimedia content, and to store and present documents written in multiple language scripts and localized to specific user groups.
Multilingual Information Access (MLIA) addresses the problem of accessing and retrieving information from collections in any language. This covers both technical aspects, such as language identification and character encoding, and the overall access and retrieval of multilingual information. Systems that process information in multiple languages (either queries, documents or both) are called Multilingual Information Retrieval (MLIR) systems (Peters et al., 2012). In such systems documents in the collection exist in different languages and search requests can be made in any language (e.g. as occurring on the web). More specifically, systems that help users to cross the language boundary – querying a multilingual collection in one language in order to retrieve relevant documents written in other languages – are referred to as Cross Language Information Retrieval or CLIR (Gey et al., 2005; Nie, 2010). An obvious question for MLIR/CLIR is ‘Why do users want to retrieve documents they presumably can't read?’.