« eLuxemburgensia » is the National Library of Luxembourg's second digitisation project. The content can be browsed or searched in full text mode through the new interface at www.eluxemburgensia.lu. Furthermore, the digitised content has been integrated into an digital archiving system.
In order to be able to offer this full text search, the content has been run through optical character recognition software. This entails mistakes in the recognition of words. These mistakes can have several reasons: bad paper quality of the originals, printing imperfections or decay over time of the paper. These imperfections mean that the OCR - with the current state of the art technology - cannot correctly identify all the letters in the original.
In order to be able to produce good scans, the national library needs first to ensure that the paper originals are of the best possible quality. This is ensured by scanning from its own collections, which are then complemented as needed by volumes graciously lent by the National Archives, the Grand Séminaire - Centre Jean XXIII, the municipality of Grevenmacher and the Centre national de littérature as well as by private collectors.
The first digitisation project (2002-2008) produced digital images, which can be viewed and downloaded at www.luxemburgensia.bnl.lu.
The digitisation project aims:
- to promote the printed heritage while protecting the fragile originals,
- to provide public online access to this heritage (within the limits of copyright regulation),
- to provide enhanced search methods for the digitised content.
Technical: METS/ALTO as digitisation formats
The national library uses METS/ALTO for its digitisation projects. These formats are used, as far as it's possible, for the images and the metadata.
METS
The METS format allows the modelling of and the searching in the logical structure of documents (pages, articles etc.). It also keeps track of technical metadata (resolution, scanning equipment etc.) which enables the long term preservation of content.
ALTO
The ALTO metadata format describes the layout of individual pages and stores the output of the OCR (Optical Character Recognition) program, which enables full-text search. The combination of both METS and ALTO is what enables the search for specific terms and then highlighting the terms found in the original article on the page images.
METS/ALTO in Open Data
Following the Open Data strategy of the government of the Grand Duchy of Luxembourg, the National Library of Luxembourg (BnL) opened its data to the public et made them accessible on website data.bnl.lu.
Now, every person can download large sets of data et reuse them freely. The available data of the BnL is aimed at an diversified professional audiance (data scientists, historians, linguists, digital humanities researchers and developers).
Datasets and metadata available on data.bnl.lu are part of the digitised Luxembourgish periodical collections which are in the public domain. Those datasets go from 250MB up to 257GB and allow for different levels of usage, such as training of machine learning algorithm or even neural networks.
The viewer as open source package
The newspaper viewer for METS/ALTO files, which has been developed by the BnL, is available as an open source package through sourceforge.net. The project page for the bnlviewer includes the viewer itself, the search service and a sample of METS/ALTO files.
The BnL welcomes your comments and questions about the project and website.