Language filter for digitised content

The search engine on eluxemburgensia.lu now features a language filter, which allows users to narrow the search results to one or more languages. The newspapers, books and historical magazines digitised by the BnL bear witness to the culture of multilingualism present in Luxembourg and often bring together different languages on the same page. Although most of the content is in German and French, there is also content in Luxembourgish, English and other languages. The new filter allows, for example, a user who prefers to read only one of the languages, to filter out German, French or Luxembourgish content.

We encourage the users of eluxemburgensia to make their searches in different languages. The language filter allows you to easily determine if only content in a certain language was found and if it would be useful to search with translated keywords.

The language information has been integrated into the metadata of the digitised documents and can now also be used for data analysis. In fact, algorithms for automatic language processing are often language-dependent and this work provides a good basis for the development of new tools for the analysis of historical texts.

Which languages are represented?

The two main languages are German, which accounts for 67% of the total, followed by French at 31,3%. Luxembourgish is represented in a much smaller proportion, with just over 118,123 articles (or 1,4%), while English, with 10,149 articles (or 0,1%), is considered a minority.

This is followed by 15 other lesser used languages:

Latin 1959
Italian 1379
Portuguese 212
Polish 168
Dutch 78
Spanish 32
Esperanto 15
Hungarian 14
Croatian 8
Ido 4
Danish 2
Irish 2
Bosnian 2
Russian 2
Slovenian 1

What is the language detection process?

While language determination for monographs is based on bibliographic records, the process is more complicated for serial articles for which no such data exists. As optical character recognition (OCR) is not yet perfect and as some types of texts, such as lists of names, do not have an identifiable language, the algorithm developed by the BnL uses several complementary heuristics:

  • A vote between the standard algorithms fasttext, cld3 and langid
  • Dictionaries of the languages identified in the collection
  • Measures of OCR quality
  • Information about other articles in the periodical

For languages with fewer than 1,000 articles, the texts are reviewed by hand to check that the algorithm has indeed determined the correct language. This allows some accuracy to be maintained for marginal languages. For the remaining articles, the language is not manually reviewed and inaccuracies remain. In addition, there are multilingual articles (e.g. one part in French, another in German) for which the dominant language is chosen. Finally, some content, such as lists of sports results, do not lend themselves to this kind of language determination process.