You might not need Metadata

»The word Metadata in Wikidata Morse code.svg« via Wikicommons (CC-0)

When going for Text Mining a collection of resources pretty much the first step is to check for the language. If it is a multi-language collection may be there is at least some metadata. At SUB we often use the METS/MODS standard based on XML.

Within MODS there is a mods:languageTerm that can be checked against an authority file for its named languages (or language codes). In this case it is the iso639-2b (LoC). Unfortunately the schema is not applied very strict as we can see in the following excerpt.

lang code occurrences documents
ger 275479 4793
de 25980 414
de e 1 1
dee 1 1
fr 230 58
fra 38 24
fre 3607 857
por 27 20
pt 1 1
  3613 55

Well, iso639-2b lists three-letter codes only. pt is an iso639-1 alpha-2 code for Portuguese so the document should be added to por which is the correct notation. At least one can guess the correct language, by cross-checking the alpha-2 code. But they also mixed between iso639-2b and iso639-2t (see fre and fra) what makes another cross-check necessary. For others like de e we have to guess that this means something like German, while the blank field is just unusable as long as we are not going for automatic language detection.

There is quite a good one in the even longer complete list of language codes of this special collection. It is bel, the code for Belarusian. It is unlikely that a resources published in Belgium is in Belarusian.

Six flaws are still in there:

lang code occurrences documents
Culture, leisure, sports 1 1
Impressum 2 2
Inhalt 1 1
Inhalt des XLIII. Bandes 1 1
New Books on the RWE Homepage 2 1
[Rezensionen] 1 1

Besides these mistakes, it seems that the most part is specified very well. We just have to correct about 10% of the data which can be done via script. But the question remains: Why there is no validation of these files?

You can look for these values by searching the index. The rather strange values are not in the index but still in the METS files, e.g. here.

Thanks to Michelle, who provided the title for this post.

  1. 2018-02-19-metadata.csv