»The word Metadata in Wikidata Morse code.svg« via Wikicommons (CC-0)
When going for Text Mining a collection of resources pretty much the first step is to check for the language. If it is a multi-language collection may be there is at least some metadata. At SUB we often use the METS/MODS standard based on XML.
Within MODS there is a
mods:languageTerm that can be checked against an authority
file for its named languages (or language codes). In this case it is the
iso639-2b (LoC). Unfortunately the schema is not applied very
strict as we can see in the following excerpt.
iso639-2b lists three-letter codes only.
pt is an
iso639-1 alpha-2 code for Portuguese so the document should be added to
por which is the correct notation. At least one can guess the correct language, by cross-checking the alpha-2 code. But they also mixed between
fra) what makes another cross-check necessary. For others like
de e we have to guess that this means something like German, while the blank field is just unusable as long as we are not going for automatic language detection.
There is quite a good one in the even longer complete list of language codes of
this special collection. It is
bel, the code for Belarusian. It is unlikely that
a resources published in Belgium is in Belarusian.
Six flaws are still in there:
|Culture, leisure, sports||1||1|
|Inhalt des XLIII. Bandes||1||1|
|New Books on the RWE Homepage||2||1|
Besides these mistakes, it seems that the most part is specified very well. We just have to correct about 10% of the data which can be done via script. But the question remains: Why there is no validation of these files?
Thanks to Michelle, who provided the title for this post.