You might not need Metadata
»The word Metadata in Wikidata Morse code.svg« via Wikicommons (CC-0)
When going for Text Mining a collection of resources pretty much the first step is to check for the language. If it is a multi-language collection may be there is at least some metadata. At SUB we often use the METS/MODS standard based on XML.
Within MODS there is a
mods:languageTerm that can be checked against an authority
file for its named languages (or language codes). In this case it is the
iso639-2b (LoC). Unfortunately the schema is not applied very
strict as we can see in the following excerpt.
iso639-2b lists three-letter codes only.
pt is an
iso639-1 alpha-2 code for Portuguese so the document should be added to
por which is the correct notation. At least one can guess the correct language, by cross-checking the alpha-2 code. But they also mixed between
fra) what makes another cross-check necessary. For others like
de e we have to guess that this means something like German, while the blank field is just unusable as long as we are not going for automatic language detection.
There is quite a good one in the even longer complete list of language codes of
this special collection. It is
bel, the code for Belarusian. It is unlikely that
a resources published in Belgium is in Belarusian.
Six flaws are still in there:
|Culture, leisure, sports||1||1|
|Inhalt des XLIII. Bandes||1||1|
|New Books on the RWE Homepage||2||1|
Besides these mistakes, it seems that the most part is specified very well. We just have to correct about 10% of the data which can be done via script. But the question remains: Why there is no validation of these files?
You can look for these values by searching the index. The rather strange values are not in the index but still in the METS files, e.g. here.
Thanks to Michelle, who provided the title for this post.