19 February 2018 / METADATA

You might not need Metadata

»The word Metadata in Wikidata Morse code.svg« via Wikicommons (CC-0)

When going for Text Mining a collection of resources pretty much the first step is to check for the language. If it is a multi-language collection may be there is at least some metadata. At SUB we often use the METS/MODS standard based on XML.

Within MODS there is a mods:languageTerm that can be checked against an authority file for its named languages (or language codes). In this case it is the iso639-2b (LoC). Unfortunately the schema is not applied very strict as we can see in the following excerpt.

lang code	occurrences	documents
ger	275479	4793
de	25980	414
de e	1	1
dee	1	1
fr	230	58
fra	38	24
fre	3607	857
por	27	20
pt	1	1
	3613	55

Well, iso639-2b lists three-letter codes only. pt is an iso639-1 alpha-2 code for Portuguese so the document should be added to por which is the correct notation. At least one can guess the correct language, by cross-checking the alpha-2 code. But they also mixed between iso639-2b and iso639-2t (see fre and fra) what makes another cross-check necessary. For others like de e we have to guess that this means something like German, while the blank field is just unusable as long as we are not going for automatic language detection.

There is quite a good one in the even longer complete list of language codes of this special collection. It is bel, the code for Belarusian. It is unlikely that a resources published in Belgium is in Belarusian.

Six flaws are still in there:

lang code	occurrences	documents
Culture, leisure, sports	1	1
Impressum	2	2
Inhalt	1	1
Inhalt des XLIII. Bandes	1	1
New Books on the RWE Homepage	2	1
[Rezensionen]	1	1

Besides these mistakes, it seems that the most part is specified very well. We just have to correct about 10% of the data which can be done via script. But the question remains: Why there is no validation of these files?

You can look for these values by searching the index. The rather strange values are not in the index but still in the METS files, e.g. here.

Thanks to Michelle, who provided the title for this post.

2018-02-19-metadata.csv

You might not need Metadata

SADE²⁰¹⁸

Static Infrastructure Status with Jekyll and GitHub Pages