Here is a metadata labyrinth.
A consortium known as the “Comité International des Télécommunications de Presse” (aka, IPTC) developed a standard for embedding information about photographs into digital files. Adobe Photoshop picked up on this, and soon IPTC was a standard for embedding copyright, caption, and other important metadata in digital photographs.
Unfortunately, when IPTC was developed, it wasn’t widely understood that 8-bits is insufficient space to distinguish all the characters in written language. If you’re an english speaker, you might be surprised to hear this. 8-bits provides 256 distinct codes (2 raised to the 8th power is 256). Since there are only 26 letters in the Roman alphabet, you’re probably thinking, “There are 230 extra codes!” But lower case letters are different than upper case. And letters with accents are different. But the really killer is that some Asian languages have thousands of glyphs.
So a latecomer to the IPTC game is the use of “character encodings” to provide a way to specify one of those thousands of glyphs in an 8-bit stream. How does it work? Well, there are a variety of ways, but they all rely on some sort of technique that one byte can say, “hey, I’m part of a multi-byte character”. There are dozens if not hundreds of different character encodings.
So, IPTC added support for character encodings. But unfortunately, the documentation for IPTC wasn’t written by a native English speaker — I think it was probably written by a Japanese lawyer, so it is rather hard to decipher.
The IPTC spec refers to character escape sequences, ISO 2022, and does explicitly say it supports UTF-8. I like UTF-8 — it is what Mac OS X supports natively, so it is easy to use.
Google ISO 2022, and you find the definition of “escape sequences”. These are a way to change the character encoding mid stream. So the first sentence of your description could be plain ascii with byte A9 for the copyright (©) , but then switch to another encoding, where A9 might be a different glyph. Makes my head spin.
Unfortunately, ISO 2022 doesn’t mention UTF-8.
I guess UTF-8 support in ISO 2022 is an extension?
Eventually I did find a few sketchy pages, for example here that suggests that the escape sequence for UTF-8 is ESC % G. Interesting — most of these references talk about how Terminal programs should function to properly do telnet. Ugh, that doesn’t sound very authoritative with respect to metadata.
I’ll keep searching….
Categories