Skipping Unicode Boms With Unicodebominputstream
Many years ago, I worked on a Java project which read XML configuration files. Everything was fine. I was using JAXP, the Java API for XML Processing, which was part of J2SE 1.4.2.
The development phase was nearly done, unit tests were running well and the test team did not report any major issue. Suddenly, without any modification to the source code that could explain what was going on, the Java Virtual Machine started spittng out tons of exceptions:
org.xml.sax.SAXParseException: Document root element is missing.
Except the document root was not missing.
I double checked configuration file paths and their contents again and again. I double checked commits in the SCM. I asked the test team if they already encountered the problem. They didn’t.
Further debugging, I decided to have a look at the hexadecimal output of the offending configuration file which revealed the presence of two bytes before the document root element. Those two bytes were in fact a Unicode Byte Order Mark (BOM).
At this point, I remembered I made a quick tweak to some files and I did it with… Microsoft’s
Notepad.exe: the most widespread XML editor in the world!
I was not aware of Unicode BOMs and I didn’t know that
Notepad.exe would silently add such a BOM when saving a modified file encoded in UTF-8. Most of the “advanced” file editors detect BOMs and do not display them, that’s why I didn’t understand what was going on. The problem is that Java also doesn’t recognize and doesn’t skip UTF-8 BOMs at the beginning of input streams.
Is this a Java bug from the beginning? The initial UTF-8 specification (RFC2279 - January 1998) tells nothing about BOMs whereas the latest UTF-8 specification (RFC3629 - November 2003) and the Unicode FAQ explicitly mention that UTF-8 data streams may contain an initial BOM. Java may simply have followed RFC2279.
At the time, in 2004, some XML parsers like the Xerces 2 parser already had workarounds for UTF-8 BOMs while Crimson (the Java 1.4 built-in XML parser) simply didn’t. You had to skip UTF-8 BOMs in your own code.
In 2001, someone opened bug JDK-4508058 with the sound expectation Java should detect and skip UTF-8 BOMs at the beginning of UTF-8 streams, the same way it does for e.g. UTF-16. People complained a lot. There were even insults (which have been filtered by Oracle when they took over) in the bug tracker. Nothing moved until November 2005 when Sun folks decided to fix and close the bug:
new state “closed, fixed in mustang(b61)”
Bug JDK-4508058 remained fixed for a while before being ultimately reverted because some other great programmers relied on that exact same bug:
the Java EE 5 RI and SJSAS 9.0 has been relying on detecting a BOM, setting the appropriate encoding, and discarding the BOM bytes before reading the input
See, they’re complaining because shipped code breaks if/when JDK behavior changes. And instead of fixing JDK-4508058 and accept this would be an annoyance only for Java EE 5 RI and SJSAS 9.0 users, people in charge at Sun decided we’re all living in a better world if JDK-4508058 gets closed as “won’t fix”. Because fuck you, just skip the BOM yourself.
10 years have passed since I wrote the
UnicodeBOMInputStream class and yet Java doesn’t properly deal with UTF-8 Unicode BOMs at the beginning of data.