Mittwoch, 7. September 2011

The devil is in the detail

Being an expert in XML is like being an expert in comma-separated values.
  -- Terrence Parr



I do not have problems with any other stuff from Terrence (including the marvelous antlr) , but in this quote, his intended meaning walks on seriously thin ice. CSV is not trivial, it is rather one of the worst possible formats for data storage that I have ever encountered. And I use the phrase "one of the worst" here only, because I consider INI files significantly worse.

I am something that I call a bidirectional software designer. This means I believe you can cock things up on both sides of the software universe, macro- and microdevelopment, alike. CSV is one of those micro-failures that so many (inexperienced) programmers fall upon. And it keeps creeping me up. Proper handling of CSV files is a serious problem, as you have to deal with at least


  • character set issues (ASCII, unicode, mixed encoding)
  • quoting (double quotes, single quotes)
  • comments (yes, no, how?)
  • escaping (double double quotes, backslash)
  • locale issues (is 1.001 a decimal number or is it an integer. What about 1,001? Of course, everything is clear for 1,002.3. Or is it not?)

XML (+ XML Schema) does away with all of this for good. Escaping and quoting are well defined. XML Schema gives you proper datatype representation. Comments are well defined. And XML is even hierarchical.
But it is not easy to fiddle with  XML, either. Consider this fragment XML file

<associations>
  <association id="1">One</association>
  <association id="2">Two</association>
</associations>
The XML document tree of this fragment looks something like this:

associations
+ association
  + @id=1
+ association
  + @id=2
#
And then look at this wounderful piece of C# code:

XmlNode assocs;


If (assocs.ChildNodes.Count >= 1)
&& (
assocs.ChildNodes[0].Name.Equals("association"))
This code is intended to check for associations elements that do not contain any association children. Since the accompanying schema ensures that "association" appears only as a first level element of associations, everything is totally fine. Not particularly elegant, but fine.

Until you got an XML parser that does not ignore whitespace. If so,  xmlClass.ChildNodes[0] suddenly becomes an XmlText node, that is followed by the real association node:
association
+ XmlText (whitespace)
+ association
  + @id=1
+ XmlText (whitespace)
+ association
  + @id=2
+ XmlText (whitespace)

This is perfectly alright with the XML Schema (which is set to ignore whitespace) but causes the code above to silently ignore associations in a perfectly valid XML file. Have fun finding that bug.

XML solves a few problems of previous ad hoc data formats, but it certainly is not simple, either. You must still be carefully aware of the small pitfalls that can cause chaos in the world above. God is in the detail.