(firstname.lastname@example.org), Principal Consultant, Fourthought, Inc
31 Jan 2006
The use of XML has become widespread, but much of it is not well formed. When it is well formed, it's often of poor design, which makes processing and maintenance very difficult. And much of the infrastructure for serving XML can compound these problems. In response, there has been some public discussion of XML best practices, such as Henri Sivonen's document, "HOWTO Avoid Being Called a Bozo When Producing XML." Uche Ogbuji frequently discusses XML best practices on IBM developerWorks, and in this column, he gives you his opinion about the main points discussed in such articles.
I have been discussing XML best practices in this column and in other series for years. Others, such as fellow columnist Elliotte Rusty Harold, have covered it as well. The more XML experts that join the discussion of XML design principles, the better, so the community can converge on solid advice for developers at all levels of XML adoption. In this article, using a recent document and a classic one, you learn more details about XML best practices.
Enter the no bozo zone
Henri Sivonen wrote a useful article, "HOWTO Avoid Being Called a Bozo When Producing XML" (see Resources). Adopting the perspective of XML-based Web feed formats, such as RSS and Atom, he goes over his Dos and Don'ts for producing well-formed XML with namespaces. As he says in his introduction:
There seem to be developers who think that well-formedness is awfully hard -- if not impossible -- to get right when producing XML programmatically and developers who can get it right and wonder why the others are so incompetent. I assume no one wants to appear incompetent or to be called names. Therefore, I hope the following list of Dos and Don'ts helps developers to move from the first group to the latter.
The first bit of advice Henri gives is, "Don't think of XML as a text format." I think this is dangerous advice. Certainly his main point is valid -- you cannot be as careless in producing or editing XML as you would a simple text document, but this applies to all text formats with any structure. However, saying that XML is not text is denying one of the most important characteristics of XML, one that is enshrined in the very definition of XML in the specification. ("A textual object is a well-formed XML document [if it conforms to this specification.]") Henri's statement is also confusing because there is a technical definition of text in XML that is essentially the sequence of characters interpreted as XML. Text is not merely what goes within leaf elements or within attributes -- technically called character data. Text is the fundamental fabric of all XML entities, so to say that XML is not text is a contradiction. I think it's more useful to highlight the specific ways in which XML differs from text formats with which developers might already be familiar.
This comment is an example of how Henri's advice is colored by his interest in the problem of generating well-formed Web feeds. He is right to warn people that carelessly slapping strings together and hoping they are well formed is a dangerous course. I too have written articles advising people to use mature XML toolkits rather than simple text tools when generating XML (see Resources). My concern is that the way in which Henri couches this advice is a bit confusing and could be misconstrued in the broader context of XML processing. He reiterates his advice in the sections, "Don't use text-based templates" and "Don't
print". I think this should be summarized as: "Do not use mechanisms that you're not sure will result in well-formed XML." That's very important advice indeed. One approach to safe XML generation is sending SAX events, as Henri suggests in, "Use a tree or a stack (or an XML parser)." If you do so, however, do not assume you are home free. The SAX tools you use might not do all the necessary well-formedness checking. For example, some Unicode characters are not allowed in XML. You may need an additional level of checking to account for such issues.
Henri rightly suggests that users not try to manage namespaces by hand. As I've discussed on developerWorks, XML namespaces require a great deal of care. His suggestion that developers only think in terms of universal name [namespace Uniform Resource Identifier (URI) plus local name] is generally sound, but sometimes a developer cannot avoid dealing with prefixes or XML declarations. In specifications, such as XSLT, a QName (prefix/local name combination) can be used within attribute values, and the prefix is supposed to be interpreted according to in-scope namespace declarations. This kind of pattern is called a QName in context. In this case, the developer must have control over the declared prefix or the resulting XML processing will fail. When developers do manage their own namespace declarations, the result is often messy because of the complexities of XML namespaces.
One way to clean up namespace syntax that might become messy while passing through a pipeline of XML processing is to insert a canonicalization step to the end of the pipeline. XML canonicalization eliminates the syntactic variations permitted by XML 1.0 and XML namespaces, including different namespace declaration patterns. Canonicalization will not eliminate all the issues that make namespace declarations treacherous to developers. Canonicalization does not help with QNames in context problems since it does not change the prefixes used in a document, but it does reduce the mess of namespace declarations to the point where you can easily spot problems or even write code to automatically fix them. The GenX library, which is one of the XML generation options Henri suggests, automatically generates canonical XML, and many other toolkits provide canonicalization as an option.
Henri's advice about Unicode and character handling is almost completely sound. However, in "Avoid adding pretty-printing white space in character data," I think the case is a bit overstated. Pretty-printing XML is safe in most cases between elements, rather than within elements with character data. As Henri says, if you have the XML in Listing 1, it is usually not safe to render it as in Listing 2.
Listing 1. XML sample
Listing 2. XML sample with white space added to character data
But it is usually safe to pretty-print the XML in Listing 3, so that the output is as in Listing 4.
Listing 3. Another XML sample
Listing 4. XML sample in Listing 3 with white space added to character data
Many XML serializer tools understand this distinction between relatively safe and relatively unsafe pretty-printing. It is important to understand that the form of pretty-printing shown in Listings 3 and 4 can cause distortion if white space is added to mixed content. Such problems can be avoided if the serialization is guided by a schema. In practice, though, most vocabularies that use mixed content are not so sensitive to white space normalization, so don't worry too much about pretty-printing. You should be knowledgeable of the issues, and be sure there is an option to turn pretty-printing off (preferably the default should be to not pretty-print). Henri recommends a pretty-printing practice as in Listing 5, but I disagree because I think it makes for ugly markup that's not friendly to manipulation by people.
Listing 5. Pretty-printing convention suggested by Henri Sivonen but not recommended by this author
From the monastery
Switching to a very different speed, the second resource I shall explore in this article is Simon St. Laurent's "Monastic XML" (see Resources). This is a collection of brief essays with advice on how to process and even think about XML for maximum effect. Simon uses the metaphors of monasticism and asceticism to suggest that it is dangerous to load XML too heavily with baggage that does not suit its simple, textual roots. In "Marking-up at the foundation," he discusses the fundamental roles of character data and markup (elements and attributes). In "Naming things and reading names," he explains why the generic identifier (also called the element type name) is an important concept and how it should be the sole primary key to the structure of the marked-up information. Realistically, if you're using XML namespaces, the primary key is the universal name (namespace URI plus local name), and this complication is one of the reasons Simon urges caution in "Namespaces as opportunity." "Accepting the discipline of trees" calls out one of XML's dirty secrets: Even though it seems that XML's hierarchical structure could be easily extended to graph structure, in practice, the modeling of graphs in XML has proven a bit difficult. But by far the most important lesson on the "Monastic XML" site is found in "Optimizing markup for processing is always premature." XML is a declarative technology, and therein lies its strengths, as well as its frustrations, for many developers. Developers who try to pull XML design too close to the details of processing generally end up making that processing more difficult in the long term. The key to success with XML is to focus on the nature of the information that needs to be represented in the abstract separately from the technical design of the systems that need to process that information.
There is always bound to be some difference of opinion when considering XML best practices, especially in these early stages, but it is great to have a variety of voices on the topic. There are a few other sources for discussion of the topic, and I'll continue to cover them in this column. If you have sources for advice on best practices or want to share your own opinion, please join the discussion on the Thinking XML discussion forum.
Get products and technologies
- Read the online articles covered in this piece: "HOWTO Avoid Being Called a Bozo When Producing XML," by Henri Sivonen and "Monastic XML," by Simon St. Laurent.
- Learn more about the basic nature of XML, rather than just guessing. Start with "Design Principles for XML," which explains why XML is the way it is. See discussion and resources in "Once again: no excuses to ignore i18n in XML," by Uche Ogbuji, and be sure to bookmark "The Annotated XML Specification," which is the XML 1.0 specification with useful and liberal annotation by Tim Bray, one of the editors of the specification; just be aware that this covers the first edition of XML 1.0, which is now up to the third edition.
- Find out some of the hazards of producing XML through careless text printing in "Proper XML Output in Python." Despite the title, the discussion is useful for Python and non-Python programmers alike. More specific to Python, read a clarification of the details of Python and Unicode in "Confusion over Python storage form for Unicode." Both articles are by Uche Ogbuji.
- Learn more about XML canonicalization in "Introducing XML canonical form," (developerWorks, December 2004) by Uche Ogbuji.
- Find more XML resources on the developerWorks XML zone, where you'll find previous installments of the Thinking XML column, such as "Harold's Effective XML" (July 2004) and "Hacking XML Hacks (September 2004)." If you have comments on this article or any others in this column, please post them on the Thinking XML forum.
- For more on XML best practices, see the series "Principles of XML design" by Uche Ogbuji.
- Learn how you can become an IBM Certified Developer in XML and related technologies.
Browse for books
on these and other technical topics.
- Build your next development project with IBM trial software, available for download directly from developerWorks.
About the author
Uche Ogbuji is a consultant and co-founder of Fourthought, Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his Weblog Copia, or contact him at email@example.com.