Well-Formed or Toast?
April 12, 2002
Ignoring for a moment the potential importance of validity to
data-oriented applications, you might wonder why even when an XML
document does not require a DTD (i.e., is standalone), it still
must be well-formed. In fact, if a document is not well-formed,
it cannot even be called an XML document.
The reason for insisting on well-formedness is to counteract the
"browser bloat" syndrome that occurred when the major
browser vendors decided they wanted their browser to be able to
render the horribly inaccurate HTML developed by graduates (or
perhaps flunkies) of the Learn HTML in 2 Days or Less school.
Many Web pages contain completely invalid HTML, with improperly
nested elements, missing end tags, misspelled element names,
missing delimiters, and other aberrations. Browsers such as
Netscape Communicator and Internet Explorer do an admirable job
of recovering from these errors, but only at the expense of a
considerable amount of built-in recovery code.
Fortunately, with XML (and XHTML), parsers do not need to
implement recovery code and can therefore stay trim and
lightweight. If the parser encounters a well-formedness problem,
it should only report the problem to the calling application. It
explicitly must not attempt to correct what might be missing,
overlapping, or misspelled. Violations of well-formedness
constraints are considered fatal errors, according to the XML 1.0
Recommendation. The bottom line here is: either a document is
well-formed XML, or it's toast.
The extra code necessary to do the HTML-like corrections might
not be a significant problem for a desktop PC with lots of memory.
It's more of an issue as XML is fed to handheld PCs and other
devices with limited memory and/or processing power.
Validating and Nonvalidating Parsers
The differences between validating and nonvalidating parsers are
not quite as clear as you might think. According to the XML 1.0
specification (
http://www.w3.org/ TR/REC-xml#proc-types),
Validating processors must, at user option, report violations of
the constraints expressed by the declarations in the DTD, and
failures to fulfill the validity constraints given in this
specification. To accomplish this, validating XML processors
must read and process the entire DTD and all external parsed
entities referenced in the document. Non-validating processors
are required to check only the document entity, including the
entire internal DTD subset, for well-formedness
In other words, validating parsers must read the entire DTD and
check the document against the structural constraints it describes.
You might conclude, therefore, that nonvalidating parsers do not
need to consult the DTD, but that turns out to be incorrect.
Even nonvalidating parsers need to supply default values for
attributes and to replace text based on internal entities
(discussed in chapter 4).
Although there used to be a class of strictly nonvalidating
parsers, they tend to be much less popular of late. Most modern
parsers (2000 and beyond) can be run in either validating or
nonvalidating mode. Why run in nonvalidating mode when a parser
is capable of validation? Because validation can significantly
impact performance, especially when long and complex DTDs are
involved. Some developers find that while enabling validation
during development and test phases is crucial, it's sometimes
beneficial to surpress validation in production systems where
document throughput is most valued and the reliability of the
data is already known. Consult the documentation of prospective
parsers to determine how to toggle this switch, and which is the
default mode. For example, the Apache Xerces parser is
nonvalidating by default.
Some of the more highly regarded XML parsers include:
-
Apache XML Project's Xerces
-
IBM's XML Parser for Java (xml4j)
-
JavaSoft's XML Parser
-
MSXML 4.0 Release: Microsoft XML Core Services component
(aka MSXML Parser) and SDK
-
Oracle's XML Parser
-
ElCel Technology's XML Validator
URLs for these parsers and many more can be found on the XML
Parsers/ Processors list at XMLSoftware.com,
http://www.xmlsoftware.com/parsers/.
Event-Based vs. Tree-Based Parsing
We will cover tree-based and event-based parsing in some depth
when we cover SAX and DOM in chapters 7 and 8, respectively. For
now, an overview should be sufficient.
Event-based Parsing
Event-based parsers (SAX) provide a data-centric view of XML.
When an element is encountered, the idea is to process it and
then forget about it. The event-based parser returns the element,
its list of attributes, and the content. This is more efficient
for many types of applications, especially searches. It requires
less code and less memory since there is no need to build a large
tree in memory as you are scanning for a particular element,
attribute, and/or content sequence in an XML document.
Tree-Based Parsing
On the other hand, tree-based parsers (DOM) provide a
document-centric view of XML. In tree-based parsing, an in-memory
tree is created for the entire document, which is extremely
memory-intensive for large documents. All elements and attributes
are available at once, but not until the entire document has been
parsed. This technique is useful if you need to navigate around
the document and perhaps change various document chunks, which is
precisely why it is useful for the Document Object Model (DOM),
the aim of which is to manipulate documents via scripting
languages or Java.
David Megginson, the main force behind SAX (Simple API for XML),
contrasts these two approaches in "Events vs. Trees"
on the SAX site
(
http://www.saxproject.org/?selected=event). The W3C
presents its viewpoint in an item from the DOM FAQ, "What
is the relationship between the DOM and SAX?"
(
http://www.w3.org/DOM/faq#SAXandDOM).
Well-Formed vs. Valid Documents
XML Family of Specifications: A Practical Guide
Summary
|