Parsing with a DTD
After the XML declaration, the document prolog can include a DTD, reference an external DTD, or both. In this section, you'll see the effect of the DTD on the data that the parser delivers to your application.
DTD's Effect on the Nonvalidating Parser
In this section, you'll use the Echo program to see how the data appears to the SAX parser when the data file references a DTD.
Note: The XML file used in this section is
slideSample05.xml
, which referencesslideshow1a.dtd
. The output is shown inEcho07-05.txt
. (The browsable versions areslideshow1a-dtd.html
,slideSample05-xml.html
, andEcho07-05.html
.)
Running the Echo program on your latest version of
slideSample.xml
shows that many of the superfluous calls to thecharacters
method have now disappeared.Before, you saw this:
... > PROCESS: ...CHARS:
ELEMENT: <slide ATTR: ... > ELEMENT: <title> CHARS: Wake up to ... END_ELM: </title> END_ELM: </slide>CHARS:
ELEMENT: <slide ATTR: ... > ...Now you see this:
... > PROCESS: ... ELEMENT: <slide ATTR: ... > ELEMENT: <title> CHARS: Wake up to ... END_ELM: </title> END_ELM: </slide> ELEMENT: <slide ATTR: ... > ...It is evident that the whitespace characters that were formerly being echoed around the
slide
elements are no longer being delivered by the parser, because the DTD declares thatslideshow
consists solely ofslide
elements:Tracking Ignorable Whitespace
Now that the DTD is present, the parser is no longer calling the
characters
method with whitespace that it knows to be irrelevant. From the standpoint of an application that is interested in processing only the XML data, that is great. The application is never bothered with whitespace that exists purely to make the XML file readable.On the other hand, if you were writing an application that was filtering an XML data file and if you wanted to output an equally readable version of the file, then that whitespace would no longer be irrelevant: it would be essential. To get those characters, you add the
ignorableWhitespace
method to your application. You'll do that next.
Note: The code written in this section is contained in
Echo08.java
. The output is inEcho08-05.txt
. (The browsable version isEcho08-05.html
.)
To process the (generally) ignorable whitespace that the parser is seeing, add the following highlighted code to implement the
ignorableWhitespace
event handler in your version of the Echo program:public void characters (char buf[], int offset, int len) ... }public void ignorableWhitespace (char buf[], int offset, int Len) throws SAXException { nl(); emit("IGNORABLE"); }
public void processingInstruction(String target, String data) ...This code simply generates a message to let you know that ignorable whitespace was seen.
Note: Again, not all parsers are created equal. The SAX specification does not require that this method be invoked. The Java XML implementation does so whenever the DTD makes it possible.
When you run the Echo application now, your output looks like this:
ELEMENT: <slideshow ATTR: ... >IGNORABLE IGNORABLE
PROCESS: ...IGNORABLE IGNORABLE
ELEMENT: <slide ATTR: ... >IGNORABLE
ELEMENT: <title> CHARS: Wake up to ... END_ELM: </title>IGNORABLE
END_ELM: </slide>IGNORABLE IGNORABLE
ELEMENT: <slide ATTR: ... > ...Here, it is apparent that the
ignorableWhitespace
is being invoked before and after comments and slide elements, whereascharacters
was being invoked before there was a DTD.Cleanup
Now that you have seen ignorable whitespace echoed, remove that code from your version of the Echo program. You won't need it any more in the exercises that follow.
Note: That change has been made in
Echo09.java
.
Empty Elements, Revisited
Now that you understand how certain instances of whitespace can be ignorable, it is time revise the definition of an empty element. That definition can now be expanded to include
where there is whitespace between the tags and the DTD says that the whitespace is ignorable.
Echoing Entity References
When you wrote
slideSample06.xml
, you defined entities for the singular and plural versions of the product name in the DTD:You referenced them in the XML this way:
Now it's time to see how they're echoed when you process them with the SAX parser.
Note: The XML used here is contained in
slideSample06.xml
, which referencesslideshow1b.dtd
, as described in Defining Attributes and Entities in the DTD. The output is shown inEcho09-06.txt
. (The browsable versions areslideSample06-xml.html
,slideshow1b-dtd.html
, andEcho09-06.html
.)
When you run the Echo program on
slideSample06.xml
, here is the kind of thing you see:Note that the product name has been substituted for the entity reference.
Echoing the External Entity
In
slideSample07.xml
, you defined an external entity to reference a copyright file.
Note: The XML used here is contained in
slideSample07.xml
and incopyright.xml
. The output is shown inEcho09-07.txt
. (The browsable versions areslideSample07-xml.html
,copyright-xml.html
, andEcho09-07.html
.)
When you run the Echo program on that version of the slide presentation, here is what you see:
... END_ELM: </slide> ELEMENT: <slide ATTR: type "all" > ELEMENT: <item> CHARS: This is the standard copyright message that our lawyers make us put everywhere so we don't have to shell out a million bucks every time someone spills hot coffee in their lap... END_ELM: </item> END_ELM: </slide> ...Note that the newline that follows the comment in the file is echoed as a character, but the comment itself is ignored. That is why the copyright message appears to start on the next line after the
CHARS:
label instead of immediately after the label: the first character echoed is actually the newline that follows the comment.Summarizing Entities
An entity that is referenced in the document content, whether internal or external, is termed a general entity. An entity that contains DTD specifications that are referenced from within the DTD is termed a parameter entity. (More on that later.)
An entity that contains XML (text and markup), and is therefore parsed, is known as a parsed entity. An entity that contains binary data (such as images) is known as an unparsed entity. (By its nature, it must be external.) We'll discuss references to unparsed entities later, in Using the DTDHandler and EntityResolver.