Displaying Special Characters and CDATA
The next thing we will do with the parser is to customize it a bit so that you can see how to get information it usually ignores. In this section, you'll learn how the parser handles
Handling Special Characters
In XML, an entity is an XML structure (or plain text) that has a name. Referencing the entity by name causes it to be inserted into the document in place of the entity reference. To create an entity reference, you surround the entity name with an ampersand and a semicolon:
Earlier, you put an entity reference into your XML document by coding
Note: The file containing this XML is
slideSample03.xml, as described in Using an Entity Reference in an XML Document. The results of processing it are shown inEcho07-03.txt. (The browsable versions areslideSample03-xml.htmlandEcho07-03.html.)
When you run the Echo program on
slideSample03.xml, you see the following output:The parser has converted the reference into the entity it represents and has passed the entity to the application.
Handling Text with XML-Style Syntax
When you are handling large blocks of XML or HTML that include many special characters, you use a
CDATAsection.
Note: The XML file used in this example is
slideSample04.xml. The results of processing it are shown inEcho07-04.txt. (The browsable versions areslideSample04-xml.htmlandEcho07-04.html.)
A
CDATAsection works like<pre>...</pre>in HTML, only more so: all whitespace in aCDATAsection is significant, and characters in it are not interpreted as XML. ACDATAsection starts with<![CDATA[and ends with]]>. The fileslideSample04.xmlcontains thisCDATAsection for a fictitious technical slide:...<slide type="tech"> <title>How it Works</title> <item>First we fozzle the frobmorten</item> <item>Then we framboze the staten</item> <item>Finally, we frenzle the fuznaten</item> <item><![CDATA[Diagram: frobmorten <--------------- fuznaten | <3> ^ | <1> | <1> = fozzle V | <2> = framboze staten-------------------------+ <3> = frenzle <2> ]]></item> </slide></slideshow>When you run the Echo program on the new file, you see the following output:
ELEMENT: <item> CHARS: Diagram: frobmorten <--------------- fuznaten | <3> ^ | <1> | <1> = fozzle V | <2> = framboze staten-------------------------+ <3> = frenzle<2> END_ELM: </item>You can see here that the text in the
CDATAsection arrived as it was written. Because the parser didn't treat the angle brackets as XML, they didn't generate the fatal errors they would otherwise cause. (If the angle brackets weren't in aCDATAsection, the document would not be well formed.)Handling CDATA and Other Characters
The existence of
CDATAmakes the proper echoing of XML a bit tricky. If the text to be output is not in aCDATAsection, then any angle brackets, ampersands, and other special characters in the text should be replaced with the appropriate entity reference. (Replacing left angle brackets and ampersands is most important, other characters will be interpreted properly without misleading the parser.)But if the output text is in a
CDATAsection, then the substitutions should not occur, resulting in text like that in the earlier example. In a simple program such as our Echo application, it's not a big deal. But many XML-filtering applications will want to keep track of whether the text appears in aCDATAsection, so that they can treat special characters properly. (Later, you will see how to use aLexicalHandlerto find out whether or not you are processing aCDATAsection.)One other area to watch for is attributes. The text of an attribute value can also contain angle brackets and semicolons that need to be replaced by entity references. (Attribute text can never be in a
CDATAsection, though, so there is never any question about doing that substitution.)