Displaying Special Characters and CDATA
The next thing we will do with the parser is to customize it a bit so that you can see how to get information it usually ignores. In this section, you'll learn how the parser handles
Handling Special Characters
In XML, an entity is an XML structure (or plain text) that has a name. Referencing the entity by name causes it to be inserted into the document in place of the entity reference. To create an entity reference, you surround the entity name with an ampersand and a semicolon:
Earlier, you put an entity reference into your XML document by coding
Note: The file containing this XML is
slideSample03.xml
, as described in Using an Entity Reference in an XML Document. The results of processing it are shown inEcho07-03.txt
. (The browsable versions areslideSample03-xml.html
andEcho07-03.html
.)
When you run the Echo program on
slideSample03.xml
, you see the following output:The parser has converted the reference into the entity it represents and has passed the entity to the application.
Handling Text with XML-Style Syntax
When you are handling large blocks of XML or HTML that include many special characters, you use a
CDATA
section.
Note: The XML file used in this example is
slideSample04.xml
. The results of processing it are shown inEcho07-04.txt
. (The browsable versions areslideSample04-xml.html
andEcho07-04.html
.)
A
CDATA
section works like<pre>...</pre>
in HTML, only more so: all whitespace in aCDATA
section is significant, and characters in it are not interpreted as XML. ACDATA
section starts with<![CDATA[
and ends with]]>
. The fileslideSample04.xml
contains thisCDATA
section for a fictitious technical slide:...<slide type="tech"> <title>How it Works</title> <item>First we fozzle the frobmorten</item> <item>Then we framboze the staten</item> <item>Finally, we frenzle the fuznaten</item> <item><![CDATA[Diagram: frobmorten <--------------- fuznaten | <3> ^ | <1> | <1> = fozzle V | <2> = framboze staten-------------------------+ <3> = frenzle <2> ]]></item> </slide>
</slideshow>When you run the Echo program on the new file, you see the following output:
ELEMENT: <item> CHARS: Diagram: frobmorten <--------------- fuznaten | <3> ^ | <1> | <1> = fozzle V | <2> = framboze staten-------------------------+ <3> = frenzle<2> END_ELM: </item>
You can see here that the text in the
CDATA
section arrived as it was written. Because the parser didn't treat the angle brackets as XML, they didn't generate the fatal errors they would otherwise cause. (If the angle brackets weren't in aCDATA
section, the document would not be well formed.)Handling CDATA and Other Characters
The existence of
CDATA
makes the proper echoing of XML a bit tricky. If the text to be output is not in aCDATA
section, then any angle brackets, ampersands, and other special characters in the text should be replaced with the appropriate entity reference. (Replacing left angle brackets and ampersands is most important, other characters will be interpreted properly without misleading the parser.)But if the output text is in a
CDATA
section, then the substitutions should not occur, resulting in text like that in the earlier example. In a simple program such as our Echo application, it's not a big deal. But many XML-filtering applications will want to keep track of whether the text appears in aCDATA
section, so that they can treat special characters properly. (Later, you will see how to use aLexicalHandler
to find out whether or not you are processing aCDATA
section.)One other area to watch for is attributes. The text of an attribute value can also contain angle brackets and semicolons that need to be replaced by entity references. (Attribute text can never be in a
CDATA
section, though, so there is never any question about doing that substitution.)