Validating with XML Schema - The J2EE 1.4 Tutorial

Validating with XML Schema

You're now ready to take a deeper look at the process of XML Schema validation. Although a full treatment of XML Schema is beyond the scope of this tutorial, this section shows you the steps you take to validate an XML document using an XML Schema definition. (To learn more about XML Schema, you can review the online tutorial, XML Schema Part 0: Primer, at http://www.w3.org/TR/xmlschema-0/. You can also examine the sample programs that are part of the JAXP download. They use a simple XML Schema definition to validate personnel data stored in an XML file.)

At the end of this section, you'll also learn how to use an XML Schema definition to validate a document that contains elements from multiple namespaces.

Overview of the Validation Process

To be notified of validation errors in an XML document, the following must be true:

The factory must configured, and the appropriate error handler set.

The document must be associated with at least one schema, and possibly more.

Configuring the DocumentBuilder Factory

It's helpful to start by defining the constants you'll use when configuring the factory. (These are the same constants you define when using XML Schema for SAX parsing.)
static final String JAXP_SCHEMA_LANGUAGE =
    "http://java.sun.com/xml/jaxp/properties/schemaLanguage";

static final String W3C_XML_SCHEMA =
    "http://www.w3.org/2001/XMLSchema"; 
Next, you configure DocumentBuilderFactory to generate a namespace-aware, validating parser that uses XML Schema:
...
  DocumentBuilderFactory factory =
      DocumentBuilderFactory.newInstance()
  factory.setNamespaceAware(true);
  factory.setValidating(true);
try {
  factory.setAttribute(JAXP_SCHEMA_LANGUAGE, W3C_XML_SCHEMA);
} 
catch (IllegalArgumentException x) {
  // Happens if the parser does not support JAXP 1.2
  ...
} 
Because JAXP-compliant parsers are not namespace-aware by default, it is necessary to set the property for schema validation to work. You also set a factory attribute to specify the parser language to use. (For SAX parsing, on the other hand, you set a property on the parser generated by the factory.)

Associating a Document with a Schema

Now that the program is ready to validate with an XML Schema definition, it is necessary only to ensure that the XML document is associated with (at least) one. There are two ways to do that:

With a schema declaration in the XML document

By specifying the schema(s) to use in the application

Note: When the application specifies the schema(s) to use, it overrides any schema declarations in the document.

To specify the schema definition in the document, you create XML like this:
<documentRoot
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:noNamespaceSchemaLocation='YourSchemaDefinition.xsd'
>
  ... 
The first attribute defines the XML namespace (xmlns) prefix, xsi, which stands for "XML Schema instance." The second line specifies the schema to use for elements in the document that do not have a namespace prefix--that is, for the elements you typically define in any simple, uncomplicated XML document. (You'll see how to deal with multiple namespaces in the next section.)

You can also specify the schema file in the application:
static final String schemaSource = "YourSchemaDefinition.xsd";
static final String JAXP_SCHEMA_SOURCE =
    "http://java.sun.com/xml/jaxp/properties/schemaSource";
...
DocumentBuilderFactory factory =
    DocumentBuilderFactory.newInstance()
...
factory.setAttribute(JAXP_SCHEMA_SOURCE,
    new File(schemaSource)); 
Here, too, there are mechanisms at your disposal that will let you specify multiple schemas. We'll take a look at those next.

Validating with Multiple Namespaces

Namespaces let you combine elements that serve different purposes in the same document without having to worry about overlapping names.

Note: The material discussed in this section also applies to validating when using the SAX parser. You're seeing it here, because at this point you've learned enough about namespaces for the discussion to make sense.

To contrive an example, consider an XML data set that keeps track of personnel data. The data set may include information from the W2 tax form as well as information from the employee's hiring form, with both elements named <form> in their respective schemas.

If a prefix is defined for the tax namespace, and another prefix defined for the hiring namespace, then the personnel data could include segments like this:
<employee id="...">
  <name>....</name>
  <tax:form>
    ...w2 tax form data...
  </tax:form>
  <hiring:form>
    ...employment history, etc....
  </hiring:form>
</employee> 
The contents of the tax:form element would obviously be different from the contents of the hiring:form and would have to be validated differently.

Note, too, that in this example there is a default namespace that the unqualified element names employee and name belong to. For the document to be properly validated, the schema for that namespace must be declared, as well as the schemas for the tax and hiring namespaces.

Note: The default" namespace is actually a specific namespace. It is defined as the "namespace that has no name." So you can't simply use one namespace as your default this week, and another namespace as the default later. This "unnamed namespace" (or "null namespace") is like the number zero. It doesn't have any value to speak of (no name), but it is still precisely defined. So a namespace that does have a name can never be used as the default namespace.

When parsed, each element in the data set will be validated against the appropriate schema, as long as those schemas have been declared. Again, the schemas can be declared either as part of the XML data set or in the program. (It is also possible to mix the declarations. In general, though, it is a good idea to keep all the declarations together in one place.)

Declaring the Schemas in the XML Data Set

To declare the schemas to use for the preceding example in the data set, the XML code would look something like this:
<documentRoot
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:noNamespaceSchemaLocation="employeeDatabase.xsd"
  xsi:schemaLocation=
    "http://www.irs.gov/ fullpath/w2TaxForm.xsd
     http://www.ourcompany.com/ relpath/hiringForm.xsd"
  xmlns:tax="http://www.irs.gov/"
  xmlns:hiring="http://www.ourcompany.com/"
>
  ... 
The noNamespaceSchemaLocation declaration is something you've seen before, as are the last two entries, which define the namespace prefixes tax and hiring. What's new is the entry in the middle, which defines the locations of the schemas to use for each namespace referenced in the document.

The xsi:schemaLocation declaration consists of entry pairs, where the first entry in each pair is a fully qualified URI that specifies the namespace, and the second entry contains a full path or a relative path to the schema definition. (In general, fully qualified paths are recommended. In that way, only one copy of the schema will tend to exist.)

Note that you cannot use the namespace prefixes when defining the schema locations. The xsi:schemaLocation declaration understands only namespace names and not prefixes.

Declaring the Schemas in the Application

To declare the equivalent schemas in the application, the code would look something like this:
static final String employeeSchema = "employeeDatabase.xsd";
static final String taxSchema                 = "w2TaxForm.xsd";
static final String hiringSchema                 = "hiringForm.xsd";

static final String[] schemas = {
    employeeSchema,
    taxSchema, 
    hiringSchema,
    };

static final String JAXP_SCHEMA_SOURCE =
  "http://java.sun.com/xml/jaxp/properties/schemaSource";

...
DocumentBuilderFactory factory =
    DocumentBuilderFactory.newInstance()
...
factory.setAttribute(JAXP_SCHEMA_SOURCE, schemas); 
Here, the array of strings that points to the schema definitions (.xsd files) is passed as the argument to the factory.setAttribute method. Note the differences from when you were declaring the schemas to use as part of the XML data set:

There is no special declaration for the default (unnamed) schema.

You don't specify the namespace name. Instead, you only give pointers to the .xsd files.

To make the namespace assignments, the parser reads the .xsd files, and finds in them the name of the target namespace they apply to. Because the files are specified with URIs, the parser can use an EntityResolver (if one has been defined) to find a local copy of the schema.

If the schema definition does not define a target namespace, then it applies to the default (unnamed, or null) namespace. So, in our example, you would expect to see these target namespace declarations in the schemas:

employeeDatabase.xsd: none

w2TaxForm.xsd: http://www.irs.gov/

hiringForm.xsd: http://www.ourcompany.com

At this point, you have seen two possible values for the schema source property when invoking the factory.setAttribute() method: a File object in factory.setAttribute(JAXP_SCHEMA_SOURCE, new File(schemaSource)) and an array of strings in factory.setAttribute(JAXP_SCHEMA_SOURCE, schemas). Here is a complete list of the possible values for that argument:

A string that points to the URI of the schema

An InputStream with the contents of the schema

A SAX InputSource

A File

An array of Objects, each of which is one of the types defined here.

Note: An array of Objects can be used only when the schema language (like http://java.sun.com/xml/jaxp/properties/schemaLanguage) has the ability to assemble a schema at runtime. Also, when an array of Objects is passed it is illegal to have two schemas that share the same namespace.