Thursday 16 September 2010

Java: "Content is not allowed in prolog" - causes of this XML processing error

Content is not allowed in prolog is an error generally emitted by the Java XML parsers when data is encountered before the <?xml... declaration. You may inspect the document in a text editor and think nothing is wrong, but you need to go down to the byte level to understand the problem. You probably have a character encoding bug.

This code reproduces the problem:

import java.io.*;
import java.nio.charset.Charset;
import javax.xml.parsers.*;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class ContentNotAllowedInProlog {
  private static void parse(InputStream streamthrows SAXException,
      ParserConfigurationException, IOException {
    SAXParserFactory.newInstance().newSAXParser().parse(stream,
        new DefaultHandler());
  }

  public static void main(String[] args) {
    String[] encodings = "UTF-8""UTF-16""ISO-8859-1" };
    for (String actual : encodings) {
      for (String declared : encodings) {
        if (actual != declared) {
          String xml = "<?xml version='1.0' encoding='" + declared
              "'?><x/>";
          byte[] encoded = xml.getBytes(Charset.forName(actual));
          try {
            parse(new ByteArrayInputStream(encoded));
            System.out.println("HIDDEN ERROR! actual:" + actual + " " + xml);
          catch (Exception e) {
            System.out.println(e.getMessage() " actual:" + actual + " xml:"
                + xml);
          }
        }
      }
    }
  }
}

The output:

Content is not allowed in prolog. actual:UTF-8 xml:<?xml version='1.0' encoding='UTF-16'?><x/>
HIDDEN ERROR! actual:UTF-8 <?xml version='1.0' encoding='ISO-8859-1'?><x/>
Content is not allowed in prolog. actual:UTF-16 xml:<?xml version='1.0' encoding='UTF-8'?><x/>
Content is not allowed in prolog. actual:UTF-16 xml:<?xml version='1.0' encoding='ISO-8859-1'?><x/>
HIDDEN ERROR! actual:ISO-8859-1 <?xml version='1.0' encoding='UTF-8'?><x/>
Content is not allowed in prolog. actual:ISO-8859-1 xml:<?xml version='1.0' encoding='UTF-16'?><x/>

This code also highlights another, more insidious character encoding issue - when we can accidentally encode with one encoding thinking it is another and everything seems to work.

When you inspect the data in a hex editor problems become more apparent.

A valid UTF-16 form:

FF FE 3C 00 3F 00 78 00 6D 00 6C 00 20 00 76 00         __<_?_x_m_l_ _v_
65 00 72 00 73 00 69 00 6F 00 6E 00 3D 00 27 00         e_r_s_i_o_n_=_'_
31 00 2E 00 30 00 27 00 20 00 65 00 6E 00 63 00         1_._0_'_ _e_n_c_
6F 00 64 00 69 00 6E 00 67 00 3D 00 27 00 55 00         o_d_i_n_g_=_'_U_
54 00 46 00 2D 00 31 00 36 00 27 00 3F 00 3E 00         T_F_-_1_6_'_?_>_
3C 00 78 00 2F 00 3E 00                                 <_x_/_>_

Note: exact UTF-16 byte forms vary - big-endian, little-endian, with or without a byte-order-mark. This one is little-endian with a BOM.

An XML document that declares itself as UTF-16 but is really UTF-8:

EF BB BF 3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E         ___<?xml version
3D 27 31 2E 30 27 20 65 6E 63 6F 64 69 6E 67 3D         ='1.0' encoding=
27 55 54 46 2D 31 36 27 3F 3E 3C 78 2F 3E               'UTF-16'?><x />

Note: UTF-8 XML documents can come with or without a byte-order-mark. This one includes a BOM.

XML, Java and Encodings

The code was written and tested against Sun's win32 Java 1.6.0_17 which uses a version of the Apache Xerces parser internally.

No comments:

Post a Comment

All comments are moderated