Woodstox and the w3c 503 error

4:06 PM 1 Comments

This morning, I was testing our static analysis tool and it threw a very strange error:

Could not read source file: Server returned HTTP response code: 503 for URL http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd

It came from the Woodstox parser when it was trying to parse the dtd listed in the html tag of one of our files.

So, I browsed to said dtd and found this page instead:

IP blocked due to re-requesting files too often

Your IP address has been blocked from accessing our site for 24 hours due to abuse.

The specific type of abuse we observed is: re-requesting the same resource too frequently. Specifically, we received at least 500 requests for the same resource (URI) from your IP address within a ten-minute time interval.

If you are using an application that makes HTTP requests to other sites, please configure it to use an outgoing HTTP cache instead of re-requesting the same files over and over again.

... and so on.

So, this led me to a lot of interesting research that I won't go over here. I am simply going to show one way that I learned around the issue with the Java XMLStreamReader and Woodstox.

To get an XMLStreamReader, one can do this:

InputStream is = ...;
XMLInputFactory factory = XMLInputFactory.getInstance();
XMLStreamReader reader = factory.createXMLStreamReader(is);

This will create a "ValidatingStreamReader" which is going to request the dtd each time it sees one. Thus, the complaint from w3c that its xhtml1-transitional dtd was being requested to often.

There are two ways that I see to solve this, and I found one after digging in the API for a few minutes. If I change my code to read this:

InputStream is = ...;
XMLInputFactory factory = XMLInputFactory.getInstance();
factory.setProperty(XMLInputFactory.SUPPORT_DTD, false);
XMLStreamReader reader = factory.createXMLStreamReader(is);

Then it won't request any dtds when parsing the xml file.

I found another property when digging through the Woodstox code that I can't figure out how to access. It was in InputConfigFlags and is referenced in ReaderConfig, which is an object fashioned in the Woodstox implementation of XMLStreamReader:

* If true, input factory is allowed cache parsed external DTD subsets,
* potentially speeding up things for which DTDs are needed for: entity
* substitution, attribute defaulting, and of course DTD-based validation.
final static int CFG_CACHE_DTDS = 0x00010000;

This seems like the more appropriate solution. Any ideas on how to access it?


"I love to teach, as a painter loves to paint, as a singer loves to sing, as a musician loves to play" - William Lyon Phelps

1 comment:

  1. Found it. In the StAX InputFactory, called XmlInputFactory2, there is a method called configureForSpeed(). While not the ideal solution--configureForSpeed() undoubtedly sets up other configurations in addition to caching DTDs--it appears to activate a caching mechanism.