Parsing HTML output in Java

26 August 2012

Recently I needed to parse the HTML output from a 3rd party site so I could use an XPath locator on the source to identify a specific value. I used Apache HttpComponents to make the HTTP request to the site and provide me with the HTML output and thought I could parse the output using Dom4j. This is where I ran into problems, first the HTML output contained an invalid unicode character which caused Dom4j problems but this was easily fixed using Mark McLaren's method to strip out all invalid characters. Then Dom4j started having issues with the HTML structure itself so I hunted around for other solutions to convert HTML output into something more well formed and found HTMLCleaner which did the trick, it also solved the invalid unicode characters too.

I also came across JTidy thanks to Mark's post which was next on my list to try.