Saturday, November 10, 2007

Parsing HTML with java



Sometimes it might be necessary to parse HTML to extract some data out of it. Practical requirements include extracting certain ID out of the HTML among other things. This can be a problem since HTML is not well formed. HTML is full of tags that need not be closed such as the br tag. To get around this, use the HTMLEditorKit. The kit can also help you integrate a HTML solution with Swing. Here is some code

HTMLEditor kit parser:

public class HTMLParser
{
public static void main(String [] args) throws Exception
{
HTMLEditorKit.ParserCallback callback = new CallBack();
Reader reader = new FileReader("d:/test.html");
ParserDelegator delegator = new ParserDelegator();
delegator.parse(reader, callback, false);
}

}
// Implement the call back class. Just like a SAX content handler
class CallBack extends HTMLEditorKit.ParserCallback
{
Stack stack = new Stack();
public void flush() throws BadLocationException{}
public void handleComment(char[] data, int pos){}

public void handleStartTag(HTML.Tag tag, MutableAttributeSet a, int pos)
{
// get a tag and push it into a stack
System.out.println("Tag: " + tag );
stack.push(tag);
}

public void handleEndTag(HTML.Tag t, int pos){}
public void handleSimpleTag(HTML.Tag t,MutableAttributeSet a, int pos){}
public void handleError(String errorMsg, int pos){}
public void handleEndOfLineString(String eol){}

public void handleText(char[] data, int pos)
{
// pop the stack to get the latest tag processed. If you are interested
// in parsing it and extracting the data continue. else return
Object o = stack.pop();
if ( ! ((HTML.Tag)o).toString().equals("span"))
{
return;
}
String strData="";
for (char ch : data)
{
strData = strData + ch;
}
System.out.println("Text: " + strData );
}
}
The parser will tolerate tags that are not closed.

If you would prefer a DOM solution to the parser problem have a look at jTidy

http://jtidy.sourceforge.net/

A DOM solution is appropriate for HTML documents that are not too huge and require random access + modifications in memory. I have not tried jTidy myself. Lack of documentation made me stay away. The documentation available at source forge was pretty bad. Sample programs that where the lines of code were all fused into a continuous set of characters.

Another DOM like solution is HTML-Parser. Here is the link

http://htmlparser.sourceforge.net/

This parser is more powerful. You can use a light weight or heavy duty solution depending on your requirement. Here is some code for a light weight Lexer parser. Documentation for this parser was pretty good.

Lexer code (click to enlarge):
New Document
Here is the output:

HTML

HEAD

TITLE
New Document
/TITLE

/HEAD

BODY

/BODY
/HTML


One more solution is to use the swing Parser class. Here is some code

Swing parser:
DTD dtd = DTD.getDTD("html.dtd");
Parser parser = new Parser(dtd )
{
@Override
protected void handleText(char[] data)
{
String str = "";
for (char ch : data)
{
str += ch;
}
System.out.println("Text: " + str);
}

@Override
protected void startTag(TagElement element) throws ChangedCharSetException
{
System.out.println("Start tag: " + element.getElement().getName());
super.startTag(element);
}


};
parser.parse(new FileReader(new File("d:/test2.html")));
test2.html:


Output:
Start tag: html
Start tag: head
Start tag: title
Text: New Document
Start tag: body
Start tag: test
Text: testamondo
Start tag: h1
Text: Big Header

This parser is DTD driven. It is more suited to a SAX type solution.

In conclusion

  • Use the HTML Parser when you need complex operations to be performed. You can choose between light weight and heavy duty implementations
  • Use the HTMLEditor kit or the swing Parser when you intend to simply parse and read the HTML for specific data.

I am refraining from suggesting jTidy. I have not found any documentation as yet that will let me compare it with the other parsers. If I do I will update this article.