To do this you will inevitably need to parse the HTML of the page. You will need to use a DOM parser to do this.
What the DOM Parser will do is grab the HTML you downloaded and turn it into a DOM object in which you can browse and get the information you need.
I've done some projects of this type in particular, and the biggest problems you'll encounter are basically two:
1) Each site (and sometimes different sessions or stories from the same site) has a different HTML structure, so you have to make different maps for each session / site.
2) HTML sites (even large ones such as UOL, Earth) have badly formatted, error-prone htmls. This can eventually make a mistake at the time of parsing the gift, which will complicate your life.
The key is to find a parser that preprocesses html to correct errors, or is error-tolerant.
The last time I worked on a project like this, I did a little robot with java, because it has a Java-ready library that is perfect for this, that you can get the data in the HTML like jquery. It's really cool!
link
Hugs and good luck!