Web Crawler (Spider) with ajax in JSF using Node.js or api JSoup in java

2

I have the task of creating an interface optimized for touch monitor, taking data from a website ( link ).

This site gives a listing of bus lines and consults their schedules, using an auto-complete ajax.

Because it is a government agency, the possibility of obtaining the data in another way is almost nil.

I thought of doing a crawler in java or node.js to go to the request url, to pass the parameters of the site (inputs) and to filter in the response what I need. Easy! Only in theory: (

I made a request in this url:

http://www.consultas.der.mg.gov.br/grgx/sgtm/consulta_linha.xhtml;jsessionid=1820D695BDE4B916EC808F84BD1B335D

Using this http header with the webcrawler module of node.js:

Accept:application/xml, text/xml, */*; q=0.01
Accept-Encoding:gzip, deflate
Accept-Language:pt-BR,pt;q=0.8,en-US;q=0.6,en;q=0.4
Connection:keep-alive
Content-Length:457
Content-Type:application/x-www-form-urlencoded; charset=UTF-8
Cookie:JSESSIONID=1820D695BDE4B916EC808F84BD1B335D
Faces-Request:partial/ajax
Host:www.consultas.der.mg.gov.br
Origin:http://www.consultas.der.mg.gov.br
Referer:http://www.consultas.der.mg.gov.br/grgx/sgtm/consulta_linha.xhtml
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36
X-Requested-With:XMLHttpRequest

And the form date below, where I used the number 6 as a query for autocomplete, which in the site comes up with a listing:

javax.faces.partial.ajax:true
javax.faces.source:form:tabview:campoBusca
javax.faces.partial.execute:form:tabview:campoBusca
javax.faces.partial.render:form:tabview:campoBusca
form:tabview:campoBusca:form:tabview:campoBusca
form:tabview:campoBusca_query:6
form:form
form:tabview:campoBusca_input:6
form:tabview:campoBusca_hinput:6
form:tabview_activeIndex:0
javax.faces.ViewState:-6275073363975845032:-2043218073946595619

But that was the answer:

I also tried in java, using JSoup, but it was worse, returned a lifecicle exception.

I got caught in the curve. How to make a functional webcrawler in this scenario?

    
asked by anonymous 06.06.2016 / 15:04

1 answer

9

This means that you can not make an arbitrary request for the site, otherwise ... you will get the lifecycle error. This is because the system stores information in the session, which will not be present when it requests outside the browser.

I do not have the solution for this particular situation, but technically it would be necessary for you to simulate the actual use of the system and not just the final request. This can be made easier if you manually access and monitor requests using the developer tool. Kind of like you already did, but including the previous steps and sending the session ID on all requests.

  

Note: It's not because the data is available on the Internet that you can simply capture and put it on your site. Always check the rights to use the data.

    
07.06.2016 / 01:46