How to get values from a column of several tables displayed on a web page?

6

On a web page, there are one or more tables with information that I need to get in list form.

Specifically, I need to get a list of the values of the second column of a table of the Web page that I inform. For example, the pages below.


Link: link


Link: link

Currently, I copy the table to a worksheet and then filter the list. But if it had a script (Python, Ruby or Perl) or program (Java or C #) that just informs the link and it already returns the list would be a hand on the wheel.

The page with this kind of material always has one of the above patterns.

    
asked by anonymous 18.12.2013 / 14:17

1 answer

9

Open the Chrome Console, and type the following:

$x("//tr[position() > 1]/td[2]/p/span/text()")

This will call the (Ferramentas > Console Javascript) Javascript function (set for the Chrome console) and return the result of the XPath provided as a parameter.

Explanation of the XPath expression

  • $x : selects all elements //tr[position() > 1] of the page except the first
  • tr : selects only the second element td[2] (i.e., the second column); you can change the column number or use td to select the first two columns, for example.
  • td[position()=1 or position()=2] : selects the element p/span within the element span (this is true for the page of your example, for other pages you should check the elements inside the p )
  • td selects the contents of the tag.

Use in programming languages

The most practical solution in any programming language involves XPath with some XML library. Ruby example:

require 'nokogiri'
require 'open-uri'
require 'openssl'
OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE
doc = Nokogiri::HTML(open(ARGV[0]).read)
doc.xpath("//tr[position() > 1]/td[2]/p/span/text()").each { |x| puts x}

Usage example

ruby script.rb 'https://www.iomat.mt.gov.br/do/navegadorhtml/mostrar.htm?id=630237&edi_id=3580'
    
18.12.2013 / 14:53