Error when trying to extract table from a site by R, how to solve?

Question

Error when trying to extract table from a site by R, how to solve?

Navigation

#1 by (4 votes)

4

I'm using this code, I want to import the country table into the R:

library(XML)
url <- "http://en.wikipedia.org/wiki/List_of_countries_by_population"
country_data <- readHTMLTable(url, which=2)

R returns the error:

Error: failed to load external entity "http://en.wikipedia.org/wiki/List_of_countries_by_population"

How to proceed?

html url r xml rvest

asked by anonymous 22.02.2017 / 20:56

1 answer

Increasing modal size using bootstrap and html What is the best way to concatenate strings in Python?

score 4 · Accepted Answer

My little experience with web scraping in R made me like the rvest package more than XML to do this kind of work. So I'm going to give you a solution with it, instead of a solution with the package you wanted:

library(rvest)

url <- "http://en.wikipedia.org/wiki/List_of_countries_by_population"

tabela <- read_html(url) %>%
  html_table(fill=TRUE) %>%
  .[[2]]

The only trick here is to know how to identify the position of the table that interests you in what has been downloaded from the internet. In the specific case of the address present in the object url , the table that interests us is in the [[2]] position.

As far as I know, the only way to figure out the correct number is by trial and error. Maybe there is another way, but I do not know.

If the above code generates the error Error in open.connection(x, "rb") : Timeout was reached , try running the command below:

library(rvest)
library(curl)

url <- "http://en.wikipedia.org/wiki/List_of_countries_by_population"

tabela <- read_html(curl(url, handle = curl::new_handle("useragent" = "Mozilla/5.0"))) %>%
  html_table(fill=TRUE) %>%
  .[[2]]

When using curl , we force the scraper to identify itself to the site. Thus, the site does not refute the connection that R tries to make.