My little experience with web scraping in R
made me like the rvest
package more than XML
to do this kind of work. So I'm going to give you a solution with it, instead of a solution with the package you wanted:
library(rvest)
url <- "http://en.wikipedia.org/wiki/List_of_countries_by_population"
tabela <- read_html(url) %>%
html_table(fill=TRUE) %>%
.[[2]]
The only trick here is to know how to identify the position of the table that interests you in what has been downloaded from the internet. In the specific case of the address present in the object url
, the table that interests us is in the [[2]]
position.
As far as I know, the only way to figure out the correct number is by trial and error. Maybe there is another way, but I do not know.
If the above code generates the error Error in open.connection(x, "rb") : Timeout was reached
, try running the command below:
library(rvest)
library(curl)
url <- "http://en.wikipedia.org/wiki/List_of_countries_by_population"
tabela <- read_html(curl(url, handle = curl::new_handle("useragent" = "Mozilla/5.0"))) %>%
html_table(fill=TRUE) %>%
.[[2]]
When using curl
, we force the scraper to identify itself to the site. Thus, the site does not refute the connection that R
tries to make.