Why does rvest break when processing an empty file?

4

When trying to process the contents of an empty file, the rvest package locks and closes RStudio . Here is a small replay of the problem:

tf <- tempfile()
file.create(tf)
html_erro <- read_html(tf)
html_erro %>% html_nodes('h1') %>% html_text() 

Why is the error (non-existent file) treated like this? Why does R close in the place of an error message appear?

Thank you!

    
asked by anonymous 22.11.2016 / 01:41

1 answer

3

I will only answer the part: Why does the error happen?

When you read an empty file with the read_html function of the xml2 package using the code below:

tf <- tempfile()
file.create(tf)
html_erro <- read_html(tf)

You get a list of two elements with the externalptr class. This can be seen with:

str(html_erro)
List of 2
 $ node:<externalptr> 
 $ doc :<externalptr> 
 - attr(*, "class")= chr [1:2] "xml_document" "xml_node"

Now let's look at each of these objects in the list. First $doc :

html_erro$doc
<pointer: 0x128c4d4c0>

See that it is a pointer to this memory address: 0x128c4d4c0 . Now look at the object $node :

html_erro$node
<pointer: 0x0> 

It is a pointer to the address 0x0 . Here the problem will happen. When at some point your program tries to access the value of this pointer, it will attempt to access a null / non-existent memory address, causing what is called Segmentation fault .

In your case, the html_nodes function attempted to access this address and found the problem, but it could happen for example when you do print(html_erro) , here the print function method for xml_doc tries to access that pointer and causes segmentation to fail.

    
22.11.2016 / 02:52