What HTTP methods can a crawler crawl?

3

A conceptual question (or not):

Of HTTP methods, which ones can not be "crawled" - or interpreted - by a crawler ?

  • POST
  • GET
  • PUT
  • PATCH
  • DELETE

Can anyone with a knowledge of the subject answer us?

    
asked by anonymous 23.03.2016 / 21:19

2 answers

0

OPTIONS, GET, HEAD .

From the book " Cloud Standards: Agreements That Hold Together Clouds ": "Web crawlers, for example, use only safe methods to avoid disturbing data on the sites they crawl"

Or: " Web crawlers, for example, use only safe methods to avoid disturbing data about crawling sites "

What makes perfect sense for the purpose of a crawler, if we think logically.

A great reference on the subject is link

  

Idempotency and safety are important attributes of HTTP methods. An idempotent request could be called repeatedly with the same results as if it had been executed once. If a user clicks a thumbnail of a picture and every click of the picture returns the same big cat picture, that HTTP request is idempotent. Non-idempotent requests to change each time they are called.

     

Safe requests are requests that do not alter the resource; non-safe requests have the ability to change a resource. For example, a user posting a comment is using a non-safe request, because the user is changing some resource on the web page; However, the user clicking the cat thumbnail is a safe request, because clicking the picture does not change the resource on the server.

     

Production safe crawlers consider certain methods as always safe and idempotent, eg GET requests. Consequently, crawlers will send GET requests arbitrarily without worrying about the effect of repeated requests or that the request might change the resource. However, safe crawlers will recognize other methods, e.g. POST requests, non-idempotent and unsafe. So, good web crawlers will not send POST requests.

RFC on Safe and Idempotent Methods: link

  

9.1.1 Safe Methods

     

Implementors should be aware that the software represents the user in their interactions over the Internet, and should be careful to allow the user to be aware of any actions they might take which may have an unexpected significance to themselves or others. p>      

In particular, the convention has been established that the GET and HEAD methods SHOULD NOT have the significance of taking an action other than retrieval. These methods ought to be considered "safe". This allows user agents to represent other methods, such as POST, PUT and DELETE, in a special way, so that the user is made aware of the fact that possibly ansafe action is being requested.

     

Naturally, it is not possible to ensure that the server does not generate side-effects as a result of performing GET request; in fact, some dynamic resources consider that a feature. The important distinction here is that the user did not request the side effects, so therefore he can not be held accountable for them.

     

9.1.2 Idempotent Methods

     

0 identical requests are the same as for a single request. The methods GET, HEAD, PUT and DELETE share this property. Also, the methods OPTIONS and TRACE SHOULD have no side effects, and are inherently idempotent.

     

However, it is possible that a sequence of several requests is non-idempotent, even if all of the methods executed in that sequence are idempotent. (A sequence is idempotent if a single execution of the entire sequence always yields a result that is not changed by a reexecution of all, or part, of that sequence.) For example, a sequence is non-idempotent if its result depends on a value that is later modified in the same sequence.

     

A sequence that never has side effects is idempotent, by definition (provided that concurrent operations are not executed on the same set of resources).

    
23.03.2016 / 21:31
0

This is independent of Crawler, you can simulate any request.

Curl

 curl --request POST 'http://www.somedomain.com/'
 curl --request DELETE 'http://www.somedomain.com/'
 curl --request PUT 'http://www.somedomain.com/'

source: Link

Python

r = requests.put("http://httpbin.org/put")
>>> r = requests.delete("http://httpbin.org/delete")
>>> r = requests.head("http://httpbin.org/get")
>>> r = requests.options("http://httpbin.org/get")

source: Link

Java

GetRequest request = Unirest.get(String url);
GetRequest request = Unirest.head(String url);
HttpRequestWithBody request = Unirest.post(String url);
HttpRequestWithBody request = Unirest.put(String url);
HttpRequestWithBody request = Unirest.patch(String url);
HttpRequestWithBody request = Unirest.options(String url);
HttpRequestWithBody request = Unirest.delete(String url);

Lib: Link

    
23.03.2016 / 21:50