Indexing of sites in half possible?

2

While browsing the internet reading articles and more articles, I came across some that are only displayed in half, or just the beginning, here is an example . In this case, only the subscriber has free access.

My question is?

Is the site fully indexed by search engines? or is the page indexed in half? the robots.txt file is related to the fact?

How is this type of delimitation possible on the content of a website, as to what the reader can and can not read?

    
asked by anonymous 08.11.2017 / 03:05

2 answers

3

The question talks about two things, partial site indexing and partial page indexing.

Yes, you can use robots.txt or meta name="robots" to indicate what you do not want to index .

In fact if the content is otherwise protected against access nor need it. If the page is only accessible by password, the content will not be indexed.

This is actually the only effective way to prevent indexing, since saying that you do not want something indexed is just a convention. An indexer may not respect this. Google respects today, but it may not respect when it wants. There are malicious crawlers.

Obviously, only the access control done on the server will be effective.

All this is well known fact. I think the biggest question is partial indexing of the page.

This is usually done by identifying that the client requesting the page is an indexer. It generates a different page with partial content when it is a known indexer. Obviously it is possible to deceive the site by saying that it is the indexer. Then the indexer receives the whole page and can index all the content, but a normal client receives the page with the cover. This can lead to penalties in indexing if the mechanism identifies the maneuver. Of course, it will always be possible to access content through the indexer cache.

Obviously you can submit all content and limit via JavaScript. This does not protect anything, just cheats, since the content is there. It may be difficult for the layman, but he has no protection.

I'd like to say about the myth that crawlers run JavaScript. Yes some perform, but not all. And they can not simulate user actions like a real user does, so do not count on indexing if there is interaction with the user or another way that depends on things that the indexer is not able to do, and always appear new things that the indexer is not able to simulate. The programmatic code exists on the page just to define non-standard streams, and this by definition makes it impossible in practice to try to simulate everything that might occur.

If you want to protect the content, just controlling it on the server. And obviously it will control the display, does not prevent the person from copying and posting elsewhere, even automatically. It's good to make it clear why some people think Santa does.

    
08.11.2017 / 11:28
2

Taking the Googlebot as an example ...

According to this article on kissmetrics , the Google crawler indexes the entire page including title, description, the alt attribute of the images and all content .

According to this other article, in Search Engine Land , Googlebot is still capable to process JavaScript in order to dynamically index DOM-included content. This other article shows exhaustive indexing tests for pages that use frameworks and JavaScript libraries most popular (spoiler: it looks like it still does not process AngularJS v2 very well.

Google's own framework

Because indexing depends on the crawler being able to reach the page, the link to it must exist somewhere on the internet, or its indexing must be explicitly requested through Google's webmaster tools.

So if a web crawler is able to reach a restricted content area to index it, then a human is as well, and the content is no longer restricted. For an area of restricted content to be efficient, there should be no links to it that are not barred by some kind of authentication.

The robots.txt file represents a map of what the webmaster wants crawlers to index or not, such as example:

User-agent: *
Disallow: /restrito/
Robots from large search companies tend to obey such guidelines, but it should be remembered that not everyone respects the rules and policies of good neighborliness. If the content is restricted, display it only under authentication.

Therefore, Googlebot seems to be able to index all the content of a page, including that "revealed" dynamically via JavaScript, but not what is behind an authentication check , as its example of an area of subscribers. If the webmaster of this article wants it to be indexed under terms that are only in restricted content, you should copy them in some way to the public part of the page, such as in the form of keywords within the <description> tag. >

So finally answering your question: Yes, it is possible that sites are indexed "in half" , since crawlers are only able to index what they can access . Restricted areas are not indexed.

    
08.11.2017 / 03:39