Questions tagged as 'web-crawler'

2
answers

How not to allow indexing by search engines?

On this day I placed my domain on Google and it fetched my Web Site and my System . I would like my System to be hidden from Google and from any other search engine. Would you like to do this? How to get indexing already done...
asked by 20.02.2015 / 15:20
3
answers

___ ___ erkimt How the semantic / indexing with AngularJS? ______ qstntxt ___

I always wonder, AngularJS is a framework that is constantly being used.

But I have a question about how it works for crawlers (example googlebot).

Do they even run the javascript and interpret the code to get the information and show the site developed on the platform?

  

With the angular HTML theoretically does not have information "yet", it is first necessary to trigger the controllers and such.

The question is: How does semantics / indexing work with Angular?

    
______ azszpr71758 ___

According to this post , Google's crawler renders pages that have Javascript and browse the listed states.

Interesting parts of the post (free translation):

  

[...] we decided to try to interpret pages by running JavaScript. It's hard to do this on a grand scale, but we decided it was worth it. [...] In recent months, our indexing system has been serving a large number of web pages the way a regular user would see them with JavaScript enabled.

     

If features such as JavaScript or CSS in separate files are blocked (say with %code% ) so that Googlebot can not retrieve them, our indexing system will not be able to see your site as a regular user.

     

We recommend allowing Googlebot to retrieve your JavaScript and CSS so that your content can be better indexed.

Recommendations for Ajax / JS can be found at this link .

If you want to serve Angular application content for crawlers that do not support the same kind of functionality, you need to pre-render the content. Services such as Prerender.io are intended for this.

    
______ azszpr71790 ___

Crawlers (example googlebot) use pure text reading, meaning they first validate meta tags, then comments, then they remove all encoding and then read the whole text without code. Reason: Increase processing speed and reduce errors by having fields that hide, or have nodes (nodules) that are removed during execution. Crawlers do not run any kind of technology (Browser), they only read the file. The Angular does not stop being a Javascript like any other, for that reason its elements are ignored. Only items relevant to SEO (Optimization) are brought into question in their indexing.

Part of my explanation you find in this Google article Understanding Web Pages Better

To better understand the process of viewing plain text, make a requestion of the page in question by CURL, Lynx which are technologies commonly used by Crawlers.

For better indexing we recommend creating robots.txt and XML sitemaps .

    
______ azszpr71766 ___

One tip I can give you is, take the course they offer, it's quick and easy, you'll better understand how semantics work:

link

    
___

I always wonder, AngularJS is a framework that is constantly being used. But I have a question about how it works for crawlers (example googlebot). Do they even run the javascript and interpret the code to get the information and show th...
asked by 29.06.2015 / 14:43
1
answer

Protect automated access web pages

How can I protect my web pages from being accessed in an automated way? By search engine bots like Googlebot (I think the basic form was the meta tag with noindex and nofollow). By Headless Browser (browsers without graphical interface and...
asked by 20.05.2015 / 13:57
3
answers

Simple National Consultation (by CNPJ)

I'm trying to implement a query of Simple Nacional , the operation is similar to the query by CNPJ of the revenue. Details I've seen so far: After loading the page, it executes a ajax ( file captcha2.js ) that returns 3 ite...
asked by 21.05.2015 / 22:42
1
answer

Need for Server Side Award for Content javascript - AngularJs

Knowing that starting this year google crawler runs javascript , considering the indexing of a content that is displayed using AngularJs, is there still a need for a version of the same content rendered on the server side for SEO? Plus : Pr...
asked by 05.08.2014 / 14:11
2
answers

Conflict between Simple_HTML_Dom and non-object-oriented functions

I'm developing an app that has to access a list of sites saved in a database, loading all its links. It is a test application but I have encountered a difficulty. The routine is this one: function crawler() { include_once './simple_htm...
asked by 08.05.2014 / 11:38
1
answer

Carousel Content Affects SEO? Is the content of the hidden Carousel indexed?

I have a question regarding the Carousel and how its contents are indexed by crawlers . First, I believe most Carousel are not so friendly from the point of view of Accessibility . This in itself could already hurt the indexing of the conte...
asked by 03.01.2019 / 13:38
2
answers

How to calculate an optimal value for the Scrapyd variable CONCURRENT_REQUESTS?

One of the settings that comes standard with Scrapyd is the number of concurrent processes (it is 16). CONCURRENT_REQUESTS = 16 What would be the best methodology to calculate an optimal value for this variable? The goal is to get the b...
asked by 09.01.2015 / 21:01
1
answer

How to manage the running and failure to execute the Spiders?

I'm developing a module to get information about spiders running on the company system. Here is the template where we saved the start of operations and the job. I would like to validate if the jobs were done correctly and fill in the rest of the...
asked by 12.01.2015 / 17:48
1
answer

___ ___ erkimt Specifying the search engines update an HTML document? ______ qstntxt ___

According to MDN using the %code% tag with attribute %code% %code% , allows search engines to know the date the document was created, and then displays this information in the Rich Snippet of the searches.

Is it possible to indicate the update date for it?

    
______ azszpr332568 ___

You can build your %code% by using the %code% tag to tell %code% that you want your content to be re-indexed, hourly, every day, or weekly for example.

Here you can see the complete and recommended protocol for you to build your %code% , notice that you can determine how regularly your content is reindexed: link

The frequency with which the page changes. This value provides general information for search engines and may not match the frequency of page indexing. Valid values are:

  • always
  • hourly
  • daily
  • weekly
  • monthly
  • yearly
  • never

The "always" value should be used to describe documents that always change when accessed. The value "never" should be used to describe the archived URLs.

Note that the value of this tag is considered a %code% and not a command.

Although you can not be totally sure that Google will consider this tag to reindex its contents %code% or %code% for example

  

"If the site pages are properly linked, the normal   Web crawlers can detect most of your   site. ", but" The use of the sitemap does not guarantee that all items in it   will be crawled and indexed because Google processes are   based complex algorithms to program the tracking. However, the   sitemap benefits the site in most cases, and you will never be   penalized for using it. "

Source: link

Otherwise, you will not be able to completely stop %code% , in a case of urgency you can manually request the reindexing of a URL. For example if you make a security update on the contacts page you can ask Google to do a reindexing of your page. Here you can learn more about this: link

To manually add a URL (search tb "Fetch as Google") : link

To index by Search Console:

YoucanstillrequesttoreindextheentiresitethroughtheSerchConsole!

    
___

According to MDN using the <time></time> tag with attribute datetime <time datetime="yyy-mm-dd hh:mm:ss"></time> , allows search engines to know the date the document was created, and then displays t...
asked by 27.09.2018 / 01:38