How to handle common crawling problems
Is there any content on your site that you specifically don’t want to be crawled by search engines? If that’s the case, you can prevent search engine crawlers from accessing the page (or pages). A file called robots.txt holds the key to making this happen, and using robots.txt is considerably simpler than most people realize.
But as with most technical aspects of SEO, things could go astray. It’s possible that a crawl of a website will become stuck on the first level of URLs and refuse to go any farther. When this occurs just the base domain, also known as the “Start URLs,” will be scanned.
There could be multiple reasons for this and other crawling problems as well as a wide range of solutions. This article explains everything you need to know about how to handle crawling problems that happen a lot. Let’s get started!
Common crawling problems
Robots.txt is used by a lot of websites with no issues at all. However, because it’s so widely used there are a number of common problems that can occur if all the pieces don’t align correctly.
Some regular crawling issues include:
- 1 disallowed URL is returned
- 1 indexable URL returning 200 status code
- 1 URL is having a trouble with the connection
- 0 URLs have been retrieved
- 1 indexable URL with a status code of 200 returning
Solutions to common crawling problems
If you run into a problem chances are there’s a simple solution that can get your crawling on track. Try these steps first:
- Review your robots.txt document for errors.
- Look for a link at the first URL.
- Restrict access based on user agent or IP address.
- Check the project settings for incorrect base domain.
- Check the advanced settings: included URLs restriction.
It’s good to know how to fix crawling issues when they occur, but it’s even better when you can avoid them all together. Keep reading for an overview on how to use robots.txt correctly.
How to properly use the robots.txt file to optimize web crawling
A robots.txt file instructs web crawlers and search engines to bypass a website, a web page or a set of pages within a domain. Web crawlers from search engines will always respect this file’s directives to avoid specific pages on your site.
Using Google’s services, you can quickly make a robots.txt file. From there you can use the Webmaster Tools section of your account to see exactly which URLs have been blacklisted. Other search engines also provide the same services and comply with the robots.txt file.
You can restrict access to specific web pages by either knowing the root of your domain or by using the robots meta tag.
Using the noindex meta tag and x-robots-tag
If there is any element on a page you do not want to be indexed in any way, shape, or form, your best bet is to utilize either the noindex meta tag or the x-robots-tag, particularly when it comes to the web crawlers used by Google.
It should be noted that it’s possible certain content will still be indexed. If there are externals links to the page located on other websites, search engines will continue to index the page’s specific content even if it is no longer directly accessible.
Using robots.txt with password security
Spammers may utilize unethical search engine optimization strategies to get around the robots.txt file. If sensitive data is on a page, then the best course of action is to use the robots.txt file and implement password security.
The robots.txt file is the most effective tool for making some content invisible to search engines, and it’s also quite simple to implement. Undoing robots.txt is just as simple as. All you have to do is delete the file to make a webpage accessible again to web crawlers for indexing.