Most WordPress site owners have heard of the robots.txt file, but not everyone knows what it does or if they need one. What is this lightweight text file, and why should you care? Is it necessary with modern websites? This beginner’s guide explains everything in a way that even the raw novice can understand.
Why Is It Called a Robots File?
Search engine bots or spiders continually crawl websites to look for new or updated content. Googlebot is the best known, but all search engines work similarly. The RoBOTs.txt file follows something called the Robots Exclusion Standard. All that means is the standard in which websites communicate with obedient web bots and crawlers.
A robots.txt is not foolproof as less obedient bots like email scrapers or malware ignore it. It’s also publicly visible. Despite that, this text file is an invaluable asset to many sites and blogs.
The ‘Do as I Ask’ File
The job of this tiny editable file is to control how web bots interact with your site’s file paths. The way they crawl is entirely dependent on your robots.txt. That makes it an incredibly powerful yet simple tool to have in your Search Engine Optimisation (SEO) toolkit.
Summing up the two primary uses of the robots.txt file look like this:
- Tells obedient bots which pages, files, or folders to crawl and index
- Tells compliant bots which pages, files, or folders NOT to crawl (ignore) and index
Thus, the robots.txt is the first file a search engine bot looks for when it arrives at a website.
Robots.txt Syntax and Rules
A robots.txt file is not fixed, meaning you can open and edit it to control the rules. The language used is robots.txt syntax. It’s easy to read, but it must be exact to work. Most webmasters copy the syntax they need and paste it into the file to save time and avoid typos.
Common Allow/Disallow rules include:
- Disallow bots to crawl a directory and all its contents
- Disallow bots to crawl a single webpage
- Disallow crawling of a specific file type
- Disallow crawling of an entire website
- Allow access only to a single named crawler
- Allow site access to all but a single web crawler
- Block access to a specific image
- Block all images on a site from image search results
There are others, but you get the idea.
How to Read the Robots.txt Syntax
Your robots.txt file contains at least one block of directives (guidelines) to instruct web crawlers. Each block begins with the words ‘User-agent,’ which refers to a specific bot or spider. A single block can also address all search engine bots using the * wildcard symbol.
Here’s what commonly used robots.txt blocks look like:
Allow all search engines full access:
Adding the forward-slash / after Disallow, blocks access to all search engines:
Block access to a single folder (replace /folder/ with the actual name):
For example, you could have photos on your site that you don’t want the search engines to index. User-agent: * tells all bots not to visit the named folder.
Block access to a single file (replace ‘file’ with the actual name)
Block access to a single image (replace ‘image’ with the actual name)
It’s vital that you use the robots.txt file correctly and don’t block or allow access to stuff by accident. Various online validators let you check your file for errors. It’s advisable to at least submit any new changes to your file to Google’s robots.txt Tester.
Common search engine user-agents
Below is a list of the user-agents most used in robots.txt files:
Video & Images
Reasons to Disallow Search Engine Bots
The larger the site, the more time it takes to spider. Googlebot, and others, have a crawl quota. If the files on a website exceed that quota, the bot moves on. It resumes crawling from where it left off when it returns for the next session. The way to stop or improve this issue is to prevent bots from crawling unnecessary files to speed up indexing.
The problem is that bots crawl everything unless they’re instructed otherwise. And there are many site files on larger projects that don’t need crawling. Typical file exclusions should include theme folders, plugin files, admin pages, and others. Also, you may have private pages on your site that you don’t want to appear in a web search. You can disallow access to those too.
Here’s what a typical robots.txt file might looks like.
The above Robots.txt file gives visiting bots 6 clear instructions:
- Index ALL WordPress content files
- Index ALL WordPress images
- Don’t index (Disallow) WordPress plugin files
- Disallow access to the WP admin area
- Disallow access to this particular WP readme file
- Disallow access to links that include /refer/
The last two lines provide the full XML sitemap URLs for posts and pages.
What Should You Include in Your Robots.txt File?
Search engines are better than ever at indexing sites. When it comes to WordPress, Google actually needs access to folders that a lot of webmasters block. For this reason, I'd highly recommend you check out this post on the Yoast SEO site for best practices with robots.txt files.
How to Create a New Robots.txt File
You can create a new robots.txt file in WordPress if it’s missing. There are two ways to achieve this. One is to use the popular Yoast SEO plugin, and the other is the manual approach. Skip to the second method if you don’t have and don’t plan to install the YOAST plugin.
#1 Create a robots.txt using the Yoast SEO plugin
Log in to WP Dashboard and go to SEO -> Tools from the side menu.
From the Tools screen, click the File Editor link.
Click the Create robots.txt file button.
The Yoast SEO robots.txt file generator adds some basic rules to the new file. Replace these with yours if they disagree with what you need. If you’re unsure, use the rules mentioned in the above section, ‘Reasons to Disallow Search Engine Bots.’
Click the Save changes to robots.txt button when you’re done.
#2 Create and upload a robots.txt using FTP
To create a robots.txt file open Notepad, enter your rules, and Save As robot.txt. You then upload the file to your website’s root directory (main folder) using any FTP software. Consider the free FileZilla program if you don’t have one. There’s a section in my article “Using FTP to Install WordPress Themes” if you need help to set up a FileZilla account.
If you ever need to delete or add rules in the robots.txt, make changes to the local copy. You then re-upload the modified file to overwrite the one on the server.
Whichever method you use, remember to test the file with an online tester straight after. They all do a good job, but most WordPress webmasters prefer to use Google’s Search Console.
You now know what a robots.txt file is and why it exists.
It’s a simple yet powerful tool that gives you more ways to control your SEO strategies. A well-optimised file is vital for larger sites as it saves wasting the crawl budget. Moreover, you can block access to areas of the site that you don’t want to show in the search results.