The robots.txt
file, also known as the Robots Exclusion Protocol or Standard, is a simple text file used by websites to communicate with web crawlers and other web robots. It tells these crawlers which pages on the site should not be processed or scanned. This is primarily used to manage crawler traffic to avoid overloading the server and to keep certain parts of the website private.
How it Works
- Location: The
robots.txt
file must be placed in the root directory of the website.- For example, for a website
example.com
, the file should be located athttps://example.com/robots.txt
.
- For example, for a website
- Syntax: The file uses a specific syntax to define rules for web crawlers. The primary components are:
User-agent
: Specifies the web crawler the rule applies to.Disallow
: Tells the crawler not to access certain parts of the site.Allow
: (Optional) Overrides aDisallow
rule for specific URLs.Crawl-delay
: (Optional) Specifies the delay between successive requests to the server.Sitemap
: (Optional) Provides the location of the website’s XML sitemap.
Example robots.txt File
Here is an example of a robots.txt
file with some common commands:
User-agent: *
Disallow: /private/
Disallow: /tmp/
Allow: /public/
Crawl-delay: 10
Sitemap: https://example.com/sitemap.xml
Commands and Their Usage
- User-agent
- Specifies the web crawler to which the rules apply. An asterisk (
*
) indicates that the rules apply to all web crawlers. - Example:
User-agent: Googlebot
- Specifies the web crawler to which the rules apply. An asterisk (
- Disallow
- Blocks access to the specified directories or pages.
- Example:
Disallow: /private/
- Allow
- Grants access to specific directories or pages, even if a broader
Disallow
rule exists. - Example:
Allow: /public/
- Grants access to specific directories or pages, even if a broader
- Crawl-delay
- Sets a delay (in seconds) between requests to the server by the crawler.
- Example:
Crawl-delay: 10
- Sitemap
- Specifies the location of the XML sitemap.
- Example:
Sitemap: https://example.com/sitemap.xml
Detailed Example
Consider a website with the following robots.txt
file:
User-agent: *
Disallow: /admin/
Disallow: /login/
Allow: /public/
Crawl-delay: 5
User-agent: Googlebot
Disallow: /no-google/
Allow: /google-allowed/
Crawl-delay: 10
Sitemap: https://example.com/sitemap.xml
- Rules for All Crawlers (
*
):- Do not access the
/admin/
and/login/
directories. - Allow access to the
/public/
directory. - Wait 5 seconds between requests.
- Do not access the
- Additional Rules for Googlebot:
- Do not access the
/no-google/
directory. - Allow access to the
/google-allowed/
directory. - Wait 10 seconds between requests.
- Do not access the
Tips for Using robots.txt
- Case Sensitivity: The
robots.txt
file is case-sensitive, so be careful with directory and file names. - Testing: Use tools like Google Search Console to test your
robots.txt
file and ensure it’s working as expected. - Privacy: Remember that
robots.txt
is publicly accessible, so don’t use it to hide sensitive information.
By properly configuring the robots.txt
file, website administrators can control which parts of their site are accessible to web crawlers, optimizing the site’s performance and security.