Technical Deep Dives
Robots.txt Configuration

Robots.txt Configuration

The robots.txt file tells search engine crawlers which pages or sections of your site they can or cannot access. It's a text file placed in your site's root directory that provides crawling instructions.

How Robots.txt Works

1
Crawler Arrives

Bot visits your site

2
Checks robots.txt

Fetches /robots.txt first

3
Follows Rules

Crawls allowed pages only

Basic Syntax

# Comment - ignored by crawlers

User-agent: *
Disallow: /private/
Allow: /private/public-page.html

User-agent: Googlebot
Disallow: /no-google/

Sitemap: https://example.com/sitemap.xml

Robots.txt Directives

Directive Purpose Example
User-agent Specifies which crawler the rules apply to User-agent: Googlebot
Disallow Blocks access to specified path Disallow: /admin/
Allow Permits access (overrides Disallow) Allow: /admin/public/
Sitemap Location of XML sitemap Sitemap: https://...
Crawl-delay Seconds between requests (not Google) Crawl-delay: 10

Common User Agents

  • * - All crawlers
  • Googlebot - Google's main crawler
  • Googlebot-Image - Google Images
  • Googlebot-News - Google News
  • Bingbot - Microsoft Bing
  • Slurp - Yahoo
  • DuckDuckBot - DuckDuckGo
  • Baiduspider - Baidu

Pattern Matching

Pattern Matches Example
* Any sequence of characters Disallow: /*.php
$ End of URL Disallow: /*.php$
/ Root or path separator Disallow: /folder/

Common Robots.txt Examples

Block Everything

User-agent: *
Disallow: /

Allow Everything

User-agent: *
Disallow:

Block Specific Folder

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /tmp/

Block URL Parameters

User-agent: *
Disallow: /*?*
Disallow: /*&*

Robots.txt vs Noindex

robots.txt
  • Blocks crawling
  • Page may still be indexed (via external links)
  • Saves crawl budget
  • Cannot read noindex if blocked
noindex
  • Allows crawling
  • Prevents indexing
  • Page won't appear in search
  • Best for removing from search
Important: Don't use robots.txt to hide pages from search. If external sites link to a blocked page, it can still appear in search results. Use noindex instead.

Common Mistakes

  1. Blocking CSS/JS files - Prevents Google from rendering pages properly
  2. Using robots.txt for security - It's public and not a security measure
  3. Blocking entire site accidentally - Forgetting to update after development
  4. Syntax errors - Case sensitivity, missing colons, wrong paths
  5. Conflicting rules - Order matters; more specific rules should come first

Testing Your Robots.txt

  • Use Google Search Console's robots.txt Tester
  • Check that critical pages aren't accidentally blocked
  • Verify CSS and JS files are accessible
  • Test after any changes

External Resources