Robots.txt is one of the oldest and most fundamental tools in SEO, yet it is routinely misconfigured even on major sites. While the basic syntax is simple — allow and disallow directives for user agents — the advanced patterns for wildcard matching, crawl-delay, sitemap declarations, and section-level control can make the difference between efficient crawl budget utilization and catastrophic indexing failures.
Robots.txt Fundamentals
The robots.txt file lives at the root of your domain and provides crawl instructions to search engine bots. It is a suggestion, not a command — well-behaved bots follow it, but malicious bots ignore it. Google's crawlers follow robots.txt directives strictly. The file uses plain text with User-agent, Disallow, Allow, and Sitemap directives. Rules are processed in order of specificity, not top-to-bottom.
Robots.txt blocks crawling, not indexing. If a page is disallowed in robots.txt but has external links pointing to it, Google may still index the URL — showing it in search results with no snippet because it cannot crawl the content. Use noindex Key Insight We once audited a site that had accidentally blocked their entire /blog/ directory in robots.txt during a server migration. It went unnoticed for four months. They lost 60 percent of their organic blog traffic before the error was discovered. Always audit your robots.txt after any server or infrastructure changes.
Testing and Monitoring
Use Google Search Console's robots.txt tester to validate your rules before deploying changes. Test specific URLs to see whether they are blocked or allowed by your current configuration. After deploying changes, monitor Search Console's crawl stats for any unexpected drops in crawl activity. Set up change monitoring for your robots.txt file to alert you if it is modified unexpectedly — this can happen during CMS updates or server configuration changes.
Ready to Improve Your SEO?
Get a free audit and actionable recommendations for your business.
Get in Touch