robots.txt is a plain-text file served from the root of a host (for example https://example.com/robots.txt) that tells web crawlers which URL paths they are allowed to fetch. It uses simple directives — User-agent, Allow, Disallow, and Sitemap — and is defined by RFC 9309. The file is advisory: it relies on cooperative bots and is not an access-control mechanism.
Why it matters
A well-formed robots.txt keeps crawlers focused on URLs that should be indexed, which helps with crawl budget on large sites and prevents staging or admin paths from showing up in search. A broken or overly aggressive file can deindex an entire site overnight — a stray Disallow: / is one of the most common SEO regressions during a launch.
How to check
- Fetch
/robots.txtdirectly and confirm it returns200withContent-Type: text/plain. - Validate the file in Google Search Console's robots.txt report or Bing Webmaster Tools.
- Use
Disallow:to block crawl paths, but usenoindexheaders or meta tags to keep pages out of the index —robots.txtdoes not remove URLs that are already indexed. - Add a
Sitemap:line that points to your sitemap.xml. - Never put secrets or auth-only paths behind
robots.txt; protect them with authentication. - If you publish AI guidance, link to your llms.txt policy.