Crawler Detection

Crawler Detection checks your website's configuration to ensure AI systems can access and index your content. This feature verifies your robots.txt and llms.txt files to identify potential access issues.

Why Crawler Access Matters

AI systems need to access your content to include information in their knowledge base, cite your pages in responses, understand your brand and offerings, and recommend your products or services. Many websites inadvertently block AI crawlers through overly restrictive robots.txt rules, missing or misconfigured llms.txt files, blocking specific user agents, or IP-based restrictions.

What Gets Checked

Your robots.txt file controls crawler access by specifying which user agents can crawl, which paths are allowed or blocked, crawl rate directives, and sitemap references. nonBot AI checks whether major AI crawlers are explicitly allowed, explicitly blocked, or covered by general rules.

The llms.txt file is an emerging standard for AI access that specifies content appropriate for LLM training, indicates preferred citation methods, provides metadata about your content, and signals AI-friendly content policies.

The system checks for common AI crawler user agents including GPTBot (OpenAI), Google-Extended (Google AI), Anthropic crawlers, Perplexity crawlers, and other AI system crawlers.

Viewing Results

Navigate to Playbook > Crawler Detection to see overall access status (accessible, blocked, or partial), specific findings for each crawler type, configuration recommendations, and direct links to relevant files.

Green status indicates AI crawlers can access your content with no blocking rules detected. Yellow status means some crawlers are allowed while others are blocked, or specific paths may be restricted; review is recommended. Red status indicates AI crawlers are blocked and content is inaccessible for AI indexing, requiring action.

Common Issues

Some robots.txt files block all crawlers with User-agent: * Disallow: /, which prevents AI indexing entirely. Others target AI-specific crawlers with rules like User-agent: GPTBot Disallow: / that specifically prevent OpenAI's crawler.

Without an llms.txt file, AI systems have no explicit permission signals, citation preferences are unclear, and your visibility priority may be reduced. Blocking important paths like /products/ or /about/ hides key brand information from AI systems.

Recommendations

Ensure your robots.txt permits AI access by adding explicit allow rules for GPTBot, Google-Extended, and anthropic-ai. Create an llms.txt file in your root domain specifying your brand name, contact information, allowed paths, and preferred citation format.

You can balance access and protection by allowing public marketing content while blocking private and admin areas, blocking user-generated content if desired, and allowing product and service information.

Implementation

To update robots.txt, locate the file (usually at domain.com/robots.txt), review current rules, add explicit allow rules for AI crawlers, test using online robots.txt validators, and monitor for indexing changes.

To create llms.txt, create a new file named llms.txt, add it to your web root, specify access policies, include brand information, and set citation preferences. After making changes, run a new analysis in nonBot AI, check crawler detection results, monitor for visibility changes, and allow time for AI systems to re-crawl.

Impact Timeline

Crawler access changes don't show immediate results. AI systems crawl at their own schedule, training updates happen periodically, citation changes may take weeks or months, and consistent access over time yields the best results.

Access Requirements

Crawler Detection is a subscription-only feature available in Pro, Elite, and Agency plans.

Next Steps

Review technical optimization tips in Best Practices, create content to make accessible with Content Ideas, or find where your content can be cited with Citation Targets.