Commercial CrawlersActive

CCBot

Name: CCBot
Author: Common Crawl

Common Crawl's open-source web archive used by multiple AI companies for training.

Operated by Common CrawlOfficial docs

What is CCBot?

CCBot is the web crawler for Common Crawl, a nonprofit that maintains an open web archive used by researchers and AI companies worldwide. Common Crawl data has been used to train many of the world's largest language models, including GPT, Claude, and LLaMA.

Unlike corporate crawlers, CCBot serves the broader research community. Its archive is freely available, meaning content crawled by CCBot could end up in any number of AI training pipelines. This makes the allow/block decision particularly impactful.

CCBot respects robots.txt and publishes its IP ranges. Its crawl rate is relatively low since it aims for broad coverage rather than frequent re-crawling. The decision to block CCBot is effectively a decision about whether your content should be available in the most widely used open training dataset.

User-Agent Strings

These are the known user-agent patterns used by CCBot. Use them to identify this crawler in your server logs or configure robots.txt rules.

CCBot

ccbot

robots.txt example:

User-agent: CCBot
Disallow: /private/
Allow: /

How to Manage CCBot

Blocking CCBot affects multiple AI training pipelines, not just one company.

Common Crawl data is publicly available — blocking prevents future inclusion.

Low crawl rates mean minimal bandwidth impact.

Use Switch to track CCBot alongside company-specific training crawlers.

How to block CCBot

Start managing CCBot today

Switch detects, tracks, and lets you build custom journeys for CCBot and 35+ other AI agents and crawlers. Set up in five minutes.

Get Started Free

Related Agents

AI2Bot

Commercial Crawlers

Allen AI

Allen Institute for AI's research crawler for academic AI development.

Amazonbot

Commercial Crawlers

Amazon

Amazon's web crawler powering Alexa, Amazon search, and AI services.

Applebot-Extended

Commercial Crawlers

Apple

Apple's AI training token controlling how Applebot data is used for Apple Intelligence.

Bytespider

Commercial Crawlers

ByteDance

ByteDance's web crawler for TikTok AI and LLM training data.

ClaudeBot

Commercial Crawlers

Anthropic

Anthropic's web crawler collecting training data for Claude models.

cohere-ai

Commercial Crawlers

Cohere

Cohere's web crawler for enterprise AI and language model training.

Back to Agents Directory