This site lets users to execute full-text queries to search Google's C4 Dataset. Our hope is this will help ML practitioners better understand its contents, so that they're aware of the potential biases and issues that may be inherited via it's use.
The dataset is released under the terms of ODC-BY. By using this, you are also bound by the Common Crawl Terms of Use in respect of the content contained in the dataset.
You can read more about the supported query syntax
here
. Each record has two fields, url
and text
, both of
which are searchable. The fields are indexed using the
Standard analyzer,
which means you can't search for punctuation.