MS MARCO Web Search effectively creates a dataset by meticulously sampling over one year of Bing's search logs. The process involves filtering out rare queries, as well as any content that may include personal information, offensive material, or irrelevance to the ClueWeb22 document set. This careful selection results in a set that mirrors realistic query distributions, crucial for training AI models. With approximately 10 million query-document pairs for training and a properly balanced test set, the dataset is tailored for high-quality AI training and relevancy evaluation.
To generate large scale high quality queries and query-document relevance labels, we sample query-document clicks from one year of Bing search engine's logs.
The resulting set includes queries triggered by many users, which reflects the real query distribution of a commercial web search engine.
We sample around 10 million query-document pairs from the train set and 10 thousand query-document pairs from the test set.
The initial query set gets filtered to remove queries that are rarely triggered, contain personally identifiable information, offensive content, adult content and those having no click connection to the ClueWeb22 document set.
Collection
[
|
...
]