Opensource 开源项目与数据

Toolkits

ReChorusTop-K Recommendation with Implicit Feedback

ReChorus is a general PyTorch framework for Top-K recommendation with implicit feedback, especially for research purpose. It aims to provide a fair benchmark to compare different state-of-the-art algorithms. We hope this can partially alleviate the problem that different papers adopt non-comparable experimental settings, so as to form a “Chorus” of recommendation algorithms.

PyTorch Top-K Recommendation

ULTRAUnbiased Learning to Rank Algorithm

ULTRA is an Unbiased Learning To Rank Algorithms toolbox that provides a codebase for experiments and research on learning to rank with human annotated or noisy labels. With the unified data processing pipeline, ULTRA supports multiple unbiased learning-to-rank algorithms, online learning-to-rank algorithms, neural learning-to-rank models, as well as different methods to use and simulate noisy labels (e.g., clicks) to train and test different algorithms/ranking models

PyTorch TensorFlow Learning to Rank

Datasets

TianGong-CRL Dataset

Description: We provide this Chinene-centric TianGong-CRL dataset to support researches in epidemic related Information Retrieval (IR) tasks and information needs of Chinese people in the context of COVID-19. Refined from an 82-day search log by Sogou, the second largest search engine in China, the dataset consists of two parts. The first part provides a collection of 1492 COVID-19 related queries and the submission frequency of these queries in each province of China over an 82-day period, the second part provides a sample of COVID-19-related search logs during the period, we only provide session-level data for user privacy concerns. We also sample a subset of 1,700 sessions from TianGong-CRL and manually label each session with five intent labels.

Logs

Image Annotation Dataset

Description: On Annotation Methodologies for Image Search Evaluation

Image Annotation

User Behavior Dataset

Description: The influence of image search intents on user behavior and satisfaction

Logs Annotation

Reading Attention Dataset

Description: Understanding Reading Attention Distribution during Relevance Judgement.

Annotation

Sogou-SRR Dataset

Description: The Sogou-SRR (Search Result Relevance) dataset was constructed to support researches on search engine relevance estimation and ranking tasks. The dataset consists of 6,338 queries and corresponding top 10 search results. For each search result, the screenshot, title, snippet, HTML source code, parse tree, url as well as a 4-grade relevance score (1-4) and the result type are provided. The queries are sampled from search logs of Sogou.com. The sampled queries with frequency between 100 and 10,000 are usually regarded as torso queries , and usually the most important concerns for ranking algorithm design.

Logs

Sogou-QCL Dataset

Description: The Sogou-QCL dataset was created to support research on information retrieval and related human language technologies. The dataset consists of 537,366 queries, more than 9 million Chinese web pages, and five kinds of relevance labels assessed by click models. Moreover, a 2,000-queries’ dataset with 4-level human assessed relevance labels is also offered to the public for research.

Logs

TianGong-ULTR Dataset

Description: The Tiangong-ULTR (Unbiased Learning To Rank) dataset is constructed to support the studies on unbiased learning to rank. This dataset provides real click data sampled from the search logs of Sogou.com for the training of unbiased learning to rank algorithm as well as a seperate set of human-annotated data for the evaluation of their performance.

Logs Annotation

SearchSuccess Dataset

Description: This dataset was created to support research on search evaluation in exploratory search. We conducted a user study which contained 166 search sessions in three domains. Users’ interactions and explicit feedback were collected during searching process. The clicked documents collected in the user study were annotated by external assessors.

Logs

ZhihuRec Dataset

Description: ZhihuRec dataset is collected from a knowledge-sharing platform (Zhihu), which is composed of around 100M interactions collected within 10 days, 798K users, 165K questions, 554K answers, 240K authors, 70K topics, and more than 501K user query logs. There are also descriptions of users, answers, questions, authors, and topics, which are anonymous. To the best of our knowledge, this is the largest real-world interaction dataset for personalized recommendation.

Logs

T²Ranking

Description: T²Ranking is a large-scale Chinese benchmark for passage ranking, including passage retrieval and re-ranking. T²Ranking comprises more than 300K queries and over 2M unique passages from real- world search engines. Specifically, we sample question-based search queries from user logs of the Sogou search engine, a popular search system in China. For each query, we extract the content of corresponding documents from different search engines. After model-based passage segmentation and clustering-based passage de-duplication, a large-scale passage corpus is obtained. For a given query and its corresponding passages, we hire expert annotators to provide 4-level relevance judgments of each query-passage pair.

Annotation Passage retrieval Passage re-ranking

Special thanks to Shuqi Zhu for the initial construction of this page.