approved
CoSRec

CoSRec is the first dataset explicitly designed for joint Conversational Search and Recommendation (CSR) tasks. CoSRec comprises approximately 9,000 user-system conversations generated by a Large Language Model (LLM) in the product search and recommendation domain. These conversations encompass a variety of interactions, including pure search, pure recommendation, and mixed search-and-recommendation utterances. To ensure the quality of the dataset, a sample of approximately 3% of the conversations has been manually annotated to identify user intents and assess overall quality. Additionally, for 20 high-quality conversations, we provide utterance-level human-generated relevance judgments for items or documents, depending on the intent of the utterance. These annotations enable precise and effective evaluation of joint CSR systems. A key feature of CoSRec is its agnosticism toward underlying systems and evaluation paradigms. To address this, CoSRec includes separate ground truths for search and recommendation tasks, allowing researchers to apply diverse evaluation paradigms and methodologies. CoSRec includes 9,249 conversations split into 3 partitions: CoSRec-Raw: 8,938 non-annotated conversations containing 71,656 utterances. CoSRec-Crowd: 291 human-annotated conversations including 2,329 utterances. CoSRec-Curated: 20 deeply human-annotated conversations containing 150 utterances. The CoSRec dataset comes with human-made quality assessments for a subset of 311 conversations (∼3%) corresponding to CoSRec-Crowd and CoSRec-Curated. In particular, the annotation process involved 99 semi-expert human annotators. The quality assessments are given on a 1 to 5 scale and concern 4 aspects: fluency, informativeness, logicality and coherence. Moreover, CoSRec includes also human labeled intents for each utterance of the CoSRec-Crowd and CoSRec-Curated conversations. Each utterance is annotated with zero, one, or more among “search”, “recommendation”, and “product detail” intents: Search: The user asked for general information about a topic related to the product they are discussing. Recommendation: The user asks for some products to be suggested, according to her requirements. Product Detail: The user inquires about details of the product being discussed. For the 20 CoSRec-Curated conversations, the the labeled intents were further refined by reviewing cases where annotators did not reach unanimity. Along with intent labels, human annotators also provided a stand-alone formulation. This formulation is a self-explanatory textual description of the information need, independent of the conversation’s context, as it fully encapsulates it. Since each conversation, and therefore each utterance, was annotated by multiple annotators, we define the longest stand-alone formulation as the canonical formulation. In contrast, the others are considered reformulations. The CoSRec-Curated portion of the dataset contains a total of 17k relevance judgments for user intents related to search and recommendation. The judgments for the search intents are created by following the standard TREC-style annotation procedure (pooling passages retrieved from MS-MARCO). The judgments for the recommendation intents, instead, are created taking into account personalization. In particular, a pool of product for each intent has been retrieved from a filtered version of the Amazon Reviews catalogue using personalized requests (stand-alone formulations concatenated with some keywords representative of the user for which the request is personalized). Then the assessors were required to assess the relevance of the retrieved products for the request taking into account also a summary of the profile of the user for which the request was personalized. For both the search and recommendation intents the relevance has been assessed employing 3 relevance labels: (0) Not Relevant, (1) Partially Relevant, (2) Highly Relevant.

Tags
Data and Resources
To access the resources you must log in
  • CoSRec GitHub

    The resource: 'CoSRec GitHub' is not accessible as guest user. You must login to access it!
  • CoSRec SBD

    The resource: 'CoSRec SBD' is not accessible as guest user. You must login to access it!
Personal Data Attributes

Description: Personal Data related Information

Field Value
Anonymised Anonymized
ChildrenData No
General Data Yes
Personal Data No
Personal data was manifestly made public by the data subject No
Sensitive Data No
Additional Info
Field Value
Accessibility Both
Basic rights Download
Creation Date 2025-02-17
Creator Alessio, Marco, marco.alessio@isti.cnr.it, orcid.org/0009-0008-5043-2174
Creator Merlo, Simone, simone.merlo@phd.unipd.it, orcid.org/0009-0003-8003-4795
Data sharing agreement yes
Dataset Citation Marco Alessio, Simone Merlo, Tommaso Di Noia, Guglielmo Faggioli, Marco Ferrante, Nicola Ferro, Cristina Ioana Muntean, Franco Maria Nardini, Fedelucio Narducci, Raffaele Perego, Giuseppe Santucci, and Nicola Viterbo. 2025. CoSRec: A Joint Conversational Search and Recommendation Dataset. In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2025, Padua, Italy, July 13-17, 202. ACM.
Field/Scope of use Non-commercial research only
Group Social Impact of AI and explainable ML
License term 2025-07-14 /2027-12-31
Processing Degree Primary
SoBigData Node SoBigData IT
Sublicense rights No
Territory of use World Wide
Thematic Cluster Text and Social Media Mining [TSMM]
system:type Dataset
Management Info
Field Value
Author Muntean Cristina
Maintainer Muntean Cristina
Version 1
Last Updated 9 June 2025, 16:10 (CEST)
Created 6 June 2025, 12:05 (CEST)