Product Reviews for Ordinal Quantification - Items

Item
Groups

approved

Product Reviews for Ordinal Quantification

This data set comprises a labeled training set, validation samples, and testing samples for ordinal quantification. It appears in our research paper "Ordinal Quantification Through Regularization", which we have published at ECML-PKDD 2022. The data is extracted from the McAuley data set of product reviews in Amazon, where the goal is to predict the 5-star rating of each textual review. We have sampled this data according to two protocols that are suited for quantification research. The goal of quantification is not to predict the star rating of each individual instance, but the distribution of ratings in sets of textual reviews. More generally speaking, quantification aims at estimating the distribution of labels in unlabeled samples of data. The first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification, where classes are ordered and a similarity of neighboring classes can be assumed. 5-star ratings of product reviews lie on an ordinal scale and, hence, pose such an ordinal quantification task. This data set comprises two representations of the McAuley data. The first representation consists of TF-IDF features. The second representation is a RoBERTa embedding. This second representation is dense, while the first is sparse. In our experience, logistic regression classifiers work well with both representations. RoBERTa embeddings yield more accurate predictors than the TF-IDF features. You can extract our data sets yourself, for instance, if you require a raw textual representation. The original McAuley data set is public already and we provide all of our extraction scripts.

Tags

Data and Resources

To access the resources you must log in

Zenodo link

Site containing the files to download
The resource: 'Zenodo link' is not accessible as guest user. You must login to access it!

Item URL

https://data.d4science.org/ctlg/ResourceCatalogue/product_reviews_for_ordinal_quantification

Personal Data Attributes

Description: Personal Data related Information

Field	Value
Anonymised	Pseudo Anonymized
ChildrenData	No
Cross Border Authorised	Yes
Data Protection Impact Assessment	Yes
Ethics Committee Approval	Yes
General Data	Yes
Informed Consent Template	Yes
Personal Data	No
Personal data was manifestly made public by the data subject	No
Sensitive Data	No

Additional Info

Field	Value
Accessibility	Both
Accessibility Mode	Download
Availability	On-Line
Basic rights	Download
Creation Date	2022-09-16
Creator	Bunse, Mirko, mirko.bunse@cs.tu-dortmund.de, orcid.org/0000-0002-5515-6278
Dataset Citation	Bunse, Mirko, Moreo, Alejandro, Sebastiani, Fabrizio, & Senz, Martin. (2022). Product Reviews for Ordinal Quantification (v0.1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7081208
Dataset Re-Use Safeguards	None
DiskSize	23.4
External Identifier	10.5281/zenodo.7081208
Field/Scope of use	Non-commercial research only
Format	zip
Group	Others
Language	eng, English
License term	2022-09-16 /2032-09-16
Manifestation Type	Virtual
Processing Degree	Secondary
Retention Period	2022-09-16 /2032-09-16
Size	23.4 GB
Sublicense rights	No
Territory of use	World Wide
Thematic Cluster	Text and Social Media Mining [TSMM]
system:type	Dataset

Management Info

Field	Value
Author	Moreo Alejandro
Maintainer	Moreo Alejandro
Version	1
Last Updated	17 June 2023, 08:23 (CEST)
Created	17 February 2023, 16:16 (CET)