Synthetic Datasets for Fine-Grained Fairness Analysis of Abusive Language Detection Systems - Items

Item
Groups

approved

Synthetic Datasets for Fine-Grained Fairness Analysis of Abusive Language Detection Systems

Three synthetic datasets covering different types of bias grouped by target, namely sexism, racism and ableism. The reason for distinguishing the records by abuse targets is due to the need for specialised datasets addressing different phenomena of abusive language with a fine-grained approach. The resulting data do not contain samples from datasets under license: the contents we release are therefore freely available. Briefly, the first dataset on sexism contains 1,200 non-hateful and 4,423 hateful samples; the second one on racism contains 400 non-hateful and 1,500 hateful records; the last one on ableism contains 220 hateful sentences. The label distribution is radically different from traditional abusive language datasets, where the prevalent class is non-hateful. This choice is motivated by the fact that we want to mainly focus on the phenomena surrounding social prejudices providing realistic and diverse examples, with the aim of exploring in depth the language used to convey biases.

Tags

Data and Resources

To access the resources you must log in

Synthetic Datasets for Fine-Grained Fairness ...CSV

Three synthetic datasets covering different types of bias grouped by target,...
The resource: 'Synthetic Datasets for ...' is not accessible as guest user. You must login to access it!

Item URL

https://data.d4science.org/ctlg/ResourceCatalogue/synthetic_datasets_for_fine-grained_fairness_analysis_of_abusive_language_detection_systems

Personal Data Attributes

Description: Personal Data related Information

Field	Value
ChildrenData	No
Cross Border Authorised	Yes
General Data	Yes
Personal Data	No
Personal data was manifestly made public by the data subject	No
Sensitive Data	No

Additional Info

Field	Value
Accessibility	Both
Accessibility Mode	OnLine Access
Attribution requirements	Manerba, Marta Marchiori, and Sara Tonelli. "Fine-grained fairness analysis of abusive language detection systems with checklist." Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021.
Availability	On-Line
Basic rights	Modification
Creation Date	2021-12-22
Creator	Marchiori Manerba, Marta, marta.marchiori@phd.unipi.it
Dataset Citation	Manerba, Marta Marchiori, and Sara Tonelli. "Fine-grained fairness analysis of abusive language detection systems with checklist." Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021.
Dataset Re-Use Safeguards	/
Display requirements	Paper citation
Field/Scope of use	Any use
Format	csv
Format Schema	text, label
Group	Societal Debates and Misinformation
Language	eng, English
License term	2023-11-27 /2026-11-27
Manifestation Type	Original
Processing Degree	Primary
Retention Period	2026-11-27 /2026-11-30
Size	5623 ; 1900 ; 220
SoBigData Node	SoBigData EU
SoBigData Node	SoBigData IT
Sublicense rights	No
Territory of use	World Wide
Thematic Cluster	Text and Social Media Mining [TSMM]
system:type	Dataset

Management Info

Field	Value
Author	MARCHIORI MANERBA MARTA
Maintainer	MARCHIORI MANERBA MARTA
Version	1
Last Updated	1 December 2023, 09:19 (CET)
Created	27 November 2023, 16:18 (CET)