Items - SoBigData.eu Catalogue

Dataset

SWH Filenames

A 69 GB dataset with ~2.3 billion strings representing deduplicated names of source code files collected by Software Heritage, the great library of source code...

ZIP
The resource: 'SWH Filenames' is not accessible as guest user. You must login to access it!

Dataset

FANCY Dataset

(NLI) FANCY (FActivity, Negation, Common-sense, hYpernimy) is a new dataset with 4000 sentence pairs concerning complex linguistic phenomena such as factivity, negation,...

The resource: 'FANCY Dataset' is not accessible as guest user. You must login to access it!

Dataset

Santorini Tweets July-August 2021

This dataset contains 225.501 tweets written by 141.277 users. These tweets are geolocated in Santorini, or they contain the word or the hashtag "santorini" in the text. They...

ZIP
The resource: 'tweet_santorini.csv' is not accessible as guest user. You must login to access it!

Dataset

Wikinews dataset

This dataset consists of a sample of 365 news published by Wikinews from November 2004 to June 2014 and annotated with about 5000 entities, each associated with a saliency...

JSON
The resource: 'entity-saliency' is not accessible as guest user. You must login to access it!

Dataset

The Italian Music Dataset

The dataset is built by exploiting the Spotify and SoundCloud APIs. It is composed of over 14,500 different songs of both famous and less famous Italian musicians. Each song...

JSON
The resource: 'Dataset' is not accessible as guest user. You must login to access it!

Dataset

WIRE dataset

This dataset consists of 503 pairs of Wikipedia entities drawn from the New York Times dataset with a human assigned relatedness score. The domain experts based their...

HTML
The resource: 'WikipediaRelatedness' is not accessible as guest user. You must login to access it!
CSV
The resource: 'WIRE dataset' is not accessible as guest user. You must login to access it!

Dataset

Wikipedia Word Embeddings

Embeddings were created through applying word2vec skipgram to a corpus of wikipedia non-stub articles from a December 2015 English dump with the following parameters: -cbow 0...

The resource: 'Embeddings' is not accessible as guest user. You must login to access it!

Dataset

Amazon reviews

This (link to the) dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014. This dataset includes reviews...

HTML
The resource: 'Julian McAuley's repository.' is not accessible as guest user. You must login to access it!

Dataset

Conversational search dataset with labels

CAsT 2019 data is split into two files one for training and the other one for testing. - Training set: CAsT 2019 conversations from training set and from test set without...

The resource: 'Conversational dataset ...' is not accessible as guest user. You must login to access it!

Dataset

Learning to quantify: LeQua 2022 datasets

The aim of LeQua 2022 (the 1st edition of the CLEF “Learning to Quantify” lab) is to allow the comparative evaluation of methods for “learning to quantify” in textual...

The resource: 'Zenodo link' is not accessible as guest user. You must login to access it!

Dataset

Product Reviews for Ordinal Quantification

This data set comprises a labeled training set, validation samples, and testing samples for ordinal quantification. It appears in our research paper "Ordinal Quantification...

The resource: 'Zenodo link' is not accessible as guest user. You must login to access it!

Dataset

Cherenkov Telescope Data for Ordinal Quantification

This labeled data set is targeted at ordinal quantification. It appears in our research paper "Ordinal Quantification Through Regularization", which we have published at...

The resource: 'Zenodo' is not accessible as guest user. You must login to access it!

Dataset

Cross-Lingual Dataset of Crisis-Related Social Media

If you use this dataset, please cite the following paper: Fedor Vitiugin, Carlos Castillo: Cross-Lingual Query-Based Summarization of Crisis-Related Social Media: An Abstractive...

The resource: 'Cross-Lingual Dataset of ...' is not accessible as guest user. You must login to access it!

Dataset

Dataset for Evaluating Abstractive Summaries of Crisis-Related Social Media

The dataset created for evaluation of summaries generated from social media posted during five natural disasters. The dataset contains: ground truth reports created by human...

The resource: 'Dataset for Evaluating ...' is not accessible as guest user. You must login to access it!

14 items found