-
SWH Filenames
A 69 GB dataset with ~2.3 billion strings representing deduplicated names of source code files collected by Software Heritage, the great library of source code...-
ZIP
The resource: 'SWH Filenames' is not accessible as guest user. You must login to access it!
-
ZIP
-
Santorini Tweets July-August 2021
This dataset contains 225.501 tweets written by 141.277 users. These tweets are geolocated in Santorini, or they contain the word or the hashtag "santorini" in the text. They...-
ZIP
The resource: 'tweet_santorini.csv' is not accessible as guest user. You must login to access it!
-
ZIP
-
Articles and comments of major Estonian newspapers
The dataset contains articles and comments of four major Estonian news portals since early 2000s to 2016. -
The Italian Music Dataset
The dataset is built by exploiting the Spotify and SoundCloud APIs. It is composed of over 14,500 different songs of both famous and less famous Italian musicians. Each song...-
JSON
The resource: 'Dataset' is not accessible as guest user. You must login to access it!
-
JSON
-
Wikipedia Word Embeddings
Embeddings were created through applying word2vec skipgram to a corpus of wikipedia non-stub articles from a December 2015 English dump with the following parameters: -cbow 0... -
Product Reviews for Ordinal Quantification
This data set comprises a labeled training set, validation samples, and testing samples for ordinal quantification. It appears in our research paper "Ordinal Quantification...