28 items found

Groups: Others Tags: Text mining

Filter Results
  • Dataset

    SWH Filenames

    A 69 GB dataset with ~2.3 billion strings representing deduplicated names of source code files collected by Software Heritage, the great library of source code...
    • ZIP
      The resource: 'SWH Filenames' is not accessible as guest user. You must login to access it!
  • Dataset

    FANCY Dataset

    (NLI) FANCY (FActivity, Negation, Common-sense, hYpernimy) is a new dataset with 4000 sentence pairs concerning complex linguistic phenomena such as factivity, negation,...
    • The resource: 'FANCY Dataset' is not accessible as guest user. You must login to access it!
  • Dataset

    Santorini Tweets July-August 2021

    This dataset contains 225.501 tweets written by 141.277 users. These tweets are geolocated in Santorini, or they contain the word or the hashtag "santorini" in the text. They...
    • ZIP
      The resource: 'tweet_santorini.csv' is not accessible as guest user. You must login to access it!
  • Method

    Quantum Distance-Based Classifier

    The Quantum Distance-Based Classifier is a technique inspired by the classical k-Nearest Neighbors that leverages quantum properties to perform prediction.
  • Method

    CLiQS

    CLiQS is a Python language software package for social media texts summarization with a diversified approach.
    • The resource: 'CLiQS-CM' is not accessible as guest user. You must login to access it!
  • Access required...

    ×

    Method

    Private Distributed W2V

    Accelerated training of Word Embeddings for large text corpora. Creates a word2vec-model from an input corpus of tokenized texts through the use of parallel distributed...
  • Method

    Ariadne Dutch Dendrochronology Entity Recognizer

    Identifies terms and phrases in Dutch for analysing archaeological text. The method delivers named entities of archaeological elements, wood material, sample, and date, with...
    • method-engine
      The resource: 'Method Engine' is not accessible as guest user. You must login to access it!
  • Method

    Ariadne Dutch Archaeology Named Entity Recognizer

    Identifies terms and phrases in Dutch for analysing archaeological text. The method delivers named entities of archaeological context, physical object, material, time...
    • method-engine
      The resource: 'Method Engine' is not accessible as guest user. You must login to access it!
  • Method

    Ariadne English Archaeology Named Entity Recognizer

    Identifies terms and phrases in English for analysing archaeological text. The method delivers named entities of archaeological context, physical object, material, time...
    • method-engine
      The resource: 'Method Engine' is not accessible as guest user. You must login to access it!
  • Method

    Ariadne Swedish Archaeology Named Entity Recognizer

    Identifies terms and phrases in Swedish for analysing archaeological text. The method delivers named entities of archaeological context, physical object, material, time...
    • method-engine
      The resource: 'Method Engine' is not accessible as guest user. You must login to access it!
  • Method

    Ariadne English Dendrochronology Entity Recognizer

    Identifies terms and phrases in English for analysing archaeological text. The method delivers named entities of archaeological elements, wood material, sample, and date, with...
    • method-engine
      The resource: 'Method Engine' is not accessible as guest user. You must login to access it!
  • Method

    Ariadne Swedish Dendrochronology Entity Recognizer

    Identifies terms and phrases in Swedish for analysing archaeological text. The method delivers named entities of archaeological elements, wood material, sample, and date, with...
    • method-engine
      The resource: 'Method Engine' is not accessible as guest user. You must login to access it!
  • Method

    GATE Cloud Chemical Entity Recogniser

    This service annotates chemical named entities using the open source OSCAR4 tagger. As well as the names of the detected entities the tagger also returns their structure in...
    • method-engine
      The resource: 'Method Engine' is not accessible as guest user. You must login to access it!
  • Dataset

    Wikinews dataset

    This dataset consists of a sample of 365 news published by Wikinews from November 2004 to June 2014 and annotated with about 5000 entities, each associated with a saliency...
    • JSON
      The resource: 'entity-saliency' is not accessible as guest user. You must login to access it!
  • Dataset

    The Italian Music Dataset

    The dataset is built by exploiting the Spotify and SoundCloud APIs. It is composed of over 14,500 different songs of both famous and less famous Italian musicians. Each song...
    • JSON
      The resource: 'Dataset' is not accessible as guest user. You must login to access it!
  • Method

    ArchiveSpark

    ArchiveSpark is an Apache Spark framework for easy data access, processing, extraction as well as derivation for Web archives and archival collections. It has a simple and...
    • The resource: 'ArchiveSpark on GitHub' is not accessible as guest user. You must login to access it!
  • Method

    Dictionary creator

    This tool creates a dictionary with inverse document frequency (idf) values from the Google NGrams dataset.
    • The resource: 'Source code' is not accessible as guest user. You must login to access it!
  • Dataset

    WIRE dataset

    This dataset consists of 503 pairs of Wikipedia entities drawn from the New York Times dataset with a human assigned relatedness score. The domain experts based their...
    • HTML
      The resource: 'WikipediaRelatedness' is not accessible as guest user. You must login to access it!
    • CSV
      The resource: 'WIRE dataset' is not accessible as guest user. You must login to access it!
  • Dataset

    Wikipedia Word Embeddings

    Embeddings were created through applying word2vec skipgram to a corpus of wikipedia non-stub articles from a December 2015 English dump with the following parameters: -cbow 0...
    • The resource: 'Embeddings' is not accessible as guest user. You must login to access it!
  • Dataset

    Amazon reviews

    This (link to the) dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014. This dataset includes reviews...
    • HTML
      The resource: 'Julian McAuley's repository.' is not accessible as guest user. You must login to access it!
You can also access this registry using the API (see API Docs).