8 items found

Licenses: Creative Commons Attribution 4.0 Types: Dataset Tags: Text mining

Filter Results
  • Dataset

    SWH Filenames

    A 69 GB dataset with ~2.3 billion strings representing deduplicated names of source code files collected by Software Heritage, the great library of source code...
    • ZIP
      The resource: 'SWH Filenames' is not accessible as guest user. You must login to access it!
  • Dataset

    DNA 31-mers

    A 12 GB dataset containing all the ~367M unique 31-mers in the DNA sequences available in the Pizza&Chili Corpus (https://pizzachili.dcc.uchile.cl/texts.html). This dataset...
    • ZIP
      The resource: 'DNA 31-mers' is not accessible as guest user. You must login to access it!
  • Dataset

    DNA 12-mers

    A 179 MB dataset containing all the ~14M unique 12-mers in the DNA sequences available in the Pizza&Chili Corpus (https://pizzachili.dcc.uchile.cl/texts.html). This dataset...
    • ZIP
      The resource: 'DNA 12-mers' is not accessible as guest user. You must login to access it!
  • Dataset

    BioTAGME: A comprehensive platform for biological knowledge network analysis

    This Network was built through BioTAGME, a system that combines TAGME, an entity-annotation framework based on Wikipedia corpus with a network-based inference methodology (i.e.,...
  • Dataset

    Broad Twitter Corpus

    The Broad Twitter Corpus is a named entity-annotated dataset of tweets, collected in order to capture temporal, spatial and social diversity. The goal of the corpus is to...
    • JSON
      The resource: 'Broad Twitter Corpus' is not accessible as guest user. You must login to access it!
  • Dataset

    Learning to quantify: LeQua 2022 datasets

    The aim of LeQua 2022 (the 1st edition of the CLEF “Learning to Quantify” lab) is to allow the comparative evaluation of methods for “learning to quantify” in textual...
    • The resource: 'Zenodo link' is not accessible as guest user. You must login to access it!
  • Dataset

    Cherenkov Telescope Data for Ordinal Quantification

    This labeled data set is targeted at ordinal quantification. It appears in our research paper "Ordinal Quantification Through Regularization", which we have published at...
    • The resource: 'Zenodo' is not accessible as guest user. You must login to access it!
  • Dataset

    Cross-Lingual Dataset of Crisis-Related Social Media

    If you use this dataset, please cite the following paper: Fedor Vitiugin, Carlos Castillo: Cross-Lingual Query-Based Summarization of Crisis-Related Social Media: An Abstractive...
    • The resource: 'Cross-Lingual Dataset of ...' is not accessible as guest user. You must login to access it!
You can also access this registry using the API (see API Docs).