Broad Twitter Corpus

The Broad Twitter Corpus is a named entity-annotated dataset of tweets, collected in order to capture temporal, spatial and social diversity. The goal of the corpus is to provide a representative example of named entities in social media. Its annotations have high agreement and quality, and it has about 12000 entity annotations, of types Person, Location and Organization.

Data and Resources
To access the resources you must log in
  • Broad Twitter CorpusJSON

    The Broad Twitter Corpus is a named entity-annotated dataset of tweets,...

    The resource: 'Broad Twitter Corpus' is not accessible as guest user. You must login to access it!
Additional Info
Field Value
Accessibility Both
AccessibilityMode OnLine Access
AccessibilityMode Download
Availability On-Line
Basic rights Download
Basic rights Copying
Basic rights Distribution
Basic rights Modification
Basic rights Communication
Basic rights Making available to the public
ChildrenData No
Consent obtained also covers the envisaged transfer of the personal data outside the EU No
Consent of the data subject No
CreationDate 2016-10-01
Creator Derczynski, Leon
DataProtectionDirective Data Protection Act 1998
External Identifier
Field/Scope of use Any use
Format JSON
Language eng, English
License term /Not specified
ManifestationType Virtual
Personal data was manifestly made public by the data subject Yes
PersonalData No
PersonalSensitiveData Select PersonalSensitiveData
ProcessingDegree Secondary
RelatedPaper L. Derczynski, K. Bontcheva, I. Roberts. Broad Twitter Corpus: A Diverse Named Entity Recognition Resource. Proceedings of COLING, 2016
Restrictions on use Credit must be given and the license linked.
Sublicense rights No
Territory of use World Wide
ThematicCluster Text and Social Media Mining
TimeCoverage 2009-01-01 /2014-12-31
system:type Dataset
Management Info
Field Value
Author Gorrell Genevieve
Maintainer Leon Derczynski, Kalina Bontcheva, Ian Roberts
Version 1
Last Updated 22 December 2020, 18:06 (CET)
Created 29 June 2018, 11:33 (CEST)