Wikipedia Word Embeddings

Embeddings were created through applying word2vec skipgram to a corpus of wikipedia non-stub articles from a December 2015 English dump with the following parameters: -cbow 0 -size 200 -window 10 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 0 -iter 10 The format is standard for word embeddings textual representations and can be read simply, e.g. in a line-by-line basis. First line is dimensionality of the embedding matrix (number of words, number of dimensions): 2847710 200 The following lines: word1 v_1_1 ... v_1_200 . . . word2847710 v_2847710_1 ... v_2847710_200 where v_i_j is the value of the i-th word embedding in the j-th dimension.

Tags
Data and Resources
To access the resources you must log in
  • Embeddings

    The resource: 'Embeddings' is not accessible as guest user. You must login to access it!
Additional Info
Field Value
Accessibility Both
AccessibilityMode OnLine Access
AccessibilityMode Download
Area Societal Debates
Attribution requirements Yes, Maciej Rybinski
Availability On-Line
Basic rights Download
Basic rights Copying
Basic rights Modification
ChildrenData No
Consent obtained also covers the envisaged transfer of the personal data outside the EU N/A (Not appliable)
Consent of the data subject N/A (Not appliable)
CreationDate 2018-05-01
Creator Rybinski, Maciej
DataProtectionDirective N/A
DiskSize 5400
Display requirements
Distribution requirements
External Identifier
Field/Scope of use Any use
Format Text
FormatSchema
IP/Copyrights
Item URL http://data.d4science.org/ctlg/ResourceCatalogue/wikipedia_word_embeddings
http://data.d4science.org/ctlg/ResourceCatalogue/wikipedia_word_embeddings
Language eng, English
License term /Not specified
ManifestationType Virtual
Personal data was manifestly made public by the data subject N/A (Not appliable)
PersonalData No
PersonalSensitiveData No
ProcessingDegree Secondary
RelatedPaper
Requirement of non-disclosure (confidentiality mark)
Restrictions on use See GNU LGPL
Semantic Coverage Wikipedia
Size 5.4GB
Sublicense rights No
Territory of use World Wide
ThematicCluster Text and Social Media Mining
TimeCoverage 2015-01-01 /2015-12-31
spatial
system:type SoBigData.eu: Dataset
Management Info
Field Value
Author Gorrell Genevieve
Maintainer Gorrell Genevieve
Version 1
Last Updated 6 July 2018, 16:42 (CEST)
Created 6 July 2018, 16:42 (CEST)