approved
SWH Filenames

A 69 GB dataset with ~2.3 billion strings representing deduplicated names of source code files collected by Software Heritage, the great library of source code (https://www.softwareheritage.org).

Refer to https://inria.hal.science/hal-04171177 for the technical description of how the filenames were retrieved from the Software Heritage Graph.

This dataset has been used for the evaluation of compressed string dictionaries (https://doi.org/10.1007/978-3-031-43980-3_16).

Number of newline-separated strings: 2294328376 Size of the zip-compressed data:17532778513 bytes (17.53 GB) Size of the uncompressed data: 68976134664 bytes (68.98 GB) Encoding: UTF-8

Tags
Data and Resources
To access the resources you must log in
  • SWH FilenamesZIP

    The resource: 'SWH Filenames' is not accessible as guest user. You must login to access it!
Personal Data Attributes

Description: Personal Data related Information

Field Value
Anonymised No
ChildrenData No
Cross Border Authorised Yes
Data Protection Impact Assessment No
Ethics Committee Approval No
General Data Yes
Informed Consent Template No
Personal Data No
Personal data was manifestly made public by the data subject No
Sensitive Data No
Additional Info
Field Value
Accessibility Both
Accessibility Mode OnLine Access
Accessibility Mode Download
Availability On-Line
Basic rights Download
Basic rights Copying
Basic rights Modification
Creation Date 2023-04-27 09:00
Creator Vinciguerra, Giorgio, giorgio.vinciguerra@unipi.it, orcid.org/0000-0003-0328-7791
Dataset Citation https://inria.hal.science/hal-04171177
Dataset Re-Use Safeguards /
DiskSize 68976.13
Field/Scope of use Any use
Format txt
Group Others
Language eng, English
License term 2023-11-30 09:00/2999-12-31 09:00
Manifestation Type Original
Processing Degree Secondary
Retention Period 2023-11-30 09:00
SoBigData Node SoBigData IT
SoBigData Node SoBigData EU
Sublicense rights No
Territory of use World Wide
Thematic Cluster Other
system:type Dataset
Management Info
Field Value
Author Vinciguerra Giorgio
Maintainer Vinciguerra Giorgio
Version 1
Last Updated 6 December 2023, 10:46 (CET)
Created 30 November 2023, 16:51 (CET)