approved
ArchiveSpark

ArchiveSpark

ArchiveSpark is an Apache Spark framework for easy data access, processing, extraction as well as derivation for Web archives and archival collections. It has a simple and expressive interface and an extensible architecture to support various derivation tools. It is compliant to and reuses of the standard formats in the domain of Web archives, and also the output is in a standard, readable and reusable format. An efficient selection and filtering process based on a metadata index makes this framework faster than alternative approaches without depending on any additional data stores.

Tags
Data and Resources
To access the resources you must log in
  • ArchiveSpark on GitHub

    The resource: 'ArchiveSpark on GitHub' is not accessible as guest user. You must login to access it!
Additional Info
Field Value
Accessibility Both
AccessibilityMode Download
Availability On-Line
Basic rights Download
CreationDate 2016-06-01
Creator Holzmann, Helge, holzmann@L3s.de, orcid.org/0000-0003-4811-6902
Dependencies on Other SW Apache Spark
External Identifier github.com/helgeho/ArchiveSpark
Field/Scope of use Any use
License term /Not specified
Owner Holzmann, Helge, holzmann@L3s.de, orcid.org/0000-0003-4811-6902
ProgrammingLanguage Scala
RelatedPaper H. Holzmann, V. Goel and A. Anand. ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. 16th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL). Newark, New Jersey, USA. June 2016
Sublicense rights No
Territory of use World Wide
ThematicCluster Web Analytics
UsageMode Download
input Any archival collection with metadata
output Derivative dataset
system:type Method
Management Info
Field Value
Author Anand Avishek
Maintainer Helge Holzmann
Version 2
Last Updated 17 June 2023, 08:25 (CEST)
Created 29 June 2018, 11:32 (CEST)