Archive Spark

An Apache Spark framework for easy data processing, extraction as well as derivation for archival collections. Originally developed for the use with Web archives, it has now been extended to support any archival dataset through Data Specifications. ArchiveSpark incorporates lightweight metadata records about the items in a dataset, which are commonly available for archival collections. Now, basic operations, like filtering, deduplication, grouping, sorting, will be performed on these metadata records, before they get enriched with additional information from the actual data records. Hence, rather than starting from everything and removing unnecessary data, ArchiveSpark starts from metadata that gets extended, leading to significant efficiency improvements in the work with archival collections.

The author did not intend to violate any copyright on figures or content. In case you are the legal owner of any copyrighted content, please contact and we will immediately remove it

Data and Resources
To access the resources you must log in
  • Archive Spark SlidesPDF

    Guide for the work with archival data using ArchiveSpark

    The resource: 'Archive Spark Slides' is not accessible as guest user. You must login to access it!
  • Archive Spark Jupyter NotebooksZIP

    Recipes for the work with archival data (Web archives and other) using...

    The resource: 'Archive Spark Jupyter ...' is not accessible as guest user. You must login to access it!
Additional Info
Field Value
Availability On-Site
Course Archive Spark
Keywords Archiving
Keywords Data Processing
Keywords Spark
Keywords Archival Collections
Length 21 slides
Prerequisites None
Provider Institution LUH - Leibniz Universität Hannover – L3S Research Center
Target users Social Scientists
Target users Data Scientists
Target users PhD Students
Target users Professionals
Thematic Cluster Web Analytics [WA]
Training material typology Slides
system:type TrainingMaterial
Management Info
Field Value
Version 1
Last Updated 8 October 2021, 13:11 (CEST)
Created 29 June 2018, 11:34 (CEST)