< Table of Contents | Related Projects > |
---|
In the traditional Spark / Map Reduce approach, datasets are loaded fully before irrelevant records are filtered out and relevant ones are transformed into something more valuable by extracting and deriving meaningful information.
In contrast to this, ArchiveSpark incorporates lightweight metadata records about the items in a dataset, which are commonly available for archival collections. Now, basic operations, like filtering, deduplication, grouping, sorting, will be performed on these metadata records, before they get enriched with additional information from the actual data records. Hence, rather than starting from everything and removing unnecessary data, ArchiveSpark starts from metadata that gets extended, leading to significant efficiency improvements in the work with archival collections.:
The original version of ArchiveSpark was developed for web archives, with the metadata coming from CDX (capture index) and the data being stored in (W)ARC files. With the later introduction of Data Specifications, ArchiveSpark can now be used with any archival collection that provides metadata records along with the data.
ArchiveSpark is described and published in two research papers, which you should cite when you use ArchiveSpark in your work:
- The first and main paper was the presentation of ArchiveSpark at JCDL 2016 (Best Paper Nominee). It describes the core ideas and includes benchmarks:
- We later presented the extensions to ArchiveSpark to make it a more universal / generic data processing platform for any archival collection at IEEE BigData 2017 (Short Paper):
In addition to these publications, ArchiveSpark was used as a major component in the following works:
- In combination with the temporal archive search engine Tempas, ArchiveSpark was used for a data analysis case starting from keyword queries through Tempas2ArchiveSpark:
- ArchiveSpark with ArchiveSpark2Triples was used to build a semantic layer for web archives in the following publication:
< Table of Contents | Related Projects > |
---|