Skip to content

Latest commit

 

History

History
31 lines (20 loc) · 3.62 KB

Approach_Publications.md

File metadata and controls

31 lines (20 loc) · 3.62 KB
< Table of Contents Related Projects >

Approach

In the traditional Spark / Map Reduce approach, datasets are loaded fully before irrelevant records are filtered out and relevant ones are transformed into something more valuable by extracting and deriving meaningful information.

In contrast to this, ArchiveSpark incorporates lightweight metadata records about the items in a dataset, which are commonly available for archival collections. Now, basic operations, like filtering, deduplication, grouping, sorting, will be performed on these metadata records, before they get enriched with additional information from the actual data records. Hence, rather than starting from everything and removing unnecessary data, ArchiveSpark starts from metadata that gets extended, leading to significant efficiency improvements in the work with archival collections.:

ArchiveSpark Approach

The original version of ArchiveSpark was developed for web archives, with the metadata coming from CDX (capture index) and the data being stored in (W)ARC files. With the later introduction of Data Specifications, ArchiveSpark can now be used with any archival collection that provides metadata records along with the data.

Publications

ArchiveSpark is described and published in two research papers, which you should cite when you use ArchiveSpark in your work:

In addition to these publications, ArchiveSpark was used as a major component in the following works:

< Table of Contents Related Projects >