ArchiveSpark Documentation

ArchiveSpark is a Java/JVM library, written in Scala, based on Apache Spark, which can be used as an API for easy and efficient access to web archives and other supported datasets, as part of your own project or stand-alone, using Scala's interactive shell or notebook tools, such as Jupyter.

To get familiar with ArchiveSpark, but also for most of the common use cases, we recommend the use with Jupyter. In order to get you started more easily, we provide a pre-packaged and pre-configured Docker container with ArchiveSpark and Jupyter ready to run, just one command away: https://github.com/helgeho/ArchiveSpark-docker

To learn more about ArchiveSpark have a look at our GitHub repository.

Basics / Background

Approach and Publications
Related Projects

Getting Started

Installing ArchiveSpark with Jupyter
Using ArchiveSpark with Jupyter
General Usage
Recipes / Examples
Building ArchiveSpark (advanced)
Using ArchiveSpark as a Library (advanced)

API Docs

Configuration
ArchiveSpark Operations
Data Specifications (DataSpecs)
Enrichment Functions

Developer Documentation

Contribute
How to Implement DataSpecs
How to Implement Enrichment Functions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ArchiveSpark Documentation

Basics / Background

Getting Started

API Docs

Developer Documentation

Files

README.md

Latest commit

History

README.md

File metadata and controls

ArchiveSpark Documentation

Basics / Background

Getting Started

API Docs

Developer Documentation