metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)

metawarc (pronounced me-ta-warc) is a command line WARC files processing tools. Its goal is to make CLI interaction with files inside WARC archives so easy as possible. It provides a simple metawarc command that allows to extract metadata from images, documents and other files inside WARC archives.

Contents

1 Main features
2 File formats supported
3 Installation
- 3.1 Any OS
- 3.2 Python version
4 Usage
5 Quickstart
6 Commands
- 6.1 Index command
- 6.2 Index content command
- 6.3 Stats command
- 6.4 Dump metadata command
- 6.5 List files command
- 6.6 Dump command

1 Main features

Built-in WARC support
Metadata extraction for a lot of file formats
Low memory footprint
Documentation
Test coverage

2 File formats supported

MS Office OLE: .doc, .xls, .ppt
MS Office XML: .docx, .xlsx, .pptx
Adobe PDF: .pdf
Images: .png, .jpg, .tiff, .jpeg, .jp2

3 Installation

3.1 Any OS

A universal installation method (that works on Windows, Mac OS X, Linux, …, and always provides the latest version) is to use pip:

# Make sure we have an up-to-date version of pip and setuptools:
$ pip install --upgrade pip setuptools

$ pip install --upgrade metawarc

(If pip installation fails for some reason, you can try easy_install metawarc as a fallback.)

3.2 Python version

Python version 3.6 or greater is required.

4 Usage

Synopsis:

$ metawarc [command] [flags]  inputfile

See also metawarc --help and metawarc [command] --help for help for each command.

5 Quickstart

Index all WARC files in all subfolders

$ metawarc index '*/*.warc.gz'

View file extensions statistics

$ metawarc stats -m exts

List all PDF files

$ metawarc list-files -e pdf

Dumps all records with size greater than 10M and file extension 'pdf' to 'bigpdf' directory

$ metawarc dump -q "content_length > 10000000 and ext = 'pdf'" -o bigpdf

6 Commands

6.1 Index command

Generates 'warcindex.db' DuckDB database with WARC files meta and for each WARC file generated two Parquet files in 'data' directory, they inherit WARC file name and have suffix '_records' and "_headers". All of them registered in 'warcindex.db' with tables as "files" and "tables".

Analyzes 'armstat.am.warc.gz' and writes 'warcindex.db' with records and headers metadata.

$ metawarc index armstat.am.warc.gz

Analyzes all WARC files in all subfolders and writes 'warcindex.db' with records and headers metadata.

$ metawarc index '*/*.warc.gz'

6.2 Index content command

Analyzes WARC files records and extracts relevant metadata / content for future reuse. Supported metadata types: ooxmldocs, oledocs, pdfs, images, links Results saved to Parquet file in 'data' directory with suffix of the related metdata. For example '_images' for images.

Collects PDF files metadata from all WARC files

$ metawarc index-content -t pdfs

Collects all links for selected WARC file (should be listed in 'warcindex.db' after index command run)

$ metawarc index-content -i armstat.am.warc.gz -t links

6.3 Stats command

Returns total length and count of records by each mime or file extension.

Processes data in 'metawarc.db' and prints total length and count for each mime

$ metawarc stats -m mimes

Processes data in 'metawarc.db' and prints total length and count for each file extension

$ metawarc stats -m exts

6.4 Dump metadata command

Dumps metadata from tables. Supported metadata types: pdfs, ooxmldocs, oledocs, images, links

Exports PDF files metadata and writes as 'pdfs_metadata.jsonl'

$ metawarc dump-metadata -t pdfs -o pdfs_metadata.jsonl

6.5 List files command

Prints list of records with id, offset, length and url using 'metawarc.db'. Accepts list of mime types or list of file extensions or query as WHERE clause

Prints all records with mime type (content type) 'application/zip'

$ metawarc list-files -m 'application/zip'

Prints all records with file extensions 'xls' and 'xlsx'

$ metawarc list-files -e xls,xlsx

Prints all records with size greater than 10M and file extension 'pdf'

$ metawarc list-files -q "content_length > 10000000 and ext = 'pdf'"

6.6 Dump command

Dumps records payloads as files using 'metawarc.db' as WARC index. Accepts list of mime types or list of file extensions or query as WHERE clause. Adds CSV file 'records.csv' to the output directory with basic data about each dumped record.

Dumps all records with mime type (content type) 'application/zip' to 'allzip' directory

$ metawarc dump -m 'application/zip' -o allzip

Dumps all records with file extensions 'xls' and 'xlsx' to 'sheets' directory

$ metawarc dump -e xls,xlsx -o sheets

Dumps all records with size greater than 10M and file extension 'pdf' to 'bigpdf' directory

$ metawarc dump -q "content_length > 10000000 and ext = 'pdf'" -o bigpdf

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.idea		.idea
metawarc		metawarc
.coveragerc		.coveragerc
.editorconfig		.editorconfig
.gitignore		.gitignore
AUTHORS.rst		AUTHORS.rst
HISTORY.rst		HISTORY.rst
LICENSE		LICENSE
README.rst		README.rst
flake8		flake8
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)

1 Main features

2 File formats supported

3 Installation

3.1 Any OS

3.2 Python version

4 Usage

5 Quickstart

6 Commands

6.1 Index command

6.2 Index content command

6.3 Stats command

6.4 Dump metadata command

6.5 List files command

6.6 Dump command

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

datacoon/metawarc

Folders and files

Latest commit

History

Repository files navigation

metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!

Languages