Skip to content

CR 1.9 providing pointers into files and bytestreams

Leah Prescott edited this page May 28, 2014 · 1 revision

Proposed change:
Adding BETYPE, BEING and END attribute to the <file> and <stream> element. These attributes are optional.
The BETYPE attribute must either have the value BYTE or IDREF

Description:
A file or stream which is embedded into a file is represented by nested <file> file elements or by a <stream> element as child of a <file> element. As we have two different kind of files, the file containing the other file or stream is called container file. The container file is usually a zip, tar or WARC file. But it could be any container format like e.g. a TIFF file containing various images (a Multi-TIFF file).
Besides pointing into binary files using byte offsets, the proposed mechanism must allow to point into XML files using XMLIDs as well.

Use case:
Storing the location of a file or stream within a container in the METS file would allow to read data directly from the container without loading and parsing the whole container file, if the container format will allow this.

The definition of a container file is very vague. Common container file formats are zip or tar. They had been designed tofor bundleing several files into one big file. But also content files themselves may be containers for a certain type of bytestreams. E.g. a TIFF file may actually contain various images or various manifestations of the same image (different resolutions).

The container file as such may not be in the main focus of interest. The embedded files and bytestreams with their metadata are usually more important. Recording the location of a content file within a container file using byte offsets might enable to read those even if the internal structure of the container file is unknown. This might proof very valuable especially for the readablitly of new and still immature container formats such as WARC.

Though the <file> and <bytestream> element are using the same attributes as the <area> element within the structmap, the semantics is very different. The <area> element is pointing to an area within a content file which actually manifests the <div> object. A <div> object is typically not a file but representing a logical or physical entity such as a column, a page, a chapter etc.
Both kinds of references into a file may even be used at the same time:

the file set consists of page images
all page images are bundled into one container file (e.g. a TAR or ZIP file)
the physical structMap defines columns
the structMap contains pointers into the page image files using the area element (using XHTML coordinates defining the columns in the image file)
the content file elements containing byte offsets defining the file's position within the container file.
The location information in the <area> element and the <file> element are using different points of reference.



contributed by markus enders on Feb 15 5:58pm

https### _4922663812``://github.com/mets/METS-board.wiki.git

Clone this wiki locally