Export formats¶
AltoXML
¶
Bases: Serializer
Alto XML serializer.
This serializer uses a jinja template to produce Alto XML files according to version 4.4 of the Alto schema.
Features¶
- Uses Alto version 4.4.
- Includes detailed processing metadata in the
<Description>
block. - Supports rendering of region locations (printspace and margins).
To enable this, first make sure that the regions are tagged by
calling
layout.label_regions(...)
before serialization. - Will always produce a file, but the file may be empty.
Limitations¶
- Two-level segmentation: The Alto schema only supports two-level segmentation, i.e. pages with regions and lines. Pages with deeper segmentation will be flattened so that only the innermost regions are rendered.
- Only includes text confidence at the page level.
Examples¶
Example usage with the Export
pipeline step:
Source code in src/htrflow/serialization/serialization.py
validate
¶
Validate doc
against the current schema
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doc
|
str
|
Input document |
required |
Source code in src/htrflow/serialization/serialization.py
PageXML
¶
Bases: Serializer
Page XML serializer
This serializer uses a jinja template to produce Page XML files according to the 2019-07-15 version of the schema.
Features¶
- Includes line confidence scores.
- Supports nested segmentation.
Limitations¶
- Will not create an output file if the page is not serializable, for example if it does not contain any regions. (This behaviour differs from the Alto serializer, which instead would produce an empty file.)
Examples¶
Example usage with the Export
pipeline step:
Source code in src/htrflow/serialization/serialization.py
validate
¶
Validate doc
against the current schema
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doc
|
str
|
Input document |
required |
Source code in src/htrflow/serialization/serialization.py
Json
¶
Bases: Serializer
JSON serializer
This serializer extracts all content from the collection and saves it as json. The resulting json file(s) include properties that are not supported by Alto or Page XML, such as region confidence scores.
Examples¶
Example usage with the Export
pipeline step:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
one_file
|
Export all pages of the collection to the same file. Defaults to False. |
False
|
|
indent
|
The indentation level of the output json file(s). |
4
|