Skip to content

Export formats

AltoXML

Bases: Serializer

Alto XML serializer.

This serializer uses a jinja template to produce Alto XML files according to version 4.4 of the Alto schema.

Features

  • Uses Alto version 4.4.
  • Includes detailed processing metadata in the <Description> block.
  • Supports rendering of region locations (printspace and margins). To enable this, first make sure that the regions are tagged by calling layout.label_regions(...) before serialization.
  • Will always produce a file, but the file may be empty.

Limitations

  • Two-level segmentation: The Alto schema only supports two-level segmentation, i.e. pages with regions and lines. Pages with deeper segmentation will be flattened so that only the innermost regions are rendered.
  • Only includes text confidence at the page level.

Examples

Example usage with the Export pipeline step:

- step: Export
  settings:
    dest: alto-ouptut
    format: alto

Parameters:

Name Type Description Default
template_dir

Name of template directory.

_TEMPLATES_DIR
template_name

Name of template file in template_dir.

'alto'
Source code in src/htrflow/serialization/serialization.py
def __init__(self, template_dir=_TEMPLATES_DIR, template_name="alto"):
    """
    Arguments:
        template_dir: Name of template directory.
        template_name: Name of template file in `template_dir`.
    """
    env = Environment(loader=FileSystemLoader([template_dir, "."]))
    self.template = env.get_template(template_name)
    self.schema = os.path.join(_SCHEMA_DIR, "alto-4-4.xsd")

validate

Validate doc against the current schema

Parameters:

Name Type Description Default
doc str

Input document

required
Source code in src/htrflow/serialization/serialization.py
def validate(self, doc: str) -> None:
    """Validate `doc` against the current schema

    Arguments:
        doc: Input document

    Raises:
        xmlschema.XMLSchemaValidationError if the document violates
        the current schema.
    """
    xmlschema.validate(doc, self.schema)

PageXML

Bases: Serializer

Page XML serializer

This serializer uses a jinja template to produce Page XML files according to the 2019-07-15 version of the schema.

Features

  • Includes line confidence scores.
  • Supports nested segmentation.

Limitations

  • Will not create an output file if the page is not serializable, for example if it does not contain any regions. (This behaviour differs from the Alto serializer, which instead would produce an empty file.)

Examples

Example usage with the Export pipeline step:

- step: Export
  settings:
    dest: page-ouptut
    format: page

Parameters:

Name Type Description Default
template_dir

Name of template directory.

_TEMPLATES_DIR
template_name

Name of template file in template_dir.

'page'
Source code in src/htrflow/serialization/serialization.py
def __init__(self, template_dir=_TEMPLATES_DIR, template_name="page"):
    """
    Arguments:
        template_dir: Name of template directory.
        template_name: Name of template file in `template_dir`.
    """
    env = Environment(loader=FileSystemLoader([template_dir, "."]))
    self.template = env.get_template(template_name)
    self.schema = os.path.join(_SCHEMA_DIR, "pagecontent.xsd")

validate

Validate doc against the current schema

Parameters:

Name Type Description Default
doc str

Input document

required
Source code in src/htrflow/serialization/serialization.py
def validate(self, doc: str) -> None:
    """Validate `doc` against the current schema

    Arguments:
        doc: Input document

    Raises:
        xmlschema.XMLSchemaValidationError if the document violates
        the current schema.
    """
    xmlschema.validate(doc, self.schema)

Json

Bases: Serializer

JSON serializer

This serializer extracts all content from the collection and saves it as json. The resulting json file(s) include properties that are not supported by Alto or Page XML, such as region confidence scores.

Examples

Example usage with the Export pipeline step:

- step: Export
  settings:
    dest: json-ouptut
    format: json
    one_file: False
    indent: 2

Parameters:

Name Type Description Default
one_file

Export all pages of the collection to the same file. Defaults to False.

False
indent

The indentation level of the output json file(s).

4
Source code in src/htrflow/serialization/serialization.py
def __init__(self, one_file=False, indent=4):
    """
    Arguments:
        one_file: Export all pages of the collection to the same file.
            Defaults to False.
        indent: The indentation level of the output json file(s).
    """
    self.one_file = one_file
    self.indent = indent

PlainText

Bases: Serializer

Plain text serializer

This serializer extracts all text content from the collection and saves it as plain text. All other data (metadata, coordinates, geometries, confidence scores, and so on) is discarded.

Examples

Example usage with the Export pipeline step:

- step: Export
  settings:
    dest: text-ouptut
    format: txt