Skip to content

Pipeline steps

This page lists HTRflow's built-in pipeline steps.

Base step

PipelineStep

Pipeline step base class.

Pipeline steps are implemented by subclassing this class and overriding the run() method.

run

Run the pipeline step.

Parameters:

Name Type Description Default
collection Collection

Input collection

required

Returns:

Type Description
Collection

A new collection, updated with the results of the pipeline step.

Source code in src/htrflow/pipeline/steps.py
def run(self, collection: Collection) -> Collection:
    """
    Run the pipeline step.

    Arguments:
        collection: Input collection

    Returns:
        A new collection, updated with the results of the pipeline step.
    """

Pre-processing steps

ProcessImages

Bases: PipelineStep

Base for image preprocessing steps.

This is a base class for all image preprocessing steps. Subclasses define their image processing operation by overriding the op() method. This step does not alter the original image. Instead, a new copy of the image is saved in the directory specified by ProcessImages.output_directory. The PageNode's image path is then updated to point to the new processed image.

Attributes:

Name Type Description
output_directory str

Where to write the processed images.

op

Perform the image processing operation on image.

Parameters:

Name Type Description Default
image NumpyImage

Input image.

required

Returns:

Type Description
NumpyImage

A processed version of image.

Source code in src/htrflow/pipeline/steps.py
def op(self, image: NumpyImage) -> NumpyImage:
    """
    Perform the image processing operation on `image`.

    Arguments:
        image: Input image.

    Returns:
        A processed version of `image`.
    """
    pass

Binarization

Bases: ProcessImages

Binarize images.

Runs image binarization on the collection's images. Saves the resulting images in a directory named binarized. All subsequent pipeline steps will use the binarized images.

Example YAML:

- step: Binarization

Inference steps

Inference

Bases: PipelineStep

Run model inference.

This is a generic pipeline step for any type of model inference. This step always runs the model on the images of the collection's leaf nodes.

Example YAML:

- step: Inference
  settings:
    model: DiT
    model_settings:
      model: ...

Source code in src/htrflow/pipeline/steps.py
def __init__(self, model_class, model_kwargs, generation_kwargs):
    self.model_class = model_class
    self.model_kwargs = model_kwargs
    self.generation_kwargs = generation_kwargs
    self.model = None

Segmentation

Bases: Inference

Run a segmentation model.

See Segmentation models for available models.

Example YAML:

- step: Segmentation
  settings:
    model: yolo
    model_settings:
      model: Riksarkivet/yolov9-regions-1

Source code in src/htrflow/pipeline/steps.py
def __init__(self, model_class, model_kwargs, generation_kwargs):
    self.model_class = model_class
    self.model_kwargs = model_kwargs
    self.generation_kwargs = generation_kwargs
    self.model = None

TextRecognition

Bases: Inference

Run a text recognition model.

See Text recognition models for available models.

Example YAML:

- step: TextRecognition
  settings:
    model: TrOCR
    model_settings:
      model: Riksarkivet/trocr-base-handwritten-hist-swe-2

Source code in src/htrflow/pipeline/steps.py
def __init__(self, model_class, model_kwargs, generation_kwargs):
    self.model_class = model_class
    self.model_kwargs = model_kwargs
    self.generation_kwargs = generation_kwargs
    self.model = None

Post-processing steps

Prune

Bases: PipelineStep

Remove nodes based on a given condition.

This is a generic pruning (filtering) step which removes nodes (segments, lines, words) based on the given condition. The condition is a function f such that f(node) == True if node should be removed from the tree. This step runs f on all nodes, at all segmentation levels. See the RemoveLowTextConfidence[Lines|Regions|Pages] steps for examples of how to formulate condition.

Parameters:

Name Type Description Default
condition Callable[[Node], bool]

A function f such that f(node) == True if node should be removed from the document tree.

required
Source code in src/htrflow/pipeline/steps.py
def __init__(self, condition: Callable[[Node], bool]):
    """
    Arguments:
        condition: A function `f` such that `f(node) == True` if
            `node` should be removed from the document tree.
    """
    self.condition = condition

RemoveLowTextConfidencePages

Bases: Prune

Remove all pages where the average text confidence score is below threshold.

Example YAML:

- step: RemoveLowTextConfidencePages
  settings:
    threshold: 0.8

Parameters:

Name Type Description Default
threshold float

Confidence score threshold.

required
Source code in src/htrflow/pipeline/steps.py
def __init__(self, threshold: float):
    """
    Arguments:
        threshold: Confidence score threshold.
    """
    super().__init__(
        lambda node: node.parent and node.parent.is_root() and metrics.average_text_confidence(node) < threshold
    )

RemoveLowTextConfidenceRegions

Bases: Prune

Remove all regions where the average text confidence score is below threshold.

Example YAML:

- step: RemoveLowTextConfidenceRegions
  settings:
    threshold: 0.8

Parameters:

Name Type Description Default
threshold float

Confidence score threshold.

required
Source code in src/htrflow/pipeline/steps.py
def __init__(self, threshold: float):
    """
    Arguments:
        threshold: Confidence score threshold.
    """
    super().__init__(lambda node: node.is_region() and metrics.average_text_confidence(node) < threshold)

RemoveLowTextConfidenceLines

Bases: Prune

Remove all lines with text confidence score below threshold.

Example YAML:

- step: RemoveLowTextConfidenceLines
  settings:
    threshold: 0.8

Parameters:

Name Type Description Default
threshold float

Confidence score threshold.

required
Source code in src/htrflow/pipeline/steps.py
def __init__(self, threshold: float):
    """
    Arguments:
        threshold: Confidence score threshold.
    """
    super().__init__(lambda node: node.is_line() and metrics.line_text_confidence(node) < threshold)

ReadingOrderMarginalia

Bases: PipelineStep

Order regions and lines by reading order.

This step orders the pages' first- and second-level segments (corresponding to regions and lines). Both the regions and their lines are ordered using reading_order.order_regions.

Parameters:

Name Type Description Default
two_page Literal['auto'] | bool

Whether the page is a two-page spread. Three modes: - 'auto': determine heuristically for each page using layout.is_twopage - True: assume all pages are spreads - False: assume all pages are single pages

False
Source code in src/htrflow/pipeline/steps.py
def __init__(self, two_page: Literal["auto"] | bool = False):
    """
    Arguments:
        two_page: Whether the page is a two-page spread. Three modes:
            - 'auto': determine heuristically for each page using
                `layout.is_twopage`
            - True: assume all pages are spreads
            - False: assume all pages are single pages
    """
    self.two_page = two_page

OrderLines

Bases: PipelineStep

Order lines top-down.

This step orders the lines within each region top-down.

Example YAML:

- step: OrderLines

WordSegmentation

Bases: PipelineStep

Segment lines into words.

This step segments lines of text into words. It estimates the word boundaries from the recognized text, which means that this step must be run after a line-based text recognition model.

See also <models.huggingface.trocr.WordLevelTrOCR>, which is a version of TrOCR that outputs word-level text directly using a more sophisticated method.

Example YAML:

- step: WordSegmentation

Export steps

Export

Bases: PipelineStep

Export results.

Exports the current state of the collection in the given format. This step is typically the last step of a pipeline, however, it can be inserted at any pipeline stage. For example, you could put an Export step before a post processing step in order to save a copy without post processing. A pipeline can include as many Export steps as you like.

See Export formats or the <serialization.serialization> module for more details about each export format.

Example:

- step: Export
  settings:
    format: Alto
    dest: alto-outputs

Parameters:

Name Type Description Default
dest str

Output directory.

required
format Literal['alto', 'page', 'txt', 'json']

Output format as a string.

required
Source code in src/htrflow/pipeline/steps.py
def __init__(
    self,
    dest: str,
    format: Literal["alto", "page", "txt", "json"],
    **serializer_kwargs,
):
    """
    Arguments:
        dest: Output directory.
        format: Output format as a string.
    """
    self.serializer = get_serializer(format, **serializer_kwargs)
    self.dest = dest

ExportImages

Bases: PipelineStep

Export the collection's images.

This step writes all existing images (regions, lines, etc.) in the collection to disk. The exported images are the images that have been passed to previous Inference steps and the images that would be passed to a following Inference step.

Example YAML:

- step: ExportImages
  settings:
    dest: exported_images

Parameters:

Name Type Description Default
dest str

Destination directory.

required
Source code in src/htrflow/pipeline/steps.py
def __init__(self, dest: str):
    """
    Arguments:
        dest: Destination directory.
    """
    self.dest = dest
    os.makedirs(self.dest, exist_ok=True)

Misc

Break

Bases: PipelineStep

Break the pipeline! Used for testing.

Example YAML:

- step: Break

ImportSegmentation

Bases: PipelineStep

Import segmentation from PageXML files.

This step replicates the line segmentation from PageXML files. It can be used to import ground truth segmentation for evaluation purposes.

Example YAML:

- step: ImportSegmentation
  settings:
    source: /path/to/pageXMLs

Parameters:

Name Type Description Default
source str

Path to a directory with PageXML files. The XML files must have the same names as the input image files (ignoring the file extension).

required
Source code in src/htrflow/pipeline/steps.py
def __init__(self, source: str):
    """
    Arguments:
        source: Path to a directory with PageXML files. The XML files
            must have the same names as the input image files (ignoring
            the file extension).
    """
    self.source = source