Pipeline steps

This page lists HTRflow's built-in pipeline steps.

Base step¶

`PipelineStep` ¶

Pipeline step base class.

Pipeline steps are implemented by subclassing this class and overriding the run() method.

`run` ¶

Run the pipeline step.

Parameters:

Name	Type	Description	Default
`collection`	`Collection`	Input collection	required

Returns:

Type	Description
`Collection`	A new collection, updated with the results of the pipeline step.

Source code in src/htrflow/pipeline/steps.py

def run(self, collection: Collection) -> Collection:
    """
    Run the pipeline step.

    Arguments:
        collection: Input collection

    Returns:
        A new collection, updated with the results of the pipeline step.
    """

Pre-processing steps¶

`ProcessImages` ¶

Bases: PipelineStep

Base for image preprocessing steps.

This is a base class for all image preprocessing steps. Subclasses define their image processing operation by overriding the op() method. This step does not alter the original image. Instead, a new copy of the image is saved in the directory specified by ProcessImages.output_directory. The PageNode's image path is then updated to point to the new processed image.

Attributes:

Name	Type	Description
`output_directory`	`str`	Where to write the processed images.

`op` ¶

Perform the image processing operation on image.

Parameters:

Name	Type	Description	Default
`image`	`NumpyImage`	Input image.	required

Returns:

Type	Description
`NumpyImage`	A processed version of `image`.

Source code in src/htrflow/pipeline/steps.py

def op(self, image: NumpyImage) -> NumpyImage:
    """
    Perform the image processing operation on `image`.

    Arguments:
        image: Input image.

    Returns:
        A processed version of `image`.
    """
    pass

`Binarization` ¶

Bases: ProcessImages

Binarize images.

Runs image binarization on the collection's images. Saves the resulting images in a directory named binarized. All subsequent pipeline steps will use the binarized images.

Example YAML:

- step: Binarization

Inference steps¶

`Inference` ¶

Bases: PipelineStep

Run model inference.

This is a generic pipeline step for any type of model inference. This step always runs the model on the images of the collection's leaf nodes.

Example YAML:

- step: Inference
  settings:
    model: DiT
    model_settings:
      model: ...

Source code in src/htrflow/pipeline/steps.py

def __init__(self, model_class, model_kwargs, generation_kwargs):
    self.model_class = model_class
    self.model_kwargs = model_kwargs
    self.generation_kwargs = generation_kwargs
    self.model = None

`Segmentation` ¶

Bases: Inference

Run a segmentation model.

See Segmentation models for available models.

Example YAML:

- step: Segmentation
  settings:
    model: yolo
    model_settings:
      model: Riksarkivet/yolov9-regions-1

Source code in src/htrflow/pipeline/steps.py

def __init__(self, model_class, model_kwargs, generation_kwargs):
    self.model_class = model_class
    self.model_kwargs = model_kwargs
    self.generation_kwargs = generation_kwargs
    self.model = None

`TextRecognition` ¶

Bases: Inference

Run a text recognition model.

See Text recognition models for available models.

Example YAML:

- step: TextRecognition
  settings:
    model: TrOCR
    model_settings:
      model: Riksarkivet/trocr-base-handwritten-hist-swe-2

Source code in src/htrflow/pipeline/steps.py

def __init__(self, model_class, model_kwargs, generation_kwargs):
    self.model_class = model_class
    self.model_kwargs = model_kwargs
    self.generation_kwargs = generation_kwargs
    self.model = None

Post-processing steps¶

`Prune` ¶

Bases: PipelineStep

Remove nodes based on a given condition.

This is a generic pruning (filtering) step which removes nodes (segments, lines, words) based on the given condition. The condition is a function f such that f(node) == True if node should be removed from the tree. This step runs f on all nodes, at all segmentation levels. See the RemoveLowTextConfidence[Lines|Regions|Pages] steps for examples of how to formulate condition.

Parameters:

Name	Type	Description	Default
`condition`	`Callable[[Node], bool]`	A function `f` such that `f(node) == True` if `node` should be removed from the document tree.	required

Source code in src/htrflow/pipeline/steps.py

def __init__(self, condition: Callable[[Node], bool]):
    """
    Arguments:
        condition: A function `f` such that `f(node) == True` if
            `node` should be removed from the document tree.
    """
    self.condition = condition

`RemoveLowTextConfidencePages` ¶

Bases: Prune

Remove all pages where the average text confidence score is below threshold.

Example YAML:

- step: RemoveLowTextConfidencePages
  settings:
    threshold: 0.8

Parameters:

Name	Type	Description	Default
`threshold`	`float`	Confidence score threshold.	required

Source code in src/htrflow/pipeline/steps.py

def __init__(self, threshold: float):
    """
    Arguments:
        threshold: Confidence score threshold.
    """
    super().__init__(
        lambda node: node.parent and node.parent.is_root() and metrics.average_text_confidence(node) < threshold
    )

`RemoveLowTextConfidenceRegions` ¶

Bases: Prune

Remove all regions where the average text confidence score is below threshold.

Example YAML:

- step: RemoveLowTextConfidenceRegions
  settings:
    threshold: 0.8

Parameters:

Name	Type	Description	Default
`threshold`	`float`	Confidence score threshold.	required

Source code in src/htrflow/pipeline/steps.py

def __init__(self, threshold: float):
    """
    Arguments:
        threshold: Confidence score threshold.
    """
    super().__init__(lambda node: node.is_region() and metrics.average_text_confidence(node) < threshold)

`RemoveLowTextConfidenceLines` ¶

Bases: Prune

Remove all lines with text confidence score below threshold.

Example YAML:

- step: RemoveLowTextConfidenceLines
  settings:
    threshold: 0.8

Parameters:

Name	Type	Description	Default
`threshold`	`float`	Confidence score threshold.	required

Source code in src/htrflow/pipeline/steps.py

def __init__(self, threshold: float):
    """
    Arguments:
        threshold: Confidence score threshold.
    """
    super().__init__(lambda node: node.is_line() and metrics.line_text_confidence(node) < threshold)

`ReadingOrderMarginalia` ¶

Bases: PipelineStep

Order regions and lines by reading order.

This step orders the pages' first- and second-level segments (corresponding to regions and lines). Both the regions and their lines are ordered using reading_order.order_regions.

Parameters:

Name	Type	Description	Default
`two_page`	`Literal['auto'] \| bool`	Whether the page is a two-page spread. Three modes: - 'auto': determine heuristically for each page using `layout.is_twopage` - True: assume all pages are spreads - False: assume all pages are single pages	`False`

Source code in src/htrflow/pipeline/steps.py

def __init__(self, two_page: Literal["auto"] | bool = False):
    """
    Arguments:
        two_page: Whether the page is a two-page spread. Three modes:
            - 'auto': determine heuristically for each page using
                `layout.is_twopage`
            - True: assume all pages are spreads
            - False: assume all pages are single pages
    """
    self.two_page = two_page

`OrderLines` ¶

Bases: PipelineStep

Order lines top-down.

This step orders the lines within each region top-down.

Example YAML:

- step: OrderLines

`WordSegmentation` ¶

Bases: PipelineStep

Segment lines into words.

This step segments lines of text into words. It estimates the word boundaries from the recognized text, which means that this step must be run after a line-based text recognition model.

See also <models.huggingface.trocr.WordLevelTrOCR>, which is a version of TrOCR that outputs word-level text directly using a more sophisticated method.

Example YAML:

- step: WordSegmentation

Export steps¶

`Export` ¶

Bases: PipelineStep

Export results.

Exports the current state of the collection in the given format. This step is typically the last step of a pipeline, however, it can be inserted at any pipeline stage. For example, you could put an Export step before a post processing step in order to save a copy without post processing. A pipeline can include as many Export steps as you like.

See Export formats or the <serialization.serialization> module for more details about each export format.

Example:

- step: Export
  settings:
    format: Alto
    dest: alto-outputs

Parameters:

Name	Type	Description	Default
`dest`	`str`	Output directory.	required
`format`	`Literal['alto', 'page', 'txt', 'json']`	Output format as a string.	required

Source code in src/htrflow/pipeline/steps.py

def __init__(
    self,
    dest: str,
    format: Literal["alto", "page", "txt", "json"],
    **serializer_kwargs,
):
    """
    Arguments:
        dest: Output directory.
        format: Output format as a string.
    """
    self.serializer = get_serializer(format, **serializer_kwargs)
    self.dest = dest

`ExportImages` ¶

Bases: PipelineStep

Export the collection's images.

This step writes all existing images (regions, lines, etc.) in the collection to disk. The exported images are the images that have been passed to previous Inference steps and the images that would be passed to a following Inference step.

Example YAML:

- step: ExportImages
  settings:
    dest: exported_images

Parameters:

Name	Type	Description	Default
`dest`	`str`	Destination directory.	required

Source code in src/htrflow/pipeline/steps.py

def __init__(self, dest: str):
    """
    Arguments:
        dest: Destination directory.
    """
    self.dest = dest
    os.makedirs(self.dest, exist_ok=True)

Misc¶

`Break` ¶

Bases: PipelineStep

Break the pipeline! Used for testing.

Example YAML:

- step: Break

`ImportSegmentation` ¶

Bases: PipelineStep

Import segmentation from PageXML files.

This step replicates the line segmentation from PageXML files. It can be used to import ground truth segmentation for evaluation purposes.

Example YAML:

- step: ImportSegmentation
  settings:
    source: /path/to/pageXMLs

Parameters:

Name	Type	Description	Default
`source`	`str`	Path to a directory with PageXML files. The XML files must have the same names as the input image files (ignoring the file extension).	required

Source code in src/htrflow/pipeline/steps.py

def __init__(self, source: str):
    """
    Arguments:
        source: Path to a directory with PageXML files. The XML files
            must have the same names as the input image files (ignoring
            the file extension).
    """
    self.source = source

Pipeline steps

Base step¶

PipelineStep ¶

run ¶

Pre-processing steps¶

ProcessImages ¶

op ¶

Binarization ¶

Inference steps¶

Inference ¶

Segmentation ¶

TextRecognition ¶

Post-processing steps¶

Prune ¶

RemoveLowTextConfidencePages ¶

RemoveLowTextConfidenceRegions ¶

RemoveLowTextConfidenceLines ¶

ReadingOrderMarginalia ¶

OrderLines ¶

WordSegmentation ¶

Export steps¶

Export ¶

ExportImages ¶

Misc¶

Break ¶

ImportSegmentation ¶

`PipelineStep` ¶

`run` ¶

`ProcessImages` ¶

`op` ¶

`Binarization` ¶

`Inference` ¶

`Segmentation` ¶

`TextRecognition` ¶

`Prune` ¶

`RemoveLowTextConfidencePages` ¶

`RemoveLowTextConfidenceRegions` ¶

`RemoveLowTextConfidenceLines` ¶

`ReadingOrderMarginalia` ¶

`OrderLines` ¶

`WordSegmentation` ¶

`Export` ¶

`ExportImages` ¶

`Break` ¶

`ImportSegmentation` ¶