Pipeline steps
This page lists HTRflow's built-in pipeline steps.
Base step¶
PipelineStep
¶
Pipeline step base class.
Pipeline steps are implemented by subclassing this class and
overriding the run()
method.
run
¶
Run the pipeline step.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
collection
|
Collection
|
Input collection |
required |
Returns:
Type | Description |
---|---|
Collection
|
A new collection, updated with the results of the pipeline step. |
Pre-processing steps¶
ProcessImages
¶
Bases: PipelineStep
Base for image preprocessing steps.
This is a base class for all image preprocessing steps. Subclasses
define their image processing operation by overriding the op()
method. This step does not alter the original image. Instead, a new
copy of the image is saved in the directory specified by
ProcessImages.output_directory
. The PageNode
's image path is
then updated to point to the new processed image.
Attributes:
Name | Type | Description |
---|---|---|
output_directory |
str
|
Where to write the processed images. |
op
¶
Perform the image processing operation on image
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
image
|
NumpyImage
|
Input image. |
required |
Returns:
Type | Description |
---|---|
NumpyImage
|
A processed version of |
Binarization
¶
Inference steps¶
Inference
¶
Bases: PipelineStep
Run model inference.
This is a generic pipeline step for any type of model inference. This step always runs the model on the images of the collection's leaf nodes.
Example YAML:
Source code in src/htrflow/pipeline/steps.py
Segmentation
¶
Bases: Inference
Run a segmentation model.
See Segmentation models for available models.
Example YAML:
Source code in src/htrflow/pipeline/steps.py
TextRecognition
¶
Bases: Inference
Run a text recognition model.
See Text recognition models for available models.
Example YAML:
- step: TextRecognition
settings:
model: TrOCR
model_settings:
model: Riksarkivet/trocr-base-handwritten-hist-swe-2
Source code in src/htrflow/pipeline/steps.py
Post-processing steps¶
Prune
¶
Bases: PipelineStep
Remove nodes based on a given condition.
This is a generic pruning (filtering) step which removes nodes
(segments, lines, words) based on the given condition. The
condition is a function f
such that f(node) == True
if node
should be removed from the tree. This step runs f
on all nodes,
at all segmentation levels. See the RemoveLowTextConfidence[Lines|Regions|Pages]
steps for examples of how to formulate condition
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
condition
|
Callable[[Node], bool]
|
A function |
required |
Source code in src/htrflow/pipeline/steps.py
RemoveLowTextConfidencePages
¶
Bases: Prune
Remove all pages where the average text confidence score is below threshold
.
Example YAML:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
threshold
|
float
|
Confidence score threshold. |
required |
Source code in src/htrflow/pipeline/steps.py
RemoveLowTextConfidenceRegions
¶
Bases: Prune
Remove all regions where the average text confidence score is below threshold
.
Example YAML:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
threshold
|
float
|
Confidence score threshold. |
required |
Source code in src/htrflow/pipeline/steps.py
RemoveLowTextConfidenceLines
¶
Bases: Prune
Remove all lines with text confidence score below threshold
.
Example YAML:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
threshold
|
float
|
Confidence score threshold. |
required |
Source code in src/htrflow/pipeline/steps.py
ReadingOrderMarginalia
¶
Bases: PipelineStep
Order regions and lines by reading order.
This step orders the pages' first- and second-level segments
(corresponding to regions and lines). Both the regions and their
lines are ordered using reading_order.order_regions
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
two_page
|
Literal['auto'] | bool
|
Whether the page is a two-page spread. Three modes:
- 'auto': determine heuristically for each page using
|
False
|
Source code in src/htrflow/pipeline/steps.py
OrderLines
¶
WordSegmentation
¶
Bases: PipelineStep
Segment lines into words.
This step segments lines of text into words. It estimates the word boundaries from the recognized text, which means that this step must be run after a line-based text recognition model.
See also <models.huggingface.trocr.WordLevelTrOCR>
, which is a
version of TrOCR that outputs word-level text directly using a more
sophisticated method.
Example YAML:
Export steps¶
Export
¶
Bases: PipelineStep
Export results.
Exports the current state of the collection in the given format.
This step is typically the last step of a pipeline, however, it can
be inserted at any pipeline stage. For example, you could put an
Export
step before a post processing step in order to save a copy
without post processing. A pipeline can include as many Export
steps as you like.
See Export formats or the <serialization.serialization>
module for more details about each export format.
Example:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dest
|
str
|
Output directory. |
required |
format
|
Literal['alto', 'page', 'txt', 'json']
|
Output format as a string. |
required |
Source code in src/htrflow/pipeline/steps.py
ExportImages
¶
Bases: PipelineStep
Export the collection's images.
This step writes all existing images (regions, lines, etc.) in the
collection to disk. The exported images are the images that have
been passed to previous Inference
steps and the images that would
be passed to a following Inference
step.
Example YAML:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dest
|
str
|
Destination directory. |
required |
Source code in src/htrflow/pipeline/steps.py
Misc¶
Break
¶
ImportSegmentation
¶
Bases: PipelineStep
Import segmentation from PageXML files.
This step replicates the line segmentation from PageXML files. It can be used to import ground truth segmentation for evaluation purposes.
Example YAML:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
source
|
str
|
Path to a directory with PageXML files. The XML files must have the same names as the input image files (ignoring the file extension). |
required |