Models¶
Base model¶
BaseModel
¶
Bases: ABC
Model base class
This is the abstract base class of HTRflow models. It handles batching of inputs, some shared initialization arguments and generic logging.
Concrete model implementations bases this class and defines their
prediction method in _predict()
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
device
|
str | None
|
Model device as a string, recognizable by torch. Defaults
to |
None
|
allow_tf32
|
bool
|
Allow running matrix multiplications with TensorFloat-32. This speeds up inference at the expense of inference quality. On Ampere and newer CUDA devices, enabling TF32 can improve performance for matrix multiplications and convolutions. Read more here: https://huggingface.co/docs/diffusers/optimization/fp16#tensorfloat-32 |
True
|
allow_cudnn_benchmark
|
bool
|
When True, enables cuDNN benchmarking to select the fastest convolution algorithms for fixed input sizes, potentially increasing performance. Note that this may introduce nondeterminism. Defaults to False. Read more here: https://huggingface.co/docs/transformers/en/perf_train_gpu_one#tf32 |
False
|
Source code in src/htrflow/models/base_model.py
_predict
abstractmethod
¶
predict
¶
Perform inference on images
Takes an arbitrary number of inputs and runs batched inference.
The inputs can be streamed from an iterator and don't need to
be simultaneously read into memory. Prints a progress bar using
tqdm
. This is a template method which uses the model-specific
_predict(...)
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
images
|
Collection[NumpyImage]
|
Input images |
required |
batch_size
|
int
|
Inference batch size, defaults to 1 |
1
|
image_scaling_factor
|
float
|
If < 1, all input images will be down- scaled by this factor, which can be useful for speeding up inference on higher resolution images. All geometric data in the result (e.g., bounding boxes) are reported with respect to the original resolution. |
1.0
|
tqdm_kwargs
|
dict[str, Any] | None
|
Optional keyword arguments to control the progress bar. |
None
|
**kwargs
|
Optional keyword arguments that are forwarded to
the model specific prediction method |
{}
|
Source code in src/htrflow/models/base_model.py
Text recognition models¶
TrOCR
¶
Bases: BaseModel
HTRflow adapter of the tranformer-based OCR model TrOCR.
Uses huggingface's implementation of TrOCR. For further information, see https://huggingface.co/docs/transformers/model_doc/trocr.
Example usage with the TextRecognition
step:
- step: TextRecognition
settings:
model: TrOCR
model_settings:
model: Riksarkivet/trocr-base-handwritten-hist-swe-2
device: cpu
model_kwargs:
revision: 6ecbb5d643430385e1557001ae78682936f8747f
generation_settings:
batch_size: 8
num_beams: 1
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model
|
str
|
Path or name of pretrained VisisonEncoderDeocderModel. |
required |
processor
|
str | None
|
Optional path or name of pretrained TrOCRProcessor. If not given, the model path or name is used. |
None
|
model_kwargs
|
dict[str, Any] | None
|
Model initialization kwargs which are forwarded to VisionEncoderDecoderModel.from_pretrained. |
None
|
processor_kwargs
|
dict[str, Any] | None
|
Processor initialization kwargs which are forwarded to TrOCRProcessor.from_pretrained. |
None
|
kwargs
|
Additional kwargs which are forwarded to BaseModel's init. |
{}
|
Source code in src/htrflow/models/huggingface/trocr.py
_predict
¶
TrOCR-specific prediction method.
This method is used by predict()
and should typically not be
called directly. However, predict()
forwards additional kwargs
to this method.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
images
|
list[ndarray]
|
Input images. |
required |
**generation_kwargs
|
Optional keyword arguments that are forwarded to the model's .generate() method. |
{}
|
Source code in src/htrflow/models/huggingface/trocr.py
WordLevelTrOCR
¶
Bases: TrOCR
A version of TrOCR which outputs words instead of lines.
This TrOCR wrapper uses the model's attention weights to estimate word boundaries. See notebook [TODO: link] for more details. It does not support beam search, but can otherwise be used as a drop- in replacement of TrOCR.
Example usage with the TextRecognition
step:
- step: TextRecognition
settings:
model: WordLevelTrOCR
model_settings:
model: Riksarkivet/trocr-base-handwritten-hist-swe-2
device: cpu
model_kwargs:
revision: 6ecbb5d643430385e1557001ae78682936f8747f
generation_settings:
batch_size: 8
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model
|
str
|
Path or name of pretrained VisisonEncoderDeocderModel. |
required |
processor
|
str | None
|
Optional path or name of pretrained TrOCRProcessor. If not given, the model path or name is used. |
None
|
model_kwargs
|
dict[str, Any] | None
|
Model initialization kwargs which are forwarded to VisionEncoderDecoderModel.from_pretrained. |
None
|
processor_kwargs
|
dict[str, Any] | None
|
Processor initialization kwargs which are forwarded to TrOCRProcessor.from_pretrained. |
None
|
kwargs
|
Additional kwargs which are forwarded to BaseModel's init. |
{}
|
Source code in src/htrflow/models/huggingface/trocr.py
Satrn
¶
Bases: BaseModel
HTRflow adapter of Openmmlabs' Satrn model
Example usage with the TextRecognition
pipeline step:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model
|
str
|
Path to a local .pth model weights file or to a huggingface repo which contains a .pth file, for example 'Riksarkivet/satrn_htr'. |
required |
config
|
str | None
|
Path to a local config.py file or to a huggingface repo which contains a config.py file, for example 'Riksarkivet/satrn_htr'. |
None
|
kwargs
|
Additional kwargs which are forwarded to BaseModel's
|
{}
|
Source code in src/htrflow/models/openmmlab/satrn.py
_predict
¶
Satrn-specific prediction method
This method is used by predict()
and should typically not be
called directly.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
images
|
list[NumpyImage]
|
Input images |
required |
kwargs
|
Additional keyword arguments that are forwarded to
|
{}
|
Source code in src/htrflow/models/openmmlab/satrn.py
PyLaia
¶
Bases: BaseModel
A minimal HTRflow-style model wrapper around PyLaia.
Uses Teklia's implementation of PyLaia. For further information, see: https://atr.pages.teklia.com/pylaia/usage/prediction/#decode-arguments
Example usage with the TextRecognition
step:
- step: TextRecognition
settings:
model: PyLaia
model_settings:
model: Teklia/pylaia-belfort
device: cuda
revision: d35f921605314afc7324310081bee55a805a0b9f
generation_settings:
batch_size: 8
temperature: 1
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model
|
str
|
The Hugging Face Hub repository ID or a local path with PyLaia artifacts: - weights.ckpt - syms.txt - (optionally) language_model.arpa.gz, lexicon.txt, tokens.txt |
required |
revision
|
str | None
|
Optional revision of the Huggingface repository. |
None
|
use_binary_lm
|
bool
|
Whether to use binary language model format (default: False),
see |
False
|
kwargs
|
Additional kwargs passed to BaseModel.init (e.g., 'device'). |
{}
|
Source code in src/htrflow/models/teklia/pylaia.py
_predict
¶
PyLaia-specific prediction method: runs text recognition.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
images
|
list[ndarray]
|
List of images as NumPy arrays (e.g., shape [H, W, C]). |
required |
batch_size
|
int
|
Batch size for decoding. Defaults to 1. |
required |
reading_order
|
str
|
Reading order for text recognition. Defaults to "LTR". |
required |
num_workers
|
int
|
Number of workers for parallel processing. Defaults to |
required |
Returns:
Type | Description |
---|---|
list[Result]
|
list[Result]: A list of Result objects containing recognized text and optionally confidence scores. |
Source code in src/htrflow/models/teklia/pylaia.py
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
|
Segmentation models¶
RTMDet
¶
Bases: BaseModel
HTRFLOW adapter of Openmmlabs' RTMDet model
This model can be used for region and line segmentation. Riksarkivet provides two pre-trained RTMDet models:
- https://huggingface.co/Riksarkivet/rtmdet_lines
- https://huggingface.co/Riksarkivet/rtmdet_regions
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model
|
str
|
Path to a local .pth model weights file or to a huggingface repo which contains a .pth file, for example 'Riksarkivet/rtmdet_lines'. |
required |
config
|
str | None
|
Path to a local config.py file or to a huggingface repo which contains a config.py file, for example 'Riksarkivet/rtmdet_lines'. |
None
|
revision
|
str | None
|
A specific model revision, as a commit hash of the model's huggingface repo. If None, the latest available revision is used. |
None
|
kwargs
|
Additional kwargs which are forwarded to BaseModel's init. |
{}
|
Source code in src/htrflow/models/openmmlab/rtmdet.py
_predict
¶
RTMDet-specific prediction method
This method is used by predict()
and should typically not be
called directly.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
images
|
list[NumpyImage]
|
List of input images |
required |
nms_downscale
|
float
|
If < 1, all masks will be downscaled by this factor before applying NMS. This leads to faster NMS at the expense of accuracy. |
1.0
|
nms_threshold
|
float
|
Score threshold for segments to keep after NMS. |
0.4
|
nms_sigma
|
float
|
NMS parameter that affects the score calculation. |
2.0
|
**kwargs
|
Additional arguments that are passed to DetInferencer.call. |
{}
|
Source code in src/htrflow/models/openmmlab/rtmdet.py
YOLO
¶
Bases: BaseModel
HTRflow adapter of Ultralytics' YOLO model
Example usage with the Segmentation
step:
- step: Segmentation
settings:
model: YOLO
model_settings:
model: Riksarkivet/yolov9-regions-1
revision: 7c44178d85926b4a096c55c89bf224855a201fbf
device: cpu
generation_settings:
batch_size: 8
generation_settings
accepts the same arguments as YOLO.predict()
.
See the Ultralytics documentation
for a list of supported arguments.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model
|
str
|
Path to a YOLO model. The path can be a path to a
local .pt model file (for example, |
required |
revision
|
str | None
|
Optional revision of the Huggingface repository. |
None
|
Source code in src/htrflow/models/ultralytics/yolo.py
_predict
¶
Run inference.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
images
|
list[ndarray]
|
Input images |
required |
use_polygons
|
bool
|
Wheter to include output polygons (if available), default True. |
True
|
polygon_approx_level
|
float
|
A parameter which controls the maximum distance between the original polygon
and the approximated low-resolution polygon, as a fraction of the original polygon arc length.
Example: With |
0.005
|
**kwargs
|
Keyword arguments forwarded to the inner YOLO model instance. |
{}
|
Source code in src/htrflow/models/ultralytics/yolo.py
Other models¶
DiT
¶
Bases: BaseModel
HTRFLOW adapter of DiT for image classification.
Uses huggingface's implementation of DiT. For further information about the model, see https://huggingface.co/docs/transformers/model_doc/dit.
Initialize a DiT model
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model
|
str
|
Path or name of pretrained AutoModelForImageClassification. |
required |
processor
|
str | None
|
Optional path or name of a pretrained AutoImageProcessor.
If not given, the given |
None
|
model_kwargs
|
dict | None
|
Model initialization kwargs that are forwarded to AutoModelForImageClassification.from_pretrained(). |
None
|
processor_kwargs
|
dict | None
|
Processor initialization kwargs that are forwarded to AutoImageProcessor.from_pretrained(). |
None
|
kwargs
|
Additional kwargs that are forwarded to BaseModel's init. |
required |
Source code in src/htrflow/models/huggingface/dit.py
_predict
¶
Perform inference on images
Parameters:
Name | Type | Description | Default |
---|---|---|---|
images
|
list[ndarray]
|
List of input images. |
required |
return_format
|
Literal['argmax', 'softmax']
|
Decides the format of the output. Options are: - "softmax": returns the confidence scores for each class label and image. Default. - "argmax": returns the most probable class label for each image. |
'softmax'
|