CollectionΒΆ
Warning
Experimental - Syntax or Docs might change
The Collection class is the core data structure in HTRFlow, designed to manage and process document pages in a hierarchical structure. It provides a flexible and intuitive way to handle document analysis tasks, from page segmentation to text recognition.
OverviewΒΆ
The Collection class maintains a tree structure where:
- Root level contains PageNode objects (individual document pages)
- Pages can contain regions (text blocks, tables, etc.)
- Regions can contain paragraphs or lines of text
- Lines contain individual words
Each node in the tree has associated attributes like position (coordinates), dimensions, and potentially recognized text.
Note that the the Collection underneath consists of three main components that work together:
- Collection: The root container managing document pages and their hierarchy
- Result: Processing outputs that update the Collection's structure
- Geometry: Spatial utilities used throughout the Collection tree
Here's how they interact:
Collection
βββ Manages PageNodes
β βββ Updated by Results
β βββ Uses Geometry for spatial operations
Basic UsageΒΆ
Here's a typical workflow using Collection with a pipeline:
from htrflow.pipeline.pipeline import Pipeline
from htrflow.volume.volume import Collection
import yaml
# Create collection from images
collection = Collection(['image1.jpg', 'image2.jpg'])
# Define pipeline configuration
config = yaml.safe_load("""
steps:
- step: Segmentation
settings:
model: yolo
model_settings:
model: Riksarkivet/yolov9-lines-within-regions-1
- step: TextRecognition
settings:
model: TrOCR
model_settings:
model: Riksarkivet/trocr-base-handwritten-hist-swe-2
""")
# Process with pipeline
pipe = Pipeline.from_config(config)
collection = pipe.run(collection)
print(collection)
output:
collection tree: # Root
img_h x img_w node (image0) at (origo) # Image 0 (parent)
βββnode0_h x node0_w (image0_node0) at (px_0, py_0) # (child)
βββnode00_h x node00_w (image0_node0_node0) at (px_00, py_00): text0 # (child's child)
βββnode01_h x node01_w (image0_node0_node1) at (px_01, py_01): text1
βββnode02_h x node02_w (image0_node0_node2) at (px_02, py_02): text2
...
img_h x img_w node (image1) at (origo) # Image 1
βββ...
Working with CollectionΒΆ
NavigationΒΆ
The Collection class uses intuitive indexing for accessing nodes:
page = collection[0] # First page
region = collection[0][0] # First region in first page
line = collection[0][0][0] # First line in first region
or
For instance if we have a populated collection class that looked like this:
collection label: Col_output
collection tree:
2413x1511 node (img) at (0, 0)
βββ2123x1444 node (img_node0) at (54, 218)
βββ166x965 node (img_node0_node0) at (358, 224): text0
βββ222x1138 node (img_node0_node1) at (331, 450): text1
βββ156x1045 node (img_node0_node2) at (437, 702): text2
βββ119x1191 node (img_node0_node3) at (238, 888): text3
Running:
outputs:
2413x1511 node (img) at (0, 0)
2123x1444 node (img_node0) at (54, 218)
166x965 node (img_node0_node0) at (358, 224): text0
Node TypesΒΆ
The tree structure consists of different types of nodes that can be identified using these methods:
is_region()
: True for nodes containing lines/text blocksis_line()
: True for nodes containing words/text linesis_word()
: True for nodes containing single words
Example:
# Check node types
col[0].is_region() # True - Page is a region
col[0,0].is_region() # True - First child is a region
col[0,0,0].is_line() # True - First grandchild is a line
col[0].is_line() # False
col[0,0].is_line() # False
col[0,0,0].is_region() # False
Traversing the TreeΒΆ
You can traverse nodes using filters:
# Get specific node types
lines = collection.traverse(filter=lambda node: node.is_line())
regions = collection.traverse(filter=lambda node: node.is_region())
text_nodes = collection.traverse(filter=lambda node: node.contains_text())
Saving and SerializationΒΆ
The Collection class supports various serialization formats:
# ALTO XML
collection.save(directory="output", serializer="alto")
# Other supported formats:
# - Page XML
# - txt
# - Json
Creating a CollectionΒΆ
from htrflow.volume.volume import Collection
# From individual image files
collection = Collection(['page1.jpg', 'page2.jpg'])
# From a directory
collection = Collection.from_directory('path/to/images')
# From a previously saved collection
collection = Collection.from_pickle('saved_collection.pkl')
Updating Collection (without pipeline)ΒΆ
Collection nodes are updated through model results, with each update potentially modifying the tree structure. Here's a complete example with actual output:
Python:
Output:
Tree after line detection:
collection label: img
collection tree:
5168x6312 node (img) at (0, 0)
βββ1293x1579 node (img_node0) at (3020, 831)
β βββ167x395 node (img_node0_node0) at (4204, 1957)
β βββ323x378 node (img_node0_node1) at (3020, 1086)
βββ1293x1579 node (img_node1) at (2880, 1376)
βββ323x395 node (img_node1_node0) at (2971, 2208)
βββ323x243 node (img_node1_node1) at (4216, 2272)
Final tree with text:
collection label: img.jpg
collection tree:
5168x6312 node (img) at (0, 0)
βββ1293x1579 node (img_node0) at (3020, 831)
β βββ167x395 node (img_node0_node0) at (4204, 1957): "Magnam sit est ut dolorem consectetur."
β βββ323x378 node (img_node0_node1) at (3020, 1086): "Dolorem dolore consectetur porro voluptatem eius quaerat dolore."
βββ1293x1579 node (img_node1) at (2880, 1376)
βββ323x395 node (img_node1_node0) at (2971, 2208): "Sit est velit numquam modi adipisci dolorem ut."
βββ323x243 node (img_node1_node1) at (4216, 2272): "Quiquia quiquia modi modi consectetur sit numquam."