Bug 707558 - Support page segmentation mode (PSM) in Tesseract OCR
Summary: Support page segmentation mode (PSM) in Tesseract OCR
Status: UNCONFIRMED
Alias: None
Product: MuPDF
Classification: Unclassified
Component: mupdf (show other bugs)
Version: unspecified
Hardware: All All
: P2 enhancement
Assignee: MuPDF bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-02-08 12:12 UTC by Jorj
Modified: 2024-02-08 12:12 UTC (History)
0 users

See Also:
Customer:
Word Size: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jorj 2024-02-08 12:12:34 UTC
As per user enhancement request #3122 (https://github.com/pymupdf/PyMuPDF/issues/3122) in PyMuPDF, is it possible to include an additional int member "psm" (page segmentation mode) in the OCR options structure and pass its value to Tesseract-OCR?

PSM can optimize Tesseract's recognition rate very (!) significantly, for instance in cases when the image is known to represent just a line or a word or the image background has large patches of different background colors.

The default PSM value is
"3    Fully automatic page segmentation, but no OSD. (Default)"

OSD: orientation and script detection

Frequent desirable PSM options:
7    Treat the image as a single text line.
8    Treat the image as a single word.
10   Treat the image as a single character.
11   Sparse text. Find as much text as possible in no particular order.