707558 – Support page segmentation mode (PSM) in Tesseract OCR

Bug 707558 - Support page segmentation mode (PSM) in Tesseract OCR

Summary: Support page segmentation mode (PSM) in Tesseract OCR

Status:	UNCONFIRMED

Alias:	None

Product:	MuPDF
Classification:	Unclassified
Component:	mupdf (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P2 enhancement
Assignee:	MuPDF bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2024-02-08 12:12 UTC by Jorj
Modified:	2024-11-22 12:29 UTC (History)
CC List:	2 users (show)

See Also:
Customer:
Word Size:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jorj 2024-02-08 12:12:34 UTC

As per user enhancement request #3122 (https://github.com/pymupdf/PyMuPDF/issues/3122) in PyMuPDF, is it possible to include an additional int member "psm" (page segmentation mode) in the OCR options structure and pass its value to Tesseract-OCR?

PSM can optimize Tesseract's recognition rate very (!) significantly, for instance in cases when the image is known to represent just a line or a word or the image background has large patches of different background colors.

The default PSM value is
"3    Fully automatic page segmentation, but no OSD. (Default)"

OSD: orientation and script detection

Frequent desirable PSM options:
7    Treat the image as a single text line.
8    Treat the image as a single word.
10   Treat the image as a single character.
11   Sparse text. Find as much text as possible in no particular order.

Comment 1 mister_torn 2024-11-22 09:57:58 UTC

tesseract 5.0 above does not support PCM. It doesn't make sense to add this feature for the sake of tesseract 4.0

Comment 2 Robin Watts 2024-11-22 12:29:07 UTC

(In reply to mister_torn from comment #1)
> tesseract 5.0 above does not support PCM. It doesn't make sense to add this
> feature for the sake of tesseract 4.0

Can you provide a reference for this, please?

The current tesseract 5 documentation still includes references to psm being supported.

e.g. https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html