689772 – text extraction device wanted

Bug 689772 - text extraction device wanted

Summary: text extraction device wanted

Status:	RESOLVED FIXED

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	Other Driver (show other bugs)
Version:	master
Hardware:	PC All

Importance:	P3 enhancement
Assignee:	Ken Sharp

URL:
Keywords:	bountiable

Duplicates (1):	687492 (view as bug list)
Depends on:
Blocks:

Reported:	2008-03-28 02:14 UTC by leonardo
Modified:	2011-11-03 15:50 UTC (History)
CC List:	2 users (show)

See Also:
Customer:
Word Size:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description leonardo 2008-03-28 02:14:36 UTC

Alex's commitment 8347 provides a startup for this emhancement project. Need to 
study what lib\ps2ascii does, and implement same functionality to txtwrite. 
Assigning to Ken because he knows how text is processed in pdfwrite. Generally 
this project to be done with copying some code fragments from pdfwrite to 
txtwrite.

Comment 1 Ray Johnston 2010-04-25 22:33:27 UTC

*** Bug 687492 has been marked as a duplicate of this bug. ***

Comment 2 Shailesh Mistry 2011-07-25 21:50:55 UTC

Enhancement still missing in Ghostscript 9.03

Comment 3 Ken Sharp 2011-10-25 09:12:33 UTC

The txtwrite device now exists and is built into Ghostscript. The code will output the text in one of three ways:

In an 'XML' (not really XML) format which is simply a list of all the pieces of text as they are encountered along with some positional and other information. This is intended for use by developers wanting to perform their own analysis.

As simple UCS2 or UTF-8 text. In this case the code attempts to identify contiguous text (ie words broken up in the original file), text on the same line (paragraphs as well as super- and sub-scripts) and makes an attempt at outputting an approximation to the layout of the original file by using spaces to position the text in the output file.

Still need to unify the format used by MuPDF and txtwrite for the 'XML' output.

Other enhancement requests will be considered.

Comment 4 Ken Sharp 2011-11-03 15:50:05 UTC

The txtwrite device now outputs XML in a form 'compatible' with MuPDF (but not using the 'blocks') and also in a form broadly the same as MuPDF (including the blocks).

This completes the basic work on the text extraction device, there are undoubtedly bugs and features to be added, but I'm closing this issue.