Bug 689772

Summary:	text extraction device wanted
Product:	Ghostscript	Reporter:	leonardo <leonardo>
Component:	Other Driver	Assignee:	Ken Sharp <ken.sharp>
Status:	RESOLVED FIXED
Severity:	enhancement	CC:	alex, shailesh.mistry
Priority:	P3	Keywords:	bountiable
Version:	master
Hardware:	PC
OS:	All
Customer:		Word Size:	---

Description leonardo 2008-03-28 02:14:36 UTC

Alex's commitment 8347 provides a startup for this emhancement project. Need to 
study what lib\ps2ascii does, and implement same functionality to txtwrite. 
Assigning to Ken because he knows how text is processed in pdfwrite. Generally 
this project to be done with copying some code fragments from pdfwrite to 
txtwrite.

Comment 1 Ray Johnston 2010-04-25 22:33:27 UTC

*** Bug 687492 has been marked as a duplicate of this bug. ***

Comment 2 Shailesh Mistry 2011-07-25 21:50:55 UTC

Enhancement still missing in Ghostscript 9.03

Comment 3 Ken Sharp 2011-10-25 09:12:33 UTC

The txtwrite device now exists and is built into Ghostscript. The code will output the text in one of three ways:

In an 'XML' (not really XML) format which is simply a list of all the pieces of text as they are encountered along with some positional and other information. This is intended for use by developers wanting to perform their own analysis.

As simple UCS2 or UTF-8 text. In this case the code attempts to identify contiguous text (ie words broken up in the original file), text on the same line (paragraphs as well as super- and sub-scripts) and makes an attempt at outputting an approximation to the layout of the original file by using spaces to position the text in the output file.

Still need to unify the format used by MuPDF and txtwrite for the 'XML' output.

Other enhancement requests will be considered.

Comment 4 Ken Sharp 2011-11-03 15:50:05 UTC

The txtwrite device now outputs XML in a form 'compatible' with MuPDF (but not using the 'blocks') and also in a form broadly the same as MuPDF (including the blocks).

This completes the basic work on the text extraction device, there are undoubtedly bugs and features to be added, but I'm closing this issue.