Summary: | text extraction device wanted | ||
---|---|---|---|
Product: | Ghostscript | Reporter: | leonardo <leonardo> |
Component: | Other Driver | Assignee: | Ken Sharp <ken.sharp> |
Status: | RESOLVED FIXED | ||
Severity: | enhancement | CC: | alex, shailesh.mistry |
Priority: | P3 | Keywords: | bountiable |
Version: | master | ||
Hardware: | PC | ||
OS: | All | ||
Customer: | Word Size: | --- |
Description
leonardo
2008-03-28 02:14:36 UTC
*** Bug 687492 has been marked as a duplicate of this bug. *** Enhancement still missing in Ghostscript 9.03 The txtwrite device now exists and is built into Ghostscript. The code will output the text in one of three ways: In an 'XML' (not really XML) format which is simply a list of all the pieces of text as they are encountered along with some positional and other information. This is intended for use by developers wanting to perform their own analysis. As simple UCS2 or UTF-8 text. In this case the code attempts to identify contiguous text (ie words broken up in the original file), text on the same line (paragraphs as well as super- and sub-scripts) and makes an attempt at outputting an approximation to the layout of the original file by using spaces to position the text in the output file. Still need to unify the format used by MuPDF and txtwrite for the 'XML' output. Other enhancement requests will be considered. The txtwrite device now outputs XML in a form 'compatible' with MuPDF (but not using the 'blocks') and also in a form broadly the same as MuPDF (including the blocks). This completes the basic work on the text extraction device, there are undoubtedly bugs and features to be added, but I'm closing this issue. |