Writer/ToDo/PDF Import
Motivation
PDF is a widely used format to exchange documents containing text and graphics between different applications and different platforms. OpenOffice.org is currently able to create such PDF documents via export filters that are already available within every major OpenOffice.org application. Unfortunately, OpenOffice.org is not able to import PDF documents back again, although this is one of the more often requested features.
Goals for a PDF import
The document created by importing a PDF file should resemble the original as close as possible; nevertheless PDF per se does not lend itself to that end easily: most PDF files contain no information about layout or document structure at all. Therefore a PDF file will never be able to be imported on a 1:1 basis. We have to define goals to define what level of similarity must be achieved on a basis of feasibility.
These goals should be treated as paramount:
- all text that is visible in the original PDF document should be imported
- text attributes: font family, font size, weight (bold, not bold), style (italic, not italic) should be imported together with the respective text.
- all drawing elements (images, vector graphics) should be imported.
- if the implementation has to choose between layout fidelity and editability, lean towards layout.
Additionally there are some goals that would greatly enhance the import result, all of these features can by their nature only be implemented with heuristic methods since PDF (unless the file uses tagged PDF) does not contain structural information. The following text features should be detected (sequence in descending importance):
- Paragraphs
- Enumerations
- Titles
- Underlined text
Implementation
We will try to come up with a first prototype soon, most probably using an out-of-process xpdf instance to do the parsing (due to license issues). Here's a list of things to do:
Area | Title | State | CWS |
---|---|---|---|
Parser | Wrap pdf parser with UNO | 100% | picom |
Parser | Connect to xpdf out-of-process | 10% | picom |
Tooling | Enhance rendering API to provide truly generic bitmap access | 100% | picom |
Canvas | Adapt Canvas implementations to the new API | 100% | picom |
Tooling | Adapt VCL's canvastools to be able to import XBitmap generically to VCL bitmap | 100% | picom |
Tooling | Enable GraphicImporter to use rendering::XBitmap | 90% | picom |
Import | Read content via UNO | 50% | picom |
Import | Combine low-level structure (like stroke and fill) | 90% | picom |
Import | Generate SAX events | 90% | picom |
Import | Generate ODF stream | 0% | picom |
Import | Detect text flow: portions | 0% | picom |
Import | Detect text flow: lines | 0% | picom |
Import | Detect text flow: paragraphs | 0% | picom |
Import | Detect text style | 0% | picom |
Import | Detect shape style (e.g. shadow) | 0% | picom |
Parser | Replacement for xpdf | 0% | picom |
CVS | Move pdf import to OOo CVS | 0% | picom |