Genii Weblog

Perils of PDF 4: Missing and obscured data

Mon 9 Sep 2019, 10:21 AM

by Ben Langhinrichs
Last week, I started a series about certain constraints on exporting or archiving Notes data to PDF. There has been a lot of chatter recently about exporting to PDF, a feature that may be supported natively in HCL Notes V11, and that is offered as an archiving solution by some consultants and vendors. 
The PDF format itself is great for certain use cases, but has certain limitations by its very nature, and other limitations due to expectations. As I said in my first post, PDF is seen as being a little like an image. This is wrong on two counts. The first is that the data is not pixel perfect by any means. The second is that in to the extent that it shows what is visible, an image does a lousy job of revealing what is not visible. In this example, I actually show three different instances of where data is lost or obscured in different ways. The first part has the missing data. With caption tables, the caption (or title) is missing, thus losing context. With sections that have not been expanded, the title is there but the contents are missing. And of course, all the attachments and doclinks are non-functional in any case. The second example has obscured data. With a numbered list in a table (not tweaked in any way to try to get this result), part of the numbers is missing. Please note, there examples are only representative of other missing or obscured data with PDF rendering.
This is the fourth of eight primary issues. Depending on what vendor or driver you use, a few of these may have at least a partial solution, but they are good items to check when validating your approach. The table of contents of all issues will be at the bottom of this post.
4a) Missing data
Portion of a rich text field with a caption table and a section.
Inline JPEG image
PDF rendering. Note that the caption titles are missing, and the section title is all that remains.
Inline JPEG image
Rendered by the Midas LSX to HTML. The first image shows as it opens, while in the second I have clicked on the Q3 caption and section title to show them open.
Inline JPEG image
After clicking the Q3 caption and section title (both are clickable with our HTML rendering).
Inline JPEG image
4b) Obscured data
Portion of a rich text field with a numbered list inside a table (same wide table as before, though I don't show as much).
Inline JPEG image
PDF rendering. Note that the second page looks like it starts a new numbered list, but is really number 11. I also noted the left data cut off as shown in previous post.
Inline JPEG image
Rendered by the Midas LSX to HTML. 
Inline JPEG image
Table of Contents (will be updated as the blog series continues)
Perils of PDF 1: Attachments
Perils of PDF 2: Doclinks
Perils of PDF 3: Wide Tables and data loss
Perils of PDF 4: Missing and obscured data
Perils of PDF 5: Data Confusion
Want to try out our Midas LSX export for yourself? Simply fill out the online evaluation request, and we'll get you started. There's no cost to seeing it for yourself.

Copyright 2019 Genii Software Ltd.

What has been said:

1110.1. Stephan Wissel
(09/09/2019 10:28 AM)

One way to make archival to PDF "revision robust" is to use PDF's native capability to store custom XML properties. Nothing will stop you (short of double storage requirements) to store the DXL representation as Meta data. Eventually stripping attachments out.

1110.2. Ben Langhinrichs
(09/09/2019 05:01 PM)

Stephan - That could be done, but that only addresses the issues of getting back to where you started. People who want to archive and see the archive without Notes, whether because they are moving to web only or are a customer or whatever, won't be able to tell what the original data was even if it is stored as DXL. It needs to be be visible in a non-Notes context or it is fairly useless for most uses cases. - Ben