Genii Weblog

OpenSesame: How fast is fast enough?

Fri 4 May 2007, 12:40 PM

by Ben Langhinrichs
In my quest to provide true integration between Notes and ODF (a.k.a. tilting at windmills), I have put together code that builds and modifies spreadsheets, presentations and word processing documents, but I have also put in a fair amount of time converting rich text to ODF and back.  While it is certainly not finished, I have enough done to test performance a bit.  As many regular readers here know, I am a bit of a fanatic about performance, believing that it is better to build scalability in from the beginning.

But what is the milestone or goal at which I should aim?  Right now, I can export about twenty reasonably diverse documents a second to ODF files (albeit with weird memory leaks and all the other fun factors of early development).  The fidelity is good, but is the speed reasonable?  If I convert rich text to HTML, I can do over 400 a second, but that is not saving anything to disk, and I haven't tested that way with ODF yet, since it is a less likely scenario.  Obviously, if this is a once at a time export to get into a productivity editor, the speed is fine.  But is there a scenario where companies will export hundreds of thousands of documents?  Or, is there a scenario where on-the-fly conversion will make sense, as in CoexEdit?

I don't know.  I don't even know how to know.  Sigh!  I guess I'll just keep making it faster until I run out of patience with that process.  Anybody have an opinion?

Copyright 2007 Genii Software Ltd.

What has been said:

587.1. Nathan T. Freeman
(05/04/2007 10:11 AM)

Well, I can certainly envision scenarios where it's going to be important to do this on a massive scale. Conversion of official records to an open standard platform is a good one. I could easily see a need to dump a million documents to individual ODFs for long-term archiving. (I have no idea if that would create OS-level problems, though.)

587.2. Ben Langhinrichs
(05/04/2007 10:50 AM)

So, Nathan, what would be a goal? 100/second? 200/second? It is going to take some amount of time, but I can't quite figure out the threshold under which it will be an issue. Of course, until somebody else comes out with a way to export to ODF, it won't be as much of an issue, but I have found the best way to avoid competition is to be too far beyond what can be done easily to open the door to competitors. Even 20/second is probably beyond what could be done with DXL/XSLT from Java or copy/paste and office automation with UNO and such, so the potential competition has to be either faster or more flexible or better at rendering.

587.3. Richard Schwartz
(05/04/2007 08:22 PM)

Shouldn't your benchmark be in pages per second, not documents per second?

587.4. Ben Langhinrichs
(05/05/2007 10:34 AM)

Richard - No, ironically, because it seems to make sense until you think harder about it. Document conversion is very different than document printing. For example, use the Copy Selected as Table with about thirty document and paste it into a Body field. That table with the links and icons will probably take longer than ten pages of regular text, because complexity is what takes time. Simply copying text in is incredibly fast on a modern computer. Additionally, there is both no easy way for me to measure pages, and no easy way for someone else to measure pages, so nobody can look at a database and say "Hmm, about 70,000 pages, so that should take x amount of time", whereas they can look at the number of documents. There is also an overhead per document, especially with zipping up packages like ODF, that would make 10,000 one page documents a very different effort than 1,000 ten page documents. All that leads me to measure in documents and to describe these as "reasonably diverse documents" in my post. My actual testing involves three types of databases. The first is the partner forums, and I use the ones from 2000 to 2006 to get a diversity of rich text from different releases of Notes. The second is the Designer Help database, which tends to be very complex in terms of links and images. The third is a Use Case database which I use to test all sorts of round tripping, including RT/HTML/RT, RT/MIME/RT and not RT/ODF/RT. That tests edge cases and specific types of complexity, such as certain merged table scenarios that are implemented quite differently in rich text than in HTML or ODF. Phew, long answer, but I hope that explains the metrics a bit better.

587.5. Richard Schwartz
(05/05/2007 12:21 PM)

Well, it explains it better; and it's somewhat helpful in terms of telling me whether or not 20 is a good number. At 20 documents per second, and 17000-ish documents in the 2006 forum database, that's somewhat more than 13 minutes. That seems like it might be a little too slow. On my laptop, a File - Database - New Copy on that forum took about 3:10. My laptop is probably slower (due to disk i/o speed) than whatever system you arrived at your 20 docs/second measure, so take that into account or do your own timing on a database copy. While I realize that there are many big differences between a database copy and a document conversion, I think it's not a bad standard of comparison. Pick an N, and make your target N-times-database-copy . My suggestion is N=2. If you achieve that, nobody will be able to question your speed.

587.6. Ben Langhinrichs
(05/05/2007 01:30 PM)

That is an interesting comparison. You'll be glad to hear that I am up to about 85 a minute, so pretty close to the same speed as a database copy. That seems pretty good, given that I am doing a whole lot of mapping and conversion and the database copy is probably not doing anywhere near as much.

587.7. Richard Schwartz
(05/05/2007 08:57 PM)