Johann Burkard

Code snippets, quotes, BASH scripts and nonsense. Not surprisingly, my main site is called Johann Burkard, too. Check out my web applications: WMA to MP3 converter, QR Code Generator, Pearson Correlation calculator, PDF to TIFF converter and Forex Data Feed. Also, check out Ole’s Tauschbörse Arbeit.

Page 1 · Page 2 · Page 3 · Page 4 · Page 5 · Page 6 · Page 7 · Page 8 · Page 9 · Page 10

Nov 11 2010

Java PDF Open Source/Free Library Error Rates

Data quality is always a huge challenge in working with file formats. I wanted to find out the error rate of open source/free PDF libraries for Java, i.e. what percentage of PDF files they cannot read.

I downloaded 3583 PDF files from the intertubes and here are the results:

  1. ICEpdf

    ICEpdf couldn’t read 1.981579682 % of the files though I’m not absolutely sure of the number because exceptions were horribly intermingled with logging output.

  2. jPod

    jPod couldn’t read 2.093217974 % of the files.

  3. Apache PDFBox

    Apache PDFBox couldn’t read 2.121127547 % of the files.

  4. PDFRenderer

    PDFRenderer (or PDF Renderer?) couldn’t read 8.317052749 % of the files.

Here are the code snippets I used to parse the files. They are all written in Groovy, which is really good.

PDFBox

@Grab(group='org.apache.pdfbox', module='pdfbox', version='1.3.1')
import org.apache.pdfbox.pdmodel.*

(new File('/home/johann/Desktop/pdf/').listFiles() as List).each { try { PDDocument.load(it)?.close() } catch (Throwable t) { println "${it}: ${t.message}" } }

PDFRenderer

import com.sun.pdfview.*
import java.io.*
import java.nio.
import java.nio.channels.*

(new File('/home/johann/Desktop/pdf/').listFiles() as List).each {
 RandomAccessFile raf = null
 try {
  raf = new RandomAccessFile(it, 'r')
  FileChannel channel = raf.channel
  ByteBuffer buf = channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size())
  PDFFile pdf = new PDFFile(buf)
 }
 catch (Throwable t) {
  println "${it}: ${t.message}"
 }
 finally {
  raf?.close()
 }
}

ICEpdf

import org.icepdf.core.exceptions.*
import org.icepdf.core.pobjects.*

(new File('/home/johann/Desktop/pdf/').listFiles() as List).each {
 Document document = new Document()
 try {
  document.file = it.absolutePath
 }
 catch (Throwable t) { println "${it}: ${t.message}" }
 finally {
  document.dispose()
 }
}

jPod

import de.intarsys.pdf.pd.*
import de.intarsys.tools.locator.*

(new File('/home/johann/Desktop/pdf/').listFiles() as List).each {
 try {
  FileLocator locator = new FileLocator(it.absolutePath)
  PDDocument.createFromLocator(locator)?.close()
 }
 catch (Throwable t) { println "${it}: ${t.message}" }
}

Page 1 · Page 2 · Page 3 · Page 4 · Page 5 · Page 6 · Page 7 · Page 8 · Page 9 · Page 10

Johann Burkard, , @johannburkard
Teisendorf, Germany
N 47° 51' 5.904", E 12° 49' 0.012"