Code snippets, quotes, BASH scripts and nonsense. Not surprisingly, my main site is called Johann Burkard, too. Check out my web applications: WMA to MP3 converter, QR Code Generator, Pearson Correlation calculator, PDF to TIFF converter and Forex Data Feed. Also, check out Ole’s Tauschbörse Arbeit.
Page 1 · Page 2 · Page 3 · Page 4 · Page 5 · Page 6 · Page 7 · Page 8 · Page 9 · Page 10
Nov 11 2010
Data quality is always a huge challenge in working with file formats. I wanted to find out the error rate of open source/free PDF libraries for Java, i.e. what percentage of PDF files they cannot read.
I downloaded 3583 PDF files from the intertubes and here are the results:
ICEpdf couldn’t read 1.981579682 % of the files though I’m not absolutely sure of the number because exceptions were horribly intermingled with logging output.
jPod couldn’t read 2.093217974 % of the files.
Apache PDFBox couldn’t read 2.121127547 % of the files.
PDFRenderer (or PDF Renderer?) couldn’t read 8.317052749 % of the files.
Here are the code snippets I used to parse the files. They are all written in Groovy, which is really good.
@Grab(group='org.apache.pdfbox', module='pdfbox', version='1.3.1')
import org.apache.pdfbox.pdmodel.*
(new File('/home/johann/Desktop/pdf/').listFiles() as List).each { try { PDDocument.load(it)?.close() } catch (Throwable t) { println "${it}: ${t.message}" } }
import com.sun.pdfview.*
import java.io.*
import java.nio.
import java.nio.channels.*
(new File('/home/johann/Desktop/pdf/').listFiles() as List).each {
RandomAccessFile raf = null
try {
raf = new RandomAccessFile(it, 'r')
FileChannel channel = raf.channel
ByteBuffer buf = channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size())
PDFFile pdf = new PDFFile(buf)
}
catch (Throwable t) {
println "${it}: ${t.message}"
}
finally {
raf?.close()
}
}
import org.icepdf.core.exceptions.*
import org.icepdf.core.pobjects.*
(new File('/home/johann/Desktop/pdf/').listFiles() as List).each {
Document document = new Document()
try {
document.file = it.absolutePath
}
catch (Throwable t) { println "${it}: ${t.message}" }
finally {
document.dispose()
}
}
import de.intarsys.pdf.pd.*
import de.intarsys.tools.locator.*
(new File('/home/johann/Desktop/pdf/').listFiles() as List).each {
try {
FileLocator locator = new FileLocator(it.absolutePath)
PDDocument.createFromLocator(locator)?.close()
}
catch (Throwable t) { println "${it}: ${t.message}" }
}
Page 1 · Page 2 · Page 3 · Page 4 · Page 5 · Page 6 · Page 7 · Page 8 · Page 9 · Page 10