Monday, January 19, 2009

Using PDFBox in C#

Last week I was trying to extract text from PDF files in an automated fashion. After some searching I found a CodeProject.com article describing how to use PDFBox in C#. As of this writing, the DLLs needed for the C# version are only in the old SourceForge version, which seems to be behind the Apache Incubator current version in Java.

For the SourceForge version, I created a console application, added references to FontBox-0.1.0-dev, ICSharpCode.SharpZipLib, IKVM.AWT.WinForms, IKVM.GNU.Classpath, IKVM.Runtime, and PDFBox-0.7.3.

Here's the sample code I used:

...
using org.pdfbox.pdmodel;
using org.pdfbox.util;
using System.IO;
...
static void Main(string[] args)
{
// All the error catching is left for you to do.
string[] files = {
@"C:\9780791093757.pdf",
@"C:\9780791095850.pdf",
@"C:\9780816048526.pdf"
};
bool UseIndividualPages = false;

foreach (string s in files)
{
string textFileDocument = Path.GetDirectoryName(s) + Path.DirectorySeparatorChar +
Path.GetFileNameWithoutExtension(s) + ".txt";

PDDocument pdfDocument = PDDocument.load(s);
PDFTextStripper pdfStripper = new PDFTextStripper();
pdfStripper.setPageSeparator(Environment.NewLine + Environment.NewLine);

if (UseIndividualPages) // Extracts one file per page in the PDF
{
for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++)
// getNumberOfPages() is 1-based.
{
pdfStripper.setStartPage(i);
pdfStripper.setEndPage(i);
string textFilePage = Path.GetDirectoryName(s) + Path.DirectorySeparatorChar +
Path.GetFileNameWithoutExtension(s) + " " + i.ToString() + ".txt";
ExtractText(pdfStripper, pdfDocument, textFilePage);
}
}
else // Extracts one file for the entire PDF
{
ExtractText(pdfStripper, pdfDocument, textFileDocument);
}

pdfDocument.close();
}

Console.ReadLine();
}

static void ExtractText(PDFTextStripper textStripper, PDDocument document,
string outputFile)
{
if (File.Exists(outputFile)) File.Delete(outputFile);
using (StreamWriter sw = new StreamWriter(outputFile))
{
sw.Write(textStripper.getText(document));
Console.WriteLine(outputFile);
}
}


This works well enough except for one substantial problem: end-of-line hyphens are included. In the PDF, the text is flowed so long words at the end of sentences are hyphenated. When I copy and paste the text out of Adobe Reader, the hyphens are left out and the words are whole. When I use PDFBox, the hyphens appear. I expected those characters to be special characters of some sort or another, but they are regular hyphens, which means I can't even remove them with regular expressions. Unfortunate.

I haven't found a way around this problem yet.

1 comment:

Michael said...

Hi Mattio

Just a tip, where you have:

Path.GetDirectoryName(s) + Path.DirectorySeparatorChar + Path.GetFileNameWithoutExtension(s) + ".txt";

I think you could use:

Path.ChangeExtension(s, ".txt");

Regards
Michael