Rendition Protocol: Using PDFBox in C#

Last week I was trying to extract text from PDF files in an automated fashion. After some searching I found a CodeProject.com article describing how to use PDFBox in C#. As of this writing, the DLLs needed for the C# version are only in the old SourceForge version, which seems to be behind the Apache Incubator current version in Java.

For the SourceForge version, I created a console application, added references to FontBox-0.1.0-dev, ICSharpCode.SharpZipLib, IKVM.AWT.WinForms, IKVM.GNU.Classpath, IKVM.Runtime, and PDFBox-0.7.3.

Here's the sample code I used:


...
using org.pdfbox.pdmodel;
using org.pdfbox.util;
using System.IO;
...
static void Main(string[] args)
{
  // All the error catching is left for you to do.
  string[] files = {
   @"C:\9780791093757.pdf",
   @"C:\9780791095850.pdf",
   @"C:\9780816048526.pdf"
  };
  bool UseIndividualPages = false;

  foreach (string s in files)
  {
    string textFileDocument = Path.GetDirectoryName(s) + Path.DirectorySeparatorChar +
     Path.GetFileNameWithoutExtension(s) + ".txt";
  
    PDDocument pdfDocument = PDDocument.load(s);   
    PDFTextStripper pdfStripper = new PDFTextStripper();
    pdfStripper.setPageSeparator(Environment.NewLine + Environment.NewLine);
   
    if (UseIndividualPages) // Extracts one file per page in the PDF
    {
      for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++)
      // getNumberOfPages() is 1-based.
      {
        pdfStripper.setStartPage(i);
        pdfStripper.setEndPage(i);
        string textFilePage = Path.GetDirectoryName(s) + Path.DirectorySeparatorChar +
          Path.GetFileNameWithoutExtension(s) + " " + i.ToString() + ".txt";
        ExtractText(pdfStripper, pdfDocument, textFilePage);
      }
    }
    else // Extracts one file for the entire PDF
    {
      ExtractText(pdfStripper, pdfDocument, textFileDocument);
    }

    pdfDocument.close();   
  }

  Console.ReadLine();
}

static void ExtractText(PDFTextStripper textStripper, PDDocument document,
  string outputFile)
{
  if (File.Exists(outputFile)) File.Delete(outputFile);
  using (StreamWriter sw = new StreamWriter(outputFile))
  {
    sw.Write(textStripper.getText(document));
    Console.WriteLine(outputFile);
  }
}

This works well enough except for one substantial problem: end-of-line hyphens are included. In the PDF, the text is flowed so long words at the end of sentences are hyphenated. When I copy and paste the text out of Adobe Reader, the hyphens are left out and the words are whole. When I use PDFBox, the hyphens appear. I expected those characters to be special characters of some sort or another, but they are regular hyphens, which means I can't even remove them with regular expressions. Unfortunate.

I haven't found a way around this problem yet.

Rendition Protocol

Monday, January 19, 2009

Using PDFBox in C#

1 comment:

Blog Archive

Topics

XQuery / MarkLogic Blogs

Gratuitous Links