Monday, January 19, 2009

Using PDFBox in C#

Last week I was trying to extract text from PDF files in an automated fashion. After some searching I found a CodeProject.com article describing how to use PDFBox in C#. As of this writing, the DLLs needed for the C# version are only in the old SourceForge version, which seems to be behind the Apache Incubator current version in Java.

For the SourceForge version, I created a console application, added references to FontBox-0.1.0-dev, ICSharpCode.SharpZipLib, IKVM.AWT.WinForms, IKVM.GNU.Classpath, IKVM.Runtime, and PDFBox-0.7.3.

Here's the sample code I used:

...
using org.pdfbox.pdmodel;
using org.pdfbox.util;
using System.IO;
...
static void Main(string[] args)
{
// All the error catching is left for you to do.
string[] files = {
@"C:\9780791093757.pdf",
@"C:\9780791095850.pdf",
@"C:\9780816048526.pdf"
};
bool UseIndividualPages = false;

foreach (string s in files)
{
string textFileDocument = Path.GetDirectoryName(s) + Path.DirectorySeparatorChar +
Path.GetFileNameWithoutExtension(s) + ".txt";

PDDocument pdfDocument = PDDocument.load(s);
PDFTextStripper pdfStripper = new PDFTextStripper();
pdfStripper.setPageSeparator(Environment.NewLine + Environment.NewLine);

if (UseIndividualPages) // Extracts one file per page in the PDF
{
for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++)
// getNumberOfPages() is 1-based.
{
pdfStripper.setStartPage(i);
pdfStripper.setEndPage(i);
string textFilePage = Path.GetDirectoryName(s) + Path.DirectorySeparatorChar +
Path.GetFileNameWithoutExtension(s) + " " + i.ToString() + ".txt";
ExtractText(pdfStripper, pdfDocument, textFilePage);
}
}
else // Extracts one file for the entire PDF
{
ExtractText(pdfStripper, pdfDocument, textFileDocument);
}

pdfDocument.close();
}

Console.ReadLine();
}

static void ExtractText(PDFTextStripper textStripper, PDDocument document,
string outputFile)
{
if (File.Exists(outputFile)) File.Delete(outputFile);
using (StreamWriter sw = new StreamWriter(outputFile))
{
sw.Write(textStripper.getText(document));
Console.WriteLine(outputFile);
}
}


This works well enough except for one substantial problem: end-of-line hyphens are included. In the PDF, the text is flowed so long words at the end of sentences are hyphenated. When I copy and paste the text out of Adobe Reader, the hyphens are left out and the words are whole. When I use PDFBox, the hyphens appear. I expected those characters to be special characters of some sort or another, but they are regular hyphens, which means I can't even remove them with regular expressions. Unfortunate.

I haven't found a way around this problem yet.

Sunday, January 18, 2009

Optimum WiFi on Metro-North's New Haven Line

I'm on the Metro-North New Haven line quite a bit these days. I thought having access to Optimum WiFi would be a boon, but that hasn't been the case. In nearly all cases, you have to be right on the platform of each station to get access, which is of little use on the train itself.

Here are some other general notes:
  • If you allow automatic connections, the network will connect by itself at each staion. I found it reconnected faster when I refreshed the network list.
  • You do not seem to need to login at each hotspot after you've logged in the first time.
  • The connection is not secure, so if you're not on a secured site, all the network traffic can be sniffed easily.
  • At most stations, I found I have just enough time to connect and do 1 Google search or refresh a page before the train starts moving and the signal cuts out.

Here are some comments on when you can connect at various stations:

  • Greens Farms, on the platform only.
  • Westport, on the platform only.
  • East Norwalk, on the platform only, but signal very poor.
  • South Norwalk, immediately approaching platform.
  • Rowayton, on the platform only.
  • Noroton Heights, on the platform only.
  • Stamford, immediately approaching platform, but then signal cuts out under the building.
  • Riverside, approaching platform and past it, but the signal is weak.
  • Greenwich, immediately approaching platform.

Saturday, January 17, 2009

You'll Get the Idea

I'll refine this over time.


  • CCI Mini-Mag HP HV (36 grain): Very inconsistent report, but no jams.

  • Federal Game Shok HV (40 grain): Clean and consistent. No jams.

  • Remington Yellow Jacket HP (33 grain): Clean and consistent. No jams.

  • Remington 22 Thunderbolt (40 grain): ?

  • Winchester Super X Power Point HP (40 grain): ?