For the SourceForge version, I created a console application, added references to FontBox-0.1.0-dev, ICSharpCode.SharpZipLib, IKVM.AWT.WinForms, IKVM.GNU.Classpath, IKVM.Runtime, and PDFBox-0.7.3.
Here's the sample code I used:
...
using org.pdfbox.pdmodel;
using org.pdfbox.util;
using System.IO;
...
static void Main(string[] args)
{
// All the error catching is left for you to do.
string[] files = {
@"C:\9780791093757.pdf",
@"C:\9780791095850.pdf",
@"C:\9780816048526.pdf"
};
bool UseIndividualPages = false;
foreach (string s in files)
{
string textFileDocument = Path.GetDirectoryName(s) + Path.DirectorySeparatorChar +
Path.GetFileNameWithoutExtension(s) + ".txt";
PDDocument pdfDocument = PDDocument.load(s);
PDFTextStripper pdfStripper = new PDFTextStripper();
pdfStripper.setPageSeparator(Environment.NewLine + Environment.NewLine);
if (UseIndividualPages) // Extracts one file per page in the PDF
{
for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++)
// getNumberOfPages() is 1-based.
{
pdfStripper.setStartPage(i);
pdfStripper.setEndPage(i);
string textFilePage = Path.GetDirectoryName(s) + Path.DirectorySeparatorChar +
Path.GetFileNameWithoutExtension(s) + " " + i.ToString() + ".txt";
ExtractText(pdfStripper, pdfDocument, textFilePage);
}
}
else // Extracts one file for the entire PDF
{
ExtractText(pdfStripper, pdfDocument, textFileDocument);
}
pdfDocument.close();
}
Console.ReadLine();
}
static void ExtractText(PDFTextStripper textStripper, PDDocument document,
string outputFile)
{
if (File.Exists(outputFile)) File.Delete(outputFile);
using (StreamWriter sw = new StreamWriter(outputFile))
{
sw.Write(textStripper.getText(document));
Console.WriteLine(outputFile);
}
}
This works well enough except for one substantial problem: end-of-line hyphens are included. In the PDF, the text is flowed so long words at the end of sentences are hyphenated. When I copy and paste the text out of Adobe Reader, the hyphens are left out and the words are whole. When I use PDFBox, the hyphens appear. I expected those characters to be special characters of some sort or another, but they are regular hyphens, which means I can't even remove them with regular expressions. Unfortunate.
I haven't found a way around this problem yet.
1 comment:
Hi Mattio
Just a tip, where you have:
Path.GetDirectoryName(s) + Path.DirectorySeparatorChar + Path.GetFileNameWithoutExtension(s) + ".txt";
I think you could use:
Path.ChangeExtension(s, ".txt");
Regards
Michael
Post a Comment