Sunday, September 21, 2008

Extracting text from scanned documents and copy protected pdfs

Consider a situation where you need to copy text from a pdf document and you cant because it is copy protected. Worse still, the text that you need is in a scanned document or image! Ever wondered how you can do that?

Well.. there is a way! Read on, to find to find out how.

A pre-requisite for this is that you should have Microsoft Office Document Imaging feature installed in your computer (which is typically installed by default, along with Microsoft Office)
  1. Open the pdf or the image file from which you need to extract the text
  2. Print the file using the Microsoft Office Document Image Writer, by selecting it from the printer name drop down.
  3. This will print the document to a .mdi file ( Document Imaging Format). On the save prompt give a file name and save the file.
  4. The file will automatically open, and from the Tools menu, select Send Text to Word option.
  5. Choose a preferred location for saving the Word file and click OK.
  6. Click Ok for the prompt "You must re-run OCR before performing this operation. This may take a while".
  7. This will extract text from the .mdi file and save it to the Word file. The Word file will also automatically open at the end of the operation.
This is simple right? But remember that formatting may not be correct and the accuracy of text extraction will depend on the quality of the scanned image. But it can effectively serve the purpose when it comes to getting the text out of irritating copy protected pdfs and scanned text documents.

I found this useful information from the internet when i was required to type 4 text filled pages, because the pdf was copy protected. Took me just 30 minutes to google out the information, understand the technique and get the extracted word file. (I would have taken 30 minutes per page to type)

Courtesy and credit goes to the original poster.

Earlier there was another way for pdf files. Use gmail's "view as html" option to get the copy protected pdf file displayed in html format and then simply copying the text. Though i have not used it myself, i saw many posts and blogs suggesting this. But unfortunately, the feature no longer works in gmail.