When dealing with scanned documents in PDF, the output file should not be much bigger than the input
I OCR'd a multi-page scanned document in PDF that is 119 pages long. The original file is 16MB; the output is 41MB. I doubt the text accounts for the additional 25MB... Perhaps it would be possible to add the text to the PDF without re-processing the images?
Thanks for the suggestion. We currently re-encode the image into the new PDF, which may change its size. Was the original black and white or greyscale rather than colour?
-
Anonymous commented
I also have a black-and-white. Started as 8mb, became 66mb afterwards.
-
fcchambers commented
I use a ScanSnap and your product is sooooooo close to astonishingly awesome (Sorry, for now it's just regular "awesome"...)
Are you actively working on the "files get kinda big" after OCR? Any timelines?
Also, I haven't purchased a license yet...(Just double checking the implications of the file size issue... IE, if I'd need to purchase more storage and backup, "way too expensive" Acrobat might actually be cheaper for me <Sniff, because I like your product much better for my purposes...>
Also, It would be kinda cool if license holders got more votes for features... :)
-
Matthew Green commented
The original was black and white (probably compressed TIFF files within the PDF file). Perhaps you need a way for the program to take note of the original format of the images and re-compress them using the same algorithm, if recompress you must (i.e. if you can't transfer the original images from the original PDF to the new one without reprocessing them).