1. OCAU Merchandise is available! Check out our 20th Anniversary Mugs, Classic Logo Shirts and much more! Discussion in this thread.
    Dismiss Notice

Fully searchable PDF check (OCR)

Discussion in 'Programming & Software Development' started by smack01, Jan 21, 2022.

  1. smack01

    smack01 Member

    Joined:
    Aug 19, 2002
    Messages:
    3,582
    Location:
    Brisbane
    I am trying to find a solution for an issue that has popped up at work. It centers on being able to confirm that a PDF is 100% fully searchable.

    We have a team that is responsible for submitting stat body annual reports. There is a mandated requirement that these PDF reports need to be 100% searchable. This includes the title page, all tables, all graphs ect. Previously the team has completed spot searches on the documents concentrating on the most likely area where the OCR process may not pickup text (eg tables and graphs). They have been caught out a few times where there will be one or two sentences in the middle of a page that for some reason were not searchable (even after running multiple OCR passes).

    Does anyone know of any programs out there that would be able to automate such work? Scan a PDF and be able to pickup any areas where it believes that searchable text is missing? The accessibility check in Adobe doesn't help unfortunately.

    Hope that makes sense. Thanks in advance.
     
  2. neRok

    neRok Member

    Joined:
    Aug 19, 2006
    Messages:
    4,316
    Location:
    Perth NOR
    You should print a text searchable PDF directly from the source (Word, Excel, whatever). Printing to paper, scanning that back to PDF, and doing OCR is for noobs.
     
  3. OP
    OP
    smack01

    smack01 Member

    Joined:
    Aug 19, 2002
    Messages:
    3,582
    Location:
    Brisbane
    Just an FYI, that's not what they do. Nor are they responsible for the creation of the document in the first place, they act in an auditor role and collate the submission to government.
     
  4. neRok

    neRok Member

    Joined:
    Aug 19, 2006
    Messages:
    4,316
    Location:
    Perth NOR
    I was waiting to see if anyone else responded, but I guess they are protecting their trade secrets :p

    Edit: TLDR: No I don't know of anything to do what you want, because I just print better PDF's to begin with.

    No OCR/AI/ML system is going to be infallible at this task. If you have absolutely no control over the supplied PDF and want a system that is as automated as possible, I would suggest running the PDF through 2 or more different OCR systems, and comparing their output. Any difference in the outputs means human review is required. But of course you could still end up in the situation where neither system picks up certain characters, or you get many "false positive" errors like - vs — or ' vs ´.

    And you didn't answer if these PDF were scanned ("analogue") or "digital" (printed directly to PDF)? Because if scanned, it will never be perfect as you can get small cumulative errors from the printer, the paper flatness, the scanner itself, the compression algorithm used by the scanner, etc. Now the OCR might even start recognising U as V.

    If you can get the PDF creators to change their ways, you will have a better time. If the PDF's you are receiving are already of a digital nature, I would be questioning why they aren't already text-searchable. You might find they are using some antiquated process/system like "printing" postscript files and putting that into adobe distiller to get a PDF out. Just changing to a "print to PDF" printer might be all that is required.

    Or if there is some other reason the PDF has to be that way, then there are other creative process you could implement. You could request they use monospace font and a set font/line height, and then you can divide the PDF page into rows (or even "cells) for the OCR, and because you know there should be a certain amount of characters per line, you can use that structure as part of the checking process. Or you could get them to give you a .txt output of the same PDF, and use the textual data in that PDF to compare to the OCR output. Many things you could do.

    I've had to push for changes like this many times over the years. AutoCAD for example has a font system (probably the original system from the 80's) that uses tiny lines and stuff for the letters. So if you use these fonts and print directly to PDF, it isn't text. To work around this in recent times, they have a built in PDF printer that also puts invisible digital text over the top to make it text searchable. But these days you can also use windows fonts in AutoCAD that PDF print to digital text, but because a lot of the "CAD systems" big companies use have been continually evolving since the 80's, they are still using line-fonts a lot. It's ridiculous!
    But then it's all made redundant anyway, because the whole process is outdated, and so every drawing still has to be wet-signed, so you get printed drawings with signatures scanned back in and transmitted to the client, and then you get that one client that is a bit modern and wants text searchable PDF's, and so I've seen graduate engineers running around trying to OCR shit like you are. It's lolz.
     
    Last edited: Jan 26, 2022
    elvis and grs1961 like this.
  5. mtma

    mtma Member

    Joined:
    Aug 12, 2009
    Messages:
    6,178
    Sounds like the idea is that the actual document as it is is what the subject of the audit is, so they're probably not going to make two versions as that breaks the basis of an audit - being that the second version is not the actual material and is tampered with before it reached the auditor(s). The auditor is there to do what they have to do.

    Seems like the checking process at least could be automated in part to help, in the case that the data is what it is. A simple possibility is that once you've done the OCR conversion, if you deleted all the OCR text the remaining parts are all either not text or vigilantes. On this basis a scripted process could be developed to bring these issues up to the reviewer's attention with a higher degree of coverage than can be done with manual spot checks.
     
  6. elvis

    elvis OCAU's most famous and arrogant know-it-all

    Joined:
    Jun 27, 2001
    Messages:
    46,809
    Location:
    Brisbane
    Cracking post, and this sentence is the winner.

    Sounds like it's yet another case of "trying to fix bad process with clever/magic software". This never works. Fix the process instead.
     
    grs1961 and Dilbery like this.

Share This Page

Advertisement: