Google Cloud DLP does not support document redaction (PDF, DOC, TIFF etc...), We're supposed to redact only images using google cloud api's. In this python program we are going to redact documents using DLP API.
[You are always welcome to collaborate or suggest some changes, I'll be thankful]
Approach -
- Convert every page of document into images using PyMuPDF(In Output Folder).
Output/<PDF_FILENAME>/Page0001.png
Output/<PDF_FILENAME>/Page0002.png
Output/<PDF_FILENAME>/Page0003.png
... - Redact every Page from that folder and generate output in Redacted Images folder.
Output/<PDF_FILENAME>/Redacted Images/Redacted-Page0001.png
Output/<PDF_FILENAME>/Redacted Images/Redacted-Page0002.png
Output/<PDF_FILENAME>/Redacted Images/Redacted-Page0003.png
... - Create PDF from all images in Redacted Images folder and store in base of Output.
Output/<PDF_FILENAME>/Redacted-
Requirements -
- PyMuPDF
- google.cloud.dlp
- Project Credentials (You have to download json from Google Cloud Console)
- Project Name (Name of Project on your Google Cloud Console)
Installation -
- PyMuPDF
pip install PyMuPDF
- Google Cloud DLP
pip install google.cloud.dlp
Executing The Program
python redaction.py <PDF_FILE_PATH>
That's It. Enjoy !!!