We have requirement to find if there is any blank/empty pages in a PDF files. Actually there are 4 million PDF files which needs to be validated for above condition and also there will be 10k-12k pages in a PDF. Hence need a script to automate this work.
12.7k 11 11 gold badges 41 41 silver badges 49 49 bronze badges asked Jul 27, 2015 at 6:41 1 2 2 silver badges 4 4 bronze badgesWhat OS / environment / script language? This question is too broad. Also, what things did you try yourself?
Commented Jul 27, 2015 at 6:44we will have to validate PDF files in Windows OS. I have not tried any scripting language yet. I am still trying to find best way to do my requirement.
Commented Jul 27, 2015 at 6:48It's a broad requirement but I would convert the pages with ghostscrip, or any other rasterizer, to an image and than would check if the page had any pixel on. To check the image use powershell by importing the System.Drawing assembly.
Commented Jul 27, 2015 at 8:32There is also software that already does this. Search for preflight software, you'll find applications such as callas pdfToolbox (I'm affiliated with this tool) and PitStop. Checking for empty pages is a common thing in graphic arts (you want to avoid people sending in empty PDF files when they're supplying advertisements for example).
Commented Jul 27, 2015 at 11:12 Thank you all for your suggestion. But I need a scripting language to do the same Commented Jul 28, 2015 at 5:34You could check the size of each page. It is the easiest solution I have found so far:
from reportlab.pdfgen.canvas import Canvas import os from PyPDF2 import PdfFileWriter, PdfFileReader, PdfFileMerger output = PdfFileWriter() tempoutput = PdfFileWriter() input1 = PdfFileReader(open("document4.pdf", "rb")) print ("document4.pdf has %d pages." % input1.getNumPages()) numPages1=input1.getNumPages() def getSize(filename): st = os.stat(filename) return st.st_size for i in range(numPages1): canv1 = Canvas("paginatemporal.pdf") canv1.showPage() canv1.save() archivotemp=open("paginatemporal.pdf", "rb") temporal = PdfFileReader(archivotemp) page=input1.getPage(i) page.mergePage(temporal.getPage(0)) tempoutput.addPage(page) outputStreamTemp = open("paginasize.pdf", "wb") tempoutput.write(outputStreamTemp) page=input1.getPage(i) pdfsize1= getSize("paginasize.pdf") if pdfsize1=60000: print("Page number "+ str(i+1) + " is not blank.") print(pdfsize1) archivotemp.close() outputStreamTemp.close() os.remove("paginatemporal.pdf") os.remove("paginasize.pdf") tempoutput = PdfFileWriter()
I was just trying somethings out, so it is not finished, I needed to find each page because I have to add the label :"No text" on blank pages and put sequential page numbers on every page of each topic (which can have multiple files). That is the reason I am using canvas and page merge.
I used too many temp files, but will clean the code soon.
Hope this helps you. It is in Python 3. The number 60,000 is the size I put because all blank pages on my files were less than 50,000 and all pages that had information were over 100,000, but it could change if your files are different. Try it with some of them and adjust the number to your needs.