Related: Text extraction from PDF.

2023-04-07

  • It’s been five years since I last put this together, so I asked GPT4 for recommendations.
  • GPT4’s answer to cut-and-scan PDF: PyPDF2.
    • This will be KeyError: '/XObject' for some PDFs.
    • It would be an error if the image was not embedded, as in the 2017 summary below.
  • GPT4 answers about General PDF: pdf2image.
    • $ pip install pdf2image
    • It uses pdftocairo internally, pdftocairo is a tool of [poppler
    • $ brew install poppler python
    from pdf2image import convert_from_path
    images = convert_from_path(pdf_path)
    for i, image in enumerate(tqdm(images)):
        image.save(os.path.join(output_dir, f"{i+1}.png"), "PNG")
- Example of output
    - ![image](https://gyazo.com/e475b087b9ff628e9d78baa62758e604/thumb/1000)
    - 72dpi by default

2017-08-23

  • summary

    • gs has terrible image quality, ImageMagick(convert) also uses gs internally. pdftoppm is better, but pdftocairo is the best.
      • pdftocairo -r 200 -f 0 -png mybook.pdf prefix
    • As for the cut scan PDF, the cleanest way is to extract the -r 300 equivalent image embedded in pdfimages.
      • The problem is that we need to decide if it is that kind of PDF or not.
    • The quality of pdftocairo is almost the same as pdfimages. But 48 times slower.
  • gs

    • $ time gs -q -dBATCH -dNOPAUSE -sDEVICE=png16m -r100 -sOutputFile=pages_%04d.png mybook.pdf
    • gs -q -dBATCH -dNOPAUSE -sDEVICE=png16m -r100 -sOutputFile=pages_%04d.png 219.36s user 5.22s system 95% cpu 3:54.68 total
    • 552x823
    • imageimage
    • Terrible jaggies.
  • pdftoppm

    • $ time pdftoppm -r 100 -png mybook.pdf mybook
    • pdftoppm -r 100 -png mybook.pdf mybook 464.95s user 6.77s system 96% cpu 8:07.62 total
    • 552x823
    • imageimage
    • much better
  • pdftoppm Output at twice the resolution

    • $ time pdftoppm -r 200 -png mybook.pdf mybook
    • pdftoppm -r 200 -png mybook.pdf mybook 1104.28s user 12.22s system 96% cpu 19:14.59 total
    • 1104x1646
    • imageimage
    • Has the atmosphere of the title changed a lot? (Thinner? Sharper impression)
    • After doubling the size, the thumbnail above is reduced to twice the size. Does the text become thinner in the process?
  • So what would happen if you put it out at twice the resolution on the gs?

    • $ time gs -q -dBATCH -dNOPAUSE -sDEVICE=png16m -r200 -sOutputFile=pages_%04d.png mybook.pdf
    • gs -q -dBATCH -dNOPAUSE -sDEVICE=png16m -r200 -sOutputFile=pages_%04d.png 619.65s user 13.83s system 93% cpu 11:14.36 total
    • imageimage
    • dirty
    • Mincho font “Hen” horizontal strokes, etc. are missing.
    • Enlargement (left: gs, right: pdftoppm)
    • image
    • gs seems to sample only once per pixel. pdftoppm looks like a behavior of sampling and mixing several times.
  • ImageMagick(convert)

    • $ time convert -verbose -density 200 mybook.pdf pages_%04d.png
    • "gs" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r200x200" "-sOutputFile=/tmp/magick-le3ab9PT-%08d" "-f/tmp/magick-s7CciX7v" "-f/tmp/magick-6huEpaq8"
    • imageimage
    • As you can see from the command line output, it is hitting gs internally. The image quality is also similar to gs.
  • pdfimages

    • $ time pdfimages -j mybook.pdf ./pages
    • pdfimages -j mybook.pdf ./pages 6.14s user 2.51s system 59% cpu 14.569 total
    • Extract image files in the target PDF
    • Since the experimental PDF is a cut scan of a paper book, the scan result is contained in an image file.
    • image
    • 1656x2469
    • The -r option for gs and pdftoppm was 552x823 when the -r option was 100, so -r 300 equivalent.
    • Try to reduce it to the same resolution
      • $ convert -thumbnail 1104x1646 ex7/pages-002.jpg t.png
      • convert -thumbnail 1104x1646 ex7/pages-002.jpg t.png 3.86s user 0.09s system 98% cpu 4.002 total
    • imageimage
    • Pretty, but choose PDF origin
  • pdftocairo

    • $ time pdftocairo -r 200 -f 0 -l 10 -png mybook.pdf pages
    • pdftocairo -r 200 -f 0 -l 10 -png mybook.pdf pages 27.02s user 0.29s system 79% cpu 34.260 total
    • imageimage
    • Enlarge (top left gs, top right pdftoppm, bottom left pdfimages and shrink, bottom right pdftocairo)
      • image
      • Output almost equivalent to reduced from pdfimages
      • Not equal, but close enough that you can see the difference when you cut out one line and look at them closely side by side.
    • It took 34 seconds for 10 pages, which is relatively slow (19 seconds to produce 270 pages with pdftoppm, so roughly 48 times slower).

2017-08-23

This page is auto-translated from /nishio/PDFからPNGへの変換 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.