User:SnpoSuwan/Extracting images from PDFs

From sona pona, the Toki Pona wiki

This page is a how-to-guide on extracting images from PDFs, especially from large publications, such as lipu tenpo. Thank you very much to jan Pensa and his programmer friend Robin for teaching me how to do all this.


Extracting a single PDF


Using pdfimages, open the command line and type the following commands, substituting original_pdf and image with the desired file paths.

cd C:\Program Files\Xpdf\bin64
pdfimages.exe -j original_pdf.pdf image

It will spit out several images separated into the colour and alpha channels.


Using Inkscape, open up the desired PDF and on the panel deselect "Incorporate images". Inkscape will separate all the images, separated into the colour and alpha channels.


Using ImageMagick, try the command below for all the images, substituting image, mask and output with the desired names. Quotation marks are important in this case.

magick "image" "mask" -alpha off -compose copy-opacity -composite output.png


For extractin multiple PDFs, thank you dearly to jan Kita for writing this script, which when given multiple PDFs will output images separated into many directories. If the PDFs have an uneven number of images, this may lead to the images joining up with the wrong alpha channel mask. For that reason, the script detects that and ask the user to manually remove them from the directory.

# Credit to jan Kita / .hecko (CC0) 2024

from glob import glob
from os import path
import re
import subprocess
import os

# Only edit the following lines.
in_dir = r"C:/.../INPUT_DIRECTORY"
out_dir = r"C:/.../OUTPUT_DIRECTORY"

def natural_sort(l):
    # CREDIT:
    convert = lambda text: int(text) if text.isdigit() else text.lower()
    alphanum_key = lambda key: [convert(c) for c in re.split('([0-9]+)', key)]
    return sorted(l, key=alphanum_key)

# Creates output directory if it does not exist.
os.makedirs(out_dir, exist_ok = True)

# Splits the PDFs into images.
# pdfimages splits these into the colour and alpha channels.
for pdf_path in glob(path.join(in_dir, "*.pdf")):
    path_name = path.splitext(pdf_path)[0]

    os.makedirs(path_name, exist_ok = True)[
        "C:/Program Files/Xpdf/bin64/pdfimages.exe",
        "-png", pdf_path,
        path.splitext(pdf_path)[0] + "/img"

    # The number of files in the directory must be divisible by 2.
    # Otherwise, there is an inbalance in the number of masks and images
    # which causes problems when applied blindly.
    if len(os.listdir(path_name)) % 2 != 0:
        input(f"There is an uneven number of files in the folder {path_name}.\n"
                "Please remove problematic image(s) and press Enter when ready.")

# Creates subdirectories in the output directory for each PDF inputted.
subdirs = glob(path.join(in_dir, "**"))
subdirs = [i for i in subdirs if path.isdir(i)]
for d in subdirs:
    subdir = path.split(d)[1]
    print(f"Processing {subdir}...")
    os.makedirs(path.join(out_dir, subdir), exist_ok = True)
    files = natural_sort(glob(path.join(in_dir, subdir, "*.png")))

    # For each pair of image files, the former serves the image's colour channel
    # and the latter as the alpha channel mask. 
    for num, i in enumerate(zip(files[::2], files[1::2])):
        print(f"Files processed: {num + 1}/{len(files) // 2}")[
            i[0], i[1],
            "-alpha", "off",
            "-compose", "copy-opacity",
            path.join(out_dir, subdir, path.basename(i[0]))

    print(f"Finished processing {subdir}...")


For extracting SVGs, see Wikipedia:Graphics Lab/Resources/PDF conversion to SVG for an explanation on how .