User:SnpoSuwan/Extracting images from PDFs
This page is a how-to-guide on extracting images from PDFs, especially from large publications, such as lipu tenpo. Thank you very much to jan Pensa and his programmer friend Robin for teaching me how to do all this.
Download
Extracting a single PDF
pdfimages
Using pdfimages, open the command line and type the following commands, substituting original_pdf
and image
with the desired file paths.
cd C:\Program Files\Xpdf\bin64 pdfimages.exe -j original_pdf.pdf image
It will spit out several images separated into the colour and alpha channels.
Inkscape
Using Inkscape, open up the desired PDF and on the panel deselect "Incorporate images". Inkscape will separate all the images, separated into the colour and alpha channels.
Merging
Using ImageMagick, try the command below for all the images, substituting image
, mask
and output
with the desired names. Quotation marks are important in this case.
magick "image" "mask" -alpha off -compose copy-opacity -composite output.png
Script
For extractin multiple PDFs, thank you dearly to jan Kita for writing this script, which when given multiple PDFs will output images separated into many directories. If the PDFs have an uneven number of images, this may lead to the images joining up with the wrong alpha channel mask. For that reason, the script detects that and ask the user to manually remove them from the directory.
# Credit to jan Kita / .hecko (CC0) 2024
from glob import glob
from os import path
import re
import subprocess
import os
# Only edit the following lines.
in_dir = r"C:/.../INPUT_DIRECTORY"
out_dir = r"C:/.../OUTPUT_DIRECTORY"
def natural_sort(l):
# CREDIT: https://stackoverflow.com/a/4836734
convert = lambda text: int(text) if text.isdigit() else text.lower()
alphanum_key = lambda key: [convert(c) for c in re.split('([0-9]+)', key)]
return sorted(l, key=alphanum_key)
# Creates output directory if it does not exist.
os.makedirs(out_dir, exist_ok = True)
# Splits the PDFs into images.
# pdfimages splits these into the colour and alpha channels.
for pdf_path in glob(path.join(in_dir, "*.pdf")):
path_name = path.splitext(pdf_path)[0]
os.makedirs(path_name, exist_ok = True)
subprocess.run([
"C:/Program Files/Xpdf/bin64/pdfimages.exe",
"-png", pdf_path,
path.splitext(pdf_path)[0] + "/img"
])
# The number of files in the directory must be divisible by 2.
# Otherwise, there is an inbalance in the number of masks and images
# which causes problems when applied blindly.
if len(os.listdir(path_name)) % 2 != 0:
input(f"There is an uneven number of files in the folder {path_name}.\n"
"Please remove problematic image(s) and press Enter when ready.")
# Creates subdirectories in the output directory for each PDF inputted.
subdirs = glob(path.join(in_dir, "**"))
subdirs = [i for i in subdirs if path.isdir(i)]
for d in subdirs:
subdir = path.split(d)[1]
print(f"Processing {subdir}...")
os.makedirs(path.join(out_dir, subdir), exist_ok = True)
files = natural_sort(glob(path.join(in_dir, subdir, "*.png")))
# For each pair of image files, the former serves the image's colour channel
# and the latter as the alpha channel mask.
for num, i in enumerate(zip(files[::2], files[1::2])):
print(f"Files processed: {num + 1}/{len(files) // 2}")
subprocess.run([
"magick",
i[0], i[1],
"-alpha", "off",
"-compose", "copy-opacity",
"-composite",
path.join(out_dir, subdir, path.basename(i[0]))
])
print(f"Finished processing {subdir}...")
SVGs
For extracting SVGs, see Wikipedia:Graphics Lab/Resources/PDF conversion to SVG for an explanation on how .