Welcome Guest ( Log In | Register )

 
Reply to this topicStart new topic
> Some Scratch Notes on extracting PDF with poppler

 
post Jul 1 2022, 05:48
Post #1
Lady_Slayer



Member of the Bal'masqué
*********
Group: Catgirl Camarilla
Posts: 5,629
Joined: 20-December 16
Level 500 (Ponyslayer)


My vps for uploading uses Ubuntu 20.04, so this note is basically based on ubuntu. I'm not gonna do any uploading at home, since my home upload bandwidth isn't so high.

Poppler is a good tool to convert pdf into images without scaling, it's based on xpdf-3.0 and provide a pdf render library.

Instructions on installing popplers may refer to these websites:

[installati.one]
How To Install Popplers

[askubuntu.com]
Install Poppler on Ubuntu


Special thanks to @genl who provide us an inspiration on dealing pdf tankoubons.

This post has been edited by dongmian: Sep 28 2022, 04:54
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Jul 1 2022, 05:54
Post #2
Lady_Slayer



Member of the Bal'masqué
*********
Group: Catgirl Camarilla
Posts: 5,629
Joined: 20-December 16
Level 500 (Ponyslayer)


QUOTE
Installation setup


download and unzip the package, here I use the up-to-date version which is 22.06.0 now.

CODE

wget https://poppler.freedesktop.org/poppler-22.06.0.tar.xz
tar -xvf poppler-22.06.0.tar.xz


Before you install, please make sure that the apt tool is up-to-date, you can get it done by this:

CODE

sudo apt update


then make sure that the following dependencies are installed:

CODE

sudo apt-get install libnss3 libnss3-dev
sudo apt-get install libcairo2-dev libjpeg-dev libgif-dev
sudo apt-get install cmake libblkid-dev e2fslibs-dev libboost-all-dev libaudit-dev
sudo apt install libopenjp2-7-dev -y


then use the following command to make install poppler:

CODE

cd poppler-22.06.0/
mkdir build
cd build/
cmake  -DCMAKE_BUILD_TYPE=Release   \
       -DCMAKE_INSTALL_PREFIX=/usr  \
       -DTESTDATADIR=$PWD/testfiles \
       -DENABLE_UNSTABLE_API_ABI_HEADERS=ON \
       ..
make
sudo make install


This post has been edited by dongmian: Jul 1 2022, 06:12
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Jul 1 2022, 06:04
Post #3
Lady_Slayer



Member of the Bal'masqué
*********
Group: Catgirl Camarilla
Posts: 5,629
Joined: 20-December 16
Level 500 (Ponyslayer)



I shall put the commonly used commands below:

For example if you want to simply extract a pdf file called RJ.pdf at a directory called [u]/home/user/RJ

And you want to convert it into JPEG into the same directory, your command will be :
CODE

cd /home/user/RJ
sudo pdftocairo -jpeg RJ.pdf


p.s. The stuff inside [<output-file>] is simply the file name you want to put in. the out put file will be in form of [<output-file>]-001.jpg ... and so on.

CODE

Usage: pdftocairo [options] <PDF-file> [<output-file>]
  -png                     : generate a PNG file
  -jpeg                    : generate a JPEG file
  -jpegopt <string>        : jpeg options, with format <opt1>=<val1>[,<optN>=<valN>]*
  -tiff                    : generate a TIFF file
  -tiffcompression <string>: set TIFF compression: none, packbits, jpeg, lzw, deflate
  -ps                      : generate PostScript file
  -eps                     : generate Encapsulated PostScript (EPS)
  -pdf                     : generate a PDF file
  -svg                     : generate a Scalable Vector Graphics (SVG) file
  -f <int>                 : first page to print
  -l <int>                 : last page to print
  -o                       : print only odd pages
  -e                       : print only even pages
  -singlefile              : write only the first page and do not add digits
  -r <fp>                  : resolution, in PPI (default is 150)
  -rx <fp>                 : X resolution, in PPI (default is 150)
  -ry <fp>                 : Y resolution, in PPI (default is 150)
  -scale-to <int>          : scales each page to fit within scale-to*scale-to pixel box
  -scale-to-x <int>        : scales each page horizontally to fit in scale-to-x pixels
  -scale-to-y <int>        : scales each page vertically to fit in scale-to-y pixels
  -x <int>                 : x-coordinate of the crop area top left corner
  -y <int>                 : y-coordinate of the crop area top left corner
  -W <int>                 : width of crop area in pixels (default is 0)
  -H <int>                 : height of crop area in pixels (default is 0)
  -sz <int>                : size of crop square in pixels (sets W and H)
  -cropbox                 : use the crop box rather than media box
  -mono                    : generate a monochrome image file (PNG, JPEG)
  -gray                    : generate a grayscale image file (PNG, JPEG)
  -transp                  : use a transparent background instead of white (PNG)
  -antialias <string>      : set cairo antialias option
  -icc <string>            : ICC color profile to use
  -level2                  : generate Level 2 PostScript (PS, EPS)
  -level3                  : generate Level 3 PostScript (PS, EPS)
  -origpagesizes           : conserve original page sizes (PS, PDF, SVG)
  -paper <string>          : paper size (letter, legal, A4, A3, match)
  -paperw <int>            : paper width, in points
  -paperh <int>            : paper height, in points
  -nocrop                  : don't crop pages to CropBox
  -expand                  : expand pages smaller than the paper size
  -noshrink                : don't shrink pages larger than the paper size
  -nocenter                : don't center pages smaller than the paper size
  -duplex                  : enable duplex printing
  -opw <string>            : owner password (for encrypted files)
  -upw <string>            : user password (for encrypted files)
  -q                       : don't print any messages or errors
  -v                       : print copyright and version info
  -h                       : print usage information
  -help                    : print usage information
  --help                   : print usage information
  -?                       : print usage information
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Jul 1 2022, 06:26
Post #4
Lady_Slayer



Member of the Bal'masqué
*********
Group: Catgirl Camarilla
Posts: 5,629
Joined: 20-December 16
Level 500 (Ponyslayer)


Using of pdfimage

i.e.
CODE

sudo pdfimages -j rj.pdf


CODE

Usage: pdfimages [options] <PDF-file> <image-root>
  -f <int>       : first page to convert
  -l <int>       : last page to convert
  -png           : change the default output format to PNG
  -tiff          : change the default output format to TIFF
  -j             : write JPEG images as JPEG files
  -jp2           : write JPEG2000 images as JP2 files
  -jbig2         : write JBIG2 images as JBIG2 files
  -ccitt         : write CCITT images as CCITT files
  -all           : equivalent to -png -tiff -j -jp2 -jbig2 -ccitt
  -list          : print list of images instead of saving
  -opw <string>  : owner password (for encrypted files)
  -upw <string>  : user password (for encrypted files)
  -p             : include page numbers in output file names
  -q             : don't print any messages or errors
  -v             : print copyright and version info
  -h             : print usage information
  -help          : print usage information
  --help         : print usage information
  -?             : print usage information


This post has been edited by dongmian: Jul 1 2022, 08:34
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Jul 1 2022, 06:28
Post #5
Lady_Slayer



Member of the Bal'masqué
*********
Group: Catgirl Camarilla
Posts: 5,629
Joined: 20-December 16
Level 500 (Ponyslayer)


Using of pdftoppm command:

pdftoppm version 0.86.1
Copyright 2005-2020 The Poppler Developers - [poppler.freedesktop.org] http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
CODE

Usage: pdftoppm [options] [PDF-file [PPM-file-prefix]]
  -f <int>                                 : first page to print
  -l <int>                                 : last page to print
  -o                                       : print only odd pages
  -e                                       : print only even pages
  -singlefile                              : write only the first page and do not add digits
  -scale-dimension-before-rotation         : for rotated pdf, resize dimensions before the rotation
  -r <fp>                                  : resolution, in DPI (default is 150)
  -rx <fp>                                 : X resolution, in DPI (default is 150)
  -ry <fp>                                 : Y resolution, in DPI (default is 150)
  -scale-to <int>                          : scales each page to fit within scale-to*scale-to pixel box
  -scale-to-x <int>                        : scales each page horizontally to fit in scale-to-x pixels
  -scale-to-y <int>                        : scales each page vertically to fit in scale-to-y pixels
  -x <int>                                 : x-coordinate of the crop area top left corner
  -y <int>                                 : y-coordinate of the crop area top left corner
  -W <int>                                 : width of crop area in pixels (default is 0)
  -H <int>                                 : height of crop area in pixels (default is 0)
  -sz <int>                                : size of crop square in pixels (sets W and H)
  -cropbox                                 : use the crop box rather than media box
  -hide-annotations                        : do not show annotations
  -mono                                    : generate a monochrome PBM file
  -gray                                    : generate a grayscale PGM file
  -sep <string>                            : single character separator between name and page number, default -
  -forcenum                                : force page number even if there is only one page
  -png                                     : generate a PNG file
  -jpeg                                    : generate a JPEG file
  -jpegcmyk                                : generate a CMYK JPEG file
  -jpegopt <string>                        : jpeg options, with format <opt1>=<val1>[,<optN>=<valN>]*
  -overprint                               : enable overprint
  -tiff                                    : generate a TIFF file
  -tiffcompression <string>                : set TIFF compression: none, packbits, jpeg, lzw, deflate
  -freetype <string>                       : enable FreeType font rasterizer: yes, no
  -thinlinemode <string>                   : set thin line mode: none, solid, shape. Default: none
  -aa <string>                             : enable font anti-aliasing: yes, no
  -aaVector <string>                       : enable vector anti-aliasing: yes, no
  -opw <string>                            : owner password (for encrypted files)
  -upw <string>                            : user password (for encrypted files)
  -q                                       : don't print any messages or errors
  -v                                       : print copyright and version info
  -h                                       : print usage information
  -help                                    : print usage information
  --help                                   : print usage information
  -?                                       : print usage information
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Jul 1 2022, 07:09
Post #6
Moonlight Rambler



Let's dance.
*********
Group: Gold Star Club
Posts: 6,497
Joined: 22-August 12
Level 373 (Dovahkiin)


Don't you just need
CODE
pdfimages file.pdf imageprefix

?

I will admit I rarely extract manga PDF's.

This post has been edited by dragontamer8740: Jul 1 2022, 07:10
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Jul 1 2022, 08:38
Post #7
Lady_Slayer



Member of the Bal'masqué
*********
Group: Catgirl Camarilla
Posts: 5,629
Joined: 20-December 16
Level 500 (Ponyslayer)


QUOTE(dragontamer8740 @ Jun 30 2022, 23:09) *

Don't you just need
CODE
pdfimages file.pdf imageprefix

?

I will admit I rarely extract manga PDF's.


Sure it is. And also a not bad example of post#4 exactly. And don't forget sudo as usually we don't login as root.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Jul 1 2022, 11:58
Post #8
Scumbini



C O C K INJURED
******
Group: Gold Star Club
Posts: 913
Joined: 2-December 15
Level 461 (Dovahkiin)


Poppler is already in the [packages.ubuntu.com] Ubuntu repos so you don't need to build it. Also if you pass "-all" instead of "-jpeg" it'll extract all images as their native format instead of just jpegs.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Jul 1 2022, 12:23
Post #9
Moonlight Rambler



Let's dance.
*********
Group: Gold Star Club
Posts: 6,497
Joined: 22-August 12
Level 373 (Dovahkiin)


QUOTE(dongmian @ Jul 1 2022, 06:38) *

Sure it is. And also a not bad example of post#4 exactly. And don't forget sudo as usually we don't login as root.

You don't need to be root to run pdfimages.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Jul 1 2022, 15:07
Post #10
Lady_Slayer



Member of the Bal'masqué
*********
Group: Catgirl Camarilla
Posts: 5,629
Joined: 20-December 16
Level 500 (Ponyslayer)


QUOTE(Scumbini @ Jul 1 2022, 03:58) *

Poppler is already in the [packages.ubuntu.com] Ubuntu repos so you don't need to build it. Also if you pass "-all" instead of "-jpeg" it'll extract all images as their native format instead of just jpegs.


I should say that for most situation the stuff inside the pdf are just jpg pictures. So go with jpeg should be fine, but I find them somehow upscaled by 4x. This doesn't make my gallery look wierd since 3000px still fit inside the "good" range.

This post has been edited by dongmian: Jul 1 2022, 15:16
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Jul 2 2022, 06:44
Post #11
Moonlight Rambler



Let's dance.
*********
Group: Gold Star Club
Posts: 6,497
Joined: 22-August 12
Level 373 (Dovahkiin)


QUOTE(dongmian @ Jul 1 2022, 13:07) *

I should say that for most situation the stuff inside the pdf are just jpg pictures. So go with jpeg should be fine, but I find them somehow upscaled by 4x. This doesn't make my gallery look wierd since 3000px still fit inside the "good" range.
Boo. Don't assume jpeg on the offchance that it isn't. Avoid re-encoding.

Pixel count isn't the only thing that matters.

This post has been edited by dragontamer8740: Jul 2 2022, 06:44
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Sep 28 2022, 04:59
Post #12
Lady_Slayer



Member of the Bal'masqué
*********
Group: Catgirl Camarilla
Posts: 5,629
Joined: 20-December 16
Level 500 (Ponyslayer)


Yesterday when I used pdfimages -all for a magazine, it just gave me result of .ppm

But pdfimages -png solved it.
pdftocairo gives images with white boarders. This one really sucks

BTW I didn't notice this because ubuntu reads ppm image directly, I was ending to the error message that this site doesn't support ppm XD.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Sep 28 2022, 06:13
Post #13
Moonlight Rambler



Let's dance.
*********
Group: Gold Star Club
Posts: 6,497
Joined: 22-August 12
Level 373 (Dovahkiin)


In that situation I'd just use imagemagick's 'convert' program:
CODE
for file in *.ppm; do convert "$file" "$file".png; done
rm *.ppm


Alternatively if you want to clean up the filenames (so you don't end up with files named 'file.ppm.png'):
CODE
for file in *.ppm; do convert "$file" "$(basename "$file" .ppm)".png; done
rm *.ppm


This post has been edited by dragontamer8740: Sep 28 2022, 06:16
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Feb 5 2023, 04:42
Post #14
Lady_Slayer



Member of the Bal'masqué
*********
Group: Catgirl Camarilla
Posts: 5,629
Joined: 20-December 16
Level 500 (Ponyslayer)


dongmian is what a stupid. why poppler? we have calibre that handles everything properly. It is available on Windows! It does exactly the same thing as poppler and somehow on ubuntu I have to use calibre first.

For non-drm protected pdf and epub, simply load it with calibre, click it and select convert to zip. Then the ripped image can be find in your calibre library path with its original quality. Cover image may appear in the upper level folder.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post


Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 


Lo-Fi Version Time is now: 15th August 2025 - 00:00