tumblr visit counter

Better Convert pdf to jpg using ghost script

Problem

During working on Wigoo, I’d need to “break” uploaded .pdf document to multiple .jpg/.png images.

I didn’t try to solve it with PHP because of few reasons:

  • The system should manage huge .pdf files, more then 50MB or more then 50 pages. Consequently, conversion (.pdf to multiple .jpg) will take a lot of time, and it is not nice to show blank screen to the user/visitor while converting. Even though, client browser will close the connection if doesn’t receive response timely. Therefore, conversion process has to been started as background process.
  • There are already made reliable applications, so I don’t need to reinvent the wheel. It makes coding easier/faster. This will increase productivity.


In order to be able to convert .pdf to .jpg, ImageMagick has to be installed on the system. Installation process is not subject of this post. After installing ImageMagick first conversion test has to be performed.


[user@ve pdf-test] convert book.pdf book.jpg

After first test with real .pdf (87MB / 237 pages) the server crashed.

Server Overload
Server Overload

The test was performed on Media Temple’s VPS, therefore memory is bursting above 400%.

Evidently, convert decompress whole .pdf into the memory.

Solution

The best solution is to convert page by page. Converting .pdf page by page is possible by using Ghost Script:


[user@ve pdf-test] gs -dNOPAUSE -sDEVICE=jpeg -dFirstPage=1 -dLastPage=237 -sOutputFile=image%d.jpg -dJPEGQ=100 -r300x300 -q book.pdf -c quit

Average execution time is slightly more then 500 seconds.

CPU Usage
Resources used while single process started

As we can perceive, only one processor has been fully used. It is foolish to spend a lot of time for the process and leave the hardware unused.

Execution time can be decreased if we start multiply processes.

By default, PHP has no built-in support for multithreading, but Tudor Barbu has developed multithreading class for PHP.

Execution time is measured using David Walsh’s timer class.

Here is the script:


<?php
// Define number of threads
define('THREADS', 16);

// Define total PDF pages
$totalPages = 237;


include('timer.php');
require_once('thread.php');

if( ! Thread::available() ) {
	die( 'Threads not supported' );
}

function processImage( $params ) {
	// execute ghost script conversion
	system('gs -dNOPAUSE -sDEVICE=jpeg -dFirstPage='.$params['startPage'].' -dLastPage='.$params['endPage'].' -sOutputFile=image'.$params['thread'].'-%d.jpg -dJPEGQ=100 -r300x300 -q book.pdf -c quit', $result);

}

// Calculate number of pages per thread
$pages = ceil($totalPages / THREADS);


$threads = array();
$index = 0;


$timer = new timer(1);


for ($i=1; $i<=THREADS; $i++) {
	// Calculate start page for each thread
	$startPage = $i*$pages - ($pages-1);
	// Calculate end page for each thread
	$endPage = $i*$pages;
	
	if ($endPage>$totalPages) {
		$endPage = $totalPages;
	}

	$params['startPage'] = $startPage;
	$params['endPage'] = $endPage;
	$params['thread'] = $index;
	
	$threads[$index] = new Thread('processImage');
	$threads[$index]->start($params);
	++$index;

}

while( !empty( $threads ) ) {
	foreach( $threads as $index => $thread ) {
		if( ! $thread->isAlive() ) {
			unset( $threads[$index] );
		}
	}
	// let the CPU do its work
	sleep( 1 );
}


echo $timer->get();
Multi Threading
Resources used while multithreading

Average execution time has been decreased to 60 seconds.

Conclusion




Dalibor Sojic

WEB Developer, freelancer.

More Posts - Website - Twitter - LinkedIn - Google Plus

8 Responses to “Better Convert pdf to jpg using ghost script”

  1. Philippe

    Nov 02. 2011

    GS is the soft behind http://pdf2jpg.net/ , a PDF to JPG converter as the name suggests. I agree with your figures, ImageMagick uses too much memory.

    Reply to this comment
  2. Rane Bowen

    Oct 24. 2012

    Thanks for this, very helpful. Do you know if there is an easy way to extract metadata using ghostscript?

    Thanks, Rane

    Reply to this comment
  3. Andrew B.

    Feb 26. 2013

    If you need to convert a web page to a PDF file, I suggest using this online converter: http://www.kitpdf.com/web_to_pdf/. It is simple to use and efficient.

    Reply to this comment
    • Dalibor Sojic

      Feb 26. 2013

      Thanks for the link. Such services are useful for endusers. WEB developers are looking for “how to do it”.

      For WEB to pdf conversion, I have tried several solutions: phamtonjs, wkhtmltopdf, http://www.princexml.com/. Most suitable is wkhtmltopdf.

      Reply to this comment
  4. Ali

    Mar 03. 2013

    Hello!
    I test on EasyPHP-5.3.9, but i receive the following error:

    Warning: include(timer.php) [function.include]: failed to open stream: No such file or directory in C:\Program Files (x86)\EasyPHP-5.3.9\www\PDF2IMAGE.php on line 9

    Warning: include() [function.include]: Failed opening ‘timer.php’ for inclusion (include_path=’.;C:\php\pear’) in C:\Program Files (x86)\EasyPHP-5.3.9\www\PDF2IMAGE.php on line 9

    Warning: require_once(thread.php) [function.require-once]: failed to open stream: No such file or directory in C:\Program Files (x86)\EasyPHP-5.3.9\www\PDF2IMAGE.php on line 10

    Fatal error: require_once() [function.require]: Failed opening required ‘thread.php’ (include_path=’.;C:\php\pear’) in C:\Program Files (x86)\EasyPHP-5.3.9\www\PDF2IMAGE.php on line 10

    Please help me.

    Reply to this comment
  5. patricksamson

    Jun 05. 2013

    thanks for the wonderful article..can you suggest other ways through which pdf can be converted to jpg?

    Reply to this comment

Trackbacks/Pingbacks

  1. Faster conversions from PDF to PNG/JPEG, imagemagick vs ghostscript | Bertan Guven's Blog - August 8, 2013

    [...] has the capability of processing 1 page at time which reduced the load on hardware by a lot. Here is a great article about that i used as a reference in this [...]

Leave a Reply