How To Create A PDF From Your Web Application
Many web applications have the requirement of giving the user the ability to download something in PDF format. In the case of applications (such as e-commerce stores), those PDFs have to be created using dynamic data, and be available immediately to the user.
In this article, I’ll explore ways in which we can generate a PDF directly from a web application on the fly. It isn’t a comprehensive list of tools, but instead I am aiming to demonstrate the different approaches. If you have a favorite tool or any experiences of your own to share, please add them to the comments below.
Starting With HTML And CSS
Our web application is likely to be already creating an HTML document using the information that will be added to our PDF. In the case of an invoice, the user might be able to view the information online, then click to download a PDF for their records. You might be creating packing slips; once again, the information is already held within the system. You want to format that in a nice way for download and printing. Therefore, a good place to start would be to consider if it is possible to use that HTML and CSS to generate a PDF version.
CSS does have a specification which deals with CSS for print, and this is the Paged Media module. I have an overview of this specification in my article “Designing For Print With CSS”, and CSS is used by many book publishers for all of their print output. Therefore, as CSS itself has specifications for printed materials, surely we should be able to use it?
The simplest way a user can generate a PDF is via their browser. By choosing to print to PDF rather than a printer, a PDF will be generated. Sadly, this PDF is usually not altogether satisfactory! To start with, it will have the headers and footers which are automatically added when you print something from a webpage. It will also be formatted according to your print stylesheet — assuming you have one.
The problem we run into here is the poor support of the fragmentation specification in browsers; this may mean that the content of your pages breaks in unusual ways. Support for fragmentation is patchy, as I discovered when I researched my article, “Breaking Boxes With CSS Fragmentation”. This means that you may be unable to prevent suboptimal breaking of content, with headers being left as the last item on the page, and so on.
In addition, we have no ability to control the content in the page margin boxes, e.g. adding a header of our choosing to each page or page numbering to show how many pages a complex invoice has. These things are part of the Paged Media spec, but have not been implemented in any browser.
My article “A Guide To The State Of Print Stylesheets In 2018” is still accurate in terms of the type of support that browsers have for printing directly from the browser, using a print stylesheet.
Printing Using Browser Rendering Engines
There are ways to print to PDF using browser rendering engines, without going through the print menu in the browser, and ending up with headers and footers as if you had printed the document. The most popular options in response to my tweet were wkhtmltopdf, and printing using headless Chrome and Puppeteer.
A solution that was mentioned a number of times on Twitter is a commandline tool called wkhtmltopdf. This tool takes an HTML file or multiple files, along with a stylesheet and turns them into a PDF. It does this by using the WebKit rendering engine.
We use wkhtmltopdf. It’s not perfect, although that was probably user error, but easily good enough for a production application.
— Paul Cardno (@pcardno) February 15, 2019
Essentially, therefore, this tool does the same thing as printing from the browser, however, you will not get the automatically added headers and footers. On this positive side, if you have a working print stylesheet for your content then it should also nicely output to PDF using this tool, and so a simple layout may well print very nicely.
Unfortunately, however, you will still run into the same problems as when printing directly from the web browser in terms of lack of support for the Paged Media specification and fragmentation properties, as you are still printing using a browser rendering engine. There are some flags that you can pass into wkhtmltopdf in order to add back some of the missing features that you would have by default using the Paged Media specification. However, this does require some extra work on top of writing good HTML and CSS.
Another interesting possibility is that of using Headless Chrome and Puppeteer to print to PDF.
Puppeteer. It’s amazing for this.
— Alex Russell (@slightlylate) February 15, 2019
However once again you are limited by browser support for Paged Media and fragmentation. There are some options which can be passed into the
page.pdf() function. As with wkhtmltopdf, these add in some of the functionality that would be possible from CSS should there be browser support.
It may well be that one of these solutions will do all that you need, however, if you find that you are fighting something of a battle, it is probable that you are hitting the limits of what is possible with current browser rendering engines, and will need to look for a better solution.
Yes. For simple docs, like course certificates, we can use Chrome, which has minimal @ page support. For anything else, we use PrinceXML or the paged.js polyfill in Chrome. Here’s a WIP proof-of-concept using paged.js for books: https://t.co/AZ9fO94PT2
— Electric Book Works (@electricbook) February 15, 2019
Using A Print User Agent
If you want to stay with an HTML and CSS solution then you need to look to a User Agent (UA) designed for printing from HTML and CSS, which has an API for generating the PDF from your files. These User Agents implement the Paged Media specification and have far better support for the CSS Fragmentation properties; this will give you greater control over the output. Leading choices include:
A print UA will format documents using CSS — just as a web browser does. As with browser support for CSS, you need to check the documentation of these UAs to find out what they support. For example, Prince (which I am most familiar with) supports Flexbox but not CSS Grid Layout at the time of writing. When sending your pages to the tool that you are using, typically this would be with a specific stylesheet for print. As with a regular print stylesheet, the CSS you use on your site will not all be appropriate for the PDF version.
Creating a stylesheet for these tools is very similar to creating a regular print stylesheet, making the kind of decisions in terms of what to display or hide, perhaps using a different font size or colors. You would then be able to take advantage of the features in the Paged Media specification, adding footnotes, page numbers, and so on.
In terms of using these tools from your web application, you would need to install them on your server (having bought a license to do so, of course). The main problem with these tools is that they are expensive. That said, given the ease with which you can then produce printed documents with them, they may well pay for themselves in developer time saved.
It is possible to use Prince via an API, on a pay per document basis, via a service called DocRaptor. This would certainly be a good place for many applications to start as if it looked as if it would become more cost effective to host your own, the development cost of switching would be minimal.
A free alternative, which is not quite as comprehensive as the above tools but may well achieve the results you need, is WeasyPrint. It doesn’t fully implement all of Paged Media, however, it implements more than a browser engine does. Definitely, one to try!
Moving Away From HTML And CSS
Headless browser saving to PDF was once my first choice but always produced subpar results for anything other than a single page document. We switched over to https://t.co/3o8Ce23F1t for multi-page reports which took quite a lot more effort but well worth it in the end!
— JimmyJoy (@jimle_uk) February 15, 2019
In the course of writing this article, I also discovered a Python wrapper which can run a number of different tools. (Note that you need to already have the tools themselves installed, however, this could be a good way to test out the various tools on a sample document.)
For support of Paged Media and fragmentation, Prince, Antenna House, and PDFReactor are going to come out top. As commercial products, they also come with support. If you have a budget, complex pages to print to PDF, and your limitation is developer time, then you would most likely find these to be the quickest route to have your PDF creation working well.
However, in many cases, the free tools will work well for you. If your requirements are very straightforward then wkhtmltopdf, or a basic headless Chrome and Puppeteer solution may do the trick. It certainly seemed to work for many of the people who replied to my original tweet.
If you find yourself struggling to get the output you want, however, be aware that it may be a limitation of browser printing, and not anything you are doing wrong. In the case that you would like more Paged Media support, but are not in a position to go for a commercial product, perhaps take a look at WeasyPrint.
I hope this is a useful roundup of the tools available for creating PDFs from your web application. If nothing else, it demonstrates that there are a wide variety of choices, if your initial choice isn’t working well.
Please add your own experiences and suggestions in the comments, this is one of those things that a lot of us end up dealing with, and personal experience shared can be incredibly helpful.
A roundup of the various resources and tools mentioned in this article, along with some other useful resources for working with PDF files from web applications.
Articles and Resources
- Designing For Print With CSS
- Breaking Boxes With CSS Fragmentation
- A Guide To The State Of Print Stylesheets In 2018
- Getting Started with Headless Chrome and Puppeteer
- Antenna House
- Produce & Publish Server