Generate PDF’s and ePub with wkhtmltopdf and Calibre

In a previous post, I wrote about how I use GNU make to manage dependencies and generate html files from markdown source. In this post, I’ll build on that and use the html to generate PDF’s and ePub files.

MultiMarkdown can generate PDF files using LaTeX. For some reason, I never got that to work. I tried on multiple Macs with clean installs of MultiMarkdown and a variety of LaTeX apps like mmd2tex, the LaTeX support files for MultiMarkdown, etc. I failed each time. Plus, I don’t really want to re-learn LaTeX. And I hate typing in LaTeX.

I know HTML. I know CSS. So I sought out tools to help me leverage those skills.

Leveraging HTML

It’s obvious I didn’t want to load the HTML into the browser and use the OS to Save As PDF. That would be lame. I wanted to generate this stuff at the command-line, and in scripts, and automate things. I also had an intuition that Webkit would be exposed in more ways that simply embedded into Chrome and Safari. So, I searched for “webkit pdf.” What’s the first link I found?

Jackpot! This is an awesome set of commands that allow you to feed it an HTML input and have it generate a crisp PDF. Click on the HTML and resulting PDF:

The command-line to generate this is as follows:

wkhtmltopdf --page-width 5.5in --page-height 8.5in --margin-top 0.25in
--margin-bottom 0.25in --margin-left 0.25in --margin-right 0.25in
--load-error-handling ignore lorem_ipsum.html lorem_ipsum.pdf



Generating an ePub format was also a goal. I googled all over and found a bunch of tools. The one that looked nicest was Calibre. Calibre is a complete e-book management app. It has crazy features:

  • Understand a gazillion e-book formats
  • Can import pdf books, html books, etc.
  • Can edit e-book meta-data
  • Add cover graphics
  • and lots more

But I  just wanted an ePub file. I first tried to see if I could generate an ePub myself. The format is a zip file with HTML inside and a bunch of crazy metadata to create table of contents, chapters, etc. I didn’t want to do all of that work so I’m glad I found Calibre. But like pdf’s, I didn’t want to load the GUI and manually generate ePub files. So I inspected the .app package and inside I found SOLID GOLD. There are a bunch of command-line apps that the GUI uses to do all of the work. Now that’s a programmer who knows what he or she is doing! I was able to add the ebook-convert program to my path and invoke it as follows:

ebook-convert --no-default-epub-cover --base-font-size 12 --keep-ligatures
--margin-top 10.0 --margin-bottom 10 tmp.html tmp.epub

Now I have lorem_ipsum.epub!

I like to use Adobe Digital Editions e-reader on the Mac because I don’t need to add an ePub file to any kind of Library. I just open an ePub file and a viewer displays it. No “import” process which is stupid since my ePub files will change so often as I write. You can of course use Calibre to view ePub files, Kindle, and a zillion other apps to do the same thing (albeit with the extra import step).

Width Problems

But there is a very bad problem. The formatting is horrendous! Check this out:

Cut off

The whole thing is cut off at the right? Why? Because my HTML specifically set a width so it would be small like a real book. This was not really necessary, but makes the HTML easy to read when your browser is maximized. If I hadn’t done this, text would wrap to 100% of the width of the browser and lines would be too long to read comfortably. But e-readers don’t want you to specify widths. Users have lots of different devices. Users play with font sizes and orientations and nothing can be easily predicted. So, you want an HTML file that doesn’t specify any width, like this:

The offending code was in the CSS:

body {
width: 6in;
margin-left: auto;
margin-right: auto;

So I just removed the width setting. Unfortunately, either calibre or wkhtmltopdf doesn’t respect multiple STYLE tags so I could not simply override the width when generating ePub with a second stylesheet (the way CSS was designed!). I guess I should file a bug report. Anyways, I punted and just cloned the CSS and use the ePub version of the CSS when I want to generate ePub. This is lame but cpress handles it. At some point, I’ll create a facility in cpress to merge css streams so I can have one master CSS and an ePub version which simply gets rid of the width. For now, two CSS files. Here is an image of the resulting ePub as viewed on my iPhone in iBooks Reader:


How fucking beautiful is that?!

I think that’s all I’ll say for now. I have lots more to share with you. In my next post, I’ll talk about how I aggregate multiple markdown files into as single markdown file, how a table-of-contents gets auto-generated, and how I script the generation of my  html, pdf, and ePub artifacts with crontab and sync everything with Dropbox.

Using MultiMarkdown and GNU Make to generate HTML

In a previous post, I said I was going to start talking about how I do my writing and how I generate html, pdf’s, and e-pub files. For me, it all starts with the html and MultiMarkdown is the tool I use to turn Markdown into html. From that html, I generate the other final formats.

The MultiMarkdown website does a good job at describing what the tool does. Here is an excerpt:

Writing with MultiMarkdown allows you to separate the content and structure of your document from the formatting. You focus on the actual writing, without having to worry about making the styles of your chapter headers match, or ensuring the proper spacing between paragraphs. And with a little forethought, a single plain text document can easily be converted into multiple output formats without having to rewrite the entire thing or format it by hand. Even better, you don’t have to write in “computer-ese” to create well formatted HTML or LaTeX commands. You just write, MultiMarkdown takes care of the rest.

I diverge from MultiMarkdown’s full feature set because I do not use it to generate pdf’s or e-pub formats. I only use it to generate html. The main reason for this is I could not figure out how to get LaTeX to work! When I installed LaTeX on my Mac by way of  MacTeX, I constantly got errors when I tried to generate LaTeX documents. I am sure I could figure it out eventually, but I didn’t want to. Not really. In my head, I knew CSS real well and I know I could make the html look exactly the way I wanted. Using MultiMarkdown meant  that the html would not look like the pdf, it would look like the default LaTeX styles that come with MultiMarkdown. These styles are nice, but they’re not what I want and I didn’t want to learn LaTeX to figure it all out. So, my goal was to generate html and from that I would generate the other formats.

Using Make

Now that my goal was to use MultiMarkdown to generate html, I wanted to use GNU Make to automatically build html when Markdown files change. The simplest way to do this is to author a very simple Makefile:

	multimarkdown -o $@ $<

The $@ represents the output filename and the $< represents the input file in Make parlance. This rule says that any X.html file depends on a file named and the way to create it is multimarkdown -o $@ $<.

I also added a clean rule:

	rm -rf *.html

MultiMarkdown Headers

MultiMarkdown extends standard Markdown with some attributes you can set in your header. These attributes can define the CSS file to use, insert arbitrary html into the html’s <head> element, set the author, title, etc. Lots of these directives are used for LaTeX formatting as well, but I largely ignore these. Here is a sample header:

Title: Avonia
Language: en
Author: Nick Cody
LaTeX XSLT: manuscript-novel.xslt
Surname: Cody
Base Header Level: 1
Comment: This is a work-fragment; it is the middle of a story. It is destined to be trashed.

When this is compiled to html, it looks like this:



The CSS was a bit trickier. You can use a MultiMarkdown CSS: directive, but that would link to a file. I wanted the CSS to be embedded so the html file could be e-mailed to someone and it would have everything they needed. I tried uploading the CSS to my website and used that absolute url as the CSS location, but accessing a remote server when trying to look at a local html file made me feel dirty.

So, instead, I used the HTML Header: MultiMarkdown directive. I use make to take a standard CSS file and remove all newline characters so the CSS could be embedded. The enhanced rule for that is as follows:

%.mdcss: %.css Makefile
	echo HTML Header: \>> $@ %.html: novel-style.mdcss Makefile cat > tmp cat novel-style.mdcss >> tmp cat $< >> tmp multimarkdown -o $@ tmp rm -f tmp 

A few things are happening here. First, I take the regular css file and create a new file type, .mdcss. This is the single-line MultiMarkdown directive which has the whole CSS on a single line. This is very much like css and JavaScript minification. Notice I use the tr command to strip out newlines.

Then, I have an enhanced html rule, which takes my original MultiMarkdown header, concatenates that with the mdcss, and then concatenates that with the actual writing content. The result is an html file that can be viewed directly. I have a sample file you can look at here:

You can look at the Markdown source, here:

Enhancing <hr> with fancy awesomeness

Notice that in Markdown, *** gets turned into <hr>. In my CSS, I don’t show the standard rule, I display some Unicode character I turned into a 300dpi png. This png has enough pixels to look good on the screen and on the printer. I make sure it’s the same size by using the background-size CSS attribute, along with specifying width and height in inches and not in pixels:

hr {
	background-image: url(0F05.png);
	background-size: 100%;
	margin-left: auto;
	margin-right: auto;
	margin-top: 1em;
	margin-bottom: 1em;
	width: 0.33in;
	height: 0.33in;
	border: 0px;

Notice the 0F05.png. That image weighs in at 396 pixels square and I render it at 0.33in. This yields 1200dpi… goot enough for printing and the stylesheet I created prints awesome. Here is the image:

But I don’t really want to reference that image as a file. I already embedded the CSS, so I figure it would be best to embed the image, too. You can do this by base64 encoding the image data. That turns my stylesheet into this:

hr {
	background-image: url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAYwAAAGMCAYAAADJOZVKAAAACXBIWXMAAL

The ellipsis is a big ellipsis. Lots of data in base64 encoding follows, but I omitted it for brevity. I created the encoding using the Mac’s builtin base64 command-line program and I created a helper rule:

$(CPRESS_DIR)images/%.base64: $(CPRESS_DIR)%.png $(CPRESS_DIR)cpress.mak
	base64 $< | sed -e "s/.\{76\}/&~/g" | tr '~' '\n' | tr -d ' ' > $@

That breaks the continuous stream of bytes into another stream with newlines every 76 characters. Some editors cry when you put too many characters on one line.

If I were more clevery, I’d awk the css file and replace the image url with the data uri, on the fly. Unfortunately, I’m not t hat clever, at least not yet anyway. I’m an awk n00b.

The advantage here is the stylesheet is completely contained in the html, including the image. This is awesome!

Printing Background Images

When you print html docs, background-images don’t typically print. This has been the default behavior in browsers since as long as I can remember. In my case, I wanted the default to print background graphics since I use them for the horizontal rule elements. That’s easy, so I added this to my CSS:

@media print {
	* {-webkit-print-color-adjust: exact;}

That probably only works in Safari and Chrome since they use webkit, but for now that was good enough for me.

Wrapping it up

So, that’s all for now. In another post, I’ll talk about how I used wkhtmltopdf to generate a pdf that looks identical to the HTML (as rendered in Chrome of Safari). I’ll also talk about how I use Calibre to generate e-pub format. On the surface, Calibre is a GUI program and it would appear to violate my UNIX-style approach of using Make and command-line scripts. But inside the Calibre package are a set of powerful command-line utilities that I bent to my will. Stay tuned for more on that cause it’s so exciting!

Markdown for Writing Projects

I use Markdown for writing because it’s simple, vendor neutral, and easy to process. Using Markdown, I’m not locked into a particular word processor or proprietary format. I work with text and text is awesome. I want to describe how I use Markdown to write and generate artifacts such as html, pdf, and various e-book formats like ePub.

The writing solution I wanted had these requirements:

  • I want to edit in plain-text, Markdown
  • I want to control how the HTML looks  by writing the CSS myself
  • I want the PDF to look like the HTML
  • I want the e-book format to look like the HTML and PDF
  • I want the HTML, PDF, and e-book formats to be built from the Markdown source, automatically
  • I want to edit remotely on my iPad and have my local and remote work synchronized
  • I want to retrieve past revisions in case I paint myself into a corner and I need to get back to the place I was before

I’ll write a few posts over the course of the next few weeks that will serve as a general introduction to how I used MultiMarkdownmakegit, and Dropbox to address these requirements. For now, let’s talk about Markdown.

Why Markdown?

Most people I know are already well versed in the beauty of markdown and plain text editing. Markdown is used all over the place. Github uses their own flavor to power all of their README’s, messages, comments, and more. There is blogging software that uses it. Lots of editors can do some basic coloring and bolding of markdown text to make it look pretty without a full conversion to HTML markup. It’s also just nice to read as plain text, since being readable as plain text is one of the primary features of Markdown.

You may ask what software is available to convert the plain-text markdown into something fancier. There are tons of options here. Here are just few:

  • Jekyll – Takes markdown files and can generate a static website, like a blog
  • Marked – A Markdown editor with HTML preview and PDF generation capabilities
  • Scrivener – A full-fledged writers tool that uses Markdown, manages characters, to-dos, scenes, etc.
  • Byword – A simple iPhone/iPad/Mac editor
  • TextMate – An awesome text editor for the Mac. Unlike vim, easily allows wrapping margins.

Oh how I wish WordPress would allow me to use Markdown as the native editor format! There is wp-markdown plugin, but I haven’t had the guts to try it out yet. I’m so afraid of being disappointed. It works by taking Markdown and converting it to WordPress HTML and it converts it back to Markdown when you edit a post. That scares the crap out of me.

Most of the software options listed previously allow you to write in Markdown and they can convert to something like HTML or a PDF with some canned stylesheets. And they probably do this through the GUI. For my purposes, I wanted to use TextMate and I wanted to write the stylesheet myself. I originally tried to use vim as my Markdown editor, but I found out that Vim sucks noodles at Markdown editing.

I didn’t want to use a complex tool that has a lot of features. Scrivener might be nice, I never tried it, but it looks awfully complex. I didn’t want to use a GUI. I didn’t want to pull down a menu to generate my HTML. I wanted to use the command line because the command-line is awesome. In a nutshell, I wanted Markdown to be my code and I wanted a build system that produced my programs: the HTML, PDF, and e-book formats.

I wanted to us make to see that my Markdown files are modified and have my HTML generated automatically. I wanted to write chapters as individual Markdown files and have them auto-magically aggregated into a book a post-process. Just like a linker!

In my next post, I’ll talk about MultiMarkdown and why it’s awesome and how I use it to generate HTML with my own CSS. I even tricked the HTML generation to insert javascript to produce automatic hyphenation since hyphenation is still something that HTML5/CSS3 don’t seem to do well in most browsers I tried.