Convert Rust Programming Language books to EPUB

The Rust Programming Language book is available in HTML format. However, officially, no EPUB version is provided.

The idea of this post is to convert the HTML website into an ebook. Some solutions exist, such as WebToEpub, papeer, and pandoc. However, these solutions did not format the book the way I wanted, or were not easy enough to customize for my needs. So I decided to create my own scraper in Python and share the methodology in this post.

First of all, I installed the following packages:

  • ebooklib: to create EPUB files.
  • lxml: to manipulate XHTML files.
  • scrapy: to scrape the website.
  • w3lib: to normalize URLs.

This can be done using pipenv:

pipenv install ebooklib lxml scrapy w3lib

Now, let’s create the rust_book_spider.py script and import the libraries in it.

import ebooklib.epub
import lxml.html
import os
import scrapy
import uuid
import w3lib.url

from urllib.parse import urljoin, urlparse

Then I create the RustBookSpider class containing the book metadata.

# Crawler for building the Rust Book in epub format
# @see https://idpf.org/epub/30/spec/epub30-publications.html
class RustBookSpider(scrapy.Spider):

    # Book data
    book_title = "The Rust Programming Language"
    book_language = "en"
    book_authors = [
        "Steve Klabnik",
        "Carol Nichols",
    ]
    book_cover = "cover.jpg"
    book_css = "style.css"
    book_nav_title = "Table of Contents"
    book_filename = "rust_book.epub"

The ebook will have a cover page illustrated by a provided cover.jpg file. Moreover, I have to provide a stylesheet, style.css, that I will detail later.

Let’s add the crawler data:

    # Crawler data
    name = "rust_book_spider"
    start_url = "https://doc.rust-lang.org/book/"

In addition, we need some temporary data during the scraping:

    # Temporary data
    _toc = {}
    _chapters = {}
    _images = {}
    _filenames = set()

The first method to be called in this class is start_requests(). It basically requests to download the page located at the URL defined by start_url in the crawler data, and to call parse_toc() when the request is successful.

    # Initialize the crawler
    # @see https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.start_requests
    def start_requests(self):
        # The crawler starts by parsing the table of contents of the start page
        yield scrapy.Request(self.start_url, callback=self.parse_toc)

The parse_toc() method parses the table of contents (TOC) from the HTML code of the homepage in order to request all the chapters. Every URL representing a chapter will call the parse_chapter() method.

    # Parse the table of contents of given page
    def parse_toc(self, response):
        # Get the `<ol>` element containing the TOC
        toc_element = response.css("#sidebar .chapter")
        # Extract data from the TOC
        self._toc = {
            item["url"]: item for item in self.parse_toc_list(response, toc_element)
        }
        # Setup parent-children relationships in the TOC tree
        for key, item in self._toc.items():
            if item["parent"] is not None:
                self._toc[item["parent"]]["children"].append(key)
        # The crawler continues by visiting all the URLs from the TOC
        for item in self._toc.values():
            yield scrapy.Request(item["url"], callback=self.parse_chapter)

The parse_toc_list() method recursively parses a given <ol> HTML element to get a list of URLs from the TOC.

    # Parse a `<ol>` element from the table of contents
    def parse_toc_list(self, response, element, parent=None):
        # Memorize previous TOC item
        previous = None
        for item in element.xpath("./li"):
            # Child element of `<li>` is either `<ol>` (sub-tree) or `<a>` (link)
            section = item.xpath("./ol")
            link = item.xpath("./a")
            if section.get() is not None:
                # Children of the previous TOC item
                yield from self.parse_toc_list(response, section, previous)
            elif link.get() is not None:
                item = self.parse_toc_item(response, link, parent)
                yield item
                previous = item["url"]

The parse_toc_item() method parses a given <a> HTML element and returns a TOC item.

    # Parse a `<a>` element from the table of contents
    def parse_toc_item(self, response, element, parent=None):
        # The element is like `<a href="page.html"><strong>1.2.</strong> Title</a>`
        number = element.xpath("./strong/text()").get(default="").strip()
        title = element.xpath("./text()").get().strip()
        href = element.xpath("./@href").get()
        # Normalize URL
        url = self.normalize_url(href, base=response.url)
        basename = os.path.splitext(os.path.basename(urlparse(url).path))[0]
        # Chapters are in XHTML
        filename = self.create_filename(f"{basename}.xhtml")
        return {
            "uid": basename,
            "title": title,
            # "title": f"{number} {title}".strip(),
            "url": url,
            "filename": filename,
            "parent": parent,
            "children": [],
        }

When a chapter page is successfully downloaded, scrapy calls the parse_chapter() method. It uses CSS and XPATH selectors to locate the content and metadata. In addition, I request to download images and call parse_image() for each of them. Finally, I call process_html() to perform some post-processing.

    # Parse a chapter page
    def parse_chapter(self, response):
        # Get the `<main>` element containing the page contents
        main = response.css("#content > main")
        # The TOC item contains useful metadata such as the title
        toc_item = self._toc[response.url]
        # @see http://docs.sourcefabric.org/projects/ebooklib/en/latest/ebooklib.html#ebooklib.epub.EpubHtml
        html = ebooklib.epub.EpubHtml(
            title=toc_item["title"],
            file_name=toc_item["filename"],
            lang=response.xpath("/html/@lang").get(),
        )
        # Download images
        for image in main.xpath(".//img"):
            src = image.xpath("@src").get()
            url = self.normalize_url(src, base=response.url)
            if url not in self._images:
                basename = os.path.basename(urlparse(url).path)
                filename = self.create_filename(os.path.join("images", basename))
                self._images[url] = {
                    "filename": filename,
                    "content": None,
                }
                yield scrapy.Request(url, callback=self.parse_image)
        # The HTML must be processed for the epub format
        content = self.process_html(response, main.get())
        html.set_content(content)
        self._chapters[response.url] = html

The parse_image() method simply stores the content in the _images property.

    # Parse an image
    def parse_image(self, response):
        self._images[response.url]["content"] = response.body

The process_html() method uses lxml to replace the website links to EPUB internal links, using replace_link() and normalize_url() methods.

    # Process the HTML of a chapter
    def process_html(self, response, content):
        # Parse as HTML
        # @see https://lxml.de/api/lxml.html-module.html#fragment_fromstring
        doc = lxml.html.fragment_fromstring(
            content,
            base_url=response.url,
            # @see https://lxml.de/api/lxml.etree.HTMLParser-class.html
            parser=lxml.html.HTMLParser(remove_blank_text=True, remove_comments=True),
        )
        # Remove links in titles
        for link in doc.cssselect("a.header"):
            link.drop_tag()
        # Call `self.replace_link()` on every link
        doc.rewrite_links(
            lambda link: self.replace_link(link),
            resolve_base_href=True,
            base_href=response.url,
        )
        # Replace `src` attributes in images
        for image in doc.xpath("//img"):
            old_src = image.get("src")
            url = self.normalize_url(old_src, base=response.url)
            if url in self._images:
                # XHTML files and images may be stored in different folders
                image_path = self._images[url]["filename"]
                text_dir = os.path.dirname(self._toc[response.url]["filename"])
                new_src = os.path.relpath(image_path, start=text_dir)
                image.set("src", new_src)
        # Serialize as XHTML
        # @see https://lxml.de/api/lxml.html-module.html#tostring
        return lxml.html.tostring(doc, method="xml")

    # Transform a link to match epub filenames
    def replace_link(self, link):
        url = self.normalize_url(link)
        # Replace the link if found in the table of contents
        if url in self._toc:
            filename = self._toc[url]["filename"]
            # Restore the URL fragment if needed
            parts = urlparse(link)
            return filename if parts.fragment == "" else f"{filename}#{parts.fragment}"
        return link

    # Canonicalize a URL
    def normalize_url(self, url, base=None):
        # Make the URL absolute
        if base is not None:
            netloc = urlparse(url).netloc
            if netloc == "" or netloc == urlparse(base).netloc:
                url = urljoin(base, url)
        return w3lib.url.canonicalize_url(url)

The create_filename() method is useful to ensure that no two files have the same filename.

    # Create an unused filename from a desired one
    def create_filename(self, filename):
        (basename, ext) = os.path.splitext(filename)
        i = 0
        # Loop until we find an unused filename
        while filename in self._filenames:
            filename = f"{basename}_{i}{ext}"
            i += 1
        self._filenames.add(filename)
        return filename

The create_toc() recursively builds the TOC tree for ebooklib.

    # Recursively build a TOC tree made of epub objects
    def create_toc(self, parent=None):
        # Get children of `parent` item, or root items if parent is None
        children = {
            key: item for key, item in self._toc.items() if item["parent"] == parent
        }
        return [
            # Create a section if the current item has children
            [
                ebooklib.epub.Section(item["title"], item["filename"]),
                # Recursive call
                self.create_toc(key),
            ]
            if len(item["children"]) > 0
            # Create a link if the current item has no children
            else ebooklib.epub.Link(item["filename"], item["title"], item["uid"])
            for key, item in children.items()
        ]

Finally, the scrapy engine calls the closed() method at the end, when all the content has been scraped. Then I can build the full ebook using the temporary data.

    # Terminate the crawler and create the ebook
    # @see https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.closed
    def closed(self, reason):
        # @see http://docs.sourcefabric.org/projects/ebooklib/en/latest/tutorial.html
        book = ebooklib.epub.EpubBook()

        # Add metadata
        book.set_identifier(f"urn:uuid:{uuid.uuid1()}")
        book.set_title(self.book_title)
        book.set_language(self.book_language)

        # Add authors
        for author in self.book_authors:
            book.add_author(author)

        # Add stylesheet
        css = None
        if self.book_css is not None:
            css = ebooklib.epub.EpubItem(
                uid="css",
                file_name="style.css",
                media_type="text/css",
                content=open(self.book_css, "r").read(),
            )
            book.add_item(css)

        # Add cover
        cover_image = None
        cover_html = None
        if self.book_cover is not None:
            cover_image_uid = "cover-img"
            cover_image_filename = "cover.jpg"
            cover_html_filename = "cover.xhtml"
            cover_image = ebooklib.epub.EpubCover(
                uid=cover_image_uid,
                file_name=cover_image_filename,
            )
            cover_image.set_content(open(self.book_cover, "rb").read())
            cover_html = ebooklib.epub.EpubHtml(
                uid="cover",
                title="Cover",
                file_name=cover_html_filename,
                content=f"""
                    <img src="{cover_image_filename}" alt="Cover" class="cover-img"/>
                """,
            )
            book.add_item(cover_image)
            book.add_item(cover_html)
            book.add_metadata(
                None,
                "meta",
                "",
                {
                    "name": "cover",
                    "content": cover_image_uid,
                },
            )

        # Add chapters
        for key in self._toc.keys():
            html = self._chapters[key]
            # Add stylesheet to all chapters
            if css is not None:
                html.add_item(css)
            book.add_item(html)

        # Add images
        for image in self._images.values():
            item = ebooklib.epub.EpubImage(
                file_name=image["filename"], content=image["content"]
            )
            book.add_item(item)

        # Add table of contents
        book.toc = self.create_toc()

        # Add NCX
        book.add_item(ebooklib.epub.EpubNcx())

        # Add navigation
        nav = ebooklib.epub.EpubNav(title=self.book_nav_title)
        nav.add_item(css)
        book.add_item(nav)

        # Add spine
        book.spine = [
            "cover",
            "nav",
            *(self._chapters[key] for key in self._toc.keys()),
        ]

        ebooklib.epub.write_epub(self.book_filename, book)

As I said before, the script requires a stylesheet named style.css in the same folder. I extracted a subset of the CSS declarations from the website and adapted them to my needs.

@namespace epub "http://www.idpf.org/2007/ops";

:root {
  --links: #606060;

  --block-code-color: #000;
  --block-code-bg: #e0e0e0;
  --block-code-alternate-bg: #c0c0c0;

  --boring-code-color: #909090;

  --inline-code-color: #000;
  --inline-code-bg: #d0d0d0;

  --quote-color: #000;
  --quote-bg: #fff;
  --quote-border: #c0c0c0;

  --table-border-color: #d0d0d0;
  --table-header-bg: #c0c0c0;
  --table-alternate-bg: #e0e0e0;
}

html,
body {
  font-family: sans-serif;
}

.cover-img {
  text-align: center;
  height: 100%;
}

a,
a:visited,
a:active,
a:hover {
  color: var(--links);
  text-decoration: none;
}

img {
  max-width: 100%;
}

table {
  margin: 0 auto;
  border-collapse: collapse;
}

table td {
  padding: 3px 20px;
  border: 1px solid var(--table-border-color);
}

table thead {
  background: var(--table-header-bg);
}

table thead td {
  font-weight: 700;
  border: none;
}

table thead th {
  padding: 3px 20px;
}

table thead tr {
  border: 1px solid var(--table-header-bg);
}

table tbody tr:nth-child(2n) {
  background: var(--table-alternate-bg);
}

blockquote {
  margin: 20px 0;
  padding: 0 20px;
  color: var(--quote-color);
  background-color: var(--quote-bg);
  border-top: 0.1em solid var(--quote-border);
  border-bottom: 0.1em solid var(--quote-border);
}

pre,
code {
  font-family: monospace;
  white-space: pre-wrap;
}

pre > code {
  display: block;
  position: relative;
  padding: 0.5rem;
  font-size: 0.875em;
  color: var(--block-code-color);
  background: var(--block-code-bg);
  word-break: break-all;
  line-break: anywhere;
}

:not(pre) > code {
  padding: 0.1em 0.3em;
  border-radius: 3px;
  color: var(--inline-code-color);
  background: var(--inline-code-bg);
}

.boring {
  display: none;
  color: var(--boring-code-color);
}

.does_not_compile,
.panics,
.not_desired_behavior {
  border: 0.25rem dotted var(--block-code-alternate-bg);
  padding: 0.25rem;
}

.ferris-explain {
  width: 100px;
}

span.caption {
  font-size: 0.8em;
  font-weight: 600;
}

span.caption code {
  font-weight: 400;
}

.fa {
  display: none;
}

I search for a cover.jpg file to put in the folder, and run the script using pipenv:

pipenv run scrapy runspider rust_book_spider.py

Note that this scraper should work for all the Rust books having the same structure. For example, you can build the french version, Le langage de programmation Rust, by inheriting the RustBookSpider class. You just need to override the metadata.

from rust_book_spider import RustBookSpider

class RustBookFrSpider(RustBookSpider):

    book_title = "Le langage de programmation Rust"
    book_language = "fr"
    book_authors = [
        "Steve Klabnik",
        "Carol Nichols",
    ]
    book_cover = "cover.jpg"
    book_css = "style.css"
    book_nav_title = "Table des Matières"
    book_filename = "rust_book_fr.epub"

    name = "rust_book_fr_spider"
    start_url = "https://jimskapt.github.io/rust-book-fr/"

And voilà! The full scripts can be downloaded here. Note that you need to provide your own cover.jpg file to make them work.