Convert Rust Programming Language books to EPUB
The Rust Programming Language book is available in HTML format. However, officially, no EPUB version is provided.
The idea of this post is to convert the HTML website into an ebook. Some solutions exist, such as WebToEpub, papeer, and pandoc. However, these solutions did not format the book the way I wanted, or were not easy enough to customize for my needs. So I decided to create my own scraper in Python and share the methodology in this post.
First of all, I installed the following packages:
ebooklib
: to create EPUB files.lxml
: to manipulate XHTML files.scrapy
: to scrape the website.w3lib
: to normalize URLs.
This can be done using pipenv
:
pipenv install ebooklib lxml scrapy w3lib
Now, let's create the rust_book_spider.py
script and import the libraries in it.
import ebooklib.epub
import lxml.html
import os
import scrapy
import uuid
import w3lib.url
from urllib.parse import urljoin, urlparse
Then I create the RustBookSpider
class containing the book metadata.
# Crawler for building the Rust Book in epub format
# @see https://idpf.org/epub/30/spec/epub30-publications.html
class RustBookSpider(scrapy.Spider):
# Book data
book_title = "The Rust Programming Language"
book_language = "en"
book_authors = [
"Steve Klabnik",
"Carol Nichols",
]
book_cover = "cover.jpg"
book_css = "style.css"
book_nav_title = "Table of Contents"
book_filename = "rust_book.epub"
The ebook will have a cover page illustrated by a provided cover.jpg
file. Moreover, I have to provide a stylesheet, style.css
, that I will detail later.
Let's add the crawler data:
# Crawler data
name = "rust_book_spider"
start_url = "https://doc.rust-lang.org/book/"
In addition, we need some temporary data during the scraping:
# Temporary data
_toc = {}
_chapters = {}
_images = {}
_filenames = set()
The first method to be called in this class is start_requests()
. It basically requests to download the page located at the URL defined by start_url
in the crawler data, and to call parse_toc()
when the request is successful.
# Initialize the crawler
# @see https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.start_requests
def start_requests(self):
# The crawler starts by parsing the table of contents of the start page
yield scrapy.Request(self.start_url, callback=self.parse_toc)
The parse_toc()
method parses the table of contents (TOC) from the HTML code of the homepage in order to request all the chapters. Every URL representing a chapter will call the parse_chapter()
method.
# Parse the table of contents of given page
def parse_toc(self, response):
# Get the `<ol>` element containing the TOC
toc_element = response.css("#sidebar .chapter")
# Extract data from the TOC
self._toc = {
item["url"]: item for item in self.parse_toc_list(response, toc_element)
}
# Setup parent-children relationships in the TOC tree
for key, item in self._toc.items():
if item["parent"] is not None:
self._toc[item["parent"]]["children"].append(key)
# The crawler continues by visiting all the URLs from the TOC
for item in self._toc.values():
yield scrapy.Request(item["url"], callback=self.parse_chapter)
The parse_toc_list()
method recursively parses a given <ol>
HTML element to get a list of URLs from the TOC.
# Parse a `<ol>` element from the table of contents
def parse_toc_list(self, response, element, parent=None):
# Memorize previous TOC item
previous = None
for item in element.xpath("./li"):
# Child element of `<li>` is either `<ol>` (sub-tree) or `<a>` (link)
section = item.xpath("./ol")
link = item.xpath("./a")
if section.get() is not None:
# Children of the previous TOC item
yield from self.parse_toc_list(response, section, previous)
elif link.get() is not None:
item = self.parse_toc_item(response, link, parent)
yield item
previous = item["url"]
The parse_toc_item()
method parses a given <a>
HTML element and returns a TOC item.
# Parse a `<a>` element from the table of contents
def parse_toc_item(self, response, element, parent=None):
# The element is like `<a href="page.html"><strong>1.2.</strong> Title</a>`
number = element.xpath("./strong/text()").get(default="").strip()
title = element.xpath("./text()").get().strip()
href = element.xpath("./@href").get()
# Normalize URL
url = self.normalize_url(href, base=response.url)
basename = os.path.splitext(os.path.basename(urlparse(url).path))[0]
# Chapters are in XHTML
filename = self.create_filename(f"{basename}.xhtml")
return {
"uid": basename,
"title": title,
# "title": f"{number} {title}".strip(),
"url": url,
"filename": filename,
"parent": parent,
"children": [],
}
When a chapter page is successfully downloaded, scrapy
calls the parse_chapter()
method. It uses CSS and XPATH selectors to locate the content and metadata. In addition, I request to download images and call parse_image()
for each of them. Finally, I call process_html()
to perform some post-processing.
# Parse a chapter page
def parse_chapter(self, response):
# Get the `<main>` element containing the page contents
main = response.css("#content > main")
# The TOC item contains useful metadata such as the title
toc_item = self._toc[response.url]
# @see http://docs.sourcefabric.org/projects/ebooklib/en/latest/ebooklib.html#ebooklib.epub.EpubHtml
html = ebooklib.epub.EpubHtml(
title=toc_item["title"],
file_name=toc_item["filename"],
lang=response.xpath("/html/@lang").get(),
)
# Download images
for image in main.xpath(".//img"):
src = image.xpath("@src").get()
url = self.normalize_url(src, base=response.url)
if url not in self._images:
basename = os.path.basename(urlparse(url).path)
filename = self.create_filename(os.path.join("images", basename))
self._images[url] = {
"filename": filename,
"content": None,
}
yield scrapy.Request(url, callback=self.parse_image)
# The HTML must be processed for the epub format
content = self.process_html(response, main.get())
html.set_content(content)
self._chapters[response.url] = html
The parse_image()
method simply stores the content in the _images
property.
# Parse an image
def parse_image(self, response):
self._images[response.url]["content"] = response.body
The process_html()
method uses lxml
to replace the website links to EPUB internal links, using replace_link()
and normalize_url()
methods.
# Process the HTML of a chapter
def process_html(self, response, content):
# Parse as HTML
# @see https://lxml.de/api/lxml.html-module.html#fragment_fromstring
doc = lxml.html.fragment_fromstring(
content,
base_url=response.url,
# @see https://lxml.de/api/lxml.etree.HTMLParser-class.html
parser=lxml.html.HTMLParser(remove_blank_text=True, remove_comments=True),
)
# Remove links in titles
for link in doc.cssselect("a.header"):
link.drop_tag()
# Call `self.replace_link()` on every link
doc.rewrite_links(
lambda link: self.replace_link(link),
resolve_base_href=True,
base_href=response.url,
)
# Replace `src` attributes in images
for image in doc.xpath("//img"):
old_src = image.get("src")
url = self.normalize_url(old_src, base=response.url)
if url in self._images:
# XHTML files and images may be stored in different folders
image_path = self._images[url]["filename"]
text_dir = os.path.dirname(self._toc[response.url]["filename"])
new_src = os.path.relpath(image_path, start=text_dir)
image.set("src", new_src)
# Serialize as XHTML
# @see https://lxml.de/api/lxml.html-module.html#tostring
return lxml.html.tostring(doc, method="xml")
# Transform a link to match epub filenames
def replace_link(self, link):
url = self.normalize_url(link)
# Replace the link if found in the table of contents
if url in self._toc:
filename = self._toc[url]["filename"]
# Restore the URL fragment if needed
parts = urlparse(link)
return filename if parts.fragment == "" else f"{filename}#{parts.fragment}"
return link
# Canonicalize a URL
def normalize_url(self, url, base=None):
# Make the URL absolute
if base is not None:
netloc = urlparse(url).netloc
if netloc == "" or netloc == urlparse(base).netloc:
url = urljoin(base, url)
return w3lib.url.canonicalize_url(url)
The create_filename()
method is useful to ensure that no two files have the same filename.
# Create an unused filename from a desired one
def create_filename(self, filename):
(basename, ext) = os.path.splitext(filename)
i = 0
# Loop until we find an unused filename
while filename in self._filenames:
filename = f"{basename}_{i}{ext}"
i += 1
self._filenames.add(filename)
return filename
The create_toc()
recursively builds the TOC tree for ebooklib
.
# Recursively build a TOC tree made of epub objects
def create_toc(self, parent=None):
# Get children of `parent` item, or root items if parent is None
children = {
key: item for key, item in self._toc.items() if item["parent"] == parent
}
return [
# Create a section if the current item has children
[
ebooklib.epub.Section(item["title"], item["filename"]),
# Recursive call
self.create_toc(key),
]
if len(item["children"]) > 0
# Create a link if the current item has no children
else ebooklib.epub.Link(item["filename"], item["title"], item["uid"])
for key, item in children.items()
]
Finally, the scrapy
engine calls the closed()
method at the end, when all the content has been scraped. Then I can build the full ebook using the temporary data.
# Terminate the crawler and create the ebook
# @see https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.closed
def closed(self, reason):
# @see http://docs.sourcefabric.org/projects/ebooklib/en/latest/tutorial.html
book = ebooklib.epub.EpubBook()
# Add metadata
book.set_identifier(f"urn:uuid:{uuid.uuid1()}")
book.set_title(self.book_title)
book.set_language(self.book_language)
# Add authors
for author in self.book_authors:
book.add_author(author)
# Add stylesheet
css = None
if self.book_css is not None:
css = ebooklib.epub.EpubItem(
uid="css",
file_name="style.css",
media_type="text/css",
content=open(self.book_css, "r").read(),
)
book.add_item(css)
# Add cover
cover_image = None
cover_html = None
if self.book_cover is not None:
cover_image_uid = "cover-img"
cover_image_filename = "cover.jpg"
cover_html_filename = "cover.xhtml"
cover_image = ebooklib.epub.EpubCover(
uid=cover_image_uid,
file_name=cover_image_filename,
)
cover_image.set_content(open(self.book_cover, "rb").read())
cover_html = ebooklib.epub.EpubHtml(
uid="cover",
title="Cover",
file_name=cover_html_filename,
content=f"""
<img src="{cover_image_filename}" alt="Cover" class="cover-img"/>
""",
)
book.add_item(cover_image)
book.add_item(cover_html)
book.add_metadata(
None,
"meta",
"",
{
"name": "cover",
"content": cover_image_uid,
},
)
# Add chapters
for key in self._toc.keys():
html = self._chapters[key]
# Add stylesheet to all chapters
if css is not None:
html.add_item(css)
book.add_item(html)
# Add images
for image in self._images.values():
item = ebooklib.epub.EpubImage(
file_name=image["filename"], content=image["content"]
)
book.add_item(item)
# Add table of contents
book.toc = self.create_toc()
# Add NCX
book.add_item(ebooklib.epub.EpubNcx())
# Add navigation
nav = ebooklib.epub.EpubNav(title=self.book_nav_title)
nav.add_item(css)
book.add_item(nav)
# Add spine
book.spine = [
"cover",
"nav",
*(self._chapters[key] for key in self._toc.keys()),
]
ebooklib.epub.write_epub(self.book_filename, book)
As I said before, the script requires a stylesheet named style.css
in the same folder. I extracted a subset of the CSS declarations from the website and adapted them to my needs.
@namespace epub "http://www.idpf.org/2007/ops";
:root {
--links: #606060;
--block-code-color: #000;
--block-code-bg: #e0e0e0;
--block-code-alternate-bg: #c0c0c0;
--boring-code-color: #909090;
--inline-code-color: #000;
--inline-code-bg: #d0d0d0;
--quote-color: #000;
--quote-bg: #fff;
--quote-border: #c0c0c0;
--table-border-color: #d0d0d0;
--table-header-bg: #c0c0c0;
--table-alternate-bg: #e0e0e0;
}
html,
body {
font-family: sans-serif;
}
.cover-img {
text-align: center;
height: 100%;
}
a,
a:visited,
a:active,
a:hover {
color: var(--links);
text-decoration: none;
}
img {
max-width: 100%;
}
table {
margin: 0 auto;
border-collapse: collapse;
}
table td {
padding: 3px 20px;
border: 1px solid var(--table-border-color);
}
table thead {
background: var(--table-header-bg);
}
table thead td {
font-weight: 700;
border: none;
}
table thead th {
padding: 3px 20px;
}
table thead tr {
border: 1px solid var(--table-header-bg);
}
table tbody tr:nth-child(2n) {
background: var(--table-alternate-bg);
}
blockquote {
margin: 20px 0;
padding: 0 20px;
color: var(--quote-color);
background-color: var(--quote-bg);
border-top: 0.1em solid var(--quote-border);
border-bottom: 0.1em solid var(--quote-border);
}
pre,
code {
font-family: monospace;
white-space: pre-wrap;
}
pre > code {
display: block;
position: relative;
padding: 0.5rem;
font-size: 0.875em;
color: var(--block-code-color);
background: var(--block-code-bg);
word-break: break-all;
line-break: anywhere;
}
:not(pre) > code {
padding: 0.1em 0.3em;
border-radius: 3px;
color: var(--inline-code-color);
background: var(--inline-code-bg);
}
.boring {
display: none;
color: var(--boring-code-color);
}
.does_not_compile,
.panics,
.not_desired_behavior {
border: 0.25rem dotted var(--block-code-alternate-bg);
padding: 0.25rem;
}
.ferris-explain {
width: 100px;
}
span.caption {
font-size: 0.8em;
font-weight: 600;
}
span.caption code {
font-weight: 400;
}
.fa {
display: none;
}
I search for a cover.jpg
file to put in the folder, and run the script using pipenv
:
pipenv run scrapy runspider rust_book_spider.py
Note that this scraper should work for all the Rust books having the same structure. For example, you can build the french version, Le langage de programmation Rust, by inheriting the RustBookSpider
class. You just need to override the metadata.
from rust_book_spider import RustBookSpider
class RustBookFrSpider(RustBookSpider):
book_title = "Le langage de programmation Rust"
book_language = "fr"
book_authors = [
"Steve Klabnik",
"Carol Nichols",
]
book_cover = "cover.jpg"
book_css = "style.css"
book_nav_title = "Table des Matières"
book_filename = "rust_book_fr.epub"
name = "rust_book_fr_spider"
start_url = "https://jimskapt.github.io/rust-book-fr/"
And voilà! The full scripts can be downloaded here. Note that you need to provide your own cover.jpg
file to make them work.