Convert Rust Programming Language books to EPUB
The Rust Programming Language book is available in HTML format. However, officially, no EPUB version is provided.
The idea of this post is to convert the HTML website into an ebook. Some solutions exist, such as WebToEpub, papeer, and pandoc. However, these solutions did not format the book the way I wanted, or were not easy enough to customize for my needs. So I decided to create my own scraper in Python and share the methodology in this post.
First of all, I installed the following packages:
ebooklib
: to create EPUB files.lxml
: to manipulate XHTML files.scrapy
: to scrape the website.w3lib
: to normalize URLs.
This can be done using pipenv
:
Now, let’s create the rust_book_spider.py
script and import the libraries in it.
Then I create the RustBookSpider
class containing the book metadata.
# Crawler for building the Rust Book in epub format
# @see https://idpf.org/epub/30/spec/epub30-publications.html
# Book data
=
=
=
=
=
=
=
The ebook will have a cover page illustrated by a provided cover.jpg
file. Moreover, I have to provide a stylesheet, style.css
, that I will detail later.
Let’s add the crawler data:
# Crawler data
=
=
In addition, we need some temporary data during the scraping:
# Temporary data
=
=
=
=
The first method to be called in this class is start_requests()
. It basically requests to download the page located at the URL defined by start_url
in the crawler data, and to call parse_toc()
when the request is successful.
# Initialize the crawler
# @see https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.start_requests
# The crawler starts by parsing the table of contents of the start page
yield
The parse_toc()
method parses the table of contents (TOC) from the HTML code of the homepage in order to request all the chapters. Every URL representing a chapter will call the parse_chapter()
method.
# Parse the table of contents of given page
# Get the `<ol>` element containing the TOC
=
# Extract data from the TOC
=
# Setup parent-children relationships in the TOC tree
# The crawler continues by visiting all the URLs from the TOC
yield
The parse_toc_list()
method recursively parses a given <ol>
HTML element to get a list of URLs from the TOC.
# Parse a `<ol>` element from the table of contents
# Memorize previous TOC item
= None
# Child element of `<li>` is either `<ol>` (sub-tree) or `<a>` (link)
=
=
# Children of the previous TOC item
yield from
is not None:
=
yield
=
The parse_toc_item()
method parses a given <a>
HTML element and returns a TOC item.
# Parse a `<a>` element from the table of contents
# The element is like `<a href="page.html"><strong>1.2.</strong> Title</a>`
=
=
=
# Normalize URL
=
=
# Chapters are in XHTML
=
return
When a chapter page is successfully downloaded, scrapy
calls the parse_chapter()
method. It uses CSS and XPATH selectors to locate the content and metadata. In addition, I request to download images and call parse_image()
for each of them. Finally, I call process_html()
to perform some post-processing.
# Parse a chapter page
# Get the `<main>` element containing the page contents
=
# The TOC item contains useful metadata such as the title
=
# @see http://docs.sourcefabric.org/projects/ebooklib/en/latest/ebooklib.html#ebooklib.epub.EpubHtml
=
# Download images
=
=
=
=
=
yield
# The HTML must be processed for the epub format
=
=
The parse_image()
method simply stores the content in the _images
property.
# Parse an image
=
The process_html()
method uses lxml
to replace the website links to EPUB internal links, using replace_link()
and normalize_url()
methods.
# Process the HTML of a chapter
# Parse as HTML
# @see https://lxml.de/api/lxml.html-module.html#fragment_fromstring
=
# Remove links in titles
# Call `self.replace_link()` on every link
# Replace `src` attributes in images
=
=
# XHTML files and images may be stored in different folders
=
=
=
# Serialize as XHTML
# @see https://lxml.de/api/lxml.html-module.html#tostring
return
# Transform a link to match epub filenames
=
# Replace the link if found in the table of contents
=
# Restore the URL fragment if needed
=
return
return
# Canonicalize a URL
# Make the URL absolute
= .
=
return
The create_filename()
method is useful to ensure that no two files have the same filename.
# Create an unused filename from a desired one
=
= 0
# Loop until we find an unused filename
= f
+= 1
return
The create_toc()
recursively builds the TOC tree for ebooklib
.
# Recursively build a TOC tree made of epub objects
# Get children of `parent` item, or root items if parent is None
=
return
Finally, the scrapy
engine calls the closed()
method at the end, when all the content has been scraped. Then I can build the full ebook using the temporary data.
# Terminate the crawler and create the ebook
# @see https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.closed
# @see http://docs.sourcefabric.org/projects/ebooklib/en/latest/tutorial.html
=
# Add metadata
# Add authors
# Add stylesheet
= None
=
# Add cover
= None
= None
=
=
=
=
=
# Add chapters
=
# Add stylesheet to all chapters
# Add images
=
# Add table of contents
=
# Add NCX
# Add navigation
=
# Add spine
=
As I said before, the script requires a stylesheet named style.css
in the same folder. I extracted a subset of the CSS declarations from the website and adapted them to my needs.
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
I search for a cover.jpg
file to put in the folder, and run the script using pipenv
:
Note that this scraper should work for all the Rust books having the same structure. For example, you can build the french version, Le langage de programmation Rust, by inheriting the RustBookSpider
class. You just need to override the metadata.
=
=
=
=
=
=
=
=
=
And voilà! The full scripts can be downloaded here. Note that you need to provide your own cover.jpg
file to make them work.