totext.py - Convert URL or RSS feed to text with readability
Published: 18-04-2019 | Author: Remy van Elst | Text only version of this article
Table of Contents
Love plaintext? This script downloads an URL, parses it with readability andreturns the plaintext (as markdown). It supports RSS feeds (will convert everyarticle in the feed) and saves every article.
My usecase is twofold. One is to convert RSS feeds to a Gopher site, thesecond is to get full text in my RSS reader.
The script contains a few workarounds for so-called cookiewalls. It also pausesbetween RSS feed articles to not do excessive requests.
The readability part is handled by Python, no external services are used.
Here's an example of a news article. On the left, the text-only parsed version,on the right, the webpage:
First install the required libraries.
apt-get install python python-pip #python2pip install html2text requests readability-lxml feedparser
Other distro's, use the
pip command above.
Clone the repository:
git clone https://github.com/RaymiiOrg/to-text.py
usage: totext.py [-h] -u URL [-s SLEEP] [-r] [-n]Convert HTML page to text using readability and html2text.arguments: -h, --help show this help message and exit -u URL, --url URL URL to convert (Required) -s SLEEP, --sleep SLEEP Sleep X seconds between URLs (only in rss) -r, --rss URL is RSS feed. Parse every item in feed -n, --noprint Dont print converted contents
If you want to run the script via a cronjob, use the
-n option to not haveoutput.
If the parsing failed, the article will contain the text:
python totext.py --rss --url https://raymii.org/s/feed.xmlpython totext.py --url https://www.rd.nl/vandaag/binnenland/grootste-stijging-verkeersdoden-in-jaren-1.1562067
Every file converted will also be saved to the folder
saved/$hostname. Thefilenames are sorted by date.
GNU GPLv2.Tags: bash, gopher, logs, monitoring, pygopherd, python, software