Generating org-mode Outlines for wikiHow Articles

Recently I found some great articles on wikiHow, then I want to keep notes of them in org-mode files.

At first, I manually copied the ToC of articles, but soon I found it's tedious and takes a lot of time. Today I wrote a requests-based Python script to help me extract the ToCs (Table of Content) into org-mode outlines. It takes two arguments, the first one is the URL, the second one is the containing heading's level for the generated ToC in org-mode.

For example, execute C-u M-! python 1, from within an org-mode file under a heading, let's say, * Test, then the outline will be put under the * Test heading.

Here is the source code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup
import requests
import sys

def gen_wikihow_org_outline(url, containing_heading_level=0):
    '''Generating a org outline for the specified wikiHow article at URL.'''
    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
    html = requests.get(url, headers=headers).content

    soup = BeautifulSoup(html, "html.parser")
    level_prefix = '*' * containing_heading_level

    title ="h1 > a")[0].get_text()
    print("{}* [[{}][{}]]".format(level_prefix, url, title))

    sections ="h3 > span")
    for section in sections:
        print("{}** {}".format(level_prefix, section.get_text()))
        subsections = section.parent.parent.find_all("b", class_="whb")
        for subsection in subsections:
            subsection_title = subsection.get_text()
            if len(subsection_title) < 3:
                continue # it may contain only ".", just work around it.
            print("{}*** {}".format(level_prefix, subsection_title))


if len(sys.argv) < 2:
    print("Usage: {} <url> [containing heading level]".format(sys.argv[0]))

url = sys.argv[1]
if len(sys.argv) == 3:
    containing_heading_level = int(sys.argv[2])

gen_wikihow_org_outline(url, containing_heading_level)

See also

comments powered by Disqus