Recently I found some great articles on wikiHow, then I want to keep notes of them in org-mode files.
At first, I manually copied the ToC of articles, but soon I found it's tedious and takes a lot of time. Today I wrote a requests-based Python script to help me extract the ToCs (Table of Content) into org-mode outlines. It takes two arguments, the first one is the URL, the second one is the containing heading's level for the generated ToC in org-mode.
For example, execute C-u M-! python wikihow-org-outline.py https://www.wikihow.com/Improve-Your-English 1
, from within an org-mode file under a heading, let's say, * Test
, then the outline will be put under the * Test
heading.
Here is the source code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import sys
def gen_wikihow_org_outline(url, containing_heading_level=0):
'''Generating a org outline for the specified wikiHow article at URL.'''
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
headers={"User-Agent":user_agent}
html = requests.get(url, headers=headers).content
soup = BeautifulSoup(html, "html.parser")
level_prefix = '*' * containing_heading_level
title = soup.select("h1 > a")[0].get_text()
print("{}* [[{}][{}]]".format(level_prefix, url, title))
sections = soup.select("h3 > span")
for section in sections:
print("{}** {}".format(level_prefix, section.get_text()))
subsections = section.parent.parent.find_all("b", class_="whb")
for subsection in subsections:
subsection_title = subsection.get_text()
if len(subsection_title) < 3:
continue # it may contain only ".", just work around it.
print("{}*** {}".format(level_prefix, subsection_title))
containing_heading_level=0
if len(sys.argv) < 2:
print("Usage: {} <url> [containing heading level]".format(sys.argv[0]))
sys.exit(1)
url = sys.argv[1]
if len(sys.argv) == 3:
containing_heading_level = int(sys.argv[2])
gen_wikihow_org_outline(url, containing_heading_level)