Nano Banana's visual interpretation of this workflow.
AI systems understand Markdown far better than HTML. This guide shows you how to automatically generate clean Markdown versions of your HTML pages every time you push to GitHub, resulting in more accurate AI responses and fewer hallucinations.
No build system, no local scripts, no technical overhead. Set it and foregt it.
AI systems struggle with raw HTML because most webpages include:
AI doesn't need any of that. It needs:
The solution: automatically generate a clean Markdown mirror of each page.
This script runs automatically on GitHub's servers every time you push HTML changes. It does three things:
<link rel="alternate"> tag to each HTML file (so bots know there's a Markdown version)You don't need to manually edit your HTML files. the GitHub Action handles everything.
GitHub Actions handles the conversion automatically.
Create this folder if it doesn't already exist:
.github/workflows/
Create a file named:
html-to-md.yml
Common excludes to consider: drafts, archive, vendor, dist, build. The workflow always excludes node_modules and hidden folders.
Paste the below code block: (IMPORTANT: after pasting, update the line that says BASE_URL = "https://yourdomain.com" to your domain)
name: Generate Markdown from HTML
on:
push:
branches:
- main
paths:
- "*.html"
- "**/*.html"
permissions:
contents: write
jobs:
html_to_md:
runs-on: ubuntu-latest
steps:
- name: Check out repo
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install dependencies
run: |
pip install beautifulsoup4 lxml
- name: Generate Markdown files
run: |
mkdir -p ai
python - << 'PY'
from bs4 import BeautifulSoup, NavigableString, Tag, Comment
from pathlib import Path
BASE_URL = "https://yourdomain.com" # ← YOU MUST CHANGE THIS
EXCLUDE_FOLDERS = {'node_modules'} # ← Folders to skip
def get_md_path(html_path):
"""Determine the markdown file path for an HTML file."""
parts = html_path.parts
if html_path.name == "index.html":
if len(parts) == 1:
return "ai/index.md"
else:
return f"ai/{parts[-2]}.md"
else:
stem = html_path.stem
if len(parts) > 1:
prefix = "-".join(parts[:-1])
return f"ai/{prefix}-{stem}.md"
return f"ai/{stem}.md"
def add_link_tag_if_missing(html_path, md_path):
"""Add the alternate link tag to HTML if missing."""
content = html_path.read_text(encoding="utf-8")
soup = BeautifulSoup(content, "lxml")
existing = soup.find("link", {"rel": "alternate", "type": "text/markdown"})
if existing:
return False
head = soup.find("head")
if not head:
return False
new_link = soup.new_tag("link")
new_link["rel"] = "alternate"
new_link["type"] = "text/markdown"
new_link["href"] = f"/{md_path}"
comment = Comment(" Markdown version for AI bots ")
title = head.find("title")
if title:
title.insert_after("\n ")
title.insert_after(new_link)
title.insert_after("\n ")
title.insert_after(comment)
title.insert_after("\n\n ")
else:
head.append("\n ")
head.append(comment)
head.append("\n ")
head.append(new_link)
head.append("\n")
html_path.write_text(str(soup), encoding="utf-8")
return True
# Find all HTML files and process them
files = []
for html_path in Path(".").rglob("*.html"):
if any(part.startswith('.') or part in EXCLUDE_FOLDERS for part in html_path.parts):
continue
try:
md_path = get_md_path(html_path)
add_link_tag_if_missing(html_path, md_path)
files.append((str(html_path), md_path))
except Exception:
continue
def normalize_space(text):
return " ".join(text.split())
def inline_to_md(node):
pieces = []
for child in getattr(node, "children", []):
if isinstance(child, NavigableString):
pieces.append(str(child))
elif isinstance(child, Tag):
name = child.name.lower()
if name == "a":
text = normalize_space(child.get_text(" ", strip=True))
href = child.get("href", "").strip()
if not text:
continue
if not href:
pieces.append(text)
continue
if href.startswith("http") or href.startswith("mailto:") or href.startswith("#"):
resolved = href
else:
resolved = f"{BASE_URL}{href}" if href.startswith("/") else f"{BASE_URL}/{href}"
pieces.append(f"[{text}]({resolved})")
continue
if name in ("strong", "b"):
pieces.append(f"**{normalize_space(inline_to_md(child))}**")
continue
if name in ("em", "i"):
pieces.append(f"*{normalize_space(inline_to_md(child))}*")
continue
pieces.append(inline_to_md(child))
return "".join(pieces)
def html_to_markdown(html_path, md_path):
path = Path(html_path)
if not path.exists():
return
soup = BeautifulSoup(path.read_text(encoding="utf-8"), "lxml")
title_tag = soup.find("title")
title = title_tag.get_text(strip=True) if title_tag else ""
root = soup.find("main") or soup.body or soup
allowed = ["h1","h2","h3","h4","h5","h6","p","li"]
elements = [tag for tag in root.find_all(allowed) if not tag.find_parent("nav")]
lines = []
if title:
lines.append(f"# {title}")
lines.append("")
for el in elements:
name = el.name.lower()
text = normalize_space(inline_to_md(el))
if not text:
continue
if name.startswith("h"):
lines.append(f"{'#' * int(name[1])} {text}")
lines.append("")
elif name == "p":
lines.append(text)
lines.append("")
elif name == "li":
parent = el.find_parent(["ol", "ul"])
if parent and parent.name == "ol":
siblings = [s for s in parent.find_all("li", recursive=False)]
try:
idx = siblings.index(el) + 1
except ValueError:
idx = 1
lines.append(f"{idx}. {text}")
else:
lines.append(f"- {text}")
lines.append("")
Path(md_path).write_text("\n".join(lines).rstrip() + "\n", encoding="utf-8")
for src, dst in files:
html_to_markdown(src, dst)
PY
- name: Commit and push changes
run: |
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git add -A
if git diff --staged --quiet; then
echo "No changes to commit."
else
git commit -m "Auto-generate Markdown from HTML"
git push
fi
Done. Your Markdown files now stay in sync with your HTML automatically.
Any time you push updates to your HTML:
ai/*.mdYour repo will now include:
ai/
index.md
about.md
contact.md
etc...
If you want the Markdown files and updated HTML generated by the GitHub Action locally, just git pull after pushing your updated HTML files.
This approach works for any static site hosted in any environment, provided the source lives on GitHub.
Burton Rast is a designer, a photographer, and a public speaker who loves to make things.