use beautifulsoup to parse the description from the first paragraph

Signed-off-by: Brian S. Stephan <bss@incorporeal.org>
use beautifulsoup to derive title from HTML h1
2026-01-28 14:27:21 -06:00 · 2026-01-28 14:23:41 -06:00 · 2026-01-28 13:37:40 -06:00 · 2026-01-28 13:37:35 -06:00 · 2026-01-28 13:37:12 -06:00 · 2026-01-28 13:37:07 -06:00
10 changed files with 54 additions and 132 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -2,53 +2,6 @@
 Included is a summary of changes to the project, by version. Details can be found in the commit history.
 ## v2.1.2
 ### Features
 * An optional license declaration can be added to the footer, with a config "LICENSE" directive.
 ### Improvements
 * Style changes in footnotes, hrs, table colors, footnote links, full width figures.
 * Have floats clear their side, to not have a waterfall/ratchet effect when too many floating things are next to each
  other.
 * Add borders to the plain style tables.
 ### Miscellaneous
 * One HTML tweak to make the W3C CSS validator happy.
 * Some old code from the pre-SSG days has been removed.
 ## v2.1.1
 ### Improvements
 * Use the h1-as-name feature from v2.1.0 also to generate the page name in breadcrumbs. This changes the behavior on
  pages with an h1 but no Title: meta tag to have a better name, of course, but also changes the behavior on pages with
  neither a h1 nor a Title: meta tag to have a leading slash (e.g. /page-filename) where there previously was not one
  (e.g. just page-filename). This seems like an acceptable trade-off.
 ### Miscellaneous
 * With the minor breadcrumb change, a method used to finagle the breadcrumb no-name name is no longer necessary.
 ## v2.1.0
 ### Features
 * The page title (also used in the `og:title` header) and the optional description used in the `og:description` header
  can be derived from the contents of the page content, if the markdown meta tags are not supplied. The first `h1` is
  used for the title, and the first `p` is used for the description. This is largely to save some time writing pages
  that one wants to look nice, especially in a social media card, and removes some repetition.
 ### Miscellaneous
 * Requirements bumped, which led to...
 * Python 3.9 has been removed from the supported versions.
 * Added some miscellaneous unit tests and coverage changes to keep us at 95% (which only dropped for a library reason I
  don't understand).
 ## v2.0.5
 ### Features
--- a/README.md
+++ b/README.md
@@ -7,8 +7,8 @@ A lightweight static site generator for Markdown-based sites.
 Something like the following should suffice:
 ```
-% virtualenv --python=python3.10 env-py3.10
+% virtualenv --python=python3.9 env-py3.9
-% source env-py3.10/bin/activate
+% source env-py3.9/bin/activate
 % pip install -U pip
 % pip install incorporeal-cms
 % incorporealcms-build ./path/to/instance ./path/to/output/www/root
--- a/incorporealcms/markdown.py
+++ b/incorporealcms/markdown.py
@@ -84,21 +84,6 @@ def parse_md(path: str, pages_root: str):
    rel_path = os.path.relpath(path, pages_root)
    page_name, page_description = _get_metadata_from_parsed_page(md, content, rel_path)
    page_title = f'{page_name} - {Config.TITLE_SUFFIX}' if page_name else Config.TITLE_SUFFIX
    logger.debug("title (potentially derived): %s", page_title)
    return content, md, page_name, page_title, page_description, mtime
 def _get_metadata_from_parsed_page(md, content, path: str):
    """Get the page name and description from a Markdown object and/or HTML output of a page.
    Args:
        md: the parsed Markdown object, potentially including Meta tags
        content: the Markdown page content converted to HTML, to run through BeautifulSoup
        path: path of the page, to derive the name from as a fallback
    """
    soup = BeautifulSoup(content, features='lxml')
    # get the page title first from the markdown tags, second from the first h1, last from the path
@@ -108,7 +93,7 @@ def _get_metadata_from_parsed_page(md, content, path: str):
    elif h1_tag := soup.find('h1'):
        page_name = h1_tag.string
    elif not page_name:
-        page_name = instance_resource_path_to_request_path(path)
+        page_name = instance_resource_path_to_request_path(rel_path)
    # get the page description from the markdown tags or first paragraph
    page_description = None
@@ -118,7 +103,10 @@ def _get_metadata_from_parsed_page(md, content, path: str):
        if page_description := p_tag.string:
            page_description = page_description.replace('\n', ' ')
-    return page_name, page_description
+    page_title = f'{page_name} - {Config.TITLE_SUFFIX}' if page_name else Config.TITLE_SUFFIX
    logger.debug("title (potentially derived): %s", page_title)
    return content, md, page_name, page_title, page_description, mtime
 def handle_markdown_file_path(path: str, pages_root: str) -> str:
@@ -135,6 +123,11 @@ def handle_markdown_file_path(path: str, pages_root: str) -> str:
    extra_footer = get_meta_str(md, 'footer') if md.Meta.get('footer') else None
    template_name = get_meta_str(md, 'template') if md.Meta.get('template') else 'base.html'
    # check if this has a HTTP redirect
    redirect_url = get_meta_str(md, 'redirect') if md.Meta.get('redirect') else None
    if redirect_url:
        raise NotImplementedError("redirects in markdown are unsupported!")
    template = jinja_env.get_template(template_name)
    return template.render(title=page_title,
                           config=Config,
@@ -182,8 +175,16 @@ def generate_parent_navs(path, pages_root: str):
        try:
            with open(os.path.join(pages_root, path), 'r') as entry_file:
                entry = entry_file.read()
-            content = Markup(md.convert(entry))  # nosec B704
+            _ = Markup(md.convert(entry))       # nosec B704
-            page_name, _ = _get_metadata_from_parsed_page(md, content, os.path.relpath(path, parent_resource_dir))
+            page_name = (" ".join(md.Meta.get('title')) if md.Meta.get('title')
                         else request_path_to_breadcrumb_display(request_path))
            return generate_parent_navs(parent_resource_path, pages_root) + [(page_name, request_path)]
        except FileNotFoundError:
            return generate_parent_navs(parent_resource_path, pages_root) + [(request_path, request_path)]
 def request_path_to_breadcrumb_display(path):
    """Given a request path, e.g. "/foo/bar/baz/", turn it into breadcrumby text "baz"."""
    undired = path.rstrip('/')
    leaf = undired[undired.rfind('/'):]
    return leaf.strip('/')
--- a/incorporealcms/static/css/base.css
+++ b/incorporealcms/static/css/base.css
@@ -116,26 +116,19 @@ img {
    max-width: 75% !important;
 }
 .full-width {
    max-width: 100%;
 }
 .img-center {
    display: block;
    clear: both;
    margin-left: auto;
    margin-right: auto;
 }
 .img-left {
    float: left;
    clear: left;
    margin-right: 1em;
 }
 .img-right {
    float: right;
    clear: right;
    margin-left: 1em;
 }
@@ -151,14 +144,12 @@ figure {
 figure.right {
    float: right;
    clear: right;
    margin-left: 10px;
    display: block;
 }
 figure.left {
    float: left;
    clear: left;
    margin-right: 10px;
    display: block;
 }
@@ -173,19 +164,14 @@ figcaption {
    font-size: 0.9em;
 }
-div.content .footnote {
+.footnote {
    font-size: 0.8em;
 }
-div.content .footnote p {
+.footnote p {
    margin: 0;
 }
 .footnote-ref:link, .footnote-ref:visited, .footnote-ref:hover, .footnote-ref:active {
    font-weight: normal;
 }
 .footnote-ref {
    font-size: 0.75em;
    margin-left: 1px;
 }
--- a/incorporealcms/static/css/dark.css
+++ b/incorporealcms/static/css/dark.css
@@ -14,15 +14,11 @@ body {
    background: #111;
 }
 hr {
    color: #333;
 }
 h1, h2, h3, h4, h5, h6 {
    color: #B31D15;
 }
-p a, ul a, ol a, sup a {
+p a, ul a, ol a {
    color: #DDD;
 }
@@ -30,7 +26,7 @@ footer a {
    color: #999;
 }
-p a:hover, ul a:hover, ol a:hover, footer a:hover, sup a:hover {
+p a:hover, ul a:hover, ol a:hover, footer a:hover {
    color: #B31D15;
 }
@@ -48,7 +44,7 @@ table, th, td {
 }
 th {
-    background: #111;
+    background: #333;
 }
 blockquote {
--- a/incorporealcms/static/css/light.css
+++ b/incorporealcms/static/css/light.css
@@ -14,15 +14,11 @@ body {
    background: #EEE;
 }
 hr {
    color: #CCC;
 }
 h1, h2, h3, h4, h5, h6 {
    color: #811610;
 }
-p a, ul a, ol a, sup a {
+p a, ul a, ol a {
    color: #222;
 }
@@ -30,7 +26,7 @@ footer a {
    color: #999;
 }
-p a:hover, ul a:hover, ol a:hover, footer a:hover, sup a:hover {
+p a:hover, ul a:hover, ol a:hover, footer a:hover {
    color: #811610;
 }
@@ -48,7 +44,7 @@ table, th, td {
 }
 th {
-    background: #EEE;
+    background: #CCC;
 }
 blockquote {
--- a/incorporealcms/static/css/plain.css
+++ b/incorporealcms/static/css/plain.css
@@ -9,10 +9,6 @@ div.header {
    justify-content: space-between;
 }
 table, th, td {
    border: 1px solid;
 }
 .img-25 {
    max-width: 25% !important;
 }
--- a/incorporealcms/templates/base.html
+++ b/incorporealcms/templates/base.html
@@ -18,7 +18,7 @@ SPDX-License-Identifier: GPL-3.0-or-later
 <link rel="icon" href="{{ config.FAVICON }}">
 <link rel="alternate" type="application/atom+xml" href="/feed/atom">
 <link rel="alternate" type="application/rss+xml" href="/feed/rss">
-<script src="/static/js/style_switcher.js"></script>
+<script type="text/javascript" src="/static/js/style_switcher.js"></script>
 <div {% block site_class %}class="site-wrap site-wrap-normal-width"{% endblock %}>
    {% block header %}
@@ -44,11 +44,7 @@ SPDX-License-Identifier: GPL-3.0-or-later
    </div>
    <footer>
        {% if extra_footer %}<div class="extra-footer"><i>{{ extra_footer|safe }}</i></div>{% endif %}
-        <div class="footer">
+        <div class="footer"><i>Last modified: {{ mtime }}</i></div>
            <i>Last modified: {{ mtime }}.<br />
               {% if config.LICENSE %} Available via {{ config.LICENSE|safe }}{% endif %}.
            </i>
        </div>
    </footer>
    {% endblock %}
 </div>
--- a/tests/instance/broken/redirect.md
+++ b/tests/instance/broken/redirect.md
@@ -0,0 +1 @@
 Redirect:           http://www.google.com/
--- a/tests/test_markdown.py
+++ b/tests/test_markdown.py
@@ -10,7 +10,8 @@ import pytest
 from incorporealcms import init_instance
 from incorporealcms.markdown import (generate_parent_navs, handle_markdown_file_path,
-                                     instance_resource_path_to_request_path, parse_md)
+                                     instance_resource_path_to_request_path, parse_md,
                                     request_path_to_breadcrumb_display)
 HERE = os.path.dirname(os.path.abspath(__file__))
 INSTANCE_DIR = os.path.join(HERE, 'instance')
@@ -25,21 +26,14 @@ def test_generate_page_navs_index():
    assert generate_parent_navs('index.md', PAGES_DIR) == [('example.org', '/')]
 def test_generate_page_navs_title_from_h1():
    """Test that the index page has navs to the root (itself)."""
    assert generate_parent_navs('no-title.md', PAGES_DIR) == [('example.org', '/'),
                                                              ('this page doesn\'t have a title!', '/no-title')]
 def test_generate_page_navs_subdir_index():
    """Test that dir pages have navs to the root and themselves."""
-    assert generate_parent_navs('subdir/index.md', PAGES_DIR) == [('example.org', '/'), ('another page', '/subdir/')]
+    assert generate_parent_navs('subdir/index.md', PAGES_DIR) == [('example.org', '/'), ('subdir', '/subdir/')]
 def test_generate_page_navs_subdir_real_page():
    """Test that real pages have navs to the root, their parent, and themselves."""
-    assert generate_parent_navs('subdir/page.md', PAGES_DIR) == [('example.org', '/'),
+    assert generate_parent_navs('subdir/page.md', PAGES_DIR) == [('example.org', '/'), ('subdir', '/subdir/'),
                                                                 ('another page', '/subdir/'),
                                                                 ('Page', '/subdir/page')]
@@ -48,7 +42,7 @@ def test_generate_page_navs_subdir_with_title_parsing_real_page():
    assert generate_parent_navs('subdir-with-title/page.md', PAGES_DIR) == [
        ('example.org', '/'),
        ('SUB!', '/subdir-with-title/'),
-        ('/page', '/subdir-with-title/page')
+        ('page', '/subdir-with-title/page')
    ]
@@ -57,7 +51,7 @@ def test_generate_page_navs_subdir_with_no_index():
    assert generate_parent_navs('no-index-dir/page.md', PAGES_DIR) == [
        ('example.org', '/'),
        ('/no-index-dir/', '/no-index-dir/'),
-        ('/page', '/no-index-dir/page')
+        ('page', '/no-index-dir/page')
    ]
@@ -103,6 +97,12 @@ def test_render_with_default_style_override():
                in handle_markdown_file_path('index.md', PAGES_DIR)
 def test_redirects_error_unsupported():
    """Test that we throw a warning about the barely-used Markdown redirect tag, which we can't support via SSG."""
    with pytest.raises(NotImplementedError):
        handle_markdown_file_path('redirect.md', os.path.join(INSTANCE_DIR, 'broken'))
 def test_instance_resource_path_to_request_path_on_index():
    """Test index.md -> /."""
    assert instance_resource_path_to_request_path('index.md') == '/'
@@ -123,6 +123,15 @@ def test_instance_resource_path_to_request_path_on_subdir_and_page():
    assert instance_resource_path_to_request_path('subdir/page.md') == '/subdir/page'
 def test_request_path_to_breadcrumb_display_patterns():
    """Test various conversions from request path to leaf nodes for display in the breadcrumbs."""
    assert request_path_to_breadcrumb_display('/foo') == 'foo'
    assert request_path_to_breadcrumb_display('/foo/') == 'foo'
    assert request_path_to_breadcrumb_display('/foo/bar') == 'bar'
    assert request_path_to_breadcrumb_display('/foo/bar/') == 'bar'
    assert request_path_to_breadcrumb_display('/') == ''
 def test_parse_md_metadata():
    """Test the direct results of parsing a markdown file."""
    content, md, page_name, page_title, page_desc, mtime = parse_md(
@@ -211,15 +220,3 @@ def test_index_in_source_link_is_stripped():
    assert '<a href=".#anchor">Anchored This Index</a>' in content
    assert '<a href="../">Parent</a>' in content
    assert '<a href="../#anchor">Anchored Parent</a>' in content
 def test_license_link():
    """Test that the config's license HTML is displayed in the footer."""
    with patch('incorporealcms.Config.LICENSE',
               '<a href="https://creativecommons.org/licenses/by-sa/4.0/">CC BY-SA 4.0</a>', create=True):
        assert 'Available via <a href="https://creativecommons.org/licenses/by-sa/4.0/">CC BY-SA 4.0</a>.'\
            in handle_markdown_file_path('index.md', PAGES_DIR)
    # default, no config
    assert '<a href="https://creativecommons.org/licenses/by-sa/4.0/">CC BY-SA 4.0</a>'\
        not in handle_markdown_file_path('index.md', PAGES_DIR)
Author	SHA1	Message	Date
Brian S. Stephan	229f78a5e2	use beautifulsoup to parse the description from the first paragraph Signed-off-by: Brian S. Stephan <bss@incorporeal.org>	2026-01-28 14:27:21 -06:00
Brian S. Stephan	3159bdc29a	use beautifulsoup to derive title from HTML h1 Signed-off-by: Brian S. Stephan <bss@incorporeal.org>	2026-01-28 14:23:41 -06:00
Brian S. Stephan	12c0b2eae8	requirements bump Signed-off-by: Brian S. Stephan <bss@incorporeal.org>	2026-01-28 13:37:40 -06:00
Brian S. Stephan	ede9056b1e	test the high level SSG build command Signed-off-by: Brian S. Stephan <bss@incorporeal.org>	2026-01-28 13:37:35 -06:00
Brian S. Stephan	4ec63c2f8d	remove python 3.9 from supported versions Signed-off-by: Brian S. Stephan <bss@incorporeal.org>	2026-01-28 13:37:12 -06:00
Brian S. Stephan	9a19a90cfd	Changelog for v2.0.5 Signed-off-by: Brian S. Stephan <bss@incorporeal.org>	2026-01-28 13:37:07 -06:00