Archived May, 2026.

A basic Discourse archival tool

mcmcclur

Archival tool updated with Codex May 2026

It seems that it’s pretty tricky to save an entire discourse site to a static version. According to this post by Jeff Atwood, it’s “much harder than you’d think”. It doesn’t appear that this is a priority for the Discourse team, either, which is perfectly understandable.

For my purposes, though, I found that I really needed some way to generate basic, static HTML versions of my Discourse sites. I’ve been using Discourse for a couple of years now as a discussion board when teaching my college math classes so, every few months, I retire one or two sites and start one or two more. Obviously, the discussions on the retiring sites have value so I really needed some way to save them. Ultimately, I figured I’d build my own tool.

The basic idea is simple: Use the Discourse API to crawl the site, grab the cooked version of each post, and massage that into HTML. The tool focuses largely on my own needs as a college math professor who uses small Discourse forums to support my math classes. As such, mathematical content, like f(x)=e^{-x^2}, should be automatically typeset with MathJax V4 and fenced code blocks tagged as sage are translated to active Sage Cells.

If interested, you can view

Note

The update of the archival tool was performed largely with Codex.

codinghorror

We’re definitely interested in this, because sometimes you want to turn off all the fancy hosting and databases and render out a set of static HTML pages for permanent long term archiving with zero security risk.

With the meta topic, others can follow along and edit / contribute as needed.

Falco

You can also use our basic HTML version for archiving: this topic in HTML.

You can get this version using a crawler user agent.

Maybe this + recursive wget or similar can help you.

mcmcclur

Yes, those links are gone, but it’s all summarized on this new page. Also, the output of the code as applied to this DiscourseMeta is now here. I even put it up on GitHub so maybe someone will get interested.

I’d like to edit the original post, but I seem to be past the edit window.

Incidentally, I do think that httrack works much better than I originally thought but I still strongly prefer my version for two main reasons:

  • My code explicitly supports MathJax, which is essential for my work.
    (I’ll probably need to update my code to work with the new MathPlugin sometime)
  • I’ve got much more control over what get’s downloaded and how it’s displayed. For example, I don’t like the way that httrack output points to user links, even if not downloadedl
Silvanus

I’m hosting a forum that is currently, in its third iteration, running Discourse. Our last two forums were (I think, phpbb2 or something like that). I have resolved to archive them using Discourse, so that:

  1. I scan the phpbb2 database into Discourse (there’s a migration tool)
  2. I create a static HTML archive using Discourse.
  3. I put up the static HTML archive into public use (preferably in the same place where our dynamic forum running Discourse is).

According to the first message

There are no user pages or category pages

Could it be somehow advanced so that creating category views would be also possible?

Also, any help on how to use the Jupyter notebook thing? First time I hear of this…

mcmcclur

@Silvanus Can you indicate a live discourse site that you want to archive? I’d be glad to try it out.

Also, have you tried httrack? I think that a command as simple as httrack yoursiteurl might work quite well.

Silvanus

I’m still in the phase 1 (phpbb2 > phpbb3 > discourse) of my archival, so no site yet. After I’ve managed the phpbb conversion, I’ll get back to this. It feels very, very hard. Been trying to install phpbb3 for a while now, but I get some weird problems all the time. :frowning:

I’ll have to try that httrack, thanks.

mcmcclur

@Silvanus Well, I noticed that you point to the forum at https://uskojarukous.fi/ on your Profile page; I went ahead and created a couple of archives of that. You can (temporarily) take a look at the results here:

Here are a few comments:

  • I definitely like my version better; no surprise there because I designed it the way I want it to look.
  • The front page of the httrack version doesn’t look so great simply because that’s what the escaped fragment version looks like.
  • I think it might make sense to start httrack at a subpage to generate something like this.
  • It wouldn’t be too hard to make my archival tool grab the categories; I might do that for the next iteration.
  • My code adds MathJax to every page because my forums are mathematical. I should probably try to detect if MathJax is necessary. I’m guessing your forum doesn’t require it.

The httrack command

The httrack version was generated with a command that looks like so:

httrack https://uskojarukous.fi -https://uskojarukous.fi/users* -*.rss -O uskojarukous_arxiv -x -o -M10000000 --user-agent "Googlebot"
  • The -https://uskojarukous.fi/users* -*.rss prevents httrack from downloading files matching those patterns.
  • The -x -o combo replaces both external links and errors with a local file indicating the error. So, for example, we don’t link to user profiles on the original that weren’t downloaded locally.
  • The -M10000000 restricts the total amount downloaded to 10MB. There appears to be some post processing and downloading of supplemental files that makes the total larger than this anyway.
  • The --user-agent "Googlebot" should not be necessary if the forum is powered by a recent version of Discourse.

The archival tool code

For the most part, the archival tool should run with minimal changes. I run it within a Jupyter notebook but the exact same code could be run from a Python script with the appropriate libraries installed. Of course, you need to tell it what forum you want to download. The few lines of my first input look like so:

base_url = 'https://uskojarukous.fi/'
path = os.path.join(os.getcwd(), 'uskojarukous')
archive_blurb = "A partial archive of uskojarukous.fi as of " + \
  date.today().strftime("%A %B %d, %Y") + '.'

Later, in input 6, I define max_more_topics = 2. Essentially, that defines a bound on k in this code here:

'/latest.json?page=k'

But again, there should be some changes made to the code to get it to work for non-mathematical forums.

Silvanus

Very cool, thank you for all the clarifications. Just a quick note, it seems that your tool can’t handle sub-categories (which is why many of the messages seem to be without a category).

mcmcclur

@Silvanus Yes, I think you’re absolutely right about the sub-category thing. Thanks - I had wondered about that.

Silvanus

@mcmcclur: as you already realized, I’m the admin of said forum, which is the third of our forums. When we did technological jumps, we didn’t migrate, but started from scratch, and the older forum was archived. The last two forums are in SMF format - but I finally managed to start converting them into Discourse format! :slight_smile:

So, our forum had a public area and a closed area. I’m thinking that the closed area (a few categories) should be archived, but closed off via a password gate. I noticed that the static paths are something like /t/TITLE/MESSAGEID/. This, if course, lends itself for thread-by-thread gating, but is slightly cumbersome - but, heh, I guess that’s what you get when archiving huge loads of stuff from a dynamic forum to a static archive… :slight_smile:

Antroden

Just a few tidbits for anyone else looking for some httrack tips (which works great for my purposes).

  • A complete list of command line flags: HTTrack Website Copier - Offline Browser
  • Using the -s0 flag ignores the robots.txt (if you have a non-spider-able account)
  • If your site is behind a login, you can download a .txt file of the cookie (once logged in) using a chrome extension like cookies.txt and place that in the directory you’re running httrack from.

I’m using httrack via cron to create an offline archive of our Discourse site. However, the user that is logging in under httrack gets marked as a “view” for each topic, giving super-inflated numbers of views for each topic (the cron runs every hour).

Is there a way to exclude a certain user from being recorded in the statistics / view stats for the site as a whole?

codinghorror

Good point, where would this be intercepted @sam?

sam

We have this method for tracking page views:

We have additional methods for tracking user visits which would be even harder to override.

We only store one page view per day per user, but I get that it can add up.

Hacking this out so certain users are not tracked would either require a plugin or some sort of daily query that nukes all the views by the user and remembers to also reduce views count from the topics table.

kamcc

Hi all – just jumping in here to say that @mcmcclur’s code was exactly what I was looking for! So thank you very much for sharing :slight_smile:

I made a few small modifications (mainly additional code that makes sure to grab all posts in a topic, not just the first twenty) and the code is here: GitHub - kitsandkats/ArchiveDiscourse: Code for archiving a Discourse site into static HTML., forked from @mcmcclur’s original repo and stored as a python file instead of a Jupyter notebook.

I’m very happy with how it turned out. Thanks again!

johnnyboi5858

Hi just read through this whole thread and wanted to check if this tool works if the the discourse fourm is behind a login and password how would I edit the code so it will allow me to archival the site ?

mcmcclur

As it is currently written, the code is not designed to access any material that requires a login. It should be pretty easy to set that up, though. The code interacts with the Discourse site via the Python Requests library which does offer authentication. It’s feasible that adding an auth=('user', 'pass') to the code at the appropriate points is all that’s required. I’m not currently running a Discourse site so I can’t test that at the moment.

adrelanos

httrack does not work for me. Using:

httrack https://my-forums.org --user-agent "Googlebot"

httrack is quite promising, but long forum thread with multiple pages are incomplete. Once I click on “page 2” it does not work. I.e.

  • file:///home/user/My%20Web%20Sites/my-forums/my-forum.org/t/forum-thread-title/83394658.html looks really good (does not fetch from external resources), but
  • file:///home/user/My%20Web%20Sites/my-forums/my-forum.org/t/forum-thread-title/83394658.html?page=2 is broken.

Any suggestions?

Perhaps httrack can be told somehow to “use print mode”?

Perhaps httrack can be told to “append /print at the end”?

Is there a user agent setting which shows the whole forum thread on a single page? If not, could you please add this feature? You already implemented print mode. Most is already implemented. What’s left is a user agent to which results in providing contents generated for “print mode” to the crawler? Alternatively, if you don’t like the idea of a custom user agent for this purpose, what about a http header or cookie that could be used for this purpose?


ArchiveDiscourse improved/forked by by @kitsandkats is also broken for me.


Could you please consider also implementing /print also for front page / category pages?


Quote myself in https://meta.discourse.org/t/i-dont-like-infinite-scrolling-and-want-to-disable-it/104660/3

(Temporarily) disabling infinite scroll (for some user agents) would make it possible to archive discourse with the htttrack web archive tool.

saper

Python requests will automatically use .netrc for authentication if required (but it needs to get 401 HTTP response).

brechtm

I’ve gotten good results with wget, including authentication. Described here:

https://meta.discourse.org/t/archive-an-old-forum-in-place-to-start-a-new-discourse-forum/13433/14

kyle315a

A Discourse forum that I use is being taken offline in a couple weeks, so I set out to archive the site. I did a lot of research, trail and error, and I found a simple solution with HTTrack. Here’s everything I learned.

Archive a Discourse site with HTTrack
For Windows users, the best solution appears to HTTrack. This worked great and it archived the site to HTML files. All categories, threads, and posts were archived including all pages with relative navigation links.

A basic tutorial on HTTrack is here. I left the settings on default with the following custom settings.

  • Web Addresses:
    • https://forums.gearboxsoftware.com/c/homeworld/
    • https://forums.gearboxsoftware.com/c/homeworld-dok/
  • Scan Rules:
    • -gearboxsoftware.com/* -forums.gearboxsoftware.com/* +forums.gearboxsoftware.com/c/homeworld/* +forums.gearboxsoftware.com/c/homeworld-dok/* +forums.gearboxsoftware.com/t/* +forums.gearboxsoftware.com/user_avatar/* +sea2.discourse-cdn.com/*
  • Browser ID (aka User Agent):
    • Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Note: There’s a CSS issue preventing category links from working, however that can easily be fixed as described below.

CSS Issue
When viewing Category pages as googlebot, the thread links don’t work. An example is [here](https://web.archive.org/web/20220731051419/https://forums.gearboxsoftware.com/c/homeworld/57).

This makes navigation impossible on category pages in HTTrack, archive.org and google catch. This appears to be a Discourse issue in a CSS file. To fix the links, simply block/delete the following CSS file:

  • stylesheets/desktop_theme_10_1965d1d398092f2d9f956b36e08b127e00f53b70.css?__ws=forums.gearboxsoftware.com

@codinghorror - Can you guys address this?

Challenges
I ran into the following challenges and eventually overcame them after much trial and error.

  • Discourse pages are dynamically generated with JavaScript. This makes for poor results with most archive/crawler tools.
  • Most threads only load with the first ~20? posts, the rest of the posts don’t appear until you scroll down. Pressing Ctrl+P loads a /print page with all posts visible. Users are limited to printing 5 pages an hour with print mode, but this limit can be increased by a Discourse site admin.
  • Adrelanos noted that multi-page threads weren’t being archived properly by HTTrack, however I suspect this issue was due to his HTTrack settings, as I did not have this issue.
  • Saving a page to PDF won’t include any collapsed details sections.
  • Pages can be loaded in basic HTML by adding ?_escaped_fragment_ to the end of a URL, but this trick only works for threads not categories.

The above challenges aren’t a concern once you learn that all Discourse pages/content can be rendered properly as HTML for crawlers. To do this, you must change your crawler / browser’s user agent to googlebot to get the HTML version of pages.

Archive.org
If you use the “Save Page Now” feature on web.archive.org, it will archive the javascript version of Discourse with poor results. Archive.org uses the user agent of the person requesting the archive. So you must change your user agent to googlebot. You can get a Chrome extension called “User-Agent Switcher for Chrome”. In the options add:

  • Name: Googlebot
  • String: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
  • Group: Chrome
  • Indicator Flag: 1

Alternative Archive Tools
Many tools are listed here: Archive an old forum "in place" to start a new Discourse forum
I also briefly tested GUI tools like Cyotek WebCopy, A1 Website Download, and WAIL.
Command line tools include mcmcclur’s tool and wget. A tuturial on wget is [here](https://letswp.justifiedgrid.com/download-discourse-forum-wget/).
However for Windows users, the best solution appears to HTTrack.

Note: Since I’m a new user, I’m limited to two links in a post. Hence I turned some links into preformatted text.

kyle315a

I’ve now identified the root cause. Turned out the background image is conflicting with the links!

Within this file:
stylesheets/desktop_theme_10_1965d1d398092f2d9f956b36e08b127e00f53b70.css

Within this code:

body:before {
    backface-visibility: hidden;
    -webkit-backface-visibility: hidden;
    content: "";
    display: block;
    background-color: #000000;
    background-image: url("data:image/svg+xml,%3Csvg width='6' height='6' viewBox='0 0 6 6' xmlns='http://www.w3.org/2000/svg'%3E%3Cg fill='%23adadad' fill-opacity='0.4' fill-rule='evenodd'%3E%3Cpath d='M5 0h1L0 6V5zM6 5v1H5z'/%3E%3C/g%3E%3C/svg%3E");
    position: fixed;
    height: 100vh;
    width: 100vw;
    top: 0;
    left: 0;
    z-index: 0;
    opacity: 0.03;
    background-size: 70%;
}

CSS Issue Fix
Move the background image down a layer to fix the links.

  • Open stylesheets/desktop_theme_10_1965d1d398092f2d9f956b36e08b127e00f53b70.css and replace all three “z-index:-1;” with “z-index:-2;”. Then replace “z-index:0;” with “z-index:-1;”.
  • Then open desktop_32713c1b6551369eb391868f3d4e3f2ac9c38cf1.css and simply replace all three “z-index:-1;” with “z-index:-2;”. The links will now work.
awesomerobot

Thanks for letting us know, really since this is a crawler/archive view, these images shouldn’t be displayed anyway… so I’ve opened a PR to remove them

jamesob

For what it’s worth, I’ve written a minimum viable Python script that performs simple backup of post content using the API: GitHub - jamesob/discourse-archive: Provides a simple archive of Discourse content

It’s pretty barebones, but should give someone a rough idea of how to generate a suitbable-for-public archive.

Diego_Rivera_Buendia

Just came up with a way to scrap all the content of a discourse site in a modular way (choose which categories/topics/limits to srcap). Have a look. So far it is just a tool to get a local/static version of a site: GitHub - Diegorb1329/broad_listening_eth