Funny timing. Just yesterday I was looking for an easy Windows tool to do a simple stress-test on a website (legally ofc). A requirement of mine was to just give it the root URL and the tool should discover the rest automatically (staying on the same domain). Also, parameters like parallelism had to be easily manageable.
After trying some crawlers / copiers and other tools I went back to a simple one I already knew from saving static copies of websites in the past: HTTrack. It fit the bill perfectly!
You can add the root URL, set it to "scan only" (so it doesn't download everything) and tweak the settings like connections and speed (and even change some settings mid-run, save settings, pause, ...). So thanks xroche for HTTrack! :)
I’ve used it a few times to “secure” an old but relevant dynamic website site. Like a site for a mature project that shouldn’t disappear from the internet but it’s not worth upgrading 5 year old code that wont pass our “cyber security audit” due to unsupported versions of php or rails so we just convert to a static site and delete the database. Everything pretty much works fine on the front end, and the CMS functionality is no longer needed. It’s great for that niche use case.
Does it not on a flag or am I just reading this wrong? From the man page:
-k
--convert-links
After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.
Each link will be changed in one of the two ways:
The links to files that have been downloaded by Wget will be changed to refer to the file they point to as a relative link.
Example: if the downloaded file /foo/doc.html links to /bar/img.gif, also downloaded, then the link in doc.html will be modified to point to ‘../bar/img.gif’. This kind of transformation works reliably for arbitrary combinations of directories.
The links to files that have not been downloaded by Wget will be changed to include host name and absolute path of the location they point to.
Example: if the downloaded file /foo/doc.html links to /bar/img.gif (or to ../bar/img.gif), then the link in doc.html will be modified to point to http://hostname/bar/img.gif.
Because of this, local browsing works reliably: if a linked file was downloaded, the link will refer to its local name; if it was not downloaded, the link will refer to its full Internet address rather than presenting a broken link. The fact that the former links are converted to relative links ensures that you can move the downloaded hierarchy to another directory.
Note that only at the end of the download can Wget know which links have been downloaded. Because of that, the work done by ‘-k’ will be performed at the end of all the downloads.
Not sure about the context on why this is on HN but it surely put a smile on my face. Used to use it during 56K era when I just download everything and read it. Basically using it as RSS before RSS was a thing.
Interestingly, the most recent commit and release in their github is from March 11 2025, so it's still clearly maintained.
I remember using it, it must have been in 2012 or 2013, to automatically make static sites out of Wordpress. We had a bank department as a client who had a non negotiable requirement that they could use Wordpress to manage their site, along with an IT policy that absolutely forbade using Wordress (or PHP or MySQL or even Linux) on public facing servers. So we had an intranet only Wordpress site that got scraped 6 times a day and published as static html to an IT approved public Windows webserver.
O wow, I have forgotten all about it, but I used it too! That and pavuk with it's regular expressions on the commandline.
https://tenmax.wordpress.com/
I've been using HTTrack for almost two decades to create static archives of a yearly website for an annual event.
It doesn't do the job 100% but it's a start. In particular, HTTrack does not support srcset, so only the default (1x) pixel-density images were archived (though I manually edited the archives to inject the high pixel-density images, as well as numerous other necessary fix-ups).
The benefit of the tool is fine control over the crawling process as well as which files are included. Included files have their URLs rewritten in the archived HTML (and CSS) to account for querystrings, absolute vs. relative URLs, external paths, etc.; non-included files also have their URLs rewritten to change relative to absolute links; thus, you can browse the static archive, and non-included assets still function if they are online at their original URL, even if the static archive is on local storage or hosted at a different domain than the original site.
It was more work each year as the website gradually used script in more places, leading to more and more places I would need to manually touch-up the archive to make it browsable. The website was not itself an SPA, but contained SPAs on certain pages; my goal was to capture the snapshot of the initial HTML paint of these SPAs but not to have them functional beyond that. This was (expectedly) beyond HTTrack's capabilities.
When we set up the PyCon India website back in 2009, the webmaster was very insistent on it being archived properly throughout the years. The sites were maintained using various applications (I think 2009 was by a Django app called fossmeet, 2010 and 11 were using infogami etc.).
However, after the conference was completed, the entire site was downloaded and the HTML files were uploaded statically at the same URLs. This preserved the sites from 2009 till now. You can actually see the old talks and discussions e.g. https://in.pycon.org/2009/, https://in.pycon.org/2010 etc.
I came across httrack around that time but we used wget to mirror the website. I found it interesting. IIRC, it used to refresh itself to copy recursively but I could be wrong. It's been a long time.
I used this all the time twenty years ago. Tried it out again for some reason recently, I think at the suggestion of ChatGPT (!), for some archiving, and it actually did some damage.
I do wish there was a modern version of this that could embed the videos in some of my old blog posts so I could save them entire locally as something other than an HTML mystery blob. None of the archive sites preserve video, and neither do extensions like SingleFile. If you're lucky, they'll embed a link to the original file, but that won't help later when the original posts go offline.
ArchiveBox is your friend. (When it doesn't hate you :)
It's pretty good at archiving most web pages - it relies on SingleFile and other tools to get the job done. Depends on how you saved the video, but in general it works decently well.
I can't recall the details, but this tool had quite some friction last time I tried downloading a site with it. Too many new definitions to learn too many knobs it asks you to tweak. I opted to use `wget` with the `--recursive` flag, which just did what I expected it to do out of the box: crawl all links you can find and download them. No tweaking needed, and nothing new to learn.
A few years ago my workplace got rid of our on-premise install of fogbugz. I tried to clone the site with HTTrack but did not work due to client-side JavaScript and authentication issues.
I was familiar with C#/webview2 and used that: generate the URL’s, load the pages one by one, wait for it to build the HTML, and then save the final page. Intercept and save the css/image request.
If you have ever integrated a browserview in a dektop or mobile app, you already know how to do this.
This doesn't really work with most sites anymore, does it? It can't run JavaScript (unlike headless browsers with Playwright/Puppeteer, for example), has limited supported for more modern protocols, etc.?
Any suggestions for an easy way to mirror modern web content, like an HTTrack for the enshittifed web?
Which content/information sites dependant on javascript can be found? I always find marketing oriented or app interaction heavily using js and thus unable to archive...but otherwise...
You jest, but AI could look up the site's IP on shodan and a few others then get a shell, elevate to root and just use GNU Tar to back up the site including daemon configurations assuming it's not well hidden behind a CDN and not leaking the origin servers.
Funny timing. Just yesterday I was looking for an easy Windows tool to do a simple stress-test on a website (legally ofc). A requirement of mine was to just give it the root URL and the tool should discover the rest automatically (staying on the same domain). Also, parameters like parallelism had to be easily manageable. After trying some crawlers / copiers and other tools I went back to a simple one I already knew from saving static copies of websites in the past: HTTrack. It fit the bill perfectly! You can add the root URL, set it to "scan only" (so it doesn't download everything) and tweak the settings like connections and speed (and even change some settings mid-run, save settings, pause, ...). So thanks xroche for HTTrack! :)
I’ve used it a few times to “secure” an old but relevant dynamic website site. Like a site for a mature project that shouldn’t disappear from the internet but it’s not worth upgrading 5 year old code that wont pass our “cyber security audit” due to unsupported versions of php or rails so we just convert to a static site and delete the database. Everything pretty much works fine on the front end, and the CMS functionality is no longer needed. It’s great for that niche use case.
You could also do that with plain wget.
wget won't convert hrefs / src to a correct relative location.
Does it not on a flag or am I just reading this wrong? From the man page:
Oh interesting, I hadn't realized that option existed.
Not sure about the context on why this is on HN but it surely put a smile on my face. Used to use it during 56K era when I just download everything and read it. Basically using it as RSS before RSS was a thing.
Interestingly, the most recent commit and release in their github is from March 11 2025, so it's still clearly maintained.
I remember using it, it must have been in 2012 or 2013, to automatically make static sites out of Wordpress. We had a bank department as a client who had a non negotiable requirement that they could use Wordpress to manage their site, along with an IT policy that absolutely forbade using Wordress (or PHP or MySQL or even Linux) on public facing servers. So we had an intranet only Wordpress site that got scraped 6 times a day and published as static html to an IT approved public Windows webserver.
Used Teleport Pro myself back then.
O wow, I have forgotten all about it, but I used it too! That and pavuk with it's regular expressions on the commandline. https://tenmax.wordpress.com/
Yeah that memory.
I've been using HTTrack for almost two decades to create static archives of a yearly website for an annual event.
It doesn't do the job 100% but it's a start. In particular, HTTrack does not support srcset, so only the default (1x) pixel-density images were archived (though I manually edited the archives to inject the high pixel-density images, as well as numerous other necessary fix-ups).
The benefit of the tool is fine control over the crawling process as well as which files are included. Included files have their URLs rewritten in the archived HTML (and CSS) to account for querystrings, absolute vs. relative URLs, external paths, etc.; non-included files also have their URLs rewritten to change relative to absolute links; thus, you can browse the static archive, and non-included assets still function if they are online at their original URL, even if the static archive is on local storage or hosted at a different domain than the original site.
It was more work each year as the website gradually used script in more places, leading to more and more places I would need to manually touch-up the archive to make it browsable. The website was not itself an SPA, but contained SPAs on certain pages; my goal was to capture the snapshot of the initial HTML paint of these SPAs but not to have them functional beyond that. This was (expectedly) beyond HTTrack's capabilities.
At least one other team member wanted to investigate https://github.com/Y2Z/monolith as a potential modern alternative.
When we set up the PyCon India website back in 2009, the webmaster was very insistent on it being archived properly throughout the years. The sites were maintained using various applications (I think 2009 was by a Django app called fossmeet, 2010 and 11 were using infogami etc.).
However, after the conference was completed, the entire site was downloaded and the HTML files were uploaded statically at the same URLs. This preserved the sites from 2009 till now. You can actually see the old talks and discussions e.g. https://in.pycon.org/2009/, https://in.pycon.org/2010 etc.
I came across httrack around that time but we used wget to mirror the website. I found it interesting. IIRC, it used to refresh itself to copy recursively but I could be wrong. It's been a long time.
Or just do:
Or use a much more user friendly GUI tool like HTTrack.
Something something Dropbox.
I used this all the time twenty years ago. Tried it out again for some reason recently, I think at the suggestion of ChatGPT (!), for some archiving, and it actually did some damage.
I do wish there was a modern version of this that could embed the videos in some of my old blog posts so I could save them entire locally as something other than an HTML mystery blob. None of the archive sites preserve video, and neither do extensions like SingleFile. If you're lucky, they'll embed a link to the original file, but that won't help later when the original posts go offline.
ArchiveBox is your friend. (When it doesn't hate you :)
It's pretty good at archiving most web pages - it relies on SingleFile and other tools to get the job done. Depends on how you saved the video, but in general it works decently well.
I can't recall the details, but this tool had quite some friction last time I tried downloading a site with it. Too many new definitions to learn too many knobs it asks you to tweak. I opted to use `wget` with the `--recursive` flag, which just did what I expected it to do out of the box: crawl all links you can find and download them. No tweaking needed, and nothing new to learn.
I think I had a similar experience with HTTrack. However, wget also needs some tweaking to make relatively robust crawls, e.g. https://stackoverflow.com/a/65442746/464590
Related: https://news.ycombinator.com/item?id=27789910
I still remember Netvampire https://web.archive.org/web/19990125091054/http://netvampire... , an old application I used to do the same in winNT days.
A few years ago my workplace got rid of our on-premise install of fogbugz. I tried to clone the site with HTTrack but did not work due to client-side JavaScript and authentication issues.
I was familiar with C#/webview2 and used that: generate the URL’s, load the pages one by one, wait for it to build the HTML, and then save the final page. Intercept and save the css/image request.
If you have ever integrated a browserview in a dektop or mobile app, you already know how to do this.
How does it compare to SiteSucker (<https://ricks-apps.com/osx/sitesucker/index.html>)?
Never found a great alternative of this for Mac.
https://formulae.brew.sh/formula/httrack - used it a couple months ago
Have you seen SiteSucker? https://ricks-apps.com/osx/sitesucker/
wget
In old time, I used to download entire website using HTTrack and read it later.
Nostalgia !!
This doesn't really work with most sites anymore, does it? It can't run JavaScript (unlike headless browsers with Playwright/Puppeteer, for example), has limited supported for more modern protocols, etc.?
Any suggestions for an easy way to mirror modern web content, like an HTTrack for the enshittifed web?
ArchiveBox works decently with Javascript and uses a headless browser, can be deployed with Docker
Which content/information sites dependant on javascript can be found? I always find marketing oriented or app interaction heavily using js and thus unable to archive...but otherwise...
Literally any modern social media, YouTube comments, anything using client-side rendering frameworks
Is this
wget --mirror
?
I never really understood the appeal of httrack over wget, it seems wget can do almost everything and is almost always already installed
Time to add AI mode to this :).
You jest, but AI could look up the site's IP on shodan and a few others then get a shell, elevate to root and just use GNU Tar to back up the site including daemon configurations assuming it's not well hidden behind a CDN and not leaking the origin servers.
htcrack.ai is available...
I saw this on Twitter which came to mind when I read the title: https://same.new/
Of course, it does a from scratch reimpl of a single web page, but it might be related enough to be interesting here.