HTTrack Website Copier

(httrack.com)

216 points | by rzk 3 months ago ago

65 comments

_Chief 3 months ago
This brings back so many fond memories. I grew up in a rural part of Kenya where the internet was scarce and tech practically non-existent. I was interested in web dev and taught myself PHP using HTTrack to download the php manual site, then the cprogramming.com website. I remember writing these site contents onto a thick notebook to read in school. Cprogamming.com imho was my programming foundation as I treated it as programming gospel. That kid back then would be shocked at how far I've come, now a dev at MS. Not sure how I came across httrack back then but I am so glad I did
[-]
- NetOpWibby 3 months ago
  That kid is damn proud of you. Good shit man.

icameron 3 months ago

I’ve used it a few times to “secure” an old but relevant dynamic website site. Like a site for a mature project that shouldn’t disappear from the internet but it’s not worth upgrading 5 year old code that wont pass our “cyber security audit” due to unsupported versions of php or rails so we just convert to a static site and delete the database. Everything pretty much works fine on the front end, and the CMS functionality is no longer needed. It’s great for that niche use case.

[-]

cluckindan 3 months ago

You could also do that with plain wget.

[-]

yuliyp 3 months ago

wget won't convert hrefs / src to a correct relative location.

[-]

jamal-kumar 3 months ago

Does it not on a flag or am I just reading this wrong? From the man page:

    -k
    --convert-links

    After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.

    Each link will be changed in one of the two ways:

        The links to files that have been downloaded by Wget will be changed to refer to the file they point to as a relative link.

        Example: if the downloaded file /foo/doc.html links to /bar/img.gif, also downloaded, then the link in doc.html will be modified to point to ‘../bar/img.gif’. This kind of transformation works reliably for arbitrary combinations of directories.
        The links to files that have not been downloaded by Wget will be changed to include host name and absolute path of the location they point to.

        Example: if the downloaded file /foo/doc.html links to /bar/img.gif (or to ../bar/img.gif), then the link in doc.html will be modified to point to http://hostname/bar/img.gif. 

    Because of this, local browsing works reliably: if a linked file was downloaded, the link will refer to its local name; if it was not downloaded, the link will refer to its full Internet address rather than presenting a broken link. The fact that the former links are converted to relative links ensures that you can move the downloaded hierarchy to another directory.

    Note that only at the end of the download can Wget know which links have been downloaded. Because of that, the work done by ‘-k’ will be performed at the end of all the downloads.

[-]

yuliyp 3 months ago
Oh interesting, I hadn't realized that option existed.
formerly_proven 3 months ago
iirc neither httrack nor wget process @imports in CSS
[-]
- jamal-kumar 3 months ago
  Is there anything that handles this smoothly?

ksec 3 months ago
Not sure about the context on why this is on HN but it surely put a smile on my face. Used to use it during 56K era when I just download everything and read it. Basically using it as RSS before RSS was a thing.
[-]
- bigiain 3 months ago
  Interestingly, the most recent commit and release in their github is from March 11 2025, so it's still clearly maintained.
  I remember using it, it must have been in 2012 or 2013, to automatically make static sites out of Wordpress. We had a bank department as a client who had a non negotiable requirement that they could use Wordpress to manage their site, along with an IT policy that absolutely forbade using Wordress (or PHP or MySQL or even Linux) on public facing servers. So we had an intranet only Wordpress site that got scraped 6 times a day and published as static html to an IT approved public Windows webserver.
- pixelesque 3 months ago
  Used Teleport Pro myself back then.
  [-]
  - fsiefken 3 months ago
    O wow, I have forgotten all about it, but I used it too! That and pavuk with it's regular expressions on the commandline. https://tenmax.wordpress.com/
- tamim17 3 months ago
  Yeah that memory.
jeff_tyrrill 3 months ago
I've been using HTTrack for almost two decades to create static archives of a yearly website for an annual event.
It doesn't do the job 100% but it's a start. In particular, HTTrack does not support srcset, so only the default (1x) pixel-density images were archived (though I manually edited the archives to inject the high pixel-density images, as well as numerous other necessary fix-ups).
The benefit of the tool is fine control over the crawling process as well as which files are included. Included files have their URLs rewritten in the archived HTML (and CSS) to account for querystrings, absolute vs. relative URLs, external paths, etc.; non-included files also have their URLs rewritten to change relative to absolute links; thus, you can browse the static archive, and non-included assets still function if they are online at their original URL, even if the static archive is on local storage or hosted at a different domain than the original site.
It was more work each year as the website gradually used script in more places, leading to more and more places I would need to manually touch-up the archive to make it browsable. The website was not itself an SPA, but contained SPAs on certain pages; my goal was to capture the snapshot of the initial HTML paint of these SPAs but not to have them functional beyond that. This was (expectedly) beyond HTTrack's capabilities.
At least one other team member wanted to investigate https://github.com/Y2Z/monolith as a potential modern alternative.
[-]
- noufalibrahim 3 months ago
  When we set up the PyCon India website back in 2009, the webmaster was very insistent on it being archived properly throughout the years. The sites were maintained using various applications (I think 2009 was by a Django app called fossmeet, 2010 and 11 were using infogami etc.).
  However, after the conference was completed, the entire site was downloaded and the HTML files were uploaded statically at the same URLs. This preserved the sites from 2009 till now. You can actually see the old talks and discussions e.g. https://in.pycon.org/2009/, https://in.pycon.org/2010 etc.
  I came across httrack around that time but we used wget to mirror the website. I found it interesting. IIRC, it used to refresh itself to copy recursively but I could be wrong. It's been a long time.
frozenice 3 months ago
Funny timing. Just yesterday I was looking for an easy Windows tool to do a simple stress-test on a website (legally ofc). A requirement of mine was to just give it the root URL and the tool should discover the rest automatically (staying on the same domain). Also, parameters like parallelism had to be easily manageable. After trying some crawlers / copiers and other tools I went back to a simple one I already knew from saving static copies of websites in the past: HTTrack. It fit the bill perfectly! You can add the root URL, set it to "scan only" (so it doesn't download everything) and tweak the settings like connections and speed (and even change some settings mid-run, save settings, pause, ...). So thanks xroche for HTTrack! :)
[-]
- gtirloni 3 months ago
  The issue there is that workload probably doesn't resemble your actual user requests? You could use something like k6 to script a few known use cases.
  [-]
  - frozenice 3 months ago
    It didn't need to in my simple case. k6 doesn't do crawling / auto-discovering, from what I can tell - I just wanted to give the tool one URL and press start.
- jaygray0919 3 months ago
  have used it for exactly the same purpose; have not found an equivalent tool/service - thank HTTrack
benhoff 3 months ago
I used this recently to download websites, stuffed them into a sqlite db, processed them with Mozllia's readability library, and then used the result and an llm to ask questions of the webpage itself.
It was helpful to take each step in chunks, as I didn't have a complete processing pipeline when I started.
I had wondered if there was an easier or better way to do this, as I probably would have liked to get the sitemap, pass the sitemap to an llm, then only download selected html pages vs the entire website.
[-]
- gtirloni 3 months ago
  But the sitemap could be incomplete, couldn't it?
  [-]
  - benhoff 3 months ago
    True, I guess that's the advantage of HTTrack.
    I guess for my use case, it would be better to get the parsing that HTTrack does, get all the url's, and pass that into an intelligence to selectively grab files.
alberth 3 months ago
Or just do:
```
  wget -rkpN -e robots=off https://www.example.com/
```
[-]
- stronglikedan 3 months ago
  Or use a much more user friendly GUI tool like HTTrack.
  [-]
  - blueflow 3 months ago
    If you learn to do it on the command line, doing it remotely on another machine or in an automated fashion becomes easy.
- RandomBacon 3 months ago
  Something something Dropbox.
op7 3 months ago
This isnt 1998 anymore so downloading the files from modern websites doesn't really work if youre trying to maintain your own private local / re-hosted copy of a site. especially ones with dynamically loaded content. Some additional processing is needed to fix the files. I have never been able to find a modern scraping solution that works with most modern websites. I suppose the existence of this sort of tool is in conflict of interest of Big Tech, for it would make the creation of visually identical looking phishing sites that easier much.
[-]
- stuffoverflow 3 months ago
  There definitely are tools for scraping basically any site by using the browser itself to make sure all dynamically loaded stuff gets intercepted correctly. Browsertrix[0] is probably the most well known and complete scraper for that. They offer it as a paid service for convenient setup but its open source and can be self-hosted as well.
  0: https://webrecorder.net/browsertrix/
  [-]
  - weinzierl 3 months ago
    Interesting, never had heard of them before. Pricing looks reasonable except for the time limit being per month. Daily limit sounds much more practical. How do people use that in a useful way?
    Does anyone have experience self-hosting this in the cloud? I'd worry about run-away traffic cost but since ingress is cheap most of the time maybe this is not a big problem?
Hard_Space 3 months ago
I used this all the time twenty years ago. Tried it out again for some reason recently, I think at the suggestion of ChatGPT (!), for some archiving, and it actually did some damage.
I do wish there was a modern version of this that could embed the videos in some of my old blog posts so I could save them entire locally as something other than an HTML mystery blob. None of the archive sites preserve video, and neither do extensions like SingleFile. If you're lucky, they'll embed a link to the original file, but that won't help later when the original posts go offline.
[-]
- groby_b 3 months ago
  ArchiveBox is your friend. (When it doesn't hate you :)
  It's pretty good at archiving most web pages - it relies on SingleFile and other tools to get the job done. Depends on how you saved the video, but in general it works decently well.
nbenitezl 3 months ago
Long time ago, HTTrack came very handy for me at work. We created a PHP/Mysql application to store data for a census of industrial sites and related info. Some day my boss tell me the customer wants this census being delivered to them in a auto-startup CD-ROM which was very fashionable at that time, I used HTTrack to download every page of our PHP database and all be browseable offline from the CD-ROM, the auto-startup just launch the browser at the index page.
Very handy.
superjan 3 months ago
A few years ago my workplace got rid of our on-premise install of fogbugz. I tried to clone the site with HTTrack but did not work due to client-side JavaScript and authentication issues.
I was familiar with C#/webview2 and used that: generate the URL’s, load the pages one by one, wait for it to build the HTML, and then save the final page. Intercept and save the css/image request.
If you have ever integrated a browserview in a dektop or mobile app, you already know how to do this.
NetOpWibby 3 months ago
I thought there was an update but it just looks like a random share…so I’ll share something random!
My favorite forum back in the day was the Rockman.EXE Online forums and they had server issues a few times. I was afraid of it going offline for good and came across HTTrack. My laptop was crappy as hell so maybe that’s why I didn’t have the best experience with it?
Or maybe trying to backup a forum from the front-end wasn’t a good idea LOL.
gtirloni 3 months ago
Every techie that lived through dialup days was constantly downloading stuff to read offline. I'm usually not very nostalgic as I think even through crazy times like these, we are still improving, but I feel I had much better focus back then.
Of course this doesn't translate into better productivity because we have way better tools today but it was nice to read, say, the gcc manual in one go.
bruh2 3 months ago
I can't recall the details, but this tool had quite some friction last time I tried downloading a site with it. Too many new definitions to learn too many knobs it asks you to tweak. I opted to use `wget` with the `--recursive` flag, which just did what I expected it to do out of the box: crawl all links you can find and download them. No tweaking needed, and nothing new to learn.
[-]
- smarx007 3 months ago
  I think I had a similar experience with HTTrack. However, wget also needs some tweaking to make relatively robust crawls, e.g. https://stackoverflow.com/a/65442746/464590
hosteur 3 months ago
Related: https://news.ycombinator.com/item?id=27789910
mdtrooper 3 months ago
I still remember Netvampire https://web.archive.org/web/19990125091054/http://netvampire... , an old application I used to do the same in winNT days.
deanebarker 3 months ago
I used the hell out of this, back in the day. We would copy websites down, then use Swish-E from the command line to index everything from the file system, then somehow make a web interface out of it. Good times.
alanh 3 months ago
How does it compare to SiteSucker (<https://ricks-apps.com/osx/sitesucker/index.html>)?
arkensaw 3 months ago
omg I used to use httrack to archive interesting sites at home, usual some sort of hobbyist boardgame thing or historical resource. I never kept them but the originals are long gone now, I should have!
sfmike 3 months ago
The good ole days this vlc lots of other free yet functional quirky designed little tools with critical thought for use cases
jmsflknr 3 months ago
Never found a great alternative of this for Mac.
[-]
- CharlesW 3 months ago
  Have you seen SiteSucker? https://ricks-apps.com/osx/sitesucker/
- ryoshu 3 months ago
  https://formulae.brew.sh/formula/httrack - used it a couple months ago
- 42lux 3 months ago
  wget
bomewish 3 months ago
No one has mentioned firecawl? Can anyone compare to archivebox, httrack?
tamim17 3 months ago
In old time, I used to download entire website using HTTrack and read it later.
solardev 3 months ago
This doesn't really work with most sites anymore, does it? It can't run JavaScript (unlike headless browsers with Playwright/Puppeteer, for example), has limited supported for more modern protocols, etc.?
Any suggestions for an easy way to mirror modern web content, like an HTTrack for the enshittifed web?
[-]
- vekatimest 3 months ago
  ArchiveBox works decently with Javascript and uses a headless browser, can be deployed with Docker
  [-]
  - n3storm 3 months ago
    Which content/information sites dependant on javascript can be found? I always find marketing oriented or app interaction heavily using js and thus unable to archive...but otherwise...
    [-]
    - vekatimest 3 months ago
      Literally any modern social media, YouTube comments, anything using client-side rendering frameworks
      [-]
      - n3storm 3 months ago
        I don't see Youtube/IG/Fb/Tiktok comments as valuable content or information, I would be surprised if somebody is treasuring such content.
        Maybe tweets, but APIs are available for archiving (more or less).
- superkuh 3 months ago
  True. But it's more like most sites don't work anymore. They can't run without javascript and lack any content but blank pages.
- RGamma 3 months ago
  monolith can work together with headless chromium. YMMV
ak007 3 months ago
Nostalgia !!
NewEntryHN 3 months ago
Is this
wget --mirror
?
[-]
- DaSHacka 3 months ago
  I never really understood the appeal of httrack over wget, it seems wget can do almost everything and is almost always already installed
ulrischa 3 months ago
[flagged]
shuri 3 months ago
Time to add AI mode to this :).
[-]
- LinuxBender 3 months ago
  You jest, but AI could look up the site's IP on shodan and a few others then get a shell, elevate to root and just use GNU Tar to back up the site including daemon configurations assuming it's not well hidden behind a CDN and not leaking the origin servers.
  [-]
  - shiomiru 3 months ago
    htcrack.ai is available...
- hombre_fatal 3 months ago
  I saw this on Twitter which came to mind when I read the title: https://same.new/
  Of course, it does a from scratch reimpl of a single web page, but it might be related enough to be interesting here.