What is Website Ripper? SiteSucker is an iOS app that automatically downloads Web sites from the Internet.
First written: 2016-2017. Last nontrivial update: 2019 Jan 13.
SiteSucker 是一款只需输入网址即可自动将网站下载到本地的应用。SiteSucker 可以将目标站点的目录结构、html 网页、图像、PDF、样式表、音视频等文件异步复制到本地。SiteSucker 可用于制作网站的本地副本使您可以脱机浏览站点（需要注意的是：对于动态网站以及使用CDN的网站下载后的效果可能不是. SiteSucker is an aptly named app that downloads an entire website onto your local system. With SiteSucker, you can get text, video, and all other files onto your hard drive, giving you a complete.
One way to back up a website—whether your own or someone else's—is to use a tool that downloads the website. Then you can back up the resulting files to the cloud, optical media, etc. This page gives some information on downloading websites using tools like HTTrack and SiteSucker.
Note: Here's a list of the domains I have downloaded. Let me know if you don't want your site to be downloaded.
On Windows, HTTrack is commonly used to download websites, and it's free. Once you download a site, you can zip its folder and then back that up the way you would any of your other files.
I'm still a novice at HTTrack, but from my experience so far, I've found that it captures only ~90% of a website's individual pages on average. For some websites (like the one you're reading now), HTTrack seems to capture everything, but for other sites, it misses some pages. Maybe this is because of complications with redirects? I'm not sure. Still, ~90% backup is much better than 0%.
You can verify which pages got backed up by opening the domain's index.html file from HTTrack's download folder and browsing around using the files on your hard drive. It's best if you disconnect from the Internet when doing this because I found that if I was online when browsing around the downloaded file contents, some pages got loaded from the Internet, not from the local files that I was testing.
Pictures don't seem to load offline, but you can check that they're still being downloaded. For example, for WordPress site downloads, look at the wp-contentuploads folder.
I won't explain the full how-to steps of using HTTrack, but below are two problems that I ran into.
When I tried to use HTTrack to download a single website using the program's default settings (as of Nov. 2016), I downloaded the website but also got some other random files from other domains, presumably from links on the main domain. In some cases, the number of links that the program tried to download grew without limit, and I had to cancel. In order to download files only from the desired domain, I had to do the following.
Step 1: Specify the domain(s) to download (as I had already been doing).
Step 2: Add a Scan Rules pattern like this: +https://*animalcharityevaluators.org/* . This way, only links on that domain will be downloaded.
Including a * before the main domain name is useful in case the site has subdomains. For example, the site https://animalcharityevaluators.org/ has a subdomain http://researchfund.animalcharityevaluators.org/ , which would be missed if you only used the pattern +https://animalcharityevaluators.org/* .
Some pages gave me a 'Forbidden' error, which prevented any content from being downloaded. I was able to fix this by clicking on 'Set options...', choosing the 'Browser ID' tab, and then changing 'Browser 'Identity' from the default of 'Mozilla/4.5 (compatible: HTTrack 3.0x; Windows 98)' to 'Java1.1.4'. I chose the Java identity because it didn't contain the substring 'HTTrack', which may have been the reason I was being blocked.
On Mac, I download websites using SiteSucker. This page gives configuration details that I use when downloading certain sites.
I think website downloads using the above methods don't include the redirects that a site may be using. A redirect ensures that an old link doesn't break when you move a page to a new url. If you back up your website, it's nice to include the redirects in the backup, in case you need to regenerate your website in the future.
I'm not sure if there's a way to download the redirects of a site you don't own; let me know if there is. For a site you do own, sometimes you can back up the redirects by saving the relevant
.htaccess file. In my case, I use the 'Redirection' plugin in WordPress, and its menu has an 'Import/Export' option; I find that the 'Nginx rewrite rules' export format is concise and readable.
HTTrack and SiteSucker are web crawlers, which means they identify pages on your site by following links. If you have content on your site that's not linked from the starting page you provide, then I assume these programs won't download it. (I've verified that this is true at least for SiteSucker.) If you want a page or file on your website to be downloaded, make sure there's at least one link to it. If you don't want the link to your content to be noticeable, you can add a hyperlink with no anchor text, like this:
I use this trick for files that I store on my sites as backups. In particular, whenever I publish a substantive article on a website that I don't control, such as an interview published on someone else's site, I create a PDF backup of the page because I can't guarantee that the other person will keep the content online indefinitely. On my page I add a visible hyperlink that points to the other person's site, but I also upload the PDF backup to my own site in case the original content ever disappears. Since I want these PDF files to be included in backups of my website content, I create hyperlinks to these PDF files with no anchor text.
[Update in 2018: For content of mine that's published on other people's sites, I've decided to stop storing PDF backups on my website, since these backup files could theoretically still show up in Google results. Plus, if someone else takes his copy of the content down, there's a chance he did so deliberately, and I'd want to check with him before having a copy available on my site. My new approach is to back up my interviews and other content that's hosted on another person's site just to my private files—both a print-to-PDF copy and the raw HTML. If the content on the other person's site ever goes away, I can ask that person for permission to upload it to my own site.]
Images that are only used in the context of
meta property='og:image' (for Facebook image previews) and that aren't actually linked from the body of your HTML also won't be captured by crawlers. Again, you can add an
a href link to the image with no anchor text to make sure the image gets downloaded.
If your site has images that aren't hosted on your own domain, then crawlers won't download those images when only downloading same-domain content. For example, if you use the WordPress Jetpack plugin with the Photon module, then an image that would normally be hosted at
http://yoursitehere.com/wp-content/uploads/2014/04/myimage.jpg will instead be hosted at something like
https://i0.wp.com/yoursitehere.com/wp-content/uploads/2014/04/myimage.jpg?w=642 . As a result, a crawler won't download this image.
At least in SiteSucker, I think you can work around this problem by downloading
http://yoursitehere.com/wp-content/uploads/in addition to
http://yoursitehere.com . The download of
http://yoursitehere.com/wp-content/uploads/ seems to pick up the images for some reason. (In fact, it picks up multiple sizes of each image.)
Once you've downloaded a website using HTTrack or similar software, should you compress the website folder before backing it up to the cloud? I'm uncertain and would appreciate reader feedback, but here are some considerations.
My impression is that plain text files (such as raw HTML files) are more secure against format rot and bit rot, because 'They avoid some of the problems encountered with other file formats, such as endianness, padding bytes, or differences in the number of bytes in a machine word. Further, when data corruption occurs in a text file, it is often easier to recover and continue processing the remaining contents.' A Reddit comment says: 'Straight up txt files have a very low structural scope / over head, so unless you're doing something funky, a bit error is limited to a character byte.'
As a result, I plan to back up my own websites and other important sites mostly as uncompressed files (with some compressed copies thrown into the mix too). However, when backing up lots of other websites that are less essential, compression may make sense. This is especially so if the website download has a lot of redundancy. Following is an example.
In 2017, I downloaded www.mattball.org using SiteSucker. The download had a huge amount of redundancy using the default SiteSucker download settings, because each blog comment on a blog post had its own url and thus downloaded the blog post again. For example, on a blog post with 7 comments, I got 8 copies of the blog HTML: 1 from the original post, and 7 from each of the 7 comment urls. The website download also included an enormous number of search pages. Probably I could prevent these copies from downloading with some jiggering of the settings, but I want to be able to download lots of sites with minimal per-site configuration, and I'm not sure that url-exclusion rules that I might apply in this case would work elsewhere.
In principle, compression can minimize the burden of duplicate content. Does it in practice? During the www.mattball.org download, I checked to see that the raw content downloaded so far occupied ~450 MB. Applying 'Normal' zip compression using Keka software gave a zip archive of 88 MB, which is about 1/5 the uncompressed size. Not bad. However, a 'Normal' 7z archive of the raw data was only 1.6 MB—a little more than 1/300th of the uncompressed size!
Using a simple test folder with two copies of a file, I verified that zip compression doesn't detect duplicate files, but 7z compression does. Presumably this explains the dramatic size reduction using 7z. This person found the same: 'You might expect that ZIP is smart enough to figure out this is repeating data and use only one compression object inside the .zip, but this is not the case[....] Basically most such utilities behave similarly (tar.gz, tar.bz2, rar in solid mode) - only 7zip caught me [...].'
Is it dangerous to download websites because you might make a request to a dangerous url? I'm still exploring this topic and would like advice.
My tentative guess is that the risk is low if you only download web pages from a given (trustworthy) domain. If you also download pages on other domains that are linked from the first domain, perhaps there's more risk?
HTTrack's FAQ says: 'You may encounter websites which were corrupted by viruses, and downloading data on these websites might be dangerous if you execute downloaded executables, or if embedded pages contain infected material (as dangerous as if using a regular Browser). Always ensure that websites you are crawling are safe.'
Suppose you want to monitor what changes are done to your website over time, such as to track what revisions your fellow authors are making to articles. While I imagine there are various ways to do this, one relatively low-tech method is as follows. Periodically (say, every few months, or at whatever frequency suits you) download a new copy of your website using HTTrack or SiteSucker. Store at least the previous download as well. Then run
diff -r on your two website-download folders to see what has changed. You could make this more sophisticated by adding logic to ignore trivial changes or changes in files you don't care about.
Of course, you could also do a diff on the website database
.sql file directly if you can download it.