Install and use ArchiveBox self-hosted internet archiving

Posted on 128 views

ArchiveBox is a self-hosted and powerful internet archiving solution written in Python. It enables one to collect, save and view sites you want to save offline. ArchiveBox can be set as a command-line tool, desktop app, or accessed via the web. This is a cross-platform tool available for Linux, macOS, and Windows systems.

Below are the cool features for ArchiveBox.

  • It allows one to feed it URLs one at a time, or schedule regular imports from your browser’s bookmarks, history, feeds e.t.c
  • It saves snapshots of the URLs you feed it in several formats: HTML, PDF, PNG screenshots, WAR e.t.c

In this guide, we will walk through how to install and configure and use ArchiveBox self-hosted internet archiving solution.

Install ArchiveBox self-hosted internet archiving solution

There are several methods you can use to install ArchiveBox self-hosted internet archiving solution.

  • Using PIP3
  • Using Docker

#1. Install ArchiveBox using Pip3

For this method, ensure that you have Python 3.7 and above, and Node version 12 and above installed on your system. Then install PIP on your system.

##On Debian/Ubuntu
sudo apt install python3-pip

##On RHEL/CentOS/Rocky Linux 8
sudo yum install epel-release 
sudo yum install python3-pip

##On openSUSE
sudo zypper install python3-pip

##On Arch Linux
sudo pacman -S python-pip

With PIP3 installed, you can install ArchiveBox as below.

sudo pip3 install archivebox

Initialize ArchiveBox as below.

mkdir ~/archivebox && cd ~/archivebox
archivebox init --setup

Start the ArchiveBox webserver.

archivebox server 0.0.0.0:8000

This method has a lot of dependency problems and is thus not suitable.

#2. Install ArchiveBox using Docker-Compose(Recommended)

Begin by installing docker on Linux using the aid below.

Start and enable docker

sudo systemctl enable docker
sudo systemctl start docker

Install docker-compose.

curl -s https://api.github.com/repos/docker/compose/releases/latest | grep browser_download_url  | grep docker-compose-linux-x86_64 | cut -d '"' -f 4 | wget -qi -
chmod +x docker-compose-linux-x86_64
sudo mv docker-compose-linux-x86_64 /usr/local/bin/docker-compose

Add your user to the docker group.

sudo usermod -aG docker $USER
newgrp docker

Download the docker-compose YAML file

curl -O 'https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/master/docker-compose.yml'

Start the ArchiveBox server.

docker-compose run archivebox init --setup

Proceed as below.

[√] Done. A new ArchiveBox collection was initialized (0 links).

[+] Creating new admin user for the Web UI...
Username (leave blank to use 'archivebox'): admin 
Email address: [email protected]
Password: Enter your Password
Password (again): Enter the Password again

Start the container.

$ docker-compose up

The server is now up and running.

[+] Running 1/1
 ⠿ Container thor-archivebox-1  Created                                    0.3s
Attaching to thor-archivebox-1
thor-archivebox-1  | [i] [2021-12-20 09:32:05] ArchiveBox v0.6.2: archivebox server --quick-init 0.0.0.0:8000
thor-archivebox-1  |     > /data
thor-archivebox-1  | 
thor-archivebox-1  | [^] Verifying and updating existing ArchiveBox collection to v0.6.2...
.......

Access the webpage at 0.0.0.0:8000

Use ArchiveBox self-hosted internet archiving solution

Once installed, you are set to start using ArchiveBox on your system to take a backup of sites you want to save offline.

You can add a URL to save as below.

$ archivebox add 'https://example.com'                                    

Using docker-compose.

$ docker-compose run archivebox add 'https://example.com'

Sample output:

Install-and-use-ArchiveBox-self-hosted-internet-archiving-solution

To schedule automatic adding of URLs use the command:

$ archivebox schedule --every=day --depth=1 https://example.com/rss.xml 

On Docker-compose:

$ docker-compose run archivebox schedule --every=day --depth=1 https://example.com/rss.xml 

View Archived pages.

On ArchiveBox, you can view the saved pages using the CLI or the web as below.

Using the CLI, view archived pages:

$ archivebox list 'https://example.com'

Accessing and Using ArchiveBox Web UI

From the web page, view the archived pages using the URL http://IP_Address:8000

Install-and-use-ArchiveBox-self-hosted-internet-archiving-solution-1

Add more pages and manage ArchiveBox by clicking on the + icon. provide login credentials to proceed.

Install-and-use-ArchiveBox-self-hosted-internet-archiving-solution-2

On this ArchiveBox admin dashboard, you can manage users, accounts, snapshots e.t.c

Install-and-use-ArchiveBox-self-hosted-internet-archiving-solution-3

Add a URL by clicking on Add + as shown above. Provide the list of URLs to archive.

Install-and-use-ArchiveBox-self-hosted-internet-archiving-solution-4

Scroll to the bottom of the page and add the URLs. The URLs will be added as below.

Install-and-use-ArchiveBox-self-hosted-internet-archiving-solution-5

View the list of added URLs by navigating to the home page as shown.

Install-and-use-ArchiveBox-self-hosted-internet-archiving-solution-6

You can view what is archived by clicking on the snapshot.

Install-and-use-ArchiveBox-self-hosted-internet-archiving-solution-7-1024x513

That is it!

I hope you enjoyed this guide on how to install and use ArchiveBox self-hosted internet archiving solution.

coffee

Gravatar Image
A systems engineer with excellent skills in systems administration, cloud computing, systems deployment, virtualization, containers, and a certified ethical hacker.