SOSSE: Open-Source, Self-Hosted Digital Archiving & Search Engine Solution

I've found ArchiveBox to be overwhelming, despite its capabilities. SOSSE seems like a more manageable alternative for archiving websites and searching through my archives. I'll take a closer look at the app and its features to see if it's a good fit for your needs.

What is SOSSE?

SOSSE, a self-hosted Selenium-based open-source search engine, offers a unique approach to web archiving and searching.

SOSSE Key Features

Web Page Search: Conduct advanced searches of web pages, including dynamically rendered content.
Recurring Crawling: Schedule crawls at fixed intervals or adjust the rate based on content changes.
Web Page Archiving: Store HTML content, update links for local use, download required assets, and handle dynamic content.
File Downloads: Batch download binary files from web pages in a single operation.
Atom Feeds: Generate content feeds for websites that don't have them or receive notifications when new content is published containing a specified keyword.
Authentication: Automatically authenticate to access private pages and retrieve content.
Permissions: Configure crawlers, view statistics (for admins), and perform searches anonymously (for authenticated users).
Search Features: Includes features such as private search history, external search engine shortcuts, etc.

SOSSE is built using Python and released under the GNU AGPLv3 license, making it an open-source solution. When it comes to crawling websites, SOSSE leverages a combination of technologies, including browser-based crawling with Mozilla Firefox or Google Chromium, along with Selenium for handling pages that rely on JavaScript. For even faster crawling, Requests can be employed as well. On the technical side, SOSSE is designed to be lightweight and uses PostgreSQL as its database storage solution.

Prerequisites for SOSSE

To be honest, SOSSE doesn't quite match its claims of being a "lightweight" solution, but I've found that it still gets the job done effectively. However, this is definitely subjective and can vary depending on your specific hosting environment and setup. In reality, you'll likely need a more robust virtual machine or container to run SOSSE smoothly.

I've found that RAM usage can spike up to around 5.5GB during crawling sessions, and CPU load can also get quite high as well. If you plan on doing a lot of archiving, make sure to allocate enough storage space upfront to avoid any potential problems.

It's worth noting that even when SOSSE isn't doing any busy work, it still uses around 2.8GB of RAM, which is quite a chunk for an idle process. On the plus side, the CPU usage tends to be pretty low when the app isn't actively crawling or archiving.

Install SOSSE using Docker Compose

If you don't have Docker installed and need help getting started, I recommend checking out our self-hosting guides for beginners. These guides cover the basics of setting up Docker on your server. They're designed to help you get up and running smoothly, even if you're new to self-hosting.

Use the following Docker Compose to install SOSSE:

services:
    sosse:
        volumes:
            - /docker/sosse/data:/etc/sosse/
            - /docker/sosse/logs:/var/log/sosse/
            - /docker/sosse/archives:/var/lib/sosse/html
        ports:
            - 8005:80
        image: biolds/sosse:latest

Be sure to change the mounts and port to your liking before deploying the app.

SOSSE Web UI Tour

When you first launch SOSSE, you will be presented with a login like this.

💡

You may have to place SOSSE behind a SSL and reverse proxy BEFORE trying to archive an external website. This is pretty common practice when using tools like this.

When you log into SOSSE, you'll see a pretty simple looking page with a search box, but since there's no data to search yet, we need to add some configuration to get things going. To start, you can click on "Configuration" in the upper right and then click on "Administration". I recommend starting with a new policy from scratch, rather than modifying an existing one. If you just want to grab all pages for a specific domain, simply create a new policy with that domain's URL as the regular expression. It's a good idea to create a new policy for a website you want to make a FULL archive of.

If you're having trouble with HTML archiving, I've found that using the Chromium browser works best for this task. You can find the browser settings in the Browser tab within SOSSE. Once you've set up your preferred browser, be sure to save the crawl policy and then use it to start crawling the website you want to archive.

💡

You can simply save a single URL if you want using the default crawl policy, however this is showing you how to grab an entire website to save and archive so you can then later browse and search it offline if you wish.

Once you've set up your crawl policy, head back to the Configuration section and click on "Crawl a new URL". Then, enter the URL of the website you want to start crawling. Finally, click the "Check and queue" button to initiate the crawling process.

After setting up your crawl policy and starting the crawling process, it's a good idea to let it finish before you do anything else. Then, head back to the main page of SOSSE. You should now see a thumbnail image on the page, accompanied by a link to the content that's been successfully crawled for each domain under that URL.

Now we're cookin' with peanut oil! With your crawled content at your disposal, you can turn SOSSE into your own personal search engine for the websites you've archived. For example, if you want to find articles or pages containing a specific term like "Ghost", simply type it in and SOSSE will display all relevant results from those archived sites. You can then browse them all locally within the SOSSE app.

Final Notes and Thoughts

This is just the beginning of what you can do with SOSSE, you can do much more and I encourage you to install it and have a look for yourself. Be sure to read the well written SOSSE documentation if you run into any issues.

If you really wanted to, you could index everything on your network such as wiki, notes, images and more and use SOSSE as a network search engine. Pretty rad huh?

Be sure to swing by the SOSSE Github page and give it a star!

SOSSE: Open-Source, Self-Hosted Digital Archiving & Search Engine Solution

What is SOSSE?

SOSSE Key Features

Prerequisites for SOSSE

Install SOSSE using Docker Compose

SOSSE Web UI Tour

Final Notes and Thoughts

Read Next

How I Built a Voice-First AI Mirror You Can Run at Home

Romm - Self-Hosted ROM Manager with EmulatorJS Baked In

Block Everything You Hate Online with AdGuard Home

EverShelf - The AI-Powered, Self-Hosted Inventory Brain for Your Kitchen

Issued - Small, Fast, Self-Hosted Comic Library Server

Why I Built Vykar Backup: A Faster, Simpler Rust Backup Tool

Scanopy: Self-Hosted Network Scanner That Builds a Live Topology Map

OpenDroneLog: A Self Hosted DJI Flight Log Dashboard

HarborFM - Self-Hosted Podcast Creator

Self-Hosted Push Notifications with Ntfy on iOS

Subscribe to Noted