Skip to content

Commit 3dde128

Browse files
authored
Systematically crawl all websites (#32) Solves #19
* fix: use www for edt website to fix redirect * feat: cloud service to regularly scrape content * feat: add action to automatically update cloud using ssh * chore: update cloud docs * chore: smarter scripts for crawling and moving crawls around * fix: working go-ssb-room image * fix: added missing env var to cloud docker-compose
1 parent 359a961 commit 3dde128

File tree

8 files changed

+79
-10
lines changed

8 files changed

+79
-10
lines changed

.github/workflows/cloud-push-main.yml

+29
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
name: Demo Build & Push
2+
3+
on:
4+
push:
5+
branches: [main]
6+
paths:
7+
- 'docker/cloud/**'
8+
- 'scripts/**'
9+
- 'services/**'
10+
11+
jobs:
12+
build_and_push:
13+
runs-on: ubuntu-latest
14+
steps:
15+
- name: Checkout the repo
16+
uses: actions/checkout@v2
17+
- name: Deploy Stack
18+
uses: appleboy/ssh-action@master
19+
with:
20+
host: ${{ secrets.SSH_CLOUD_HOST }}
21+
username: ${{ secrets.SSH_CLOUD_USERNAME }}
22+
key: ${{ secrets.SSH_CLOUD_SECRET }}
23+
port: ${{ secrets.SSH_CLOUD_PORT }}
24+
script: |
25+
cd ${{ secrets.CLOUD_FOLDER_PATH }}
26+
git checkout .
27+
git pull origin main
28+
docker compose -f docker/cloud/docker-compose.yml down
29+
docker compose -f docker/cloud/docker-compose.yml up -d --build

docker/cloud/.env.example

+5
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,11 @@
11
# General
22
DIR=~/data
33
4+
PGID=1000
5+
PUID=1000
6+
# CRAWL
7+
CRAWL_TIMER=86400 # 86400 = 1 day
8+
DOMAIN_CRAWLER=crawl.earthdefenderstoolkit.com
49
# F-DROID
510
DOMAIN_FDROID=repo.earthdefenderstoolkit.com
611
# FILE BROWSER

docker/cloud/docker-compose.yml

+10-2
Original file line numberDiff line numberDiff line change
@@ -71,9 +71,9 @@ services:
7171
PGID: 1000
7272
PUID: 1000
7373
command: -- c
74-
# https://hub.docker.com/r/ksinica/go-ssb-room
74+
# https://hub.docker.com/r/cooldracula/go-ssb-room
7575
ssb-room:
76-
image: ksinica/go-ssb-room
76+
image: cooldracula/go-ssb-room
7777
restart: always
7878
ports:
7979
- "0.0.0.0:8007:8008" # This is the port SSB clients connect to
@@ -138,7 +138,15 @@ services:
138138
- 21027:21027/udp
139139
crawler:
140140
build: ../../services/crawler
141+
restart: always
141142
volumes:
142143
- ${DIR}/app-data/crawls:/crawls/
144+
- ${DIR}/content:/app/content
145+
environment:
146+
VIRTUAL_HOST: ${DOMAIN_CRAWLER}
147+
LETSENCRYPT_HOST: ${DOMAIN_CRAWLER}
148+
LETSENCRYPT_EMAIL: ${EMAIL}
149+
VIRTUAL_PORT: 8080
150+
CRAWL_TIMER: ${CRAWL_TIMER}
143151
# TODO: Terrastories, Portal
144152
# V2: Open Balena

docs/SETUP_SERVER.md renamed to docs/SETUP_CLOUD_SERVER.md

+8-2
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Setting up your own Earth Defenders Toolkit Server
1+
# Setting up your own Earth Defenders Toolkit Cloud
22

33
1. **A cloud provider or a computer**: we recommend [Digital Ocean](https://digitalocean.com) or any computer your organization can provide
44
2. **Docker and docker-compose**: Some cloud providers (such as Digital Ocean) have a marketplace with [single-click Docker deployment](https://cloud.digitalocean.com/droplets/new?onboarding_origin=marketplace&appId=87786318&image=docker-20-04&activation_redirect=%2Fdroplets%2Fnew%3FappId%3D87786318%26image%3Ddocker-20-04). You can also install Docker and docker-compose on your machine using a single command:
@@ -18,4 +18,10 @@ curl -fsSL https://raw.githubusercontent.com/jinweijie/install-docker-and-compos
1818
1. Setup a stronger password for FileBrowser and Syncthing apps
1919
2. Change the description and moderation strategy for the Secure Scuttlebutt Room app
2020
3. Using the FileBrowser application create folders for your content
21-
4. Using the Syncthing application share the content folders
21+
4. Using the Syncthing application share the content folders
22+
23+
## Actions
24+
25+
To automate updates to the cloud you can fork the official repository and add your own Github Action secrets.
26+
27+
Check the [ssh-action](https://github.com/appleboy/ssh-action) repository to understand the different variables.

services/crawler/Dockerfile

+3
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,5 @@
11
FROM webrecorder/browsertrix-crawler
22
COPY ./crawl-config.yml /app/crawl-config.yml
3+
COPY ./crawl.sh /app/crawl.sh
4+
COPY ./start.sh /app/start.sh
5+
ENTRYPOINT ["sh", "/app/start.sh"]

services/crawler/crawl-config.yml

+1-6
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,8 @@
11
seeds:
2-
- url: https://earthdefenderstoolkit.com
3-
include: earthdefenderstoolkit.com/(?:\?lang=)?es
4-
depth: 2
2+
- url: https://www.earthdefenderstoolkit.com
53
- url: https://docs.mapeo.app/
6-
depth: 1
74
- url: https://docs.terrastories.app/
8-
depth: 1
95
- url: https://docs.earthdefenderstoolkit.com/
10-
depth: 1
116
combineWARCs: true
127
blockRules:
138
- url: googleanalytics.com

services/crawler/crawl.sh

+18
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
#!/usr/bin/env bash
2+
3+
echo "Crawling content!"
4+
crawl --config /app/crawl-config.yml
5+
echo "Crawling done"
6+
count=$(ls -1 /crawls/collections/crawl-*/archive/*.warc.gz 2>/dev/null | wc -l)
7+
if [ $count != 0 ]
8+
then
9+
echo "Creating folders if don't exist"
10+
mkdir -p /app/content/old-websites
11+
mkdir -p /app/content/offline-websites/
12+
echo "Moving old content"
13+
mv /app/content/offline-websites/* /app/content/old-websites/
14+
echo "Copying content to content folder"
15+
mv /crawls/collections/crawl-*/archive/* /app/content/offline-websites/
16+
echo "Moved!"
17+
fi
18+
echo "Sleeping for $CRAWL_TIMER seconds..."

services/crawler/start.sh

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
#!/usr/bin/env bash
2+
3+
echo "Starting Earth Defenders Toolkit crawler"
4+
# TODO: also run pywb to show what's latest crawl
5+
while true; do /app/crawl.sh; sleep "$CRAWL_TIMER"; done

0 commit comments

Comments
 (0)