Privacy Sql Tracking Detection Using Easylist Adservers (#3730)

hadiamjad · tunetheweb · github-actions[bot] · web-flow · commit baf490db6683 · 2024-08-16T23:48:59.000+02:00
* Add GA4 fields to match documentation (#3679) * Add standard GA4 web-vital fields * Add value * Update Timestamps (#3680) Co-authored-by: tunetheweb <10931297+tunetheweb@users.noreply.github.com> * Bump web-vitals from 4.1.0 to 4.1.1 in /src (#3681) Bumps [web-vitals](https://github.com/GoogleChrome/web-vitals) from 4.1.0 to 4.1.1. - [Changelog](https://github.com/GoogleChrome/web-vitals/blob/main/CHANGELOG.md) - [Commits](GoogleChrome/web-vitals@v4.1.0...v4.1.1) --- updated-dependencies: - dependency-name: web-vitals dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump puppeteer from 22.10.0 to 22.10.1 in /src (#3682) Bumps [puppeteer](https://github.com/puppeteer/puppeteer) from 22.10.0 to 22.10.1. - [Release notes](https://github.com/puppeteer/puppeteer/releases) - [Changelog](https://github.com/puppeteer/puppeteer/blob/main/release-please-config.json) - [Commits](puppeteer/puppeteer@puppeteer-v22.10.0...puppeteer-v22.10.1) --- updated-dependencies: - dependency-name: puppeteer dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump prettier from 3.3.1 to 3.3.2 in /src (#3683) Bumps [prettier](https://github.com/prettier/prettier) from 3.3.1 to 3.3.2. - [Release notes](https://github.com/prettier/prettier/releases) - [Changelog](https://github.com/prettier/prettier/blob/main/CHANGELOG.md) - [Commits](prettier/prettier@3.3.1...3.3.2) --- updated-dependencies: - dependency-name: prettier dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump puppeteer from 22.10.1 to 22.11.0 in /src (#3684) Bumps [puppeteer](https://github.com/puppeteer/puppeteer) from 22.10.1 to 22.11.0. - [Release notes](https://github.com/puppeteer/puppeteer/releases) - [Changelog](https://github.com/puppeteer/puppeteer/blob/main/release-please-config.json) - [Commits](puppeteer/puppeteer@puppeteer-v22.10.1...puppeteer-v22.11.0) --- updated-dependencies: - dependency-name: puppeteer dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Translation of security chapter to Japanese (#3685) * Bump puppeteer from 22.11.0 to 22.11.2 in /src (#3688) Bumps [puppeteer](https://github.com/puppeteer/puppeteer) from 22.11.0 to 22.11.2. - [Release notes](https://github.com/puppeteer/puppeteer/releases) - [Changelog](https://github.com/puppeteer/puppeteer/blob/main/release-please-config.json) - [Commits](puppeteer/puppeteer@puppeteer-v22.11.0...puppeteer-v22.11.2) --- updated-dependencies: - dependency-name: puppeteer dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump web-vitals from 4.1.1 to 4.2.0 in /src (#3690) Bumps [web-vitals](https://github.com/GoogleChrome/web-vitals) from 4.1.1 to 4.2.0. - [Changelog](https://github.com/GoogleChrome/web-vitals/blob/main/CHANGELOG.md) - [Commits](GoogleChrome/web-vitals@v4.1.1...v4.2.0) --- updated-dependencies: - dependency-name: web-vitals dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump puppeteer from 22.11.2 to 22.12.0 in /src (#3689) Bumps [puppeteer](https://github.com/puppeteer/puppeteer) from 22.11.2 to 22.12.0. - [Release notes](https://github.com/puppeteer/puppeteer/releases) - [Changelog](https://github.com/puppeteer/puppeteer/blob/main/release-please-config.json) - [Commits](puppeteer/puppeteer@puppeteer-v22.11.2...puppeteer-v22.12.0) --- updated-dependencies: - dependency-name: puppeteer dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Update Timestamps (#3691) Co-authored-by: tunetheweb <10931297+tunetheweb@users.noreply.github.com> * Remove deploy.zip step of deployment (#3692) * Remove deploy.zip * Remove from ignore files * Bump puppeteer from 22.12.0 to 22.12.1 in /src (#3694) Bumps [puppeteer](https://github.com/puppeteer/puppeteer) from 22.12.0 to 22.12.1. - [Release notes](https://github.com/puppeteer/puppeteer/releases) - [Changelog](https://github.com/puppeteer/puppeteer/blob/main/release-please-config.json) - [Commits](puppeteer/puppeteer@puppeteer-v22.12.0...puppeteer-v22.12.1) --- updated-dependencies: - dependency-name: puppeteer dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump treosh/lighthouse-ci-action from 11.4.0 to 12.1.0 (#3693) * Bump treosh/lighthouse-ci-action from 11.4.0 to 12.1.0 Bumps [treosh/lighthouse-ci-action](https://github.com/treosh/lighthouse-ci-action) from 11.4.0 to 12.1.0. - [Release notes](https://github.com/treosh/lighthouse-ci-action/releases) - [Commits](treosh/lighthouse-ci-action@11.4.0...12.1.0) --- updated-dependencies: - dependency-name: treosh/lighthouse-ci-action dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> * Upgrade to Node 20 --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Barry Pollard <barrypollard@google.com> * Bump web-vitals from 4.2.0 to 4.2.1 in /src (#3695) Bumps [web-vitals](https://github.com/GoogleChrome/web-vitals) from 4.2.0 to 4.2.1. - [Changelog](https://github.com/GoogleChrome/web-vitals/blob/main/CHANGELOG.md) - [Commits](GoogleChrome/web-vitals@v4.2.0...v4.2.1) --- updated-dependencies: - dependency-name: web-vitals dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump actions/setup-python from 5.1.0 to 5.1.1 (#3699) Bumps [actions/setup-python](https://github.com/actions/setup-python) from 5.1.0 to 5.1.1. - [Release notes](https://github.com/actions/setup-python/releases) - [Commits](actions/setup-python@v5.1.0...v5.1.1) --- updated-dependencies: - dependency-name: actions/setup-python dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump puppeteer from 22.12.1 to 22.13.0 in /src (#3698) Bumps [puppeteer](https://github.com/puppeteer/puppeteer) from 22.12.1 to 22.13.0. - [Release notes](https://github.com/puppeteer/puppeteer/releases) - [Changelog](https://github.com/puppeteer/puppeteer/blob/main/release-please-config.json) - [Commits](puppeteer/puppeteer@puppeteer-v22.12.1...puppeteer-v22.13.0) --- updated-dependencies: - dependency-name: puppeteer dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Translation of mobile-web chapter to Japanese (#3700) * Bump puppeteer from 22.13.0 to 22.15.0 in /src (#3711) Bumps [puppeteer](https://github.com/puppeteer/puppeteer) from 22.13.0 to 22.15.0. - [Release notes](https://github.com/puppeteer/puppeteer/releases) - [Changelog](https://github.com/puppeteer/puppeteer/blob/main/release-please-config.json) - [Commits](puppeteer/puppeteer@puppeteer-v22.13.0...puppeteer-v22.15.0) --- updated-dependencies: - dependency-name: puppeteer dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump jsdom from 24.1.0 to 24.1.1 in /src (#3707) Bumps [jsdom](https://github.com/jsdom/jsdom) from 24.1.0 to 24.1.1. - [Release notes](https://github.com/jsdom/jsdom/releases) - [Changelog](https://github.com/jsdom/jsdom/blob/main/Changelog.md) - [Commits](jsdom/jsdom@24.1.0...24.1.1) --- updated-dependencies: - dependency-name: jsdom dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump web-vitals from 4.2.1 to 4.2.2 in /src (#3706) Bumps [web-vitals](https://github.com/GoogleChrome/web-vitals) from 4.2.1 to 4.2.2. - [Changelog](https://github.com/GoogleChrome/web-vitals/blob/main/CHANGELOG.md) - [Commits](GoogleChrome/web-vitals@v4.2.1...v4.2.2) --- updated-dependencies: - dependency-name: web-vitals dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump prettier from 3.3.2 to 3.3.3 in /src (#3702) Bumps [prettier](https://github.com/prettier/prettier) from 3.3.2 to 3.3.3. - [Release notes](https://github.com/prettier/prettier/releases) - [Changelog](https://github.com/prettier/prettier/blob/main/CHANGELOG.md) - [Commits](prettier/prettier@3.3.2...3.3.3) --- updated-dependencies: - dependency-name: prettier dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump web-vitals from 4.2.2 to 4.2.3 in /src (#3715) Bumps [web-vitals](https://github.com/GoogleChrome/web-vitals) from 4.2.2 to 4.2.3. - [Changelog](https://github.com/GoogleChrome/web-vitals/blob/main/CHANGELOG.md) - [Commits](GoogleChrome/web-vitals@v4.2.2...v4.2.3) --- updated-dependencies: - dependency-name: web-vitals dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Update Timestamps (#3716) Co-authored-by: rviscomi <1120896+rviscomi@users.noreply.github.com> * tracking detection using easylist adservers * easylist_adserver tracking detection and query * 2022 cdn portuguese (#3725) * add file to translation * done translation cdn.md Makes progress on #505 * Bump puppeteer from 22.15.0 to 23.0.2 in /src (#3719) Bumps [puppeteer](https://github.com/puppeteer/puppeteer) from 22.15.0 to 23.0.2. - [Release notes](https://github.com/puppeteer/puppeteer/releases) - [Changelog](https://github.com/puppeteer/puppeteer/blob/main/release-please-config.json) - [Commits](puppeteer/puppeteer@puppeteer-v22.15.0...puppeteer-v23.0.2) --- updated-dependencies: - dependency-name: puppeteer dependency-type: direct:development update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Update Timestamps (#3726) Co-authored-by: tunetheweb <10931297+tunetheweb@users.noreply.github.com> * Replace `<object>` with `<iframe>` for embedded SVG (#3727) * Replace object with iframe for embedded SVG * Translations * auto upload easylist data to table * Fix the build to ignore 2024 chapters (for now) (#3728) * Fix the build to ignore 2024 chapters (for now) * Remove test line * Update Timestamps (#3729) Co-authored-by: tunetheweb <10931297+tunetheweb@users.noreply.github.com> * liniting * liniting * linting * linting * linting * linting * fixes of Simplified Chinese translation for 2020 Performance (#3734) --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Barry Pollard <barrypollard@google.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: tunetheweb <10931297+tunetheweb@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Sakae Kotaro <ksakae1216@gmail.com> Co-authored-by: rviscomi <1120896+rviscomi@users.noreply.github.com> Co-authored-by: Hadi Amjad <hadiamjad@Hadis-MacBook-Air.local> Co-authored-by: William Constantinov <33907565+HakaCode@users.noreply.github.com> Co-authored-by: Zuckjet <zuckjet@gmail.com> Co-authored-by: Max Ostapenko <1611259+max-ostapenko@users.noreply.github.com>
diff --git a/sql/2024/privacy/tracking-detection/easylist-tracker-detection.sql b/sql/2024/privacy/tracking-detection/easylist-tracker-detection.sql
@@ -0,0 +1,38 @@
+CREATE TEMP FUNCTION
+CheckDomainInURL(url STRING, domain STRING)
+RETURNS INT64
+LANGUAGE js AS """
+  return url.includes(domain) ? 1 : 0;
+""";
+
+-- We need to use the `easylist_adservers.csv` to populate the table to get the list of domains to block
+-- https://github.com/easylist/easylist/blob/master/easylist/easylist_adservers.txt
+WITH easylist_data AS (
+  SELECT string_field_0
+  FROM `httparchive.almanac.easylist_adservers`
+),
+requests_data AS (
+  SELECT url
+  FROM `httparchive.all.requests`
+  WHERE
+    date = '2024-06-01' AND
+    is_root_page = TRUE
+),
+block_status AS (
+  SELECT
+    r.url,
+    MAX(
+      CASE
+        WHEN CheckDomainInURL(r.url, e.string_field_0) = 1 THEN 1
+        ELSE 0
+      END
+    ) AS should_block
+  FROM requests_data r
+  LEFT JOIN easylist_data e
+  ON CheckDomainInURL(r.url, e.string_field_0) = 1
+  GROUP BY r.url
+)
+SELECT
+  COUNT(0) AS blocked_url_count
+FROM block_status
+WHERE should_block = 1;
diff --git a/sql/util/populate_easylist_adserver.py b/sql/util/populate_easylist_adserver.py
@@ -0,0 +1,76 @@
+# pylint: disable=import-error
+import requests
+import pandas as pd
+from google.cloud import bigquery
+
+
+def extract_domains_from_file(file_path):
+    domains = []
+    try:
+        with open(file_path, "r") as file:
+            for line in file:
+                # Remove the '||' prefix and '^' suffix
+                domain = line.strip().lstrip("||").rstrip("^")
+                if domain:  # Ensure the line is not empty
+                    domains.append(domain)
+    except FileNotFoundError:
+        print(f"Error: The file {file_path} does not exist.")
+    except Exception as e:
+        print(f"An error occurred: {e}")
+    return domains
+
+
+def save_domains_to_csv(domains, csv_file_path):
+    try:
+        # Create a DataFrame from the list of domains
+        df = pd.DataFrame(domains, columns=["Domain"])
+        # Save the DataFrame to a CSV file
+        df.to_csv(csv_file_path, index=False)
+    except Exception as e:
+        print(f"An error occurred while writing to CSV: {e}")
+
+
+def upload_csv_to_bigquery(csv_file_path):
+    # this needs the GOOGLE_APPLICATION_CREDENTIALS env variable to be set
+    client = bigquery.Client()
+
+    # Configure the job
+    job_config = bigquery.LoadJobConfig(
+        source_format=bigquery.SourceFormat.CSV,
+        skip_leading_rows=1,  # Adjust if your CSV doesn't have a header row
+        autodetect=True,  # Automatically infer schema
+    )
+
+    # Load data from the CSV file
+    with open(csv_file_path, "rb") as source_file:
+        load_job = client.load_table_from_file(
+            source_file, "httparchive.almanac.easylist_adservers",
+            job_config=job_config
+        )
+
+    # Wait for the job to complete
+    load_job.result()
+
+
+# URL to the text file containing the regex patterns
+url = "https://raw.githubusercontent.com/easylist/easylist/master/" \
+    "easylist/easylist_adservers.txt"
+file_path = "easylist_adservers.txt"
+# Path to the output CSV file
+csv_file_path = "easylist_adservers.csv"
+
+# Download the file and save it locally
+response = requests.get(url)
+with open(file_path, "wb") as file:
+    file.write(response.content)
+
+# Extract domains
+domains = extract_domains_from_file(file_path)
+
+# Save domains to CSV
+save_domains_to_csv(domains, csv_file_path)
+
+# upload domains to BQ
+upload_csv_to_bigquery(csv_file_path)
+
+print(f"Domains have been saved to {csv_file_path}")