Skip to content

HNygard/valgprotokoller

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Script package for downloading and parsing 'valgprotkoll'/'møtebok'

Scripts run using PHP. They run in sequence and outputs to file.

All PDFs are cached in this Git repo. So step 2 or step 3 does not require any download.

The summary pages is her:

The JSON files can be seen here:

Requirements for running

  • php
  • pdftotext (step 1 / step 1.2 only)

Ubuntu:

apt install php-cli poppler-utils

Commands

php 1-valgprotokoll-download.php

  • Reads from urls.txt. Downloads PDFs. Read to txt ()

php 1.2-valgprotokoll-elections-no.php

  • Reads PDFs in elections.no git repo. Updates Git submodule in PHP script (git submodule update --remote elections-no.github.io)

php 2-valgprotokoll-parser.php

  • Parses all txt files generated by step 1 / step 1.2. Outputs JSON.
  • Will ignore any files with errors. Can be turned off with: php 2-valgprotokoll-parser.php throw

php 3-valgprotokoll-html-report.php

  • Created HTML from JSON ouput in step 2.

Grabbing URLs from Google

  • Search.
  • Open dev tools and run the following:
   var list = '';
   for (var i = 0; i < a.length; i++) {
       var that = a[i];
   console.log(that);
       if(
           that.href.indexOf('google.com') === -1
           && that.href.indexOf('google.no') === -1
           && that.href.indexOf('youtube.com') === -1
           && that.href.indexOf('blogger.com') === -1
           && that.href.indexOf('googleusercontent.com') === -1
           && that.href.length > 2) {
           list += "\n" + that.href;
       }
   }
   console.log(list + "\n");
  • Browse to next page and redo.

About

Scraping av valgprotokoller 2019

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published