Skip to content

UB-Mannheim/BeTrial

Repository files navigation

betrial-logo


Build Status Python 2.7 license

Overview

Bernoulli trial generator tool for OCR result validation

This repo is part of the Aktienführer-Datenarchiv DFG project. The DFG recommends the Bernoulli trial to validate OCR results. To reduce the amount of effort to perform the test, a "Bernoulli Trial HTML Generator" was designed. This generator work with Abbyy-XML-Files (*.xml) or hocr-Files (*.hocr) and their JPG-image pendant (*.jpg).

Installation

This installation is tested with Ubuntu and we expect that it should work for other similar environments similarly.

1. Requirements

  • Python 2.7

2. Copy this repository

git clone https://github.com/UB-Mannheim/BeTrial.git
cd BeTrial

3. Dependencies can be installed into a Python Virtual Environment:

$ virtualenv betrial_venv/
$ source betrial_venv/bin/activate
$ pip install -r requirements.txt

Process steps

The whole projects has four major steps:

Loading files from web

Load the files from the web ("filegetter.py").

$ python ./filegetter.py (+ parameters)

Creating a dataset

Create a set of files for the Bernoulli-Trials ("betrialgen.py")

$ python ./betrialgen.py -p /input/dir/*.xml or *.hocr (+ parameters)

Creating a html page with csv export

Create an interactive Bernoulli-Trial html ("betrial.py").

$ python ./betrial.py /betrial/input/dir/*.png (+ parameters)

Evaluating the results

The betrial_eval html page helps to evaluate our results.

> firefox betrial_eval.html

Testcase

Creating a dataset

$ python ./betrialgen.py

Creating the html-page

$ python ./betrial.py ./test/BeTrial/input/*.png

The validation page can be opened with firefox

> firefox out.html

example-page

You see the images of all the text lines from the dataset and below each line there is the recognized text. One of the character is marked with a red rectangle. This character should be manually validated. Therefore you can select one of the radio buttons below. For example above you should check the letter j in the word just. Which is Ok.

The button Counts displays an overview, about the current validation status.

example-count

The results can be stored with Export Gesamtergebnis and Export Einzelergenisse in csv files.

To evaluate the results, open the betrial_evals.html file:

> firefox betrial_eval.html

This page helps to calculate the cumulative distribution function (cdf).

example-count

And the necessary amount of successful events to proof the predicted accuracy considering a certain error-probability.

example-count

Copyright and License

Copyright (c) 2019 Universitätsbibliothek Mannheim

Author:

BeTrial is Free Software. You may use it under the terms of the Apache 2.0 License. See LICENSE for details.

Acknowledgements

The tools are depending on some third party libraries:

  • ocropy is a collection of document analysis programs. One of them is ocropus-gtedit which builds the basis of the "betrial.py" source code. ocropus-gtedit produces an editable html-page, where you can see the images of all the text lines and below each line the recognized text. The recognized text can be updated to produce ground truth data.
  • Export2CSV export the data to csv.
  • Calculating Binom the calculation in the evaluation page are based on the implementation of Terry Ritter.

About

Bernoulli trial generator to validate OCR results

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published