Summer 2020 Outreachy Project Plan

Table of Contents

Objective

Timeline

Project Ranking

Fathom Project Ideas

Appendix

Objective

As an Outreachy alumna, Bianca would like to mentor an Outreachy intern next summer, May - August 2020.

Timeline

Based on Mozilla’s internal Outreachy documentation and the Summer 2019 timeline. Dates to be updated as internship period draws closer.

Project Ranking

Questions to answer for each project:


0-10

Can we hold off?

Will it fit?

Hard enough?

Variety of work?

Impact

Summary

Computer Vision Prototype

B: 10

E: 10

D: 10

B: 6

E: 2

D: 5

B: 10

E: 10

D: 10

B: 6

E: 9

D: 10

B: 10

E: dunno

D: 0/10

B: 7

E: 7

D: 7

Fathom Labeling Server

Notes: UI work could be repurposed for active-learning system later.

B: 7

E: 6

D: 4

B: 8

E: 8

D: 10

B: 8

E: 9

D: 10

B: 10

E: 10

D: 10

B: 10

E: 8

D: 9

B: 9

E: 8

D: 8

Improve the Evaluator Tool

B: 10

E: 4

D: 3

B: 10

E: 10

D: 10

B: 4

E: 4

D: 3

B: 5

E: 5

D: 3

B: 8

E: 7

D: 10

B: 7

E: 8

D: 6

Make an API around Fathom or the Vectorizer

B: 10

E: 9

D: 10

B: 3

E: 10

D: 10

B: 8

E: 6

D: 8

B: 7

E: 5

D: 5

B: 5

E: 5

D: 2

B: 5

E: 4

D: 7

Move FathomFox Corpus Collector UI into the Inspector Pane

B: 10

E: 10

D: 10

B: 10

E: 10

D: 10

B: 2

E: 2

D: 2

B: 4

E: 3

D: 3

B: 4

E: 2

D: 2

B: 6

E: 2

D: 6

Develop Fathom as an Active Learning System*

B: 10

E: 8

D: 10

B: 2

E: 6

D: 0

B: 10

E: 10

D: 10

B: 10

E: 10

D: 10

B: 7

E: 9

D: 10

B: 6…

E: 5

D: 8

* Out of scope as an entire project. We could start with an existing ruleset and see if we can get better accuracy than our current optimizer, as this would mean we may be able to achieve the same accuracy by labeling fewer samples.

CV:

Labeling Server:

Evaluator Tool:

Projects:

Fathom Project Ideas

Computer Vision prototype for Fathom

Project summary

Project breakdown

CV portion
Fathom/Browser tasks

Advantages

Disadvantages

Background information

ScrapingHub has tried to do this. They’re doing an ensemble model; a bunch of different models and have them mix to decide the real answer. They have a computer vision one.

Fathom Labeling Server

Project summary

Project breakdown

Design and benchmarking
Build MVP
Evaluate Prototype

Advantages

Disadvantages

Background information

Improve the Evaluator tool of FathomFox

Project summary

Make a tool that visually indicates all Fathom `types` found on a page for a ruleset and whether they are correct. See FathomFox #39 for more information.

Project breakdown

Advantages

Disadvantages

Background information

Make an API around Fathom or the Vectorizer

Project summary

This is the next iteration of the `fathom-vectorize` utility that removes FathomFox’s Vectorizer page’s UI from the equation.

Project breakdown

Advantages

Disadvantages

Background information

Move FathomFox Corpus Collector UI into the inspector pane

Project summary

Project breakdown

Advantages

Disadvantages

Background information

Develop Fathom as an Active Learning System

Project summary

Project breakdown

  1. Collect a corpus of labeled samples
  1. See Fathom Labeling Server, which would help with this step.
  1. If no automatic ruleset generation, write a ruleset for this corpus. This step is not normally a part of an Active Learning approach.
  2. Train a model on the labeled data
  3. This application would start showing the user unlabeled samples from our set of samples.
  1. The application runs the model on each one and, on uncertain ones, asks the user: did I get it right? This is the Active Learning part.
  2. If the model is certain of its labeling, the page may be fed to an autolearning pipeline to further tune the coefficients.
  1. (autolearning: I will train on this page as if the human had labeled it)
  1. If the model is uncertain, the user would correctly label the page. If no automatic ruleset generation, either the rules are perfect but you haven’t thrown enough samples such that the coefficients are right, or you need new or different rules.
  1. The model would be retrained periodically using the pre-labeled samples and all the samples that the application has since showed to the user.
  1. After how many samples shown to the user do you retrain?
  1. Repeat this process (steps 4 and 5) until we have reached the desired accuracy threshold.

Advantages

Disadvantages

Background information

TODO

Appendix

Agenda/Changelog

Additional Resources

Good first bugs:

```

fathom-test ../vectors/vectors_test_article.json '{"coeffs": [["hasEnoughParagraphs", -1.0311405658721924],["hasExactlyOneArticleElement", -1.2359271049499512],["paragraphElementsHaveSiblingsWithSameTagName", 12.159211158752441],["mostParagraphElementsAreHorizontallyAligned", 0.5681423544883728],["moreParagraphElementsThanListItemsOrTableRows", -2.6533799171447754],["headerElementIsSiblingToParagraphElements", 12.294110298156738],["hasMultipleArticleElements", -3.300487756729126],["hasMultipleParagraphsWhoseClassNameIncludesArticle", 0.26676997542381287]], "bias": -2.676619291305542}'

```