Summer 2020 Outreachy Project Plan
Table of Contents
Objective
Timeline
Project Ranking
Fathom Project Ideas
Appendix
- Agenda/Changelog: Meeting notes and high level changes to this document over time.
Objective
As an Outreachy alumna, Bianca would like to mentor an Outreachy intern next summer, May - August 2020.
Timeline
Based on Mozilla’s internal Outreachy documentation and the Summer 2019 timeline. Dates to be updated as internship period draws closer.
- Request manager/team support
- Brainstorm project ideas in advance
- Find a
- co-mentor
- Per Outreachy page/@dustin: Feb. 25, 2020 at 4pm UTC Mentor project submission deadline
Project Ranking
Questions to answer for each project:
- Can we hold off?
- Will it fit?
- Hard enough?
- Variety of work?
- Impact
* Out of scope as an entire project. We could start with an existing ruleset and see if we can get better accuracy than our current optimizer, as this would mean we may be able to achieve the same accuracy by labeling fewer samples.
CV:
- It’s great to keep the intern off the main line, but its also good to have them do important things.
- Daniel could mentor this, but he’s on paternity leave for at least part of this.
- Maybe we can work on this idea another time.
Labeling Server:
- A way to invite random people to label things.
- Useful to Mechanical Turk out, let random non-techies help label.
- Maybe Save Page throws page to a server, instead of locally
- If we wanted to do cross referencing: we may need to rethink data fathom attribute for multiple people labeling the same page
- If we froze pages that people were logged in on etc. could be a privacy issue, but could do something like Containers.
- First milestone: come up with a UI (ideally non web extension, but…) for picking an element. Don’t think about freezing or sending to a remote server, or privacy implications, etc.
Evaluator Tool:
- Much easier than Labeling Server.
- Maybe we tack this on to labeling. Pitch Outreachy project as improve Labeling UI and this work precedes Labeling server. And if the team ends up doing the evaluator stuff before the intern shows up, fine. They can start on the rest.
Projects:
Fathom Project Ideas
Computer Vision prototype for Fathom
Project summary
- Prototype computer vision approach to Fathom. Can do just as well if not better using a single screenshot of a page as a sample instead of freezing an entire HTML document and all of its resources?
Project breakdown
CV portion
Identify potential transfer learning models
- Do some research into different existing models (per Erik, https://www.basilica.ai/blog/# has one, though it may not be public) to find one or a couple candidates for doing the transfer learning. E.g. Google has a model zoo for their image classification models that are trained on a couple different datasets. Easiest thing to do is grab those. Could also do some research to find a model that someone has already trained to do almost exactly this (e.g. someone has already trained a model on HTML pages; screenshots of it).
- Try to do transfer learning on that model.
- The bulk of the NN is learning the general features of the dataset. ImageNet is everyday pictures and objects. Those pictures have different general features than pictures of webpages, but there’s
Collect samples and mark them up
- The standard for an image classification model is that the inputs for labeled image is the raw image and bounding boxes for the different features.
- This person would have to get those bounding boxes; there are tools to do that online. I’ve used one once a couple years ago.
- Depending on someone’s familiarity with Python, would take a little time; hopefully not more than a week.
- There are probably Jupyter notebooks that exist in various online courses that do exactly this. Even on the TensorFlow or PyTorch website, there’s (probably) a section on transfer flow learning that enumerate all the steps, and you do some work to put it together.
- If the model doesn’t perform well, probably need to get more data or reweight/modify more neural net layers than just the last one. You can get more data or you can retrain more of the layers of the NN (first thing you do is only reset the last layer), but you can reset the weights (not necessarily the structure) for more of the last layers. That is meant to teach the model different, higher level features.
- NN layers: Starting at the bottom and moving up, the features are higher and higher level. Lower level: edge detection, diagonal lines. For something trained on ImageNet, last layers are animal-ish or vehicle-ish.
- If you reset the weights on more layers, you can learn different higher level features.
Fathom/Browser tasks
Adapt Fathom to use what comes out of the neural network.
Integrate model into browser.
- Executing a neural network in the browser, I’ve never done this. TensorFlow has a JS library. PyTorch uses this thing called Onyx.js.
Advantages
- Can move off main thread to analyze
- Corpus collection easier
- Sample pages are screenshots rather than gigantic frozen HTML pages with subresources.
- Maybe we don’t have to put the corpora in private repos?
Disadvantages
- Reading text is slower, because OCR (per Erik: ScrapingHub doesn’t do OCR in their pop-ups model), but text recognition in images is pretty good.
- E.g. Google Translate with Camera
- We also don’t have metadata for an image.
- Page categorization is kind of tough, because you’re trying to see if the pages in a particular category look similar. But sub items like images, prices, button, paragraphs, etc. Would work.
- But CV could help you find rating stars, shopping cart symbols.
- Can we cram it out in 3 months? Probably not unless we had a CV resource or the candidate had CV experience.
- Take a Neural Network that’s already trained to do something (e.g. a CV one that’s trained to classify some set of images), and you remove the last layer of the neural network and replace it with your layer.
- Say the model was trained to classify 10 different objects or types of images, its last layer would have 10 nodes. You just train that last layer.
- The bulk of a NN just learns basic high level features of an image; last layer is just tell me about outcomes.
- Candidate needs experience maybe with CV and ML, or we find someone with CV and ML experience who can help mentor them. Daniel has CV experience.
- Daniel and Chrissy might stagger their paternity leave.
- He’ll definitely be there for some of it. But whether that’s the beginning or not is not certain.
- Maybe Ian can mentor async in Daniel’s absence? Kind of weird.
Background information
ScrapingHub has tried to do this. They’re doing an ensemble model; a bunch of different models and have them mix to decide the real answer. They have a computer vision one.
Fathom Labeling Server
Project summary
- Make labeling sample pages “mechanical turkable” AKA easy for anyone to do.
- It may be useful to split this problem into two phases: identification (IDing the element) and extraction (how to actually get the src value or the price value).
- E.g. say there are 3 wrapping elements around the root price element. They’d each be acceptable.
Project breakdown
Design and benchmarking
- What other projects at Mozilla crowdsource data, and what methods do they use?
- Where do the pages come from?
- Related work: Spider/web crawler to give us these pages.
- Where are they stored?
- How do we open the pages?
- How/where do we save the page?
- What kinds of `type`s do we need to support? E.g.:
- Visible text like price
- Non-visible attributes like <image> src (though the context menu has an option to copy the image location)
- Elements like the form element for login form
- This would be the most all-encompassing.
- How do we disambiguate which element they’re pointing to?
- This is something we get from DevTools for free. Or do we need the element at all?
- Use `elementsFromPoint`, we could show the list of elements at that point, visually highlight them, and then can select from that list which best encapsulates the type.
- For text input form field: Could they type a specific word or some crazy string into the field, and then we find the element that has that crazy string.
- Other: We can figure out specialized ways of getting those types out.
- How would messaging work in Fission between the main frame of the app and the samples loaded in an iframe?
- Can we ensure labeling is consistent between samples?
- How do labeled samples get uploaded and shared?
- Identify MVP and post-MVP milestones.
Build MVP
- Make a web app that: (Note: just ideas; these can change)
- Prompts the user to select the most appropriate element on the page, providing them with a rubric for the desired label.
- Given a list of URLs to visit, loads the first page in an iframe.
- Instructs the user to click an element on the page to label
- The selected element(s) will be listed (possibly in a sidebar).
- Note: is there a limit to the sidebar width? Erik encountered this before.
- If more than one element are located at that location, display a list of all the elements, and allow the user to select a single element from the list.
- When the “Label” button is pressed, the app could, say:
- modify the page in the iframe to add the corresponding `”data-fathom”` attribute to the element
- Freeze the labeled page
- Download the page
- If labeling is successful, the app could load the next page from the list of URLs into the iframe and reset the UI.
- If labeling is not successful, the app will display the corresponding error message.
Evaluate Prototype
- Test the prototype on as many pages and for as many classifications as possible.
- For a given classification and label, have as many different people label the same unlabeled sample pages as possible and compare the results.
- Document all the different types of errors encountered and list ideas for how to resolve or mitigate them.
- Time permitting, iterate on this prototype and begin implementing post-MVP features.
Advantages
- This seems easily reachable in 3 months.
- May be more of a research problem. Longer design phase and shorter development compared to Improve the Evaluator tool.
- This could potentially subsume the Evaluator.
Disadvantages
- The design challenge of how do we handle the situation when it’s NOT a simple string value that we’re looking for (e.g. form element) is tricky.
- Moving around large sample files for sharing may be challenging (Where do we store the labeled samples after they’re done?).
- Post-MVP: Host samples on a server?
Background information
- Osmose’s Fathom Training Server.
- Per Erik, Mint.com has published an addon (a while ago) that scrapes banking pages and asks the user to click the “Account” button, etc.
Improve the Evaluator tool of FathomFox
Project summary
Make a tool that visually indicates all Fathom `types` found on a page for a ruleset and whether they are correct. See FathomFox #39 for more information.
Project breakdown
- Create a design spec/mockup (document-driven development, etc.)
- What are the inputs to the tool?
- What is the output of the tool?
- What are the parameters the user should be able to adjust?
- Scope what the requirements for an MVP are.
- Identify post-MVP features.
- Try out the mockup before writing any code
- Build an MVP (ideas below)
- Read in the provided Fathom ruleset
- Extract the `type`s from the ruleset
- Figure out an activation mode: does it automatically run the ruleset on all the pages in the window? Does it turn on (and off) globally, labeling any new page loaded while on?
- Compare the target element(s) with the elements found by the ruleset. Indicate which element(s) is/are correctly and incorrectly identified.
- User testing (even before implementing via mockups or paper prototyping)
- Document results and make recommendations
- Iterate on prototype time permitting
Advantages
- This is a nice end-to-end one. There’s design work, there’s UI, there’s implementation.
- There’s prior art for the labeling, see Background Information
- Seems like one of the biggest bangs for the buck of all the projects.
Disadvantages
- Is this too easy? Seems like it could be done in less than three months, but maybe that’s a good thing.
- They can just pick up general bugs.
Background information
- FathomFox #39
- Prior art: Fathom Smoot Demo - labeling entire pages
Make an API around Fathom or the Vectorizer
Project summary
This is the next iteration of the `fathom-vectorize` utility that removes FathomFox’s Vectorizer page’s UI from the equation.
Project breakdown
- Enumerate all desired inputs and options (referencing the existing page) that we want to make available to the `fathom-vectorize` CLI.
- E.g. file with list of relative paths to sample pages
- Can we vectorize without FathomFox at all? Do we want to? We somehow still need to execute the ruleset on each page, so how would we do that without an extension? [Probably bundle FathomFox with fathom-vectorize so people don’t have to think about it.]
- How do we get messages (e.g. errors we previously displayed on the Vectorizer page) to the Python scripts in `fathom-vectorize`? Can we pipe the in-page Console output to stdout? Maybe send them down some socket?
- Background script bundles useful log output from all tabs and includes it as part of `vectors.json` or a supplementary JSON file?
- Modify `fathom-vectorize`
- Details depend on answering the above Design questions.
- Modify FathomFox to remove the Vectorizer UI -- or can we do without FathomFox entirely?
- Details depend on answering the above Design questions.
- User Testing
- Document results and make recommendations
- Iterate on prototype time permitting
- Add post-MVP milestones time permitting
- Like making it automatically re-run when rulesets or samples are changed
- And maybe don’t re-vectorize pages that haven’t changed (if the ruleset also hasn’t changed, that is)
Advantages
- They may also be able to make the `fathom-vectorizer` tool less experimental.
- This seems achievable in 3 months.
- Bianca can help mentor here.
Disadvantages
- This requires `fathom-vectorize` to land.
- This doesn’t seem like a tremendous improvement over what we have so far, though it does bring is closer to making the Vectorizer a purely command line tool.
Background information
- Maybe it’s just a different add-on where the content script runs the ruleset against a page and sends a JSON object to a local server, local server takes that JSON object and saves it to a file.
- I have the ability to open a webpage in a browser. I have a browser. I have this code that I want to run that will return some JSON. Can I load a page, run the JavaScript on that page and just return that JSON or send that JSON somewhere else.
- If we move vectorization to a server instead of our machines, that would be the thing that would obsolete `fathom-vectorize`; though we’d probably still use the code.
Move FathomFox Corpus Collector UI into the inspector pane
Project summary
Project breakdown
- Identify inputs and outputs to the panel, referring to the existing FathomFox DevTools panel. Is there anything we don’t need, or that we would like to change or add?
- Make a mock up of the UX/UI for this new side panel.
- Use the `devtools` API to create the panel.
- User Testing
- Document results and make recommendations
- Iterate on prototype time permitting
- Add post-MVP milestones time permitting
Advantages
- Bianca can help mentor this project.
Disadvantages
- May not be a 3 month project.
- I think ultimately, we’d like to move away from requiring knowledge and use of the DevTools as part of the labeling process. The Fathom Labeling Server project may subsume this project.
Background information
Develop Fathom as an Active Learning System
Project summary
- With the support of a set of samples that are representative of the web, create an application that trains models through active learning.
- In a meeting 12/09/19: We determined that Improve the Evaluator Tool of FathomFox or the Fathom Labeling Server projects would be the best/most appropriate projects to come out of this idea.
- E.g. Provide a UI for the user to label samples, i.e. the Fathom Labeling Server.
- E.g. We could add model evaluation to the Fathom Labeling Server.
Project breakdown
- Collect a corpus of labeled samples
- See Fathom Labeling Server, which would help with this step.
- If no automatic ruleset generation, write a ruleset for this corpus. This step is not normally a part of an Active Learning approach.
- Train a model on the labeled data
- This application would start showing the user unlabeled samples from our set of samples.
- The application runs the model on each one and, on uncertain ones, asks the user: did I get it right? This is the Active Learning part.
- If the model is certain of its labeling, the page may be fed to an autolearning pipeline to further tune the coefficients.
- (autolearning: I will train on this page as if the human had labeled it)
- If the model is uncertain, the user would correctly label the page. If no automatic ruleset generation, either the rules are perfect but you haven’t thrown enough samples such that the coefficients are right, or you need new or different rules.
- The model would be retrained periodically using the pre-labeled samples and all the samples that the application has since showed to the user.
- After how many samples shown to the user do you retrain?
- Repeat this process (steps 4 and 5) until we have reached the desired accuracy threshold.
Advantages
- Really, really powerful and awesome.
- This could reduce the number of samples we needed to label initially for example.
- Other things
Disadvantages
- Based on the high level process described above, this project would be difficult to complete without first having automatic ruleset generation implemented.
- Because at the active learning step when you evaluate your model against some unlabeled samples, if the model gets it wrong, you don’t know whether you just need more samples, or you need to modify the rules. If you have to write your own rules and evaluate them iteratively after each active learning step, you’ll never know if the page you got wrong would be right with the changes, as it will never be shown to you again.
- Right now: You iterate on your rules until you do a good job on a static training set. In the active learning approach, you never look at the same sample twice (if you get a sample wrong, you’re never shown that sample again). Active learning approach assumes you have the same or correct set of rules (or in ML context, the right network structure). So you’re problem is that you just haven’t shown it enough data yet. In the Fathom sense, you’re just training the coefficients. In ML context, you’re training the weights and biases of your nodes.
- Erik: Let’s not bet someone’s internship experience on automatic ruleset generation.
- Requires dataset that is representative of the web (A large, currently unsolved problem).
Background information
- A type of machine learning that …
- In the Fathom world, a model is a ruleset and a set of coefficients and bias. You can take the same set of rules in Fathom or same set of nodes and features in a neural network and train them for different things. Those are different models.
- So a collection of rules that has not been trained would not be a model.
- What do we mean by automatic ruleset generation?
- Automatic ruleset generation: the way Erik is doing it: he generates some large number of features (e.g. 2500 rules). Could look at every single element (so no `dom` calls), and for each element, run these 2500 rules. The hope is that these 2500 rules encompass everything there is to care about for interpreting an HTML page.
- How does he generate them? He has a working list of rules that are generally used in all rulesets (see bottom of doc).
- Why does this project require automatic ruleset generation?
- If we didn’t have automatic ruleset generation, you can still do the same process, but you’d have to write a ruleset first before this step. And then you’d have to evaluate your rules (write new rules, get rid of some rules, etc.) in between every steps 4 and 5 that you do.
- If you have the same rules, but you’re feeding in more and more samples, all that will do is generalize your coefficients. With automatic ruleset generation, while we’re not adding more rules, the assumption is that some of those rules won’t be important. They will have very small coefficients, so they will effectively not exist. But if you add new samples, perhaps some of those coefficients will change, and some of those rules will become important.
- You can prune the rules after a while with a specialized model.
- By specialized model, I mean the model for e.g. “articles” or for “shopping”. The model includes the coefficients and biases.
- This one ruleset can be trained to identify different entities (or `types` in Fathom). The consequence of that is by changing the coefficients. So “article” versus “shopping” rulesets (models) would have the same model with different coefficients.
- For a trained ruleset, some of those coefficients would be 0 or close to 0. You could if you wanted to remove those rules, since they don’t help you identify your type, and they cost compute cycles.
- Different active learning approaches:
- Daniel: RNN on graphs: there’s a research article that shows this working. HTML page could be seen as a graph. You still have to get features out of each node, which falls back into what Erik is doing.
- Erik: Feed an HTML node into the network. We should be able to train it on values or weights if we give it some tasks. Maybe train on the task to go aria-label all checkboxes, and then use aria-label checkboxes as the ground truth.
- RNNs that are general need a lot of training data.
- This way we may be able to have an agent crawl the web, do transfer learning, combine it with active learning (where it asks us: is this right?).
- Daniel: There are a bunch of different node types for a layer in a RNN. For the most part, they’re supposed to work on a forward feed of sequential nodes. There may be signal in the graph structure of the document; it may be interesting to see the nodes the Stanford team developed who worked on nodes in HTML document or sentiment analysis would be useful here.
- Bianca: RNN is NOT what we’re doing right now. We’re doing supervised learning.
- Erik: Correct. We have a fixed sized input (e.g. 3 features). When you have a variable-sized input e.g. a whole webpage, you need some way of supporting that variable-sized input. A RNN deals with variable-sized input. It looks at the first 5 things, the second 5 things, etc. Look ahead, look behind, pipe things around different ways.
- Daniel: We’ll still need to turn each node into a vector of features.
- Maybe we have a database of pages (unlabeled; maybe with some labels, but different labels). Someone says: I want to recognize prices. That person would get X random pages, would have to manually mark up those pages, and that would serve as the basis for the model. They’d use that, train some model, then start running on other pages. This would reduce the amount of labeling someone would have to do.
- Maintaining a corpus of pages as relevant. A tool for developer as they are iterating their ruleset. But you wouldn’t be writing rulesets, you’d just be training some sort of neural network (technically the neural network is like a ruleset). Once you have that you can use that to make a model of anything you want. When you want to make a model, you can be shown some random pages, label those, and then the trainer would train a model on those and start showing you test set results, and you would correct it. It would build up the training set that way.
TODO
- Finalize Outreachy project
- Write project proposal
- Submit a Project Proposal mid-February 2020
- Develop the application process
- Prepare the GitHub repos for contributors
- Provide an alternative project contribution that doesn’t require git
- This is also helpful assuming we have more applicants than good first bugs in our Fathom repos
Appendix
Agenda/Changelog
- Add Table of Contents
- First pass fleshing out these projects:
- Finishing pass fleshing out these projects:
- First pass fleshing out these projects:
- Clean up document: Each section now has the following sections:
- Project summary
- Project breakdown
- Advantages
- Disadvantages
- Background information
- Next time: Ensure all new sections are adequately complete for each project idea.
- Discuss possible breakdown of Daniel’s dream project. Is there any part where I would still be able to provide some technical mentorship? Share thoughts with Erik afterwards.
- Maybe all three of us could meet Monday to discuss this particular project idea more depending on the outcome today?
- Ideally we can find some aspect of it where both Erik and I can provide technical mentorship.
- Daniel will be out for a few weeks after Monday, 12/10. Should we meet again next Monday in light of this? Erik also offered to help in Daniel’s absence.
Additional Resources
Good first bugs:
- (TensorBoard) `fathom-train`’s -c option allows you to append a comment to the folder containing data for a particular run. E.g. “V15_article” at the end of “Dec04_10-13-55_MacBook-Pro-81.local.LR=0.01,i=5000,V15_article”. Is it possible to prepend this comment, so that the run files are sorted with the most recent ruleset version at the top?
- (TensorBoard) Can we label each data point in the Histograms section with the file name, so we can tell which samples are not bimodally distributed?
```
fathom-test ../vectors/vectors_test_article.json '{"coeffs": [["hasEnoughParagraphs", -1.0311405658721924],["hasExactlyOneArticleElement", -1.2359271049499512],["paragraphElementsHaveSiblingsWithSameTagName", 12.159211158752441],["mostParagraphElementsAreHorizontallyAligned", 0.5681423544883728],["moreParagraphElementsThanListItemsOrTableRows", -2.6533799171447754],["headerElementIsSiblingToParagraphElements", 12.294110298156738],["hasMultipleArticleElements", -3.300487756729126],["hasMultipleParagraphsWhoseClassNameIncludesArticle", 0.26676997542381287]], "bias": -2.676619291305542}'
```
- Get the status text for each sample in the Vectorizer page to be highlightable (could be extended to have any text on any page highlightable, but that’s probably too much to ask). This is so this information can be copied; e.g. it is convenient to copy the sample page URL and open it in a new tab for debugging.
- #64: Have `fathom-train` use the same relative path as `fathom-list` in its output.
- #69: Supply a Dark Mode browserAction icon for FathomFox.