-
Notifications
You must be signed in to change notification settings - Fork 6
Verified error fixes and readability improvement #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Fix typos that made the code fail to run. Eliminate dependency on contrib (because it is not actively maintained) by replacing methods with equivalent core or addon methods (in one case, the logic is the same as the contrib source code).
Fix critical errors, eliminate dependency on contrib
Reformat spacing (including indentation and spaces between operators) and adjust variable names to be more compliant with PEP8 guidelines and improve readability.
This commit fixes critical errors that prevented the program from running, eliminates the dependency on contrib (since contrib is not well maintained), and improves readability by renaming some variables and being more compliant with PEP8 guidelines.
Thanks for this PR @ahwang16 . |
I did train on switchboard, but I don't have much information because my summer internship ended in August and I have not returned to Google yet. Without any sort of parameter search/optimization, my trained model achieved a test accuracy of 0.743 (using 300-dimensional GloVe embeddings). I later used BERT embeddings, which pushed the accuracy up to 0.809. |
Thanks, sounds quite promising, specially using BERT embeddings.
I guess the labels correspond to speech acts. What about the arrays of data? Are they word ids for each utterance? |
For real data, I used the Switchboard Dialogue Act Corpus (SWDA) and an internal dataset (not sure if it is available to the public). The multilayered structure of data can be pretty confusing. In the sample test data, each integer is a token, a list of tokens is an utterance, and a list of utterances is a dialogue. The labels correspond to dialogue acts at the utterance level. So you can understand the data format like:
For the two-tier architecture in this code, the data should be parsed at the utterance level for the first layer and the dialogue level for the second layer. |
Thanks for the description! |
This version is verified to work on Google Colab using a GPU and Python3. This fixes typos in the original version that prevented the program from running, eliminates the dependency on contrib, and improves readability by changing a couple variable names and spacing.