Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Demo] Topic Modeling using HDP for Augur Message Table (GSoC 2025) #3106

Open
wants to merge 47 commits into
base: dev
Choose a base branch
from

Conversation

Xiaoha-cloud
Copy link

@Xiaoha-cloud Xiaoha-cloud commented Apr 4, 2025

Description

This PR adds a self-contained demonstration notebook for topic modeling using Gensim’s HDP model, applied to the augur_data.message table.

Related to #1199

  • Extracts discussion data from repo_id=24441, between 2021–08–03 and 2023–12–31
  • Uses spaCy + NLTK for text preprocessing
  • Leverages HDP model (non-parametric LDA) to automatically infer topic structure
  • Visualizes topic distribution via pyLDAvis
  • Includes a screenshot of the result

This is part of an exploratory effort for GSoC 2025 under the proposed topic of improving conversational topic modeling in Augur.

📂 Files are added under:
notebooks/topic_modeling/
with a screenshot in
notebooks/topic_modeling/assets/

This PR does not modify any production code, but may serve as a reference for improving future integration of repo_topic and topic_words storage logic.


Notes for Reviewers

  • I'd love feedback from maintainers on how topic data is currently stored and if this HDP approach could be explored as a possible improvement.
  • I'm also interested in clarifying where such experimental notebooks should live long-term (under examples, contrib, etc.)
  • Please let me know if this PR should be redirected elsewhere.

Signed commits

  • Yes, I signed my commits.

hdp_topic_vis_preview

ABrain7710 and others added 30 commits March 8, 2025 14:08
Signed-off-by: Sean P. Goggins <[email protected]>
Signed-off-by: Sean P. Goggins <[email protected]>
Events fix for Augur instance anomalies (One instance with an issue. We think this fixes that, and provides a more performant index going forward).
it seems to be broken and not kept up to date with the docker compose file

Signed-off-by: Adrian Edwards <[email protected]>
Signed-off-by: Adrian Edwards <[email protected]>
Signed-off-by: Sean P. Goggins <[email protected]>
Signed-off-by: Sean P. Goggins <[email protected]>
Signed-off-by: Sean P. Goggins <[email protected]>
add flower section to the docker compose
log stderr in called process for facade commit count
Leave room on github api key rate limit
fixing install-dev command by removal
adjust path that scc gets copied to
Signed-off-by: Sean P. Goggins <[email protected]>
Marcel Beyer and others added 16 commits March 21, 2025 11:53
Signed-off-by: Marcel Beyer <[email protected]>
- Move from custom scripting to re-usable actions
- Add workflow_dispatch so build can be triggered manually

Signed-off-by: John Strunk <[email protected]>
Docker mac fix
- tested on OSX
- tested on Ubuntu 22.x
Signed-off-by: Sean P. Goggins <[email protected]>
[Docs] Corrected GitLab Public Access Token URL in Documentation
Fix failing CI when not tagging a docker image
@Xiaoha-cloud Xiaoha-cloud requested a review from sgoggins as a code owner April 4, 2025 16:59
@Xiaoha-cloud
Copy link
Author

Hi maintainers, this PR is limited to a demo notebook under notebooks/topic_modeling/, and does not touch any core Augur code. The current CI failure seems to stem from existing issues in unrelated files like setup.py and events.py.

Please let me know if there's anything else I should fix on my end. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants