-
Notifications
You must be signed in to change notification settings - Fork 203
Implement a python script to dump alexa top sites to db #1579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Implement a python script which could dump alexa top site to local DB
WIP implementation https://github.com/MDTsai/webcompat.com/tree/Issue_1579 |
We query 21000 sites with 52.5 USD, only 13221 unique URL we have finally. I will try to use the data and map to current issues we have to and give use a idea about how many issues in what priority. @zoepage Do you think this is ok? |
Implement a python script which could dump alexa top site to local DB 4 columns: url(primary key), priority, country_code, ranking
Implement a python script which could dump alexa top site to local DB 4 columns: url(primary key), priority, country_code, ranking
Implement a python script which could dump alexa top site to local DB 4 columns: url(primary key), priority, country_code, ranking
👍 |
I try to use github3.py and form+helpers from webcompat to get domain_name from issue body (**URL:**).
|
👍 |
@MDTsai just an idea for checking matching of domain names. Still early here so maybe it's a bad idea. :) Trying to match in reverse order. >>> domain1 = 'www.mozilla.org'
>>> alexa_list = ['mozilla.org', 'mozilla.com']
>>> domain1.split('.')
['www', 'mozilla', 'org']
>>> domain1.split('.')[-1]
'org'
>>> last_part = domain1.split('.')[-1]
>>> # this filter extract all the domain from the alexa_list
>>> # which are matching a string finishing with last_part
>>> filter(lambda x: x.endswith(last_part), alexa_list)
['mozilla.org'] Trying to match |
Using the DB to analysis current web-bugs issues (7348), we have:
If we use the ranking, that means:
You can visit here to see the result and give me some feedback. Then I can adjust then priority from ranking or do some more thing? |
Very cool, @MDTsai! The only thing that might feel a bit odd is a site like fox.com (ranking 1125 in the US) sitting in the same "normal" bucket (I guess rank 3) as makeuseof.com (872 in US). But any system won't be perfect, and it's better than what we have right now. 👍 |
@miketaylr: |
@MDTsai The list looks really good! Thanks! :) |
Implement a python script which could dump alexa top site to local DB 4 columns: url(primary key), priority, country_code, ranking
Implement a python script which could dump alexa top site to local DB 4 columns: url(primary key), priority, country_code, ranking 2017-06-13: Fix style suggested by @karlcow
Implement a python script which could dump alexa top site to local DB 4 columns: url(primary key), priority, country_code, ranking 2017-06-13: Fix style suggested by @karlcow 2017-06-14: Fix some structure suggested by @karlcow 2017-06-20: Drop TopSite object 2017-06-21: Implement unittest for topsites.py
Implement a python script which could dump alexa top site to local DB 4 columns: url(primary key), priority, country_code, ranking 2017-06-13: Fix style suggested by @karlcow 2017-06-14: Fix some structure suggested by @karlcow 2017-06-20: Drop TopSite object 2017-06-21: Implement unittest for topsites.py, move labels.py to tools/
Implement a python script which could dump alexa top site to local DB 4 columns: url(primary key), priority, country_code, ranking 2017-06-13: Fix style suggested by @karlcow 2017-06-14: Fix some structure suggested by @karlcow 2017-06-20: Drop TopSite object 2017-06-21: Implement unittest for topsites.py, move labels.py to tools/
Implement a python script which could dump alexa top site to local DB 4 columns: url(primary key), priority, country_code, ranking 2017-06-13: Fix style suggested by @karlcow 2017-06-14: Fix some structure suggested by @karlcow 2017-06-20: Drop TopSite object 2017-06-21: Implement unittest for topsites.py, move labels.py to tools/
Implement a python script which could dump alexa top site to local DB 4 columns: url(primary key), priority, country_code, ranking 2017-06-13: Fix style suggested by @karlcow 2017-06-14: Fix some structure suggested by @karlcow 2017-06-20: Drop TopSite object 2017-06-21: Implement unittest for topsites.py, move labels.py to tools/ 2017-06-27: Fix test_topsites.py with mock datetime.datetime, add more error handling for request.get
Implement a python script which could dump alexa top site to local DB 4 columns: url(primary key), priority, country_code, ranking 2017-06-13: Fix style suggested by @karlcow 2017-06-14: Fix some structure suggested by @karlcow 2017-06-20: Drop TopSite object 2017-06-21: Implement unittest for topsites.py, move labels.py to tools/ 2017-06-27: Fix test_topsites.py with mock datetime.datetime, add more error handling for request.get
Implement a python script which could dump alexa top site to local DB 4 columns: url(primary key), priority, country_code, ranking 2017-06-13: Fix style suggested by @karlcow 2017-06-14: Fix some structure suggested by @karlcow 2017-06-20: Drop TopSite object 2017-06-21: Implement unittest for topsites.py, move labels.py to tools/ 2017-06-27: Fix test_topsites.py with mock datetime.datetime, add more error handling for request.get 2017-07-07: Archive topsites.db and replace with new one.
This is a part of #1533 , I decide to cache alexa top sites in a sqlite db, not query when new issue opened.
Assume the ranking doesn't change that frequently, we manually launch the script to create a new db, then deploy to production server.
The plan is to have a top site table with 4 columns: url (primary key), priority (from 1 to 3), country_code and ranking. We use url and priority for label. Country_code and ranking are used to left a comment when a new issue created, let contributor know it's important because it's top N in which country.
I use the alexa top slite to query:
Priority is defined in #1533 (comment) . If the url doesn't exist, then add a record. If the url exists but priority is higher (like global 9000 but top 52 in Taiwan), update the priority.
The text was updated successfully, but these errors were encountered: