Skip to content

Implement a python script to dump alexa top sites to db #1579

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MDTsai opened this issue Jun 2, 2017 · 11 comments
Closed

Implement a python script to dump alexa top sites to db #1579

MDTsai opened this issue Jun 2, 2017 · 11 comments

Comments

@MDTsai
Copy link
Contributor

MDTsai commented Jun 2, 2017

This is a part of #1533 , I decide to cache alexa top sites in a sqlite db, not query when new issue opened.
Assume the ranking doesn't change that frequently, we manually launch the script to create a new db, then deploy to production server.
The plan is to have a top site table with 4 columns: url (primary key), priority (from 1 to 3), country_code and ranking. We use url and priority for label. Country_code and ranking are used to left a comment when a new issue created, let contributor know it's important because it's top N in which country.
I use the alexa top slite to query:

  1. global top 10000 sites
  2. tier 1 top 1000 sites (11 regions, total 11000 sites)

Priority is defined in #1533 (comment) . If the url doesn't exist, then add a record. If the url exists but priority is higher (like global 9000 but top 52 in Taiwan), update the priority.

@MDTsai MDTsai self-assigned this Jun 2, 2017
MDTsai added a commit to MDTsai/webcompat.com that referenced this issue Jun 2, 2017
Implement a python script which could dump alexa top site to local DB
@MDTsai
Copy link
Contributor Author

MDTsai commented Jun 2, 2017

@MDTsai
Copy link
Contributor Author

MDTsai commented Jun 7, 2017

We query 21000 sites with 52.5 USD, only 13221 unique URL we have finally.
As @softvision-sergiulogigan question, we have update 366 URLs' priority to higher (low in global but high in country)
topsites.zip

I will try to use the data and map to current issues we have to and give use a idea about how many issues in what priority. @zoepage Do you think this is ok?

MDTsai added a commit to MDTsai/webcompat.com that referenced this issue Jun 7, 2017
Implement a python script which could dump alexa top site to local DB
4 columns: url(primary key), priority, country_code, ranking
MDTsai added a commit to MDTsai/webcompat.com that referenced this issue Jun 7, 2017
Implement a python script which could dump alexa top site to local DB
4 columns: url(primary key), priority, country_code, ranking
MDTsai added a commit to MDTsai/webcompat.com that referenced this issue Jun 7, 2017
Implement a python script which could dump alexa top site to local DB
4 columns: url(primary key), priority, country_code, ranking
@miketaylr
Copy link
Member

will try to use the data and map to current issues we have to and give use a idea about how many issues in what priority. @zoepage Do you think this is ok?

👍

@MDTsai
Copy link
Contributor Author

MDTsai commented Jun 8, 2017

I try to use github3.py and form+helpers from webcompat to get domain_name from issue body (**URL:**).
Now I need to solve the problem that domain name from URL doesn't match we have from alexa. Two ideas:

  1. Query from alexa, give "www.bing.com" receives 301 redirect to "bing.com". alexa site info API provides the same result.
  2. Reduce the domain name length, but at least a second-level domain+top-level domain. Use reduced name to lookup in DB. I prefer this idea because we have limited URL in DB.

@zoepage
Copy link
Member

zoepage commented Jun 8, 2017

will try to use the data and map to current issues we have to and give use a idea about how many issues in what priority. @zoepage Do you think this is ok?

👍

@karlcow
Copy link
Member

karlcow commented Jun 8, 2017

@MDTsai just an idea for checking matching of domain names. Still early here so maybe it's a bad idea. :)

Trying to match in reverse order.

>>> domain1 = 'www.mozilla.org'
>>> alexa_list = ['mozilla.org', 'mozilla.com']
>>> domain1.split('.')
['www', 'mozilla', 'org']
>>> domain1.split('.')[-1]
'org'
>>> last_part = domain1.split('.')[-1]
>>> # this filter extract all the domain from the alexa_list 
>>> # which are matching a string finishing with last_part
>>> filter(lambda x: x.endswith(last_part), alexa_list)
['mozilla.org']

Trying to match org, then mozilla, etc.

@karlcow
Copy link
Member

karlcow commented Jun 8, 2017

@MDTsai
Copy link
Contributor Author

MDTsai commented Jun 9, 2017

Using the DB to analysis current web-bugs issues (7348), we have:

  • Critical: 1325
  • Important: 981
  • Normal: 905

If we use the ranking, that means:

  • 43.7% of issues are important in global or tier 1 countries.
  • Global top 100 sites occupy ~18% (but I guess Oana and Serigu contribute a lot google issues). That make sense, huge website has more problem then small one.

You can visit here to see the result and give me some feedback. Then I can adjust then priority from ranking or do some more thing?
@karlcow @zoepage @softvision-sergiulogigan @softvision-oana-arbuzov @miketaylr @adamopenweb

@miketaylr
Copy link
Member

Very cool, @MDTsai!

The only thing that might feel a bit odd is a site like fox.com (ranking 1125 in the US) sitting in the same "normal" bucket (I guess rank 3) as makeuseof.com (872 in US).

But any system won't be perfect, and it's better than what we have right now. 👍

@MDTsai
Copy link
Contributor Author

MDTsai commented Jun 10, 2017

@miketaylr:
fox.com is 5067th in global and 1129th in US. So we only use the global ranking to priority 3.
makeuseof.com is 1267th in global and 862th in US. These 2 are the same priority also 3. So I will keep on next steps. :)

@zoepage
Copy link
Member

zoepage commented Jun 10, 2017

@MDTsai The list looks really good! Thanks! :)

MDTsai added a commit to MDTsai/webcompat.com that referenced this issue Jun 12, 2017
Implement a python script which could dump alexa top site to local DB
4 columns: url(primary key), priority, country_code, ranking
MDTsai added a commit to MDTsai/webcompat.com that referenced this issue Jun 13, 2017
Implement a python script which could dump alexa top site to local DB
4 columns: url(primary key), priority, country_code, ranking

2017-06-13: Fix style suggested by @karlcow
MDTsai added a commit to MDTsai/webcompat.com that referenced this issue Jun 20, 2017
Implement a python script which could dump alexa top site to local DB
4 columns: url(primary key), priority, country_code, ranking

2017-06-13: Fix style suggested by @karlcow
2017-06-14: Fix some structure suggested by @karlcow
2017-06-20: Drop TopSite object
MDTsai added a commit to MDTsai/webcompat.com that referenced this issue Jun 21, 2017
Implement a python script which could dump alexa top site to local DB
4 columns: url(primary key), priority, country_code, ranking

2017-06-13: Fix style suggested by @karlcow
2017-06-14: Fix some structure suggested by @karlcow
2017-06-20: Drop TopSite object
2017-06-21: Implement unittest for topsites.py
MDTsai added a commit to MDTsai/webcompat.com that referenced this issue Jun 22, 2017
Implement a python script which could dump alexa top site to local DB
4 columns: url(primary key), priority, country_code, ranking

2017-06-13: Fix style suggested by @karlcow
2017-06-14: Fix some structure suggested by @karlcow
2017-06-20: Drop TopSite object
2017-06-21: Implement unittest for topsites.py, move labels.py to tools/
MDTsai added a commit to MDTsai/webcompat.com that referenced this issue Jun 22, 2017
Implement a python script which could dump alexa top site to local DB
4 columns: url(primary key), priority, country_code, ranking

2017-06-13: Fix style suggested by @karlcow
2017-06-14: Fix some structure suggested by @karlcow
2017-06-20: Drop TopSite object
2017-06-21: Implement unittest for topsites.py, move labels.py to tools/
MDTsai added a commit to MDTsai/webcompat.com that referenced this issue Jun 22, 2017
Implement a python script which could dump alexa top site to local DB
4 columns: url(primary key), priority, country_code, ranking

2017-06-13: Fix style suggested by @karlcow
2017-06-14: Fix some structure suggested by @karlcow
2017-06-20: Drop TopSite object
2017-06-21: Implement unittest for topsites.py, move labels.py to tools/
MDTsai added a commit to MDTsai/webcompat.com that referenced this issue Jun 28, 2017
Implement a python script which could dump alexa top site to local DB
4 columns: url(primary key), priority, country_code, ranking

2017-06-13: Fix style suggested by @karlcow
2017-06-14: Fix some structure suggested by @karlcow
2017-06-20: Drop TopSite object
2017-06-21: Implement unittest for topsites.py, move labels.py to tools/
2017-06-27: Fix test_topsites.py with mock datetime.datetime, add more error handling for request.get
MDTsai added a commit to MDTsai/webcompat.com that referenced this issue Jun 29, 2017
Implement a python script which could dump alexa top site to local DB
4 columns: url(primary key), priority, country_code, ranking

2017-06-13: Fix style suggested by @karlcow
2017-06-14: Fix some structure suggested by @karlcow
2017-06-20: Drop TopSite object
2017-06-21: Implement unittest for topsites.py, move labels.py to tools/
2017-06-27: Fix test_topsites.py with mock datetime.datetime, add more error handling for request.get
MDTsai added a commit to MDTsai/webcompat.com that referenced this issue Jul 7, 2017
Implement a python script which could dump alexa top site to local DB
4 columns: url(primary key), priority, country_code, ranking

2017-06-13: Fix style suggested by @karlcow
2017-06-14: Fix some structure suggested by @karlcow
2017-06-20: Drop TopSite object
2017-06-21: Implement unittest for topsites.py, move labels.py to tools/
2017-06-27: Fix test_topsites.py with mock datetime.datetime, add more error handling for request.get
2017-07-07: Archive topsites.db and replace with new one.
@karlcow karlcow closed this as completed in cc7aed7 Jul 7, 2017
karlcow added a commit that referenced this issue Jul 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants