-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #1579 - Alexa Top Site to DB #1591
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MDTsai Thanks a lot.
This is just a first pass mainly on style. I haven't yet digged into the algo part.
Most of my comments would have been catched with pep8 linter. Try to find one for your own code editor, that will save you a lot of time and headaches 😄
if you are using SublimeText I can recommend a couple of things.
I have the feeling that we also need to put our tools into a new directory. tools/
instead of the root of the project.
In my second review once you fixed a bit of the style. I will go further on looking at the algorithm/usability/testing.
Once again thanks a lot.
And if you have any questions on my comments: DO NOT HESITATE. I will happily give more detailed answers.
topsites.py
Outdated
@@ -0,0 +1,166 @@ | |||
#!/usr/bin/python |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
topsites.py
Outdated
import requests | ||
|
||
from urllib import urlencode | ||
from xml.dom.minidom import parseString |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
topsites.py
Outdated
from sqlalchemy.ext.declarative import declarative_base | ||
from sqlalchemy import Column, Integer, String | ||
from sqlalchemy.orm import sessionmaker | ||
|
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
topsites.py
Outdated
ATS_HASH_ALGORITHM = "HmacSHA256" | ||
|
||
# Regions/Countries to dump to topsites.db | ||
regions = { |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
topsites.py
Outdated
|
||
# Regions/Countries to dump to topsites.db | ||
regions = { | ||
"": (10000, 1), # Global |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
topsites.py
Outdated
dig = hmac.new(secret_key, data, hashlib.sha256).digest() | ||
return base64.b64encode(dig) | ||
|
||
def get_text(self, node_list): |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
topsites.py
Outdated
query_string = self.build_query(begin, 100) | ||
toSign = "GET\n" + ATS_SERVICE_HOST + "\n/\n" + query_string | ||
signature = self.gen_sign(toSign, self.secret_key) | ||
uri = ATS_AWS_BASE_URL + query_string + "&" + urlencode({"Signature": signature}, "UTF-8") |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
topsites.py
Outdated
response = requests.get(uri) | ||
|
||
if response.status_code == requests.codes.ok: | ||
# See http://docs.aws.amazon.com/AlexaTopSites/latest/index.html?QUERY_QueryRequests.html |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
topsites.py
Outdated
sites = dom.getElementsByTagName("aws:Site") | ||
for site in sites: | ||
url = self.get_text(site.getElementsByTagName("aws:DataUrl")[0].childNodes) | ||
rank = int(self.get_text(site.getElementsByTagName("aws:Rank")[0].childNodes)) |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
topsites.py
Outdated
def build_query(self, start, count): | ||
now = time.time() | ||
mlsec = repr(now).split('.')[1][:3] | ||
timestamp = datetime.datetime.utcnow().strftime(ATS_DATEFORMAT).format(mlsec) |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
Thanks @karlcow, I use ''Python PEP8 Autoformat'' in my sublime now. Seems there is no PEP8 warning now. I re-push style fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MDTsai
Thanks for the style change. There are still a couple of things, but we will see that later.
Let's get into the meat :) I have made extensive comments. If you need to have a better understanding of some of the things I have said, we could schedule a chat together to go line by line through the code.
Or we could go over the code in San Francisco side by side.
It's cool you are working on this. ❤️
topsites.py
Outdated
"PL": (1000, 2), | ||
"GB": (1000, 2), | ||
"RU": (1000, 2), | ||
} |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
topsites.py
Outdated
from sqlalchemy.ext.declarative import declarative_base | ||
from sqlalchemy.orm import sessionmaker | ||
from urllib import urlencode | ||
from xml.dom.minidom import parseString |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
topsites.py
Outdated
|
||
|
||
class Site(Base): | ||
'''Define a sqlalchemy base object for a alexa top site.''' |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
topsites.py
Outdated
|
||
|
||
class TopSites: | ||
'''Used to query top site info from alexa API to Site object''' |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
topsites.py
Outdated
|
||
response = requests.get(uri) | ||
|
||
if response.status_code == requests.codes.ok: |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
topsites.py
Outdated
# Update priority is new is lower | ||
if site_row.priority > priority: | ||
site_row.priority = priority | ||
site_row.country_code = self.country_code |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
topsites.py
Outdated
now = time.time() | ||
mlsec = repr(now).split('.')[1][:3] | ||
nowtime = datetime.datetime.utcnow() | ||
timestamp = nowtime.strftime(ATS_DATEFORMAT).format(mlsec) |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
topsites.py
Outdated
|
||
# alexa top site only accept request with ordered query parameters | ||
return urlencode(collections.OrderedDict(sorted(query_params.items()))) | ||
|
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
topsites.py
Outdated
dig = hmac.new(secret_key, data, hashlib.sha256).digest() | ||
return base64.b64encode(dig) | ||
|
||
def get_text(self, node_list): |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
topsites.py
Outdated
for country_code in REGIONS: | ||
site.country_code = country_code | ||
end_ranking = REGIONS[country_code][0] | ||
begin_priority = REGIONS[country_code][1] |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
@MDTsai I think we would benefit in flexibility if indeed we were dropping That would also make it easier for you to create tests for it. Currently there are no unit tests at all. Also would you mind create a |
OK, I will move the code to tools, drop class TopSites and write test case :) Could be done tomorrow. |
@MDTsai could you go ahead and make a commit that moves this file to the tools directory? https://github.com/webcompat/webcompat.com/blob/master/labels.py (since I guess it's relevant to a tools directory being created)? |
1e478c8
to
f2d51b9
Compare
r? = @karlcow Drop TopSite class to functions, add unittest. Not sure if I need to move labels.py in this PR, but now it's included. |
@MDTsai thanks. I will make a pass at it. Tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MDTsai this looks pretty good.
See the comments I have made.
Thanks for moving the code to tools/
Two general comments.
- Stick to the same quoting style. We use on webcompat.com
"""
for docstrings and'
for variable values. - Stick to the same formatting style for string concatenation. We decided to use on this project
str.format()
instead ofstr + str
or%
Thanks.
tools/topsites.py
Outdated
uri = build_uri(country_code, index) | ||
response = requests.get(uri) | ||
|
||
if response.status_code == 200: |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
tools/topsites.py
Outdated
|
||
# Ranking between 100 and 1000, add priority 1 | ||
# Ranking between 1000 and 10000, add priority 2 | ||
if rank > 100 and rank <= 1000: |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
tools/topsites.py
Outdated
# Ranking between 100 and 1000, add priority 1 | ||
# Ranking between 1000 and 10000, add priority 2 | ||
if rank > 100 and rank <= 1000: | ||
priority = priority + 1 |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
tools/topsites.py
Outdated
|
||
def build_query_string(country_code, start_ranking, timestamp=None): | ||
"""Build query string for request with start ranking and count.""" | ||
if None == timestamp: |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave an example of mocking the timestamp. :)
tools/topsites.py
Outdated
|
||
def build_query_string(country_code, start_ranking, timestamp=None): | ||
"""Build query string for request with start ranking and count.""" | ||
if None == timestamp: |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
tools/topsites.py
Outdated
import hashlib | ||
import hmac | ||
import os | ||
import requests |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
r? = @karlcow Use Mock to replace datetime.datetime. Print error code and message from ATS result XML now. Change variable quote to single quote. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good to go. Only minor syntax comments.
Thanks a lot @MDTsai
ping me when you did the change and I will merge it.
tests/test_topsites.py
Outdated
from mock import patch | ||
|
||
# Add tools module to import path | ||
sys.path.append(os.path.realpath(os.pardir)) |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
site = self.dom.getElementsByTagName('aws:Site')[0] | ||
self.assertEqual(topsites.node_text(site, 'aws:DataUrl'), 'baidu.com') | ||
self.assertEqual(topsites.node_text(site, 'aws:Rank'), '1') | ||
|
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
tools/topsites.py
Outdated
self.country_code = country_code | ||
self.ranking = ranking | ||
|
||
Base.metadata.create_all(engine) |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
tools/topsites.py
Outdated
|
||
|
||
|
||
def parse_site(site): |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
tools/topsites.py
Outdated
query=query_string, | ||
signature=urlencode({"Signature": signature}, "UTF-8")) | ||
|
||
return uri |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
tools/topsites.py
Outdated
"""Extract the text node in a tree for a specific tag_name. | ||
It will only work with unique text nodes. | ||
""" | ||
node = tree.getElementsByTagName(tag_name)[0] |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
@karlcow: as our discussion in SF, I need to:
|
so we have a devops special repository for the fabric file. I'm not here for the next two weeks, so I guess @miketaylr will take the relay on this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MDTsai I believe this is ready. Thanks a lot for the code. Feel free to merge after renaming the DB. If we identify issues later we can fix the script.
I propose we deploy this on staging first.
Probably mike will do it given that I'm off for the next two weeks.
tools/topsites.py
Outdated
topsites = {} | ||
|
||
engine = create_engine('sqlite:///' + os.path.join( | ||
app.config['BASE_DIR'], 'topsites.db')) |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
Implement a python script which could dump alexa top site to local DB 4 columns: url(primary key), priority, country_code, ranking 2017-06-13: Fix style suggested by @karlcow 2017-06-14: Fix some structure suggested by @karlcow 2017-06-20: Drop TopSite object 2017-06-21: Implement unittest for topsites.py, move labels.py to tools/ 2017-06-27: Fix test_topsites.py with mock datetime.datetime, add more error handling for request.get 2017-07-07: Archive topsites.db and replace with new one.
@karlcow archive implemented, ready to merge |
Thanks a lot @MDTsai |
Ah, missed this... should we let this run on staging before merging to master? That's a bit hard now that it's on master... |
Actually, nothing is calling into this code right now. Nevermind. :) |
@miketaylr we can use crontab or other similar tool to run the script, is there any existing service on stage server? |
Yep, we currently use crontab to delete logs when they are older than 14 days. |
Implement a python script which could dump alexa top site to local DB
4 columns: url(primary key), priority, country_code, ranking
r? @zoepage could you help to arrange reviewer?