Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #1579 - Alexa Top Site to DB #1591

Merged
merged 1 commit into from
Jul 7, 2017
Merged

Conversation

MDTsai
Copy link
Contributor

@MDTsai MDTsai commented Jun 12, 2017

Implement a python script which could dump alexa top site to local DB
4 columns: url(primary key), priority, country_code, ranking

r? @zoepage could you help to arrange reviewer?

@zoepage
Copy link
Member

zoepage commented Jun 12, 2017

@MDTsai I'd say, @karlcow would be the perfect person here?

@zoepage zoepage requested a review from karlcow June 12, 2017 11:11
Copy link
Member

@karlcow karlcow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MDTsai Thanks a lot.

This is just a first pass mainly on style. I haven't yet digged into the algo part.

Most of my comments would have been catched with pep8 linter. Try to find one for your own code editor, that will save you a lot of time and headaches 😄

if you are using SublimeText I can recommend a couple of things.

I have the feeling that we also need to put our tools into a new directory. tools/ instead of the root of the project.

In my second review once you fixed a bit of the style. I will go further on looking at the algorithm/usability/testing.

Once again thanks a lot.

And if you have any questions on my comments: DO NOT HESITATE. I will happily give more detailed answers.

topsites.py Outdated
@@ -0,0 +1,166 @@
#!/usr/bin/python

This comment was marked as abuse.

topsites.py Outdated
import requests

from urllib import urlencode
from xml.dom.minidom import parseString

This comment was marked as abuse.

topsites.py Outdated
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String
from sqlalchemy.orm import sessionmaker

This comment was marked as abuse.

topsites.py Outdated
ATS_HASH_ALGORITHM = "HmacSHA256"

# Regions/Countries to dump to topsites.db
regions = {

This comment was marked as abuse.

topsites.py Outdated

# Regions/Countries to dump to topsites.db
regions = {
"": (10000, 1), # Global

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

topsites.py Outdated
dig = hmac.new(secret_key, data, hashlib.sha256).digest()
return base64.b64encode(dig)

def get_text(self, node_list):

This comment was marked as abuse.

topsites.py Outdated
query_string = self.build_query(begin, 100)
toSign = "GET\n" + ATS_SERVICE_HOST + "\n/\n" + query_string
signature = self.gen_sign(toSign, self.secret_key)
uri = ATS_AWS_BASE_URL + query_string + "&" + urlencode({"Signature": signature}, "UTF-8")

This comment was marked as abuse.

topsites.py Outdated
response = requests.get(uri)

if response.status_code == requests.codes.ok:
# See http://docs.aws.amazon.com/AlexaTopSites/latest/index.html?QUERY_QueryRequests.html

This comment was marked as abuse.

topsites.py Outdated
sites = dom.getElementsByTagName("aws:Site")
for site in sites:
url = self.get_text(site.getElementsByTagName("aws:DataUrl")[0].childNodes)
rank = int(self.get_text(site.getElementsByTagName("aws:Rank")[0].childNodes))

This comment was marked as abuse.

topsites.py Outdated
def build_query(self, start, count):
now = time.time()
mlsec = repr(now).split('.')[1][:3]
timestamp = datetime.datetime.utcnow().strftime(ATS_DATEFORMAT).format(mlsec)

This comment was marked as abuse.

@MDTsai
Copy link
Contributor Author

MDTsai commented Jun 13, 2017

Thanks @karlcow, I use ''Python PEP8 Autoformat'' in my sublime now. Seems there is no PEP8 warning now. I re-push style fixed.

Copy link
Member

@karlcow karlcow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MDTsai
Thanks for the style change. There are still a couple of things, but we will see that later.

Let's get into the meat :) I have made extensive comments. If you need to have a better understanding of some of the things I have said, we could schedule a chat together to go line by line through the code.

Or we could go over the code in San Francisco side by side.

It's cool you are working on this. ❤️

topsites.py Outdated
"PL": (1000, 2),
"GB": (1000, 2),
"RU": (1000, 2),
}

This comment was marked as abuse.

topsites.py Outdated
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from urllib import urlencode
from xml.dom.minidom import parseString

This comment was marked as abuse.

This comment was marked as abuse.

topsites.py Outdated


class Site(Base):
'''Define a sqlalchemy base object for a alexa top site.'''

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

topsites.py Outdated


class TopSites:
'''Used to query top site info from alexa API to Site object'''

This comment was marked as abuse.

topsites.py Outdated

response = requests.get(uri)

if response.status_code == requests.codes.ok:

This comment was marked as abuse.

This comment was marked as abuse.

topsites.py Outdated
# Update priority is new is lower
if site_row.priority > priority:
site_row.priority = priority
site_row.country_code = self.country_code

This comment was marked as abuse.

This comment was marked as abuse.

topsites.py Outdated
now = time.time()
mlsec = repr(now).split('.')[1][:3]
nowtime = datetime.datetime.utcnow()
timestamp = nowtime.strftime(ATS_DATEFORMAT).format(mlsec)

This comment was marked as abuse.

This comment was marked as abuse.

topsites.py Outdated

# alexa top site only accept request with ordered query parameters
return urlencode(collections.OrderedDict(sorted(query_params.items())))

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

topsites.py Outdated
dig = hmac.new(secret_key, data, hashlib.sha256).digest()
return base64.b64encode(dig)

def get_text(self, node_list):

This comment was marked as abuse.

topsites.py Outdated
for country_code in REGIONS:
site.country_code = country_code
end_ranking = REGIONS[country_code][0]
begin_priority = REGIONS[country_code][1]

This comment was marked as abuse.

@karlcow
Copy link
Member

karlcow commented Jun 14, 2017

@MDTsai I think we would benefit in flexibility if indeed we were dropping class TopSites: and reorganizing the code into small functions. The def query(): should be probably break down into smaller units.

That would also make it easier for you to create tests for it. Currently there are no unit tests at all.
The unit tests are to be put in tests/test_topsites.py

Also would you mind create a tools/ directory? Or do you want me to do that?
I think topsites.py needs to be there along labels.py

@MDTsai
Copy link
Contributor Author

MDTsai commented Jun 15, 2017

OK, I will move the code to tools, drop class TopSites and write test case :) Could be done tomorrow.

@miketaylr
Copy link
Member

@MDTsai could you go ahead and make a commit that moves this file to the tools directory? https://github.com/webcompat/webcompat.com/blob/master/labels.py (since I guess it's relevant to a tools directory being created)?

@MDTsai MDTsai force-pushed the Issue_1579 branch 3 times, most recently from 1e478c8 to f2d51b9 Compare June 22, 2017 07:02
@MDTsai
Copy link
Contributor Author

MDTsai commented Jun 22, 2017

r? = @karlcow Drop TopSite class to functions, add unittest. Not sure if I need to move labels.py in this PR, but now it's included.

@karlcow
Copy link
Member

karlcow commented Jun 22, 2017

@MDTsai thanks. I will make a pass at it. Tomorrow.

Copy link
Member

@karlcow karlcow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MDTsai this looks pretty good.

See the comments I have made.
Thanks for moving the code to tools/

Two general comments.

  • Stick to the same quoting style. We use on webcompat.com """ for docstrings and ' for variable values.
  • Stick to the same formatting style for string concatenation. We decided to use on this project str.format() instead of str + str or %

Thanks.

uri = build_uri(country_code, index)
response = requests.get(uri)

if response.status_code == 200:

This comment was marked as abuse.

This comment was marked as abuse.


# Ranking between 100 and 1000, add priority 1
# Ranking between 1000 and 10000, add priority 2
if rank > 100 and rank <= 1000:

This comment was marked as abuse.

# Ranking between 100 and 1000, add priority 1
# Ranking between 1000 and 10000, add priority 2
if rank > 100 and rank <= 1000:
priority = priority + 1

This comment was marked as abuse.


def build_query_string(country_code, start_ranking, timestamp=None):
"""Build query string for request with start ranking and count."""
if None == timestamp:

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

Copy link
Member

@karlcow karlcow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave an example of mocking the timestamp. :)


def build_query_string(country_code, start_ranking, timestamp=None):
"""Build query string for request with start ranking and count."""
if None == timestamp:

This comment was marked as abuse.

import hashlib
import hmac
import os
import requests

This comment was marked as abuse.

This comment was marked as abuse.

@MDTsai
Copy link
Contributor Author

MDTsai commented Jun 28, 2017

r? = @karlcow Use Mock to replace datetime.datetime. Print error code and message from ATS result XML now. Change variable quote to single quote.

Copy link
Member

@karlcow karlcow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good to go. Only minor syntax comments.

Thanks a lot @MDTsai

ping me when you did the change and I will merge it.

from mock import patch

# Add tools module to import path
sys.path.append(os.path.realpath(os.pardir))

This comment was marked as abuse.

site = self.dom.getElementsByTagName('aws:Site')[0]
self.assertEqual(topsites.node_text(site, 'aws:DataUrl'), 'baidu.com')
self.assertEqual(topsites.node_text(site, 'aws:Rank'), '1')

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

self.country_code = country_code
self.ranking = ranking

Base.metadata.create_all(engine)

This comment was marked as abuse.




def parse_site(site):

This comment was marked as abuse.

query=query_string,
signature=urlencode({"Signature": signature}, "UTF-8"))

return uri

This comment was marked as abuse.

"""Extract the text node in a tree for a specific tag_name.
It will only work with unique text nodes.
"""
node = tree.getElementsByTagName(tag_name)[0]

This comment was marked as abuse.

@MDTsai
Copy link
Contributor Author

MDTsai commented Jul 4, 2017

@karlcow: as our discussion in SF, I need to:

  • Provide AWS app/secret key
  • Modify the script to read app/secret key from env set by fabric.
  • Support archive topsite.db
    Is there any special requirement for the second one?

@karlcow
Copy link
Member

karlcow commented Jul 5, 2017

so we have a devops special repository for the fabric file. I'm not here for the next two weeks, so I guess @miketaylr will take the relay on this one.

Copy link
Member

@karlcow karlcow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MDTsai I believe this is ready. Thanks a lot for the code. Feel free to merge after renaming the DB. If we identify issues later we can fix the script.

I propose we deploy this on staging first.
Probably mike will do it given that I'm off for the next two weeks.

topsites = {}

engine = create_engine('sqlite:///' + os.path.join(
app.config['BASE_DIR'], 'topsites.db'))

This comment was marked as abuse.

Implement a python script which could dump alexa top site to local DB
4 columns: url(primary key), priority, country_code, ranking

2017-06-13: Fix style suggested by @karlcow
2017-06-14: Fix some structure suggested by @karlcow
2017-06-20: Drop TopSite object
2017-06-21: Implement unittest for topsites.py, move labels.py to tools/
2017-06-27: Fix test_topsites.py with mock datetime.datetime, add more error handling for request.get
2017-07-07: Archive topsites.db and replace with new one.
@MDTsai
Copy link
Contributor Author

MDTsai commented Jul 7, 2017

@karlcow archive implemented, ready to merge

@karlcow
Copy link
Member

karlcow commented Jul 7, 2017

Thanks a lot @MDTsai

@karlcow karlcow merged commit cabd006 into webcompat:master Jul 7, 2017
@miketaylr
Copy link
Member

I propose we deploy this on staging first.
Probably mike will do it given that I'm off for the next two weeks.

Ah, missed this... should we let this run on staging before merging to master? That's a bit hard now that it's on master...

@miketaylr
Copy link
Member

@MDTsai @karlcow (realizing it's not a great time to be pinging people in Asia....)

@miketaylr
Copy link
Member

Actually, nothing is calling into this code right now. Nevermind. :)

@MDTsai
Copy link
Contributor Author

MDTsai commented Jul 10, 2017

@miketaylr we can use crontab or other similar tool to run the script, is there any existing service on stage server?

@miketaylr
Copy link
Member

@miketaylr we can use crontab or other similar tool to run the script, is there any existing service on stage server?

Yep, we currently use crontab to delete logs when they are older than 14 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants