-
Notifications
You must be signed in to change notification settings - Fork 37
Updates pipeline enrollment metrics queries to improve performance #226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit should dramatically improve the query performance for the enrollment metrics pipeline What was wrong? Queries were very slow because of a 'LMIT 1' issues with MySQL. For a starting point, see here https://stackoverflow.com/questions/15460133/mysql-dramatically-slower-query-execution-if-use-limit-1-instead-of-limit-5 In Django, we were doing a filter query that returns a single record or `None`. Examples: ``` StudentModule.objects.filter(**filter_args).latest('modified') StudentModule.objects.filter(**filter_args).order_by('-modified).first() ``` Query functions such as `latest`, `first`, `last` and so on add a `LIMIT 1` to the underlying SQL query, which has apparent negative performance on the query analyzer To address this, we do two things 1. For the specified course, we filter the StudentModule records 2. For the specifid learner in the course, we filter Also, LearnerCourseGradesMetrics queries are slow as the model needs indexing on fields including site, course, and learner. We address this twofold 1. We will add indexing to the needed fields after we prune old records. This is so we're not indexing records we are just going to delete anyway 2. We filter all LearnerCourseGradeMetrics records for the specified course This commit performs #2 above to then filter from this queryset to find LearnerCourseGradeMetrics records for the specified learner in the course Enrollment Metrics tests have been updated to reflect changes in the production code
Codecov Report
@@ Coverage Diff @@
## master #226 +/- ##
==========================================
- Coverage 91.33% 91.28% -0.05%
==========================================
Files 38 38
Lines 1950 1951 +1
==========================================
Hits 1781 1781
- Misses 169 170 +1
Continue to review full report at Codecov.
|
if student_modules: | ||
most_recent_sm = student_modules.latest('modified') | ||
most_recent_sm = student_modules[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many records are we filtering from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@OmarIthawi The top five counts are 398, 382, 309, 309, 308.
As far as limiting the queries, AFAIK, there's no point in doing a "LIMIT 5" because querysets are lazy evaluated unless you use the "step" parameter of python slicing. Please see here: https://docs.djangoproject.com/en/1.11/topics/db/queries/#limiting-querysets
Here's the query I'm running to capture counts:
from django.db import connection
def student_module_metrics():
SM_METRICS_SQL = """ \
SELECT COUNT(id), course_id, student_id from courseware_studentmodule
GROUP BY course_id, student_id
HAVING COUNT(id) > 200
ORDER BY COUNT(id) DESC;
"""
with connection.cursor() as cursor:
count_records_sql = 'SELECT COUNT(id) from courseware_studentmodule;'
cursor.execute(count_records_sql)
total_records = cursor.fetchone()
cursor.execute(SM_METRICS_SQL)
rows = cursor.fetchall()
# Show the top twenty
for row in rows[:20]:
print('{} - course: {}, user: {}'.format(
row[0], row[1], row[2]))
return rows
This commit should dramatically improve the query performance for the
enrollment metrics pipeline
What was wrong?
Queries were very slow because of a 'LMIT 1' issues with MySQL. For a starting point, see here
In Django, we were doing a filter query that returns a single record or
None
. Examples:Query functions such as
latest
,first
,last
and so on add aLIMIT 1
to the underlying SQL query, which has apparent negative performance on the query analyzerTo address this, we do two things
Also, LearnerCourseGradesMetrics queries are slow as the model needs indexing
on fields including site, course, and learner. We address this twofold
This is so we're not indexing records we are just going to delete anyway
course
This commit performs #2 above to then filter from this queryset to find
LearnerCourseGradeMetrics records for the specified learner in the
course
Enrollment Metrics tests have been updated to reflect changes in the
production code