Updates pipeline enrollment metrics queries to improve performance #226

johnbaldwin · 2020-06-15T19:12:30Z

This commit should dramatically improve the query performance for the
enrollment metrics pipeline

What was wrong?

Queries were very slow because of a 'LMIT 1' issues with MySQL. For a starting point, see here

https://stackoverflow.com/questions/15460133/mysql-dramatically-slower-query-execution-if-use-limit-1-instead-of-limit-5

In Django, we were doing a filter query that returns a single record or
None. Examples:

StudentModule.objects.filter(**filter_args).latest('modified')
StudentModule.objects.filter(**filter_args).order_by('-modified).first()

Query functions such as latest, first, last and so on add a LIMIT 1 to the underlying SQL query, which has apparent negative performance on the query analyzer

To address this, we do two things

For the specified course, we filter the StudentModule records
For the specifid learner in the course, we filter

Also, LearnerCourseGradesMetrics queries are slow as the model needs indexing
on fields including site, course, and learner. We address this twofold

We will add indexing to the needed fields after we prune old records.
This is so we're not indexing records we are just going to delete anyway
We filter all LearnerCourseGradeMetrics records for the specified
course

This commit performs #2 above to then filter from this queryset to find
LearnerCourseGradeMetrics records for the specified learner in the
course

Enrollment Metrics tests have been updated to reflect changes in the
production code

This commit should dramatically improve the query performance for the enrollment metrics pipeline What was wrong? Queries were very slow because of a 'LMIT 1' issues with MySQL. For a starting point, see here https://stackoverflow.com/questions/15460133/mysql-dramatically-slower-query-execution-if-use-limit-1-instead-of-limit-5 In Django, we were doing a filter query that returns a single record or `None`. Examples: ``` StudentModule.objects.filter(**filter_args).latest('modified') StudentModule.objects.filter(**filter_args).order_by('-modified).first() ``` Query functions such as `latest`, `first`, `last` and so on add a `LIMIT 1` to the underlying SQL query, which has apparent negative performance on the query analyzer To address this, we do two things 1. For the specified course, we filter the StudentModule records 2. For the specifid learner in the course, we filter Also, LearnerCourseGradesMetrics queries are slow as the model needs indexing on fields including site, course, and learner. We address this twofold 1. We will add indexing to the needed fields after we prune old records. This is so we're not indexing records we are just going to delete anyway 2. We filter all LearnerCourseGradeMetrics records for the specified course This commit performs #2 above to then filter from this queryset to find LearnerCourseGradeMetrics records for the specified learner in the course Enrollment Metrics tests have been updated to reflect changes in the production code

codecov-commenter · 2020-06-15T19:17:53Z

Codecov Report

Merging #226 into master will decrease coverage by 0.04%.
The diff coverage is 87.50%.

@@            Coverage Diff             @@
##           master     #226      +/-   ##
==========================================
- Coverage   91.33%   91.28%   -0.05%     
==========================================
  Files          38       38              
  Lines        1950     1951       +1     
==========================================
  Hits         1781     1781              
- Misses        169      170       +1

Impacted Files	Coverage Δ
figures/pipeline/enrollment_metrics.py	`98.36% <87.50%> (+0.02%)`	⬆️
figures/sites.py	`66.27% <0.00%> (-1.17%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d96dd9b...d596ffa. Read the comment docs.

OmarIthawi · 2020-06-15T19:33:45Z

figures/pipeline/enrollment_metrics.py

    if student_modules:
-        most_recent_sm = student_modules.latest('modified')
+        most_recent_sm = student_modules[0]


How many records are we filtering from?

@OmarIthawi The top five counts are 398, 382, 309, 309, 308.

As far as limiting the queries, AFAIK, there's no point in doing a "LIMIT 5" because querysets are lazy evaluated unless you use the "step" parameter of python slicing. Please see here: https://docs.djangoproject.com/en/1.11/topics/db/queries/#limiting-querysets

Here's the query I'm running to capture counts:

from django.db import connection def student_module_metrics(): SM_METRICS_SQL = """ \ SELECT COUNT(id), course_id, student_id from courseware_studentmodule GROUP BY course_id, student_id HAVING COUNT(id) > 200 ORDER BY COUNT(id) DESC; """ with connection.cursor() as cursor: count_records_sql = 'SELECT COUNT(id) from courseware_studentmodule;' cursor.execute(count_records_sql) total_records = cursor.fetchone() cursor.execute(SM_METRICS_SQL) rows = cursor.fetchall() # Show the top twenty for row in rows[:20]: print('{} - course: {}, user: {}'.format( row[0], row[1], row[2])) return rows

johnbaldwin requested review from melvinsoft, OmarIthawi and thraxil June 15, 2020 19:12

OmarIthawi approved these changes Jun 15, 2020

View reviewed changes

johnbaldwin merged commit 49ca39c into master Jun 15, 2020

johnbaldwin deleted the john/improve-enroll-metrics-pipeline-perf branch June 15, 2020 23:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Updates pipeline enrollment metrics queries to improve performance #226

Updates pipeline enrollment metrics queries to improve performance #226

Uh oh!

johnbaldwin commented Jun 15, 2020

Uh oh!

codecov-commenter commented Jun 15, 2020 •

edited

Loading

Uh oh!

OmarIthawi Jun 15, 2020

Uh oh!

johnbaldwin Jun 15, 2020 •

edited

Loading

Uh oh!

Uh oh!

Updates pipeline enrollment metrics queries to improve performance #226

Updates pipeline enrollment metrics queries to improve performance #226

Uh oh!

Conversation

johnbaldwin commented Jun 15, 2020

Uh oh!

codecov-commenter commented Jun 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

OmarIthawi Jun 15, 2020

Choose a reason for hiding this comment

Uh oh!

johnbaldwin Jun 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-commenter commented Jun 15, 2020 •

edited

Loading

johnbaldwin Jun 15, 2020 •

edited

Loading