Skip to content

Updates pipeline enrollment metrics queries to improve performance #226

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 15, 2020

Conversation

johnbaldwin
Copy link
Contributor

This commit should dramatically improve the query performance for the
enrollment metrics pipeline

What was wrong?

Queries were very slow because of a 'LMIT 1' issues with MySQL. For a starting point, see here

https://stackoverflow.com/questions/15460133/mysql-dramatically-slower-query-execution-if-use-limit-1-instead-of-limit-5

In Django, we were doing a filter query that returns a single record or
None. Examples:

StudentModule.objects.filter(**filter_args).latest('modified')
StudentModule.objects.filter(**filter_args).order_by('-modified).first()

Query functions such as latest, first, last and so on add a LIMIT 1 to the underlying SQL query, which has apparent negative performance on the query analyzer

To address this, we do two things

  1. For the specified course, we filter the StudentModule records
  2. For the specifid learner in the course, we filter

Also, LearnerCourseGradesMetrics queries are slow as the model needs indexing
on fields including site, course, and learner. We address this twofold

  1. We will add indexing to the needed fields after we prune old records.
    This is so we're not indexing records we are just going to delete anyway
  2. We filter all LearnerCourseGradeMetrics records for the specified
    course

This commit performs #2 above to then filter from this queryset to find
LearnerCourseGradeMetrics records for the specified learner in the
course

Enrollment Metrics tests have been updated to reflect changes in the
production code

This commit should dramatically improve the query performance for the
enrollment metrics pipeline

What was wrong?

Queries were very slow because of a 'LMIT 1' issues with MySQL. For a starting point, see here

    https://stackoverflow.com/questions/15460133/mysql-dramatically-slower-query-execution-if-use-limit-1-instead-of-limit-5

In Django, we were doing a filter query that returns a single record or
`None`. Examples:

```
StudentModule.objects.filter(**filter_args).latest('modified')
StudentModule.objects.filter(**filter_args).order_by('-modified).first()
```

Query functions such as `latest`, `first`, `last` and so on add a `LIMIT
1` to the underlying SQL query, which has apparent negative performance on the query analyzer

To address this, we do two things

1. For the specified course, we filter the StudentModule records
2. For the specifid learner in the course, we filter

Also, LearnerCourseGradesMetrics queries are slow as the model needs indexing
on fields including site, course, and learner. We address this twofold

1. We will add indexing to the needed fields after we prune old records.
This is so we're not indexing records we are just going to delete anyway
2. We filter all LearnerCourseGradeMetrics records for the specified
course

This commit performs #2 above to then filter from this queryset to find
LearnerCourseGradeMetrics records for the specified learner in the
course

Enrollment Metrics tests have been updated to reflect changes in the
production code
@codecov-commenter
Copy link

codecov-commenter commented Jun 15, 2020

Codecov Report

Merging #226 into master will decrease coverage by 0.04%.
The diff coverage is 87.50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #226      +/-   ##
==========================================
- Coverage   91.33%   91.28%   -0.05%     
==========================================
  Files          38       38              
  Lines        1950     1951       +1     
==========================================
  Hits         1781     1781              
- Misses        169      170       +1     
Impacted Files Coverage Δ
figures/pipeline/enrollment_metrics.py 98.36% <87.50%> (+0.02%) ⬆️
figures/sites.py 66.27% <0.00%> (-1.17%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d96dd9b...d596ffa. Read the comment docs.

if student_modules:
most_recent_sm = student_modules.latest('modified')
most_recent_sm = student_modules[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many records are we filtering from?

Copy link
Contributor Author

@johnbaldwin johnbaldwin Jun 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OmarIthawi The top five counts are 398, 382, 309, 309, 308.

As far as limiting the queries, AFAIK, there's no point in doing a "LIMIT 5" because querysets are lazy evaluated unless you use the "step" parameter of python slicing. Please see here: https://docs.djangoproject.com/en/1.11/topics/db/queries/#limiting-querysets

Here's the query I'm running to capture counts:

from django.db import connection

def student_module_metrics():

    SM_METRICS_SQL = """ \
        SELECT  COUNT(id), course_id, student_id from courseware_studentmodule
        GROUP BY course_id, student_id
        HAVING COUNT(id) > 200
        ORDER BY COUNT(id) DESC;
    """

    with connection.cursor() as cursor:
        count_records_sql = 'SELECT COUNT(id) from courseware_studentmodule;'
        cursor.execute(count_records_sql)
        total_records = cursor.fetchone()
        cursor.execute(SM_METRICS_SQL)
        rows = cursor.fetchall()

        # Show the top twenty
        for row in rows[:20]:
            print('{} - course: {}, user: {}'.format(
                row[0], row[1], row[2]))

        return rows

@johnbaldwin johnbaldwin merged commit 49ca39c into master Jun 15, 2020
@johnbaldwin johnbaldwin deleted the john/improve-enroll-metrics-pipeline-perf branch June 15, 2020 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants