Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Gluon RNN memory leaks with extra variables #13951

@yifeim

Description

@yifeim

Note: Providing complete information in the most concise form is the best way to get help. This issue template serves as the checklist for essential information to most of the technical issues and bug reports. For non-technical issues and feature requests, feel free to present the information in what you believe is the best form.

For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io

Description

Gluon allows one to define extra variables that may not lead to model outcome. However, having them may cause memory leak.

Environment info (Required)

----------Python Info----------
Version      : 3.6.5
Compiler     : GCC 7.2.0
Build        : ('default', 'Apr 29 2018 16:14:56')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 10.0.1
Directory    : /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version      : 1.3.1
Directory    : /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet
Commit Hash   : 19c501680183237d52a862e6ae1dc4ddc296305b
----------System Info----------
Platform     : Linux-4.14.77-70.82.amzn1.x86_64-x86_64-with-glibc2.9
system       : Linux
node         : ip-172-16-95-144
release      : 4.14.77-70.82.amzn1.x86_64
version      : #1 SMP Mon Dec 3 20:01:27 UTC 2018
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               2706.669
BogoMIPS:              4600.11
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-7
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0020 sec, LOAD: 1.0198 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0912 sec, LOAD: 0.1530 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.5845 sec, LOAD: 0.1434sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0089 sec, LOAD: 0.1170 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0100 sec, LOAD: 0.3888 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0104 sec, LOAD: 0.0782 sec.```

Package used (Python/R/Scala/Julia): Python

Error Message:

If you run watch -n0.1 nvidia-smi, you may observe memory growth every by 2MB every few seconds.

Minimum reproducible example

See mxnet-memory-leak.tar.gz
The main differences between the attachment and examples/gluon/language_model/ are to add extra on Line 56 in model.py add to add mx.nd.array([], ctx=context) on Line 166 and 183 in train.py

Steps to reproduce

(Paste the commands you ran that produced the error.)

1. python train.py --cuda --tied --nhid 200 --emsize 200 --epochs 20 --dropout 0.2 &
2. watch -n0.1 nvidia-smi

What have you tried to solve it?

  1. Add a dummy link between all inputs and outputs. However, this may not always be possible / convenient / readable.
  2. I previously suggested a feature request to allow None input types in the gluon models. Communicated with @szha that this would not be fundamentally challenging. However, this has not been acted upon and may be a low-hanging fruit alongside the memory fix leak.

Related: #13247

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions