Scala Module API resize is leaking memory on the native size. #10867
Description
Description
Create and bind a MXNet Module with batch size N+1 and proceed to loop and pass DataBatches to it that require the Module to resize before performing the forward pass. Monitor the system resources (With htop, nvidia-smi, jvmtop) and you will notice the used system memory in htop will start to grow, but not the jvm heap size (the system memory usages grows beyond the set max JVM heap size) or GPU memory usage. This will continue until your system runs out of memory and there is a crash or the JVM is killed clearing all of the leaked used system memory with it.
Environment info (Required)
----------Python Info----------
Version : 3.4.3
Compiler : GCC 4.8.4
Build : ('default', 'Nov 17 2016 01:08:31')
Arch : ('64bit', 'ELF')
------------Pip Info-----------
Version : 10.0.1
Directory : /usr/local/lib/python3.4/dist-packages/pip
----------MXNet Info-----------
Version : 1.1.0
Directory : /home/paperspace/src/mxnet/python/mxnet
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform : Linux-4.4.0-31-generic-x86_64-with-Ubuntu-14.04-trusty
system : Linux
node : psplnlbg
release : 4.4.0-31-generic
version : #50~14.04.1-Ubuntu SMP Wed Jul 13 01:07:32 UTC 2016
----------Hardware Info----------
machine : x86_64
processor : x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Stepping: 1
CPU MHz: 2600.072
BogoMIPS: 5200.14
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 10240K
NUMA node0 CPU(s): 0-7
----------Network Test----------
Setting timeout: 10
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0884 sec, LOAD: 0.4182 sec.
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0131 sec, LOAD: 0.4660 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0957 sec, LOAD: 0.6263 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0099 sec, LOAD: 0.3049 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0134 sec, LOAD: 0.6473 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0126 sec, LOAD: 0.0343 sec.
Package used (Python/R/Scala/Julia):
Scala. This seems to be specific to the Scala API. Can not reproduce in Python.
For Scala user, please provide:
-
Java version: (
java -version
)
java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode) -
Maven version: (
mvn -version
)
Apache Maven 3.0.5
Maven home: /usr/share/maven
Java version: 1.8.0_131, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-8-oracle/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "4.4.0-31-generic", arch: "amd64", family: "unix" -
Scala runtime if applicable: (
scala -version
)
2.11.11
Build info (Required if built from source)
Compiler (gcc/clang/mingw/visual studio): GCC 4.8.4
MXNet commit hash:
07a83a0
Build config:
make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1
make scalainstall -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1
Error Message:
None
Minimum reproducible example
link to simple Scala project/code to reproduce issue
https://github.com/jessebrizzi/MXNet-Bug/blob/master/scala/TestBug.scala
Steps to reproduce
(Paste the commands you ran that produced the error.)
- Pull this project and run it with SBT to test this issue.
What have you tried to solve it?
- Avoid passing in DataBatch that requires resize by padding smaller DataBatch with 0's.
- Not using the Scala API