C_API for C++ iterator only supports one data and one label for one sample #12141
Description
I'm currently implementing a C++ iterator for performance purpose.
My iterator (name it MyIterator
) provides one data and multiple labels of different shapes.
These data and labels are stored in the data
attribute of DataBatch
object. The arrangement of data and labels may look like the following:
databatch.data[0] = data // in this case only one data is provided (image data)
databatch.data[1] = label1 // the 1st label, of shape shape<dim>(?)
databatch.data[2] = label2 // the 2nd label, of shape shape<dim>(?)
...
However, as I inspecting into the C API code, I found that this API implementation does not take into account of multiple-data and multiple-label situation like the case above. Instead, only the 0-th data (Line767) of DataBatch
is taken as data and only the 1-st data (Line745) of DataBatch
is taken as label. All the remainders that MyIterator
provides just go to some null space. For instance, at the python end, as I call next
, an incomplete batch is returned:
my_iter = mx.io.MyIterator(...)
databatch = my_iter.next()
print len(databatch.data)
>>> 1 // the data is preseverd since only one data is provided
print len(databatch.label)
>>> 1 // only one label is preserved !
Should this issue be placed on some "feature request" ? Maybe the API should consider a more extensible implementation by using key-value pairs to store multiple data and labels. For example:
struct DataBatchEx {
std::vector<std::pair<std::string, NDArray>> data; // Instead of vector<NDArray>, store data as kv pairs
std::vector<std::pair<std::string, NDArray>> labels; // API will explicitly use this attribute to construct label
std::vector<uint64_t> index;
std::string extra_data; // may be discarded
int num_batch_padd;
};
(Nevertheless, I could go hacking by concating all my labels into one and decoded at python end for this moment.)