Repartition - CPU #526

kaiyingshan · 2021-10-31T16:59:47Z

repartition c++ implementation and tests

nirandaperera

@kaiyingshan Great work! I requested some changes. Let me know if those make sense/ or you need any clarification.

nirandaperera · 2021-11-03T02:19:19Z

cpp/src/cylon/table.cpp

+  arrow::Result<std::shared_ptr<arrow::Table>> concat_res =
+      arrow::ConcatenateTables(received_tables);
+  RETURN_CYLON_STATUS_IF_ARROW_FAILED(concat_res.status());
+  const auto &final_table = concat_res.ValueOrDie();


let's use CYLON_ASSIGN_OR_RAISE here

Suggested change

arrow::Result<std::shared_ptr<arrow::Table>> concat_res =

arrow::ConcatenateTables(received_tables);

RETURN_CYLON_STATUS_IF_ARROW_FAILED(concat_res.status());

const auto &final_table = concat_res.ValueOrDie();

CYLON_ASSIGN_OR_RAISE(auto final_table, arrow::ConcatenateTables(received_tables))

nirandaperera · 2021-11-03T02:20:41Z

cpp/src/cylon/table.cpp

+  arrow::Result<std::shared_ptr<arrow::Table>> combine_res =
+      final_table->CombineChunks(cylon::ToArrowPool(ctx));
+  RETURN_CYLON_STATUS_IF_ARROW_FAILED(concat_res.status());
+  table_out = combine_res.ValueOrDie();


Suggested change

arrow::Result<std::shared_ptr<arrow::Table>> combine_res =

final_table->CombineChunks(cylon::ToArrowPool(ctx));

RETURN_CYLON_STATUS_IF_ARROW_FAILED(concat_res.status());

table_out = combine_res.ValueOrDie();

CYLON_ASSIGN_OR_RAISE(table_out, final_table->CombineChunks(cylon::ToArrowPool(ctx)))

nirandaperera · 2021-11-03T02:21:12Z

cpp/src/cylon/table.cpp

+static inline Status all_to_all_arrow_tables_preserve_order(const std::shared_ptr<CylonContext> &ctx,
+                                             const std::shared_ptr<arrow::Schema> &schema,
+                                             const std::vector<std::shared_ptr<arrow::Table>> &partitioned_tables,
+                                             std::shared_ptr<arrow::Table> &table_out) {


shall we use pointers for output here?

nirandaperera · 2021-11-03T02:29:09Z

cpp/src/cylon/table.cpp

+  std::vector<int64_t> size = { num_row };
+  std::vector<int64_t> sizes;
+  mpi::AllGather(size, world_size, sizes);


couple of things here. You don't need to allocate a vector to gather. You can simply do that by allocating just a int64_t size variable. My suggestion is,

Suggested change

std::vector<int64_t> size = { num_row };

std::vector<int64_t> sizes;

mpi::AllGather(size, world_size, sizes);

int64_t size = num_row;

std::vector<int64_t> sizes(world_size, 0); // allocate world_size number of slots

int status = mpi::AllGather(&size, world_size, sizes.data());

// this status needs to be checked!!

from the code, it seems like mpi::AllGather can only take vectors as arguments, but MPI_Allgather can take pointers. Should I use it?

nirandaperera · 2021-11-03T02:32:48Z

cpp/src/cylon/table.cpp

+  if(rows_per_partition.size() != world_size) {
+    return Status(
+    cylon::Code::ValueError,
+    "rows_per_partition size does not align with world size. Received " +
+        std::to_string(rows_per_partition.size()) + ", Expected " +
+        std::to_string(world_size));
+  }


let's move this before the allgather operation. There's no need to do a comm operation if this check is failing.

nirandaperera · 2021-11-03T03:02:41Z

cpp/src/cylon/table.cpp

+        std::to_string(acc));
+  }
+
+  std::vector<std::pair<int, int>> send_to = find_mapping(start_idx, num_row, rows_per_partition, dest_sizes_acc);


I'd rather use a vector<int64> instead of vector<pair<int, int64>> here. We can make the index of the vector correspond to the rank, isn't it? 🤔

nirandaperera · 2021-11-03T03:07:15Z

cpp/src/cylon/table.cpp

+  for(auto p: send_to) {
+    std::fill(itr + idx, itr + idx + p.second, (uint32_t) receive_build_rank_order[p.first]);
+    idx += p.second;
+  }


could you please clarify this loop? I didn't understand the receive_build_rank_order[p.first] part

p.first is the partition, p.second is the number of elements.
for one partition, it may need to send, for example, a1 elements to partition receive_build_rank_order[0], and then a2 elements to partition receive_build_rank_order[1], then send_to will be {{a1, 0}, {a2, 1}}.

nirandaperera · 2021-11-03T03:08:04Z

cpp/src/cylon/table.cpp

+  std::vector<std::shared_ptr<arrow::Table>> partitioned_tables;
+  RETURN_CYLON_STATUS_IF_FAILED(Split(table, no_of_partitions, outPartitions, partitioned_tables));
+
+  std::shared_ptr<arrow::Schema> schema = table->get_table()->schema();


Suggested change

std::shared_ptr<arrow::Schema> schema = table->get_table()->schema();

const auto& schema = table->get_table()->schema();

nirandaperera · 2021-11-03T03:10:11Z

cpp/src/cylon/table.cpp

+  int total = 0;
+  for(int n: sizes) {
+    total += n;
+  }


use std::accumulate

nirandaperera · 2021-11-03T03:15:03Z

cpp/test/repartition_test.cpp

+static void verify_test(std::vector<std::vector<std::string>>& expected, std::shared_ptr<Table>& output) {
+    std::stringstream ss;
+    output->PrintToOStream(ss);
+    std::string s;
+    int i = 0;
+    while(ss>>s) {
+        REQUIRE(s == expected[RANK][i++]);
+    }
+    REQUIRE(i == expected[RANK].size());
+}


you can use ARROW_EQUALS macro here. To create expected arrow table, use TableFromJSON in arrow_test_utils header

ahmet-uyar · 2021-11-03T12:15:19Z

I have implemented this https://github.com/ahmet-uyar/cylon/blob/repartition/cpp/src/cylon/repartition.hpp
Basically there two two methods I think you have also implemented:

DivideRowsEvenly (this is less significant but, may remove replicated code)
RowIndicesToAll (this calculates the list of indices to send partitions to)-

ahmet-uyar · 2021-11-03T12:25:58Z

While testing, I recommend to test with empty tables and partitions also.
for example:

        std::vector<int64_t> rows_per_partition = {0, 0, 0, 0};
        std::vector<int64_t> rows_per_partition = {12, 0, 0, 0};
        std::vector<int64_t> rows_per_partition = {6, 0, 6, 0};

I also recommend to test non-numeric data types such as strings and dates.
there are data files for that:

data/mpiops/sales_nulls_nunascii_x.csv

nirandaperera · 2021-11-03T12:59:12Z

I have implemented this https://github.com/ahmet-uyar/cylon/blob/repartition/cpp/src/cylon/repartition.hpp Basically there two two methods I think you have also implemented:
* DivideRowsEvenly (this is less significant but, may remove replicated code)

* RowIndicesToAll (this calculates the list of indices to send partitions to)-

Yes, I also saw that there were some duplication in the 2 PRs. Best would be to merge 1 PR first and then reuse utils from it in the other.

ahmet-uyar · 2021-11-04T12:44:23Z

cpp/src/cylon/table.hpp

+                   const std::vector<int64_t>& rows_per_partition,
+                   std::shared_ptr<cylon::Table> *output);
+
+Status Repartition(const std::shared_ptr<cylon::Table>& table,


we may remove this function. Previous function can get rows_per_partition with default value of empty vector. If the vector is empty, it can perform even repartitioning.

nirandaperera · 2021-11-08T16:55:22Z

@kaiyingshan I merged #528 now. Ping @ahmet-uyar if you need any help with those utils.

nirandaperera

@kaiyingshan I made some minor comments. I also saw that there were some previous comments that need to be addressed.
Overall, it looks good to me. Good job @kaiyingshan :-)
Let's address these comments and merge this on green CI!

nirandaperera · 2021-11-19T04:06:57Z

cpp/src/cylon/table.cpp

+  if(num_row == 0) {
+    return Status::OK();
+  }


I think since we are returning Status::OK, we need to set b_out. We can simply assign a to it.

nirandaperera · 2021-11-19T04:09:16Z

cpp/src/cylon/table.cpp

+  RETURN_CYLON_STATUS_IF_FAILED(Repartition(b, rows_per_partition, b_out));
+
+  return Status::OK();


nit

Suggested change

RETURN_CYLON_STATUS_IF_FAILED(Repartition(b, rows_per_partition, b_out));

return Status::OK();

return Repartition(b, rows_per_partition, b_out);

nirandaperera · 2021-11-19T04:10:38Z

cpp/src/cylon/table.cpp

@@ -1056,26 +1126,124 @@ Status Equals(const std::shared_ptr<cylon::Table>& a, const std::shared_ptr<cylo
  return Status::OK();
 }

+static Status RepartitionToMatchOtherTable(const std::shared_ptr<cylon::Table> &a, const std::shared_ptr<cylon::Table> &b, std::shared_ptr<cylon::Table> * b_out) {
+  int world_size = a->GetContext()->GetWorldSize();
+  int num_row = a->Rows();


Rows are usually int64

nirandaperera · 2021-11-19T04:12:34Z

cpp/src/cylon/table.cpp

+  if(num_row == 0) {
+    return Status::OK();
+  }


Suggested change

if(num_row == 0) {

return Status::OK();

}

if(num_row == 0) {

*output = table;

return Status::OK();

}

nirandaperera · 2021-11-19T04:14:18Z

cpp/src/cylon/table.cpp

+  std::vector<std::shared_ptr<arrow::Table>> partitioned_tables;
+  RETURN_CYLON_STATUS_IF_FAILED(Split(table, no_of_partitions, outPartitions, partitioned_tables));
+
+  std::shared_ptr<arrow::Schema> schema = table->get_table()->schema();


nirandaperera · 2021-11-19T04:15:42Z

cpp/src/cylon/table.cpp

+  *output = std::make_shared<cylon::Table>(table->GetContext(), table_out);
+
+  return Status::OK();


nirandaperera · 2021-11-19T04:16:46Z

cpp/src/cylon/table.cpp

+  int num_row = table->Rows();
+  std::vector<int64_t> size = { num_row };
+  std::vector<int64_t> sizes;
+  mpi::AllGather(size, world_size, sizes);


Status of this call needs to be checked

nirandaperera · 2021-11-19T04:17:44Z

cpp/test/repartition_test.cpp

+static void verify_test(std::vector<std::vector<std::string>>& expected, std::shared_ptr<Table>& output) {
+    std::stringstream ss;
+    output->PrintToOStream(ss);
+    std::string s;
+    int i = 0;
+    while(ss>>s) {
+        REQUIRE(s == expected[RANK][i++]);
+    }
+    REQUIRE(i == expected[RANK].size());
+}


nirandaperera · 2021-11-19T04:22:21Z

python/pycylon/test/test_repartition.py

+from utils import create_df,assert_eq
+from pycylon.net import MPIConfig
+import random
+
+"""
+Run test:
+>> pytest -q python/pycylon/test/test_repartition.py
+"""
+
+def test_repartition():
+    env=CylonEnv(config=MPIConfig())
+    df1, _ = create_df([random.sample(range(10, 300), 50),
+                            random.sample(range(10, 300), 50),
+                            random.sample(range(10, 300), 50)])
+    df2 = df1.repartition([50], None, env=env)
+    assert_eq(df1, df2)


to run this test, you need add this under test/test_all.py. Otherwise, it wouldn't run.

And I think this could be extended for multiple workers.

Suggested change

from utils import create_df,assert_eq

from pycylon.net import MPIConfig

import random

"""

Run test:

>> pytest -q python/pycylon/test/test_repartition.py

"""

def test_repartition():

env=CylonEnv(config=MPIConfig())

df1, _ = create_df([random.sample(range(10, 300), 50),

random.sample(range(10, 300), 50),

random.sample(range(10, 300), 50)])

df2 = df1.repartition([50], None, env=env)

assert_eq(df1, df2)

from utils import create_df,assert_eq

from pycylon.net import MPIConfig

import random

"""

Run test:

>> pytest -q python/pycylon/test/test_repartition.py

"""

def test_repartition():

env=CylonEnv(config=MPIConfig())

world_sz = env.get_world_size()

df1, _ = create_df([random.sample(range(10, 300), 50),

random.sample(range(10, 300), 50),

random.sample(range(10, 300), 50)])

df2 = df1.repartition([50 for _ in range(world_sz)], None, env=env) # distributed repartition

assert_eq(df1, df2) # still the local partitions would be equal

minor change

…to kaiyingshan-repartition

nirandaperera · 2021-11-23T22:49:18Z

@kaiyingshan Thank you very much for doing this! Great work.. 👍

* repartion with custom rank order * use all_to_all that preserves rank order & added tests * input validation & more test * add C++ apis and corresponding tests * minor fix * python apis * temporarily delete python api * python api * improve coding style, add comments, and refactor find mapping * remove unused code * use util function * refined distributed eq with repartition * use int64 * create a MacOS yml file (#530) * equal tests * fixes * Update test_repartition.py minor change * fixing test failures * adding an additional test Co-authored-by: Ziyao22 <[email protected]> Co-authored-by: niranda perera <[email protected]>

kaiyingshan added 8 commits October 30, 2021 04:52

repartion with custom rank order

f211d0e

use all_to_all that preserves rank order & added tests

0f3c5f5

input validation & more test

9dad023

add C++ apis and corresponding tests

2e7d6b8

minor fix

ad91400

python apis

c976a95

temporarily delete python api

b772cfe

python api

6274836

nirandaperera changed the title ~~Repartition~~ Repartition - CPU Nov 2, 2021

nirandaperera requested changes Nov 3, 2021

View reviewed changes

nirandaperera mentioned this pull request Nov 3, 2021

Repartition - GPU #528

Merged

nirandaperera requested a review from ahmet-uyar November 3, 2021 03:23

ahmet-uyar reviewed Nov 4, 2021

View reviewed changes

kaiyingshan added 2 commits November 5, 2021 04:01

improve coding style, add comments, and refactor find mapping

78d1a40

remove unused code

16a0be1

kaiyingshan and others added 6 commits November 8, 2021 12:28

Merge branch 'cylondata:main' into repartition

55d0255

use util function

9bf3a4c

refined distributed eq with repartition

8b13cf4

use int64

a182ff5

create a MacOS yml file (cylondata#530)

79c4b73

equal tests

2f1d4c6

nirandaperera requested changes Nov 19, 2021

View reviewed changes

kaiyingshan and others added 4 commits November 21, 2021 03:51

fixes

0d525cd

Update test_repartition.py

b833c2c

minor change

Merge branch 'repartition' of https://github.com/kaiyingshan/cylon in…

d43a39e

…to kaiyingshan-repartition

fixing test failures

8ca3cb3

adding an additional test

81c7f5a

nirandaperera approved these changes Nov 23, 2021

View reviewed changes

nirandaperera merged commit 112ea97 into cylondata:main Nov 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repartition - CPU #526

Repartition - CPU #526

kaiyingshan commented Oct 31, 2021

nirandaperera left a comment

nirandaperera Nov 3, 2021

nirandaperera Nov 3, 2021

nirandaperera Nov 3, 2021

nirandaperera Nov 3, 2021

kaiyingshan Nov 5, 2021

nirandaperera Nov 3, 2021

nirandaperera Nov 3, 2021

nirandaperera Nov 3, 2021

kaiyingshan Nov 5, 2021

nirandaperera Nov 3, 2021

nirandaperera Nov 19, 2021

nirandaperera Nov 3, 2021

nirandaperera Nov 3, 2021

nirandaperera Nov 19, 2021

ahmet-uyar commented Nov 3, 2021 •

edited

Loading

ahmet-uyar commented Nov 3, 2021

nirandaperera commented Nov 3, 2021

ahmet-uyar Nov 4, 2021

nirandaperera commented Nov 8, 2021

nirandaperera left a comment

nirandaperera Nov 19, 2021

nirandaperera Nov 19, 2021

nirandaperera Nov 19, 2021

nirandaperera Nov 19, 2021

nirandaperera Nov 19, 2021

nirandaperera Nov 19, 2021

nirandaperera Nov 19, 2021

nirandaperera Nov 19, 2021

nirandaperera Nov 19, 2021

nirandaperera commented Nov 23, 2021

-  std::vector<int64_t> size = { num_row };
-  std::vector<int64_t> sizes;
-  mpi::AllGather(size, world_size, sizes);
+  int64_t size = num_row;
+  std::vector<int64_t> sizes(world_size, 0); // allocate world_size number of slots
+  int status = mpi::AllGather(&size, world_size, sizes.data());
+  // this status needs to be checked!!

	std::shared_ptr<arrow::Schema> schema = table->get_table()->schema();
	const auto& schema = table->get_table()->schema();

		RETURN_CYLON_STATUS_IF_FAILED(Repartition(b, rows_per_partition, b_out));

		return Status::OK();

		*output = std::make_shared<cylon::Table>(table->GetContext(), table_out);

		return Status::OK();

Repartition - CPU #526

Repartition - CPU #526

Conversation

kaiyingshan commented Oct 31, 2021

nirandaperera left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahmet-uyar commented Nov 3, 2021 • edited Loading

ahmet-uyar commented Nov 3, 2021

nirandaperera commented Nov 3, 2021

Choose a reason for hiding this comment

nirandaperera commented Nov 8, 2021

nirandaperera left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nirandaperera commented Nov 23, 2021

ahmet-uyar commented Nov 3, 2021 •

edited

Loading