1. Use shared memory for shuffler in threads. 2. Use distributed shuffling (distribute shuffler) for machine 3. Pre compute shuffling order for machine - parallel version for load balancing?