-
Notifications
You must be signed in to change notification settings - Fork 194
Feature faster instance seg #1401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
…e-faster-instance-seg`) Here’s a highly optimized version of your program, based on your profiling. The key bottlenecks were. - Expensive per-element operations (`masks[masks < 0.5] = 0`) - Deepcopying of large arrays - Inefficient mask slicing in Python loops - Repeated device transfers in `preprocess_segmentation_masks` The following rewrites. - Avoid unnecessary `deepcopy` (can use `.copy()` and in-place without side-effect) - Fuse `masks[masks < 0.5] = 0` into more efficient array assignment (either with boolean multiply or with `np.where`) - Move as much slicing and post-processing to NumPy bulk ops, avoiding Python loops, especially in `slice_masks` - Reduce data copying/movement, making operations in-place where possible - Avoid repeated instantiation of tensors when possible (especially on constant mask shapes) - Vectorize `scale_bboxes` - Optionally leverage torch-only path if `gpu_decode` is set (minimizing .cpu().numpy roundtrip until the end) - Keep functions' external behavior and output compatible **Key Remarks**. * All dense operations are in NumPy or PyTorch and are in-place where possible. * Expensive for loop in `slice_masks` is unavoidable if boxes are of different shape, but we now extract all slicing indices once, removing redundant work. * `masks[masks < 0.5] = 0` is now an in-place multiply, which is an order of magnitude faster than masking assignment. * Avoid `deepcopy` and any redundant copying/allocations. * Torch computations use in-place operations for further speedup (e.g., `sigmoid_`). * No logic or results are changed. Let me know if you want to attempt full batch slicing (sometimes possible in edge-aligned scenarios), or want separate CUDA/NumPy-only pathways for further speedup.
⚡️ Codeflash found optimizations for this PR📄 150% (1.50x) speedup for
|
@@ -80,6 +83,7 @@ def infer( | |||
disable_preproc_contrast (bool, optional): If true, the auto contrast preprocessing step is disabled for this call. Default is False. | |||
disable_preproc_grayscale (bool, optional): If true, the grayscale preprocessing step is disabled for this call. Default is False. | |||
disable_preproc_static_crop (bool, optional): If true, the static crop preprocessing step is disabled for this call. Default is False. | |||
gpu_decode (bool, optional): Use GPU (cuda or mps) hardware to perform some of the mask decoding steps. (processing mode agnostic). Default is True. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on line 65 default is False
masks[masks < 0.5] = 0 | ||
return masks | ||
masks = slice_masks(masks, down_sampled_boxes) | ||
return masks, (mh, mw) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update method signature to indicate returned type
Also, this is breaking change, is it strictly necessary to pass shape? can we not infer shape from masks?
masks[masks < 0.5] = 0 | ||
return masks | ||
sliced_masks = slice_masks(masks, down_sampled_boxes) | ||
return sliced_masks, (mh, mw) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update method signature to indicate returned type
Also, this is breaking change, is it strictly necessary to pass shape? can we not infer shape from sliced_masks?
@@ -230,6 +241,7 @@ def process_mask_accurate( | |||
masks_in: np.ndarray, | |||
bboxes: np.ndarray, | |||
shape: Tuple[int, int], | |||
gpu_decode: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add this parameter to docstring
@@ -267,6 +280,7 @@ def process_mask_tradeoff( | |||
bboxes: np.ndarray, | |||
shape: Tuple[int, int], | |||
tradeoff_factor: float, | |||
gpu_decode: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add this parameter to docstring
# masks is a single numpy array containing multiple masks | ||
masks = (masks * 255.0).astype(np.uint8) | ||
for mask in masks: | ||
segments.append(mask2poly(mask)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are expecting list of np.array or np.array I'd suggest checking if arg is instance of list, if not - wrap it with list, and then have have remaining logic operating on list
"""Converts binary masks to polygonal segments. | ||
|
||
Args: | ||
masks (numpy.ndarray): A set of binary masks, where masks are multiplied by 255 and converted to uint8 type. | ||
masks (numpy.ndarray or list of numpy.ndarray): A set of binary masks, where masks are multiplied by 255 and converted to uint8 type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we are adding other changes to this PR not related with new parametergpu_decode
, can we divite this PR to smaller incremental changes?
[[pt[0] * scale_x + bbox[0], pt[1] * scale_y + bbox[1]] for pt in poly] | ||
for (poly, bbox) in zip(polys, pred[:, :4]) | ||
] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If bboxes were not in polys, we are changing behavior of this method - can we extract this to separate PR if this change is fixing something not related to gpu_decode
@MadeWithStone / @roboflowmaxwell thank you for submitting this PR! Can you fix the CLA (@roboflowmaxwell needs to accept CLA), and also can you split this PR into smaller PR's so all changes solving different issues can be handled separately? Please also fix formatting, and look at failing tests. Thanks! |
…1401-2025-07-02T03.27.08
…-07-02T03.27.08 ⚡️ Speed up function `process_mask_fast` by 150% in PR #1401 (`feature-faster-instance-seg`)
This PR is now faster! 🚀 @grzegorz-roboflow accepted my optimizations from: |
Description
Uses mask slices sized to the corresponding bounding boxes and moves tensor processing to gpu (when enabled with a flag) to allow for faster instance segmentation post processing on jetsons. Leads to about a 2x speed improvement in mask post processing on jetsons.
Type of change
Please delete options that are not relevant.
How has this change been tested, please provide a testcase or example of how you tested the change?
Tested using existing inference tests and by running workflows on a gpu dev vm and jetson device to test for changes in processing latency.