Skip to content

Feature faster instance seg #1401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Conversation

MadeWithStone
Copy link

Description

Uses mask slices sized to the corresponding bounding boxes and moves tensor processing to gpu (when enabled with a flag) to allow for faster instance segmentation post processing on jetsons. Leads to about a 2x speed improvement in mask post processing on jetsons.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

How has this change been tested, please provide a testcase or example of how you tested the change?

Tested using existing inference tests and by running workflows on a gpu dev vm and jetson device to test for changes in processing latency.

@CLAassistant
Copy link

CLAassistant commented Jul 2, 2025

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
3 out of 4 committers have signed the CLA.

✅ MadeWithStone
✅ grzegorz-roboflow
✅ hansent
❌ roboflowmaxwell
You have signed the CLA already but the status is still pending? Let us recheck it.

MadeWithStone and others added 2 commits July 1, 2025 20:02
…e-faster-instance-seg`)

Here’s a highly optimized version of your program, based on your profiling. The key bottlenecks were.
- Expensive per-element operations (`masks[masks < 0.5] = 0`)
- Deepcopying of large arrays
- Inefficient mask slicing in Python loops  
- Repeated device transfers in `preprocess_segmentation_masks`

The following rewrites.
- Avoid unnecessary `deepcopy` (can use `.copy()` and in-place without side-effect)
- Fuse `masks[masks < 0.5] = 0` into more efficient array assignment (either with boolean multiply or with `np.where`)
- Move as much slicing and post-processing to NumPy bulk ops, avoiding Python loops, especially in `slice_masks`
- Reduce data copying/movement, making operations in-place where possible  
- Avoid repeated instantiation of tensors when possible (especially on constant mask shapes)
- Vectorize `scale_bboxes`
- Optionally leverage torch-only path if `gpu_decode` is set (minimizing .cpu().numpy roundtrip until the end)
- Keep functions' external behavior and output compatible



**Key Remarks**.
* All dense operations are in NumPy or PyTorch and are in-place where possible.
* Expensive for loop in `slice_masks` is unavoidable if boxes are of different shape, but we now extract all slicing indices once, removing redundant work.
* `masks[masks < 0.5] = 0` is now an in-place multiply, which is an order of magnitude faster than masking assignment.
* Avoid `deepcopy` and any redundant copying/allocations.
* Torch computations use in-place operations for further speedup (e.g., `sigmoid_`).
* No logic or results are changed.

Let me know if you want to attempt full batch slicing (sometimes possible in edge-aligned scenarios), or want separate CUDA/NumPy-only pathways for further speedup.
Copy link
Contributor

codeflash-ai bot commented Jul 2, 2025

⚡️ Codeflash found optimizations for this PR

📄 150% (1.50x) speedup for process_mask_fast in inference/core/utils/postprocess.py

⏱️ Runtime : 16.4 milliseconds 6.54 milliseconds (best of 126 runs)

I created a new dependent PR with the suggested changes. Please review:

If you approve, it will be merged into this PR (branch feature-faster-instance-seg).

@@ -80,6 +83,7 @@ def infer(
disable_preproc_contrast (bool, optional): If true, the auto contrast preprocessing step is disabled for this call. Default is False.
disable_preproc_grayscale (bool, optional): If true, the grayscale preprocessing step is disabled for this call. Default is False.
disable_preproc_static_crop (bool, optional): If true, the static crop preprocessing step is disabled for this call. Default is False.
gpu_decode (bool, optional): Use GPU (cuda or mps) hardware to perform some of the mask decoding steps. (processing mode agnostic). Default is True.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on line 65 default is False

masks[masks < 0.5] = 0
return masks
masks = slice_masks(masks, down_sampled_boxes)
return masks, (mh, mw)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update method signature to indicate returned type

Also, this is breaking change, is it strictly necessary to pass shape? can we not infer shape from masks?

masks[masks < 0.5] = 0
return masks
sliced_masks = slice_masks(masks, down_sampled_boxes)
return sliced_masks, (mh, mw)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update method signature to indicate returned type

Also, this is breaking change, is it strictly necessary to pass shape? can we not infer shape from sliced_masks?

@@ -230,6 +241,7 @@ def process_mask_accurate(
masks_in: np.ndarray,
bboxes: np.ndarray,
shape: Tuple[int, int],
gpu_decode: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this parameter to docstring

@@ -267,6 +280,7 @@ def process_mask_tradeoff(
bboxes: np.ndarray,
shape: Tuple[int, int],
tradeoff_factor: float,
gpu_decode: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this parameter to docstring

# masks is a single numpy array containing multiple masks
masks = (masks * 255.0).astype(np.uint8)
for mask in masks:
segments.append(mask2poly(mask))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are expecting list of np.array or np.array I'd suggest checking if arg is instance of list, if not - wrap it with list, and then have have remaining logic operating on list

"""Converts binary masks to polygonal segments.

Args:
masks (numpy.ndarray): A set of binary masks, where masks are multiplied by 255 and converted to uint8 type.
masks (numpy.ndarray or list of numpy.ndarray): A set of binary masks, where masks are multiplied by 255 and converted to uint8 type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we are adding other changes to this PR not related with new parametergpu_decode, can we divite this PR to smaller incremental changes?

[[pt[0] * scale_x + bbox[0], pt[1] * scale_y + bbox[1]] for pt in poly]
for (poly, bbox) in zip(polys, pred[:, :4])
]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If bboxes were not in polys, we are changing behavior of this method - can we extract this to separate PR if this change is fixing something not related to gpu_decode

@grzegorz-roboflow
Copy link
Contributor

@MadeWithStone / @roboflowmaxwell thank you for submitting this PR!

Can you fix the CLA (@roboflowmaxwell needs to accept CLA), and also can you split this PR into smaller PR's so all changes solving different issues can be handled separately?

Please also fix formatting, and look at failing tests.

Thanks!

Copy link
Contributor

codeflash-ai bot commented Jul 8, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants