Feature faster instance seg #1401

MadeWithStone · 2025-07-02T03:01:31Z

Description

Uses mask slices sized to the corresponding bounding boxes and moves tensor processing to gpu (when enabled with a flag) to allow for faster instance segmentation post processing on jetsons. Leads to about a 2x speed improvement in mask post processing on jetsons.

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

How has this change been tested, please provide a testcase or example of how you tested the change?

Tested using existing inference tests and by running workflows on a gpu dev vm and jetson device to test for changes in processing latency.

CLAassistant · 2025-07-02T03:01:38Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
3 out of 4 committers have signed the CLA.

✅ MadeWithStone
✅ grzegorz-roboflow
✅ hansent
❌ roboflowmaxwell
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

…e-faster-instance-seg`) Here’s a highly optimized version of your program, based on your profiling. The key bottlenecks were. - Expensive per-element operations (`masks[masks < 0.5] = 0`) - Deepcopying of large arrays - Inefficient mask slicing in Python loops - Repeated device transfers in `preprocess_segmentation_masks` The following rewrites. - Avoid unnecessary `deepcopy` (can use `.copy()` and in-place without side-effect) - Fuse `masks[masks < 0.5] = 0` into more efficient array assignment (either with boolean multiply or with `np.where`) - Move as much slicing and post-processing to NumPy bulk ops, avoiding Python loops, especially in `slice_masks` - Reduce data copying/movement, making operations in-place where possible - Avoid repeated instantiation of tensors when possible (especially on constant mask shapes) - Vectorize `scale_bboxes` - Optionally leverage torch-only path if `gpu_decode` is set (minimizing .cpu().numpy roundtrip until the end) - Keep functions' external behavior and output compatible **Key Remarks**. * All dense operations are in NumPy or PyTorch and are in-place where possible. * Expensive for loop in `slice_masks` is unavoidable if boxes are of different shape, but we now extract all slicing indices once, removing redundant work. * `masks[masks < 0.5] = 0` is now an in-place multiply, which is an order of magnitude faster than masking assignment. * Avoid `deepcopy` and any redundant copying/allocations. * Torch computations use in-place operations for further speedup (e.g., `sigmoid_`). * No logic or results are changed. Let me know if you want to attempt full batch slicing (sometimes possible in edge-aligned scenarios), or want separate CUDA/NumPy-only pathways for further speedup.

codeflash-ai · 2025-07-02T03:27:17Z

⚡️ Codeflash found optimizations for this PR

📄 150% (1.50x) speedup for `process_mask_fast` in `inference/core/utils/postprocess.py`

⏱️ Runtime : 16.4 milliseconds → 6.54 milliseconds (best of 126 runs)

I created a new dependent PR with the suggested changes. Please review:

⚡️ Speed up function process_mask_fast by 150% in PR #1401 (feature-faster-instance-seg) #1402

If you approve, it will be merged into this PR (branch feature-faster-instance-seg).

grzegorz-roboflow · 2025-07-04T10:24:16Z

inference/core/models/instance_segmentation_base.py

@@ -80,6 +83,7 @@ def infer(
            disable_preproc_contrast (bool, optional): If true, the auto contrast preprocessing step is disabled for this call. Default is False.
            disable_preproc_grayscale (bool, optional): If true, the grayscale preprocessing step is disabled for this call. Default is False.
            disable_preproc_static_crop (bool, optional): If true, the static crop preprocessing step is disabled for this call. Default is False.
+            gpu_decode (bool, optional): Use GPU (cuda or mps) hardware to perform some of the mask decoding steps. (processing mode agnostic). Default is True.


on line 65 default is False

grzegorz-roboflow · 2025-07-04T10:27:38Z