-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Polygon tracking feature with XMem tracker #7829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe recent updates involve integrating advanced object segmentation capabilities into the CVAT tool, enhancing its ability to handle complex video annotation tasks. This includes the addition of the XMem model for long-term video object segmentation and updates to the UI for better handling of object types during annotation. These changes aim to improve the efficiency and versatility of the annotation process. Changes
Assessment against linked issues
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Hello, @bsekachev , do you think this MR will be reviewed/merged soon? |
I will try to review it on this week. |
@@ -1358,6 +1387,7 @@ export class ToolsControlComponent extends React.PureComponent<Props, State> { | |||
<Tabs type='card' tabBarGutter={8}> | |||
<Tabs.TabPane key='interactors' tab='Interactors'> | |||
{this.renderMasksConvertingBlock()} | |||
{convertMasksToPolygons ? this.renderObjectTypeBlock() : null} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xmem
is a tracker, not interactor. The difference between them is exactly that interactors produce shapes on one frame, when tracker produces shapes on multiple frames (tracks).
Why are you modifying the block responsible for interactors?
const objectTypes = Object.values(ObjectType); | ||
objectTypes.splice(objectTypes.indexOf(ObjectType.TAG), 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you mean that this code supports only shape
and track
, you should write exactly this.
Otherwise, when we will add one more object type, this part will be automatically broken and nobody will know about it
- kind: WORKDIR | ||
value: /opt/nuclio | ||
- kind: RUN | ||
value: git clone --branch main https://github.com/omerferhatt/XMem xmem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not to use original repository?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also it is better to use specific tag. main
branch may be changed and new version will not compatible with old dependencies, for example.
value: |- | ||
pip install torch torchvision | ||
- kind: RUN | ||
value: wget 'https://www.dropbox.com/scl/fi/5m1l747p15qzgq023e0q9/xmem.pth?rlkey=ss2kjaq4qlvvk5juucyvtmrh8&dl=0' -O '/xmem.pth' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we should put the file to root /
.
Better to put to current working directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think dropbox is reliable enough.
Also it is not clear who is the weights owner, not clear weights license.
Is there original place?
workerAvailabilityTimeoutMilliseconds: 10000 | ||
attributes: | ||
# Set value from the calculation of tracking of 100 objects at the same time on a 4k image | ||
maxRequestBodySize: 1073741824 # 1GB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to send 1 Gb body, within HTTP request?
value: pip install opencv-python-headless jsonpickle | ||
- kind: RUN | ||
value: |- | ||
pip install torch torchvision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually it is a good idea to fix dependencies version. Otherwise sometimes it will turn around, that the image cannot be build anymore.
Hello @omerferhatt Do you have any plans regarding the pull request? |
I will close the pull request now. |
I'm going to try to finish this week, thanks. |
@omerferhatt @bsekachev However I notice that it takes a while to run inference on each frame (and in fact you can see it in the demo video omerferhatt posted above). Does anyone know what the source of this slow down is? When I run my own XMem inference script (adapted from the XMem author's example in their README), I can rip through inference on video at around 10 frames per second. With CVAT running on that same hardware, it takes several seconds to step through XMem tracking on a single frame. Is there something inherent in running XMem inference as a serverless function that causes the observed slowdown? And, if so, is there a way to speed it up? Otherwise, it will be very tedious to step through long videos frame by frame to track the objects. And this would be true of any model, not just XMem as used in this feature branch. |
Hi @jtziebarth , |
@omerferhatt thanks for such a quick reply. Can you clarify what you mean by "request time when importing and exporting states?" Where in the code are you referring to? If you can point me in the right direction I can try to take a look at it. Thanks. |
Motivation and context
Polygon tracking is a crucial part of computer vision annotation tasks. It helps the required consecutive frame annotation, shortens the required annotation time, creates more precise annotation and moreover simplifies all HITL (Human in the loop) processes. Since the CVAT doesn't support this feature, it seems that adding this functionality with side benefits will be good for the sake of this project.
It aims to solve the below problems:
interactor
's (which allows to useSAM
with any kind of polygon tracking algorithm)According to the previous issues, it:
How it looks
initial_demo_video.webm
Resource usage
Known issues and further optimization cases:
maxWorker
is more than 1 infunction.yaml
, the entire tracker works again in a duplicate manner. Since the problem is not solved, it is set to 1 for now.How has this been tested?
Tested and developed according to CVAT documentation
Checklist
develop
branch(cvat-canvas,
cvat-core,
cvat-data and
cvat-ui)
License
Feel free to contact the maintainers if that's a concern.
Summary by CodeRabbit
New Features
Enhancements
Documentation
Performance