TransT tracker integration #4886

dschoerk · 2022-08-31T07:04:27Z

PR to integrate the single object tracker TransT as an AI tool into CVAT.

also see here: https://github.com/cvat-ai/cvat-opencv/issues/14

…th nvidia/cuda:11.7.0-devel-ubuntu20.04

* Fix show empty tasks * v1.41.1 * Update changelog Co-authored-by: Boris Sekachev <[email protected]>

feat: upgrade dotenv-webpack from 7.1.1 to 8.0.0 Snyk has created this PR to upgrade dotenv-webpack from 7.1.1 to 8.0.0. See this package in npm: https://www.npmjs.com/package/dotenv-webpack See this project in Snyk: https://app.snyk.io/org/cvat/project/6c66365f-c154-46f2-b5db-4a4cd35fea4d?utm_source=github&utm_medium=referral&page=upgrade-pr Co-authored-by: snyk-bot <[email protected]>

dschoerk · 2022-08-31T08:06:13Z

cvat-ui/src/components/annotation-page/standard-workspace/controls-side-bar/tools-control.tsx

@@ -756,7 +756,7 @@ export class ToolsControlComponent extends React.PureComponent<Props, State> {
                        });
                        // eslint-disable-next-line no-await-in-loop
                        const response = await core.lambda.call(jobInstance.taskId, tracker, {
-                            frame: frame,
+                            frame: frame ,


to avoid the linter issue

sizov-kirill · 2022-08-31T12:03:43Z

Resolved #4768

tangy5 · 2022-09-17T04:21:45Z

Hi @dschoerk , thank for this really cool integration. This is very inspiring!
Can I ask how you write the code for annotating video (MP4) in CVAT? I'm trying to implement a different task with endoscopy videos, I saw your video shows the CVAT processed each frame at a time until all frames are tracked/inference.

How are the images fed/transformed? In meantime, how are those requests kept posting to nutlio server until all frames(images) are done?

Thank you in advance and it would be super appreciated if there sample code for this video integration! Thanks!

dschoerk · 2022-09-17T05:53:45Z

Hi @dschoerk , thank for this really cool integration. This is very inspiring! Can I ask how you write the code for annotating video (MP4) in CVAT? I'm trying to implement a different task with endoscopy videos, I saw your video shows the CVAT processed each frame at a time until all frames are tracked/inference.

How are the images fed/transformed? In meantime, how are those requests kept posting to nutlio server until all frames(images) are done?

Thank you in advance and it would be super appreciated if there sample code for this video integration! Thanks!

at this time CVAT only supports to track one step at a time. a bounding box (seed) is drawn on the initially frame, and each time you press the "f" key to step to the next frame the objects are tracked. afaik there is no functionality at the moment to track multiple frames at once. in the mentioned video i just keep pressing the "f" key. i hope this answers your question.

tangy5 · 2022-09-17T06:47:26Z

Hi @dschoerk , thank for this really cool integration. This is very inspiring! Can I ask how you write the code for annotating video (MP4) in CVAT? I'm trying to implement a different task with endoscopy videos, I saw your video shows the CVAT processed each frame at a time until all frames are tracked/inference.
How are the images fed/transformed? In meantime, how are those requests kept posting to nutlio server until all frames(images) are done?
Thank you in advance and it would be super appreciated if there sample code for this video integration! Thanks!

at this time CVAT only supports to track one step at a time. a bounding box (seed) is drawn on the initially frame, and each time you press the "f" key to step to the next frame the objects are tracked. afaik there is no functionality at the moment to track multiple frames at once. in the mentioned video i just keep pressing the "f" key. i hope this answers your question.

Thank you, this is very helpful. Helped me understand CVAT much better. One step further, do you think there is can be functionality to automate the "next step" and "prediction" with each frame without pressing "f", until all frames are done?
Thank you again on the reply! Very appreciated. CVAT is a great tool.

tangy5 · 2022-09-17T06:47:58Z

Hi @dschoerk , thank for this really cool integration. This is very inspiring! Can I ask how you write the code for annotating video (MP4) in CVAT? I'm trying to implement a different task with endoscopy videos, I saw your video shows the CVAT processed each frame at a time until all frames are tracked/inference.
How are the images fed/transformed? In meantime, how are those requests kept posting to nutlio server until all frames(images) are done?
Thank you in advance and it would be super appreciated if there sample code for this video integration! Thanks!

at this time CVAT only supports to track one step at a time. a bounding box (seed) is drawn on the initially frame, and each time you press the "f" key to step to the next frame the objects are tracked. afaik there is no functionality at the moment to track multiple frames at once. in the mentioned video i just keep pressing the "f" key. i hope this answers your question.

Thank you, this is very helpful. Helped me understand CVAT much better. One step further, do you think there can be a functionality to automate the "next step" and "prediction" with each frame without pressing "f", until all frames are done? Thank you again on the reply! Very appreciated. CVAT is a great tool.

dschoerk · 2022-09-17T08:36:35Z

Hi @dschoerk , thank for this really cool integration. This is very inspiring! Can I ask how you write the code for annotating video (MP4) in CVAT? I'm trying to implement a different task with endoscopy videos, I saw your video shows the CVAT processed each frame at a time until all frames are tracked/inference.
How are the images fed/transformed? In meantime, how are those requests kept posting to nutlio server until all frames(images) are done?
Thank you in advance and it would be super appreciated if there sample code for this video integration! Thanks!

at this time CVAT only supports to track one step at a time. a bounding box (seed) is drawn on the initially frame, and each time you press the "f" key to step to the next frame the objects are tracked. afaik there is no functionality at the moment to track multiple frames at once. in the mentioned video i just keep pressing the "f" key. i hope this answers your question.

Thank you, this is very helpful. Helped me understand CVAT much better. One step further, do you think there is can be functionality to automate the "next step" and "prediction" with each frame without pressing "f", until all frames are done? Thank you again on the reply! Very appreciated. CVAT is a great tool.

to extremely simplify things: AI tools are integrated as serverless functions i.e. they get called via a rest interface on the nuclio platform like some webservice. an image is sent from cvat to the service and it responds with the tracked location and state of the tracker. within this PR i have implemented such a service. from the perspective of implementation this is great, because of its simplicity - performance when tracking over multiple frames is not amazing with this approach.

to implement what you're looking for is not trivial. the simplest i can imagine is to call the tracking service repeatedly until the required amount of frames is tracked. BUT this is not very performant. each frame is sent in a separate http request to the service and it requires n requests to tracking n frames. a better solution would be to have a service that's capable of tracking multiple frames, keep its state internally and doesn't require the images to be sent in the request but rather access them from a docker mount. all of that would reduce flexibility but increase performance.

tangy5 · 2022-09-17T16:17:41Z

Hi @dschoerk , thank for this really cool integration. This is very inspiring! Can I ask how you write the code for annotating video (MP4) in CVAT? I'm trying to implement a different task with endoscopy videos, I saw your video shows the CVAT processed each frame at a time until all frames are tracked/inference.
How are the images fed/transformed? In meantime, how are those requests kept posting to nutlio server until all frames(images) are done?
Thank you in advance and it would be super appreciated if there sample code for this video integration! Thanks!

at this time CVAT only supports to track one step at a time. a bounding box (seed) is drawn on the initially frame, and each time you press the "f" key to step to the next frame the objects are tracked. afaik there is no functionality at the moment to track multiple frames at once. in the mentioned video i just keep pressing the "f" key. i hope this answers your question.

Thank you, this is very helpful. Helped me understand CVAT much better. One step further, do you think there is can be functionality to automate the "next step" and "prediction" with each frame without pressing "f", until all frames are done? Thank you again on the reply! Very appreciated. CVAT is a great tool.

to extremely simplify things: AI tools are integrated as serverless functions i.e. they get called via a rest interface on the nuclio platform like some webservice. an image is sent from cvat to the service and it responds with the tracked location and state of the tracker. within this PR i have implemented such a service. from the perspective of implementation this is great, because of its simplicity - performance when tracking over multiple frames is not amazing with this approach.

to implement what you're looking for is not trivial. the simplest i can imagine is to call the tracking service repeatedly until the required amount of frames is tracked. BUT this is not very performant. each frame is sent in a separate http request to the service and it requires n requests to tracking n frames. a better solution would be to have a service that's capable of tracking multiple frames, keep its state internally and doesn't require the images to be sent in the request but rather access them from a docker mount. all of that would reduce flexibility but increase performance.

Thank you for the reply, my AI function also predict one frame at a time (batch size = 1), I can imagine the simplest is to repeatedly sent HTTP request, one after one automatically until all MP4 frames are sent. Thank you!

yasakova-anastasia

@dschoerk,
Thanks for your contribution! I tested it, it works great. But I have some small comments. Could you fix them please?

yasakova-anastasia · 2022-10-26T16:25:13Z