Skip to content

TransT tracker integration #4886

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from
Closed

TransT tracker integration #4886

wants to merge 16 commits into from

Conversation

dschoerk
Copy link
Contributor

PR to integrate the single object tracker TransT as an AI tool into CVAT.

also see here: https://github.com/cvat-ai/cvat-opencv/issues/14

@dschoerk dschoerk marked this pull request as ready for review August 31, 2022 07:18
@@ -756,7 +756,7 @@ export class ToolsControlComponent extends React.PureComponent<Props, State> {
});
// eslint-disable-next-line no-await-in-loop
const response = await core.lambda.call(jobInstance.taskId, tracker, {
frame: frame,
frame: frame ,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid the linter issue

@sizov-kirill
Copy link
Contributor

Resolved #4768

@tangy5
Copy link

tangy5 commented Sep 17, 2022

Hi @dschoerk , thank for this really cool integration. This is very inspiring!
Can I ask how you write the code for annotating video (MP4) in CVAT? I'm trying to implement a different task with endoscopy videos, I saw your video shows the CVAT processed each frame at a time until all frames are tracked/inference.

How are the images fed/transformed? In meantime, how are those requests kept posting to nutlio server until all frames(images) are done?

Thank you in advance and it would be super appreciated if there sample code for this video integration! Thanks!

@dschoerk
Copy link
Contributor Author

Hi @dschoerk , thank for this really cool integration. This is very inspiring! Can I ask how you write the code for annotating video (MP4) in CVAT? I'm trying to implement a different task with endoscopy videos, I saw your video shows the CVAT processed each frame at a time until all frames are tracked/inference.

How are the images fed/transformed? In meantime, how are those requests kept posting to nutlio server until all frames(images) are done?

Thank you in advance and it would be super appreciated if there sample code for this video integration! Thanks!

at this time CVAT only supports to track one step at a time. a bounding box (seed) is drawn on the initially frame, and each time you press the "f" key to step to the next frame the objects are tracked. afaik there is no functionality at the moment to track multiple frames at once. in the mentioned video i just keep pressing the "f" key. i hope this answers your question.

@tangy5
Copy link

tangy5 commented Sep 17, 2022

Hi @dschoerk , thank for this really cool integration. This is very inspiring! Can I ask how you write the code for annotating video (MP4) in CVAT? I'm trying to implement a different task with endoscopy videos, I saw your video shows the CVAT processed each frame at a time until all frames are tracked/inference.
How are the images fed/transformed? In meantime, how are those requests kept posting to nutlio server until all frames(images) are done?
Thank you in advance and it would be super appreciated if there sample code for this video integration! Thanks!

at this time CVAT only supports to track one step at a time. a bounding box (seed) is drawn on the initially frame, and each time you press the "f" key to step to the next frame the objects are tracked. afaik there is no functionality at the moment to track multiple frames at once. in the mentioned video i just keep pressing the "f" key. i hope this answers your question.

Thank you, this is very helpful. Helped me understand CVAT much better. One step further, do you think there is can be functionality to automate the "next step" and "prediction" with each frame without pressing "f", until all frames are done?
Thank you again on the reply! Very appreciated. CVAT is a great tool.

@tangy5
Copy link

tangy5 commented Sep 17, 2022

Hi @dschoerk , thank for this really cool integration. This is very inspiring! Can I ask how you write the code for annotating video (MP4) in CVAT? I'm trying to implement a different task with endoscopy videos, I saw your video shows the CVAT processed each frame at a time until all frames are tracked/inference.
How are the images fed/transformed? In meantime, how are those requests kept posting to nutlio server until all frames(images) are done?
Thank you in advance and it would be super appreciated if there sample code for this video integration! Thanks!

at this time CVAT only supports to track one step at a time. a bounding box (seed) is drawn on the initially frame, and each time you press the "f" key to step to the next frame the objects are tracked. afaik there is no functionality at the moment to track multiple frames at once. in the mentioned video i just keep pressing the "f" key. i hope this answers your question.

Thank you, this is very helpful. Helped me understand CVAT much better. One step further, do you think there can be a functionality to automate the "next step" and "prediction" with each frame without pressing "f", until all frames are done? Thank you again on the reply! Very appreciated. CVAT is a great tool.

@dschoerk
Copy link
Contributor Author

Hi @dschoerk , thank for this really cool integration. This is very inspiring! Can I ask how you write the code for annotating video (MP4) in CVAT? I'm trying to implement a different task with endoscopy videos, I saw your video shows the CVAT processed each frame at a time until all frames are tracked/inference.
How are the images fed/transformed? In meantime, how are those requests kept posting to nutlio server until all frames(images) are done?
Thank you in advance and it would be super appreciated if there sample code for this video integration! Thanks!

at this time CVAT only supports to track one step at a time. a bounding box (seed) is drawn on the initially frame, and each time you press the "f" key to step to the next frame the objects are tracked. afaik there is no functionality at the moment to track multiple frames at once. in the mentioned video i just keep pressing the "f" key. i hope this answers your question.

Thank you, this is very helpful. Helped me understand CVAT much better. One step further, do you think there is can be functionality to automate the "next step" and "prediction" with each frame without pressing "f", until all frames are done? Thank you again on the reply! Very appreciated. CVAT is a great tool.

to extremely simplify things: AI tools are integrated as serverless functions i.e. they get called via a rest interface on the nuclio platform like some webservice. an image is sent from cvat to the service and it responds with the tracked location and state of the tracker. within this PR i have implemented such a service. from the perspective of implementation this is great, because of its simplicity - performance when tracking over multiple frames is not amazing with this approach.

to implement what you're looking for is not trivial. the simplest i can imagine is to call the tracking service repeatedly until the required amount of frames is tracked. BUT this is not very performant. each frame is sent in a separate http request to the service and it requires n requests to tracking n frames. a better solution would be to have a service that's capable of tracking multiple frames, keep its state internally and doesn't require the images to be sent in the request but rather access them from a docker mount. all of that would reduce flexibility but increase performance.

@tangy5
Copy link

tangy5 commented Sep 17, 2022

Hi @dschoerk , thank for this really cool integration. This is very inspiring! Can I ask how you write the code for annotating video (MP4) in CVAT? I'm trying to implement a different task with endoscopy videos, I saw your video shows the CVAT processed each frame at a time until all frames are tracked/inference.
How are the images fed/transformed? In meantime, how are those requests kept posting to nutlio server until all frames(images) are done?
Thank you in advance and it would be super appreciated if there sample code for this video integration! Thanks!

at this time CVAT only supports to track one step at a time. a bounding box (seed) is drawn on the initially frame, and each time you press the "f" key to step to the next frame the objects are tracked. afaik there is no functionality at the moment to track multiple frames at once. in the mentioned video i just keep pressing the "f" key. i hope this answers your question.

Thank you, this is very helpful. Helped me understand CVAT much better. One step further, do you think there is can be functionality to automate the "next step" and "prediction" with each frame without pressing "f", until all frames are done? Thank you again on the reply! Very appreciated. CVAT is a great tool.

to extremely simplify things: AI tools are integrated as serverless functions i.e. they get called via a rest interface on the nuclio platform like some webservice. an image is sent from cvat to the service and it responds with the tracked location and state of the tracker. within this PR i have implemented such a service. from the perspective of implementation this is great, because of its simplicity - performance when tracking over multiple frames is not amazing with this approach.

to implement what you're looking for is not trivial. the simplest i can imagine is to call the tracking service repeatedly until the required amount of frames is tracked. BUT this is not very performant. each frame is sent in a separate http request to the service and it requires n requests to tracking n frames. a better solution would be to have a service that's capable of tracking multiple frames, keep its state internally and doesn't require the images to be sent in the request but rather access them from a docker mount. all of that would reduce flexibility but increase performance.

Thank you for the reply, my AI function also predict one frame at a time (batch size = 1), I can imagine the simplest is to repeatedly sent HTTP request, one after one automatically until all MP4 frames are sent. Thank you!

@bsekachev bsekachev self-assigned this Oct 10, 2022
@bsekachev bsekachev requested review from yasakova-anastasia and removed request for bsekachev October 21, 2022 13:13
Copy link
Contributor

@yasakova-anastasia yasakova-anastasia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dschoerk,
Thanks for your contribution! I tested it, it works great. But I have some small comments. Could you fix them please?

Comment on lines +48 to +51
def log(msg):
#with open("/log.log", "a") as logf:
# logf.write(msg+'\n')
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function can be removed.

@@ -0,0 +1,145 @@
import json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a license to the beginning of the file.

# Copyright (C) 2022 CVAT.ai Corporation
#
# SPDX-License-Identifier: MIT

Comment on lines +138 to +144
except Exception as e: # cavemen debugging
logf = open("/error.log", "w")
logf.write(str(e))
logf.write(traceback.format_exc())

return context.Response(headers={},
content_type='application/json', status_code=666)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be removed.

# logf.write(msg+'\n')
pass

def encode_state(model):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please separate these functions into a separate ModelHandler class (as is done for other serverless functions)?

image = Image.open(buf).convert('RGB')
image = np.array(image)[:, :, ::-1].copy()

#cv2.imwrite('/test.jpg', image)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove all useless comments.

@yasakova-anastasia
Copy link
Contributor

I can't push to this thread. The requested URL returned error: 403. Opened another PR.

@AIWithShrey
Copy link

Hi @dschoerk , thank for this really cool integration. This is very inspiring! Can I ask how you write the code for annotating video (MP4) in CVAT? I'm trying to implement a different task with endoscopy videos, I saw your video shows the CVAT processed each frame at a time until all frames are tracked/inference.
How are the images fed/transformed? In meantime, how are those requests kept posting to nutlio server until all frames(images) are done?
Thank you in advance and it would be super appreciated if there sample code for this video integration! Thanks!

at this time CVAT only supports to track one step at a time. a bounding box (seed) is drawn on the initially frame, and each time you press the "f" key to step to the next frame the objects are tracked. afaik there is no functionality at the moment to track multiple frames at once. in the mentioned video i just keep pressing the "f" key. i hope this answers your question.

How would you suggest I go by using TransT with multiple frames? Do I have to take the Docker route? I have about 1800 frames (30 fps, 1 minute long videos). Also, I don't find TransT on HuggingFace to use it on my own dataset, without having to use CVAT. Sure, CVAT makes my life easier for annotation, but it takes extremely long to annotate 1800 frames separately.

What would be the most feasible solution to my problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants