Skip to content

[joss] In the statement of need, how does it compare with OSS annotation tools? #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kinow opened this issue Feb 9, 2023 · 18 comments
Assignees

Comments

@kinow
Copy link
Contributor

kinow commented Feb 9, 2023

Hi,

Part of openjournals/joss-reviews#5135. I see you mentioned commercial tools in the statement of need of the JOSS paper. The first item in your list of trade offs is the cost. However, that statement of need seems to ignore the existence of other OSS tools that could be compared to LaMa.

Could you consider adding other OSS tools, please? For example:

Cheers,
-Bruno

@muctadir
Copy link
Owner

muctadir commented Mar 1, 2023

Dear @kinow
Thank you very much for your comments. As we explained in the statement of need, LaMa was developed to aid with the thematic analysis process which is a method for qualitative analysis. Although many of the tools you mentioned are about text annotation, which is a core part of thematic analysis, many of them are ML based (i.e., https://github.com/dataqa/nlp-labelling, https://github.com/BrikerMan/Kashgari). The use-case of LaMa is mostly about manual labeling. Some of the tools you mentioned are about generating/annotating dataset (https://github.com/argilla-io/argilla, https://github.com/RTIInternational/SMART) to be used for different ML algorithms, which is a very different use-case compared to what LaMa tries to solve. The commercial tools that we mentioned in the paper are extremely popular and are developed for qualitative analysis and that is why we mentioned them. At the same time, we did not want this paper to turn into a tool comparison paper and therefore, we left out other text annotation tools that we investigated. Furthermore, the word limitation for the paper contributed to presenting the most important information about the tool itself. Moreover, in the paper, we mentioned 3 points in the statement of need. One of them is indeed cost. However, all the three points equally contributed to our motivation for developing LaMa.
I hope this explains our motivation for the content of the statement of need.

@muctadir muctadir closed this as completed Mar 1, 2023
@kinow
Copy link
Contributor Author

kinow commented Mar 1, 2023

Hi @muctadir !

What about https://web.hypothes.is/? This is one that I have seen added to Open Source tools to annotate text, and also used in previous companies where I worked. I think that's one of the most popular tools used to annotate text, and appears to have many overlapping features with Lama.

Thank you!
-Bruno

@muctadir
Copy link
Owner

muctadir commented Mar 2, 2023

Dear @kinow

I am looking in the tool that you mentioned and trying to find out the feature set that it provides. For now, I understand that it can annotate texts from various sources and share these annotation across multiple users. I am now trying to find out what other features it provides. Is there a documentation page that you know of? Furthermore, I have a feeling that this tool is about sharing knowledge by something they call "social annotation". And if that is the case, I think the use-case is still different.
Please, let me know what you think about my observation and if I am missing something.

@muctadir muctadir reopened this Mar 2, 2023
@kinow
Copy link
Contributor Author

kinow commented Mar 5, 2023

Hi @muctadir,

I think hypothesis could do with a simpler web page, that lets users find more about its features with less clicks. I had a look and found these resources that should be useful, I hope, for you to view how to get started with it:

I think hypothesis has similar features, some interesting features that could be useful in lama (moderation, browser extension), but also it could lack features that are important in lama (like conflict resolution). This paper, for example, mentions brat and hypothesis, and explains why KAT is still important.

I think the lama paper is doing a great job explaining that there are commercial tools but the complex collaboration is simpler in lama. However, after reading the paper I am still left with the question whether there are Open Source tools that could be used instead (especially important for lama's paper, IMHO, as it's being published in the JOSS).

Hypothesis should check the boxes for Cost and Data access and privacy, but maybe the collaboration workflow doesn't match your use case? Or maybe there are other Open Source annotations tools that have the complex collaboration, but lack the data access and privacy, or tools that have everything that lama does, but are not maintained, etc. I think a short paragraph about it would be enough for the lama paper.

Cheers
Bruno

@muctadir
Copy link
Owner

muctadir commented Mar 6, 2023

However, after reading the paper I am still left with the question whether there are Open Source tools that could be used instead

I think a short paragraph about it would be enough for the lama paper.

I agree that this is missing. I will add a paragraph based on what we investigated initially before developing LaMa.

muctadir added a commit that referenced this issue Mar 16, 2023
@muctadir
Copy link
Owner

In light of the current word count in the paper, I have added a sentence to refer to https://labelstud.io/ which we investigated prior to developing LaMa.

muctadir added a commit that referenced this issue Mar 16, 2023
* Additional detail on prior versions #20

* add opensource alternative to resolve #7
@kinow
Copy link
Contributor Author

kinow commented Mar 17, 2023

I wasn't aware of Label Studio. Thanks for mentioning it and updating the paper. Looking at this commit, ac348f9, the text below lists the cons of the solutions (including Label Studio). The first being "Cost: As these are commercial tools", which is not correct for Label Studio? They appear to have a commercial SASS version, but the code is open source (like LaMa's code, also using a permissive license - ALv2).

The second point is about data access and privacy. Label Studio also has a page about security (https://labelstud.io/guide/security.html) but I think it a wider sense, including database access. But on their documentation you can find more about granting permissions to different users (https://labelstud.io/guide/signup.html#Invite-collaborators-to-a-project). So I think they also provide data access and privacy, and I guess it could be well tested since they are in a commercial operation.

Label Studio also seems to offer extra features like image labelling, ML assisted labelling, and other features related to the third point in the list in the paper, about complex collaboration workflow, e.g.

I have not dug into their issues & code, nor signed up for their demo, or tries running it locally. Before doing that, could you elaborate more how it was compared to LaMa, and how did your team identified that it was not sufficient to use Label Studio. Moreover, given that Cost is one of the three items raised as the motivation for LaMa, I think a single Open Source tool is not enough to drive the need for a new tool. It would be better to expand that in the paper too.

We can ping also the editor to have another opinion here, @fboehm, as well the other reviewer @luxaritas

muctadir added a commit that referenced this issue Mar 20, 2023
@muctadir
Copy link
Owner

The first being "Cost: As these are commercial tools", which is not correct for Label Studio?

Somehow I missed to fix this text. I just updated the paper with correct text here.

To answer the reminder of the comment I would like to refer to one of my previous comment. And I would like to quote parts of that reply:

LaMa was developed to aid with the thematic analysis process which is a method for qualitative analysis. Although many of the tools you mentioned are about text annotation, which is a core part of thematic analysis, many of them are ML based.

The use-case of LaMa is mostly about manual labeling. Some of the tools you mentioned are about generating/annotating dataset (https://github.com/argilla-io/argilla, https://github.com/RTIInternational/SMART) to be used for different ML algorithms, which is a very different use-case compared to what LaMa tries to solve.

In light of these two comments I made earlier, you can already see how Label Studio has a different use-case, which is about annotating data. You also mentioned about ML assisted labeling with is not what we wanted for LaMa.

Moreover, in the paper, we mentioned 3 points in the statement of need. One of them is indeed cost. However, all the three points equally contributed to our motivation for developing LaMa.

You focused on cost in your comment and as I mentioned earlier, all the 3 points are equally important. You are indeed, to some extend, correct about the first two points. However, collaboration is also a key motivation which includes features such as collaborative labeling and conflict detection and resolution. To the best of my knowledge, Label Studio does not have such features.

We can ping also the editor to have another opinion here, @fboehm, as well the other reviewer @luxaritas

I think this might be a good idea.

@luxaritas
Copy link

I haven't spent a ton of time on this, but after looking a little at hypothes.is and label studio, while they're powerful annotation tools, it does not appear to me that they're well suited for thematic analysis, at least in the context of the intended workflow of LaMa. Those tools are all about "pick out a portion of the data that contains some signal" or "classify this piece of data in some existing categories". LaMa however is focused on "we have these pieces of data, and we want to come up with a taxonomy that describes them, coming to a consensus on this taxonomy with other individuals performing the coding". It's a distinctly different type of "annotation" from my understanding of the process.

So, my position here would be that not only may those tools have some deficiencies in the three primary points listed in the paper, they are also likely not suited for the task in general, so there is still an unmet need here.

@kinow
Copy link
Contributor Author

kinow commented Mar 27, 2023

Thank you @luxaritas !

it does not appear to me that they're well suited for thematic analysis, at least in the context of the intended workflow of LaMa.

I think you are right that those tools have a different target audience, with similar features but still not identical to LaMa.

So, my position here would be that not only may those tools have some deficiencies in the three primary points listed in the paper,

The three primary points being cost/data access and privacy/complex collaboration workflow, I don't believe label studio nor hypothes.is fail at the first two. In fact one could claim that having a commercial software, label studio could have better privacy and data access for having the source code open and having a commercial service attackers could exploit.

(Digressing a bit on the main discussion, but "With commercial tools, control over the access of the research data of the storage are often unavailable" might depend on the research area, and nowadays many commercial tools are also open source. One tool I worked with recently, Arvados, is open source with a commercial support, and the data access/storage location/privacy & security are documented/provided, and certified by HIPAA. But I don't think we need to modify that 👍)

they are also likely not suited for the task in general, so there is still an unmet need here.

I think what other tools lack is the last item, the "complex collaboration workflow", but my first point here was that before the text had no other Open Source tools being compared, which would still require at least a sentence saying that there is no Open Source tools for doing thematic labelling as LaMa does.

I believe the paper has been updated to address that cost, but IMHO it would be key to express exactly what you said above. That there are other commercial and open source tools that perform similar tasks, but they lack in handling the complexity of certain annotation workflows, or lack support to controlled ontologies/vocabularies/domains for annotations & labelling, or lack in collaborative data curation, or do not handle thematic analysis, etc. (that, without making the text very long).

@luxaritas
Copy link

Yeah, I think that makes sense.

@fboehm
Copy link

fboehm commented Mar 29, 2023

hi, @luxaritas @kinow and @muctadir - Thanks for the thoughtful discussion here. I especially appreciate the comment from @kinow:

I believe the paper has been updated to address that cost, but IMHO it would be key to express exactly what you said above. That there are other commercial and open source tools that perform similar tasks, but they lack in handling the complexity of certain annotation workflows, or lack support to controlled ontologies/vocabularies/domains for annotations & labelling, or lack in collaborative data curation, or do not handle thematic analysis, etc. (that, without making the text very long).

Do you all feel that the current version of the manuscript satisfies this request? Thanks again!

@muctadir
Copy link
Owner

muctadir commented Apr 4, 2023

Hi @luxaritas @kinow @fboehm
Thanks for your comments. I think it indeed makes sense to be explicit about the use case. I have now updated the paper to include an addition point in the statement of need to address this. To answer @fboehm, I believe, the current version of the paper satisfies the request.

@kinow
Copy link
Contributor Author

kinow commented Apr 5, 2023

@muctadir I just had a look at the Markdown source and it's looking better! I was trying to preview the PDF, but I think the bot is not updating it. I'll comment in the other issue, preview the PDF, and update this issue & the checklist after that if it's looking OK (from looking at the PDF it was looking fine to me). Cheers

@muctadir
Copy link
Owner

muctadir commented Apr 5, 2023

@kinow Thanks already. I was able to get he latest paper from https://github.com/muctadir/lama/actions/runs/4604583613. Is it not accessible for you?

@kinow
Copy link
Contributor Author

kinow commented Apr 5, 2023

@muctadir I thought it would be re-generated by the bot in the pull request. The latest message in the review PR is from Feb 8 (openjournals/joss-reviews#5135 (comment)), but I can't recall if that's how it worked in the past for JOSS reviews, or if I am confusing with another pull request somewhere... will wait for @fboehm 's reply. Thanks!

@fboehm
Copy link

fboehm commented Apr 5, 2023

@editorialbot generate pdf

@fboehm
Copy link

fboehm commented Apr 5, 2023

oops. Sorry about that. I intended to comment in the review thread

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants