Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an experimental implementation of scitpye for text analysis #154

Closed
ablaom opened this issue Aug 27, 2021 · 4 comments
Closed

Add an experimental implementation of scitpye for text analysis #154

ablaom opened this issue Aug 27, 2021 · 4 comments

Comments

@ablaom
Copy link
Member

ablaom commented Aug 27, 2021

Following on from this discussion I propose we add an implementation of scitype that we mark as experimental (changeable without a breaking release).

Initially I thought of implementing something using the TokenDocument from TextAnalysis.jl but I rather think a nicer class of objects is the Document{T} type from CorpusLoaders, which has the added benefit of being much more light-weight than TextAnalysis.jl. It also defines a TaggedWord abstract type, together with a bunch of useful concrete subtypes which are used in all the corpora you can load from that package.

@pazzo83 I wonder what you think of having your transformer sucking in data as some form of CorpusLoaders.Document{T}. This type is defined here. See also this comment of @oxinabox. I don't think converting such documents (with suitably restricted T) would be hard to convert to TokenDocument if that was convenient for you to do internally.

An implementation of scitype along these lines is drafted here. This is from the tests:

    tagged_word = CorpusLoaders.PosTaggedWord("NN", "wheelbarrow")
    tagged_word2 = CorpusLoaders.PosTaggedWord("NN", "soil")
    @test scitype(tagged_word) == Annotated{Textual}
    bag_of_words = Dict("cat"=>1, "dog"=>3)
    @test scitype(bag_of_words) == Multiset{Textual}
    bag_of_tagged_words = Dict(tagged_word => 5)
    @test scitype(bag_of_tagged_words) == Multiset{Annotated{Textual}}
    @test scitype(Document("kadsfkj")) == Unknown
    @test scitype(Document([tagged_word, tagged_word2])) ==
        Annotated{AbstractVector{Annotated{Textual}}}
    nested_tokens = [["dog", "cat"], ["bird", "cat"]]
    @test scitype(Document(nested_tokens)) ==
                  Annotated{AbstractVector{AbstractVector{Textual}}}

@storopoli

@pazzo83
Copy link

pazzo83 commented Aug 27, 2021

I really like this! I'll update the TFIDF transformer to adopt these types.

@pazzo83
Copy link

pazzo83 commented Aug 31, 2021

Is the idea for an entire Corpus of documents to be represented as one Document object that is comprised of a vector of Strings? For the TfidfTransformer, the ideal input would basically be a vector of strings or vector of some Document type.

@ablaom
Copy link
Member Author

ablaom commented Aug 31, 2021

No, I wouldn't have thought so. Copied from my slack response:

Sounds like you your’re thinking about having your transformer process untokenized data. But perhaps tokenization should be left to a separate transformer? There seem to be a few of these about. In that case your allowed input is a vector of tokenized documents, that is “document” is something with machine type CorpusLoaders.Document{<:AbstractVector{<:AbstractString}} (ignoring the possibility of tagged words here, for simplicity) and your transformer is sucking in a vector of these (a corpus). You would then articulate this type requirement by declaring input_scitype{<:Type{<:YourTransformer}} = AbstractVector{<:Annotated{AbstractVector{Textual}} . Yes?

@ablaom
Copy link
Member Author

ablaom commented Sep 22, 2021

closed by #153 . See also #158

@ablaom ablaom closed this as completed Sep 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants