Add an experimental implementation of scitpye for text analysis #154

ablaom · 2021-08-27T05:14:47Z

Following on from this discussion I propose we add an implementation of scitype that we mark as experimental (changeable without a breaking release).

Initially I thought of implementing something using the TokenDocument from TextAnalysis.jl but I rather think a nicer class of objects is the Document{T} type from CorpusLoaders, which has the added benefit of being much more light-weight than TextAnalysis.jl. It also defines a TaggedWord abstract type, together with a bunch of useful concrete subtypes which are used in all the corpora you can load from that package.

@pazzo83 I wonder what you think of having your transformer sucking in data as some form of CorpusLoaders.Document{T}. This type is defined here. See also this comment of @oxinabox. I don't think converting such documents (with suitably restricted T) would be hard to convert to TokenDocument if that was convenient for you to do internally.

An implementation of scitype along these lines is drafted here. This is from the tests:

    tagged_word = CorpusLoaders.PosTaggedWord("NN", "wheelbarrow")
    tagged_word2 = CorpusLoaders.PosTaggedWord("NN", "soil")
    @test scitype(tagged_word) == Annotated{Textual}
    bag_of_words = Dict("cat"=>1, "dog"=>3)
    @test scitype(bag_of_words) == Multiset{Textual}
    bag_of_tagged_words = Dict(tagged_word => 5)
    @test scitype(bag_of_tagged_words) == Multiset{Annotated{Textual}}
    @test scitype(Document("kadsfkj")) == Unknown
    @test scitype(Document([tagged_word, tagged_word2])) ==
        Annotated{AbstractVector{Annotated{Textual}}}
    nested_tokens = [["dog", "cat"], ["bird", "cat"]]
    @test scitype(Document(nested_tokens)) ==
                  Annotated{AbstractVector{AbstractVector{Textual}}}

@storopoli

The text was updated successfully, but these errors were encountered:

pazzo83 · 2021-08-27T15:12:34Z

I really like this! I'll update the TFIDF transformer to adopt these types.

pazzo83 · 2021-08-31T03:43:24Z

Is the idea for an entire Corpus of documents to be represented as one Document object that is comprised of a vector of Strings? For the TfidfTransformer, the ideal input would basically be a vector of strings or vector of some Document type.

ablaom · 2021-08-31T04:50:34Z

No, I wouldn't have thought so. Copied from my slack response:

Sounds like you your’re thinking about having your transformer process untokenized data. But perhaps tokenization should be left to a separate transformer? There seem to be a few of these about. In that case your allowed input is a vector of tokenized documents, that is “document” is something with machine type CorpusLoaders.Document{<:AbstractVector{<:AbstractString}} (ignoring the possibility of tagged words here, for simplicity) and your transformer is sucking in a vector of these (a corpus). You would then articulate this type requirement by declaring input_scitype{<:Type{<:YourTransformer}} = AbstractVector{<:Annotated{AbstractVector{Textual}} . Yes?

ablaom · 2021-09-22T03:45:25Z

closed by #153 . See also #158

This was referenced Aug 31, 2021

Add an experimental implementation of scitype for text analysis #153

Merged

For a 2.2.0 release #156

Merged

ablaom mentioned this issue Sep 5, 2021

initial commit of tfidf transformer JuliaAI/MLJText.jl#1

Merged

ablaom closed this as completed Sep 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an experimental implementation of scitpye for text analysis #154

Add an experimental implementation of scitpye for text analysis #154

ablaom commented Aug 27, 2021 •

edited

Loading

pazzo83 commented Aug 27, 2021

pazzo83 commented Aug 31, 2021

ablaom commented Aug 31, 2021 •

edited

Loading

ablaom commented Sep 22, 2021

Add an experimental implementation of scitpye for text analysis #154

Add an experimental implementation of scitpye for text analysis #154

Comments

ablaom commented Aug 27, 2021 • edited Loading

pazzo83 commented Aug 27, 2021

pazzo83 commented Aug 31, 2021

ablaom commented Aug 31, 2021 • edited Loading

ablaom commented Sep 22, 2021

ablaom commented Aug 27, 2021 •

edited

Loading

ablaom commented Aug 31, 2021 •

edited

Loading