-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an experimental implementation of scitpye for text analysis #154
Comments
I really like this! I'll update the TFIDF transformer to adopt these types. |
Is the idea for an entire Corpus of documents to be represented as one |
No, I wouldn't have thought so. Copied from my slack response: Sounds like you your’re thinking about having your transformer process untokenized data. But perhaps tokenization should be left to a separate transformer? There seem to be a few of these about. In that case your allowed input is a vector of tokenized documents, that is “document” is something with machine type |
Following on from this discussion I propose we add an implementation of
scitype
that we mark as experimental (changeable without a breaking release).Initially I thought of implementing something using the
TokenDocument
from TextAnalysis.jl but I rather think a nicer class of objects is theDocument{T}
type fromCorpusLoaders
, which has the added benefit of being much more light-weight than TextAnalysis.jl. It also defines aTaggedWord
abstract type, together with a bunch of useful concrete subtypes which are used in all the corpora you can load from that package.@pazzo83 I wonder what you think of having your transformer sucking in data as some form of
CorpusLoaders.Document{T}
. This type is defined here. See also this comment of @oxinabox. I don't think converting such documents (with suitably restrictedT
) would be hard to convert toTokenDocument
if that was convenient for you to do internally.An implementation of scitype along these lines is drafted here. This is from the tests:
@storopoli
The text was updated successfully, but these errors were encountered: