-
Notifications
You must be signed in to change notification settings - Fork 1.2k
ArtifactStore implementation for MongoDB #3570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ArtifactStore implementation for MongoDB #3570
Conversation
0fa9225
to
328b20d
Compare
cac391d
to
02c77ae
Compare
Codecov Report
@@ Coverage Diff @@
## master #3570 +/- ##
==========================================
- Coverage 85.82% 81.47% -4.36%
==========================================
Files 146 151 +5
Lines 7084 7480 +396
Branches 440 451 +11
==========================================
+ Hits 6080 6094 +14
- Misses 1004 1386 +382
Continue to review full report at Codecov.
|
rebase please. |
1d08e0b
to
acc6a6f
Compare
rebase please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiangpengcheng a MongoDb support would be a very useful addition so thanks for initiating this PR and detailed writeup 🎉
As part of a POC I had started work on implementing such an implementation at https://github.com/chetanmeh/openwhisk-mongo. This formed the basis for later work in #3517 and #3562. So you may find that implementation useful for current work. The POC is currently lagging the state in Master and I was planning to update it post work related to attachments is completed
Based on my learnings from that work would like to share some feedback wrt approach taken in this PR.
Use of Mongo Views vs Mongo Query
View is a new feature in Mongo which I have not used much. So below observations may not be all valid.
Current PR uses Mongo View support which makes use of aggregation to compute the views. Compared to CouchDB the views in MongoDB are non materialized views and hence are computed for each read query. So conceptually view feature matches CouchDB view support but it may not perform well.
Due to that I suspect that performance of query done via views may result in collection scan as the match
clause used so far cover most of the entries in collection. From what I understand querying a view is essentially a filtering scan over the entries in the view. So if view content is large the query would end up scanning all those entries.
So would be good to check performance of queries done via ArtifactStore
over large data set to understand how views perform. If that is not good I suggest to use std MongoDB query support instead with required indexes configured. This would require to store some computed fields and then construct specific query based on CouchDB view names
MVCC Support via _rev
I liked the use of document hash for revision check. Another approach that can be used is to use an integer field with conditional update support
Streaming Attachments
MongoDbStore
is currently converting the stream to byte[]
val uploadStream = gridFSBucket.openUploadStream(BsonString(doc.id.id + name), name, option)
val inputStream: InputStream = docStream.runWith(StreamConverters.asInputStream())
val b: Array[Byte] = IOUtils.toByteArray(inputStream)
val f = uploadStream.write(ByteBuffer.wrap(b)).toFuture()
It would be better to avoid materializing the stream content and instead stream the Source to GridFS. For an example have a look at attachment supportfrom the POC
acc6a6f
to
fc8728e
Compare
@chetanmeh |
4edc7b6
to
4c84cf5
Compare
af5b933
to
2c25716
Compare
72c2305
to
bbc944d
Compare
ca2a15a
to
7dc6c7c
Compare
@chetanmeh, I followed many idead from your project https://github.com/chetanmeh/openwhisk-mongo, so I added you as the co-author, is it ok? @dubee @chetanmeh could you please re-review this pr? |
@jiangpengcheng Thanks for updating the PR to use query. I would try to review it this week. It may take some time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a first pass of review and provided some initial feedback. Its a good comprehensive implementation for Mongo !!
database = "{{ db.mongodb.database }}" | ||
username = "{{ db.credentials.admin.user }}" | ||
password = "{{ db.credentials.admin.pass }}" | ||
collections { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of making this as a configuration it would be better to make it a function of entity class as Mongo provide 2 level of namespacing server -> db -> collection. So even if multiple DBs for. OpenWhisk are hosted on same server each db can have set of collection name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, but I didn't get you meaning, do you mean not set the collections
option in the configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes as this can remain static and not wary per deployment
implicit class StringPurge(val s: String) { | ||
def escapeDollar: String = { | ||
// replace '"$xxx":' to '"_mark_of_dollar_xxx":' as the "$" can not be the first char of field name | ||
s.replaceAll("\"\\$([^}]+?\":)", "\"_mark_of_dollar_$1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does such a case exist on OpenWhisk usage of Mongo i.e. any scala instance having fields first char of $
. RegEx application on json string value looking for field names looks risky!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, there are some examples in the openwhisk-catalog, e.g. https://github.com/apache/incubator-openwhisk-catalog/blob/master/packages/installCombinators.sh#L51, is it ok to ignore them?
ansible/wipeMongoDB.yml
Outdated
--- | ||
# WARNING: This playbook wipes the database. This action is not reversible. Be very careful and know what you are doing. | ||
|
||
- hosts: ansible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would better to initialize the index as part of MongoDBArtifactStore
and avoid requiring this explicit setup for now. USed similar approach in CosmosDBArtifactStore
. We may need such a step later once we decide to introduce a new index and where Mongo is already in use.
} | ||
|
||
// calculate the revision manually, to be compatible with couchdb's _rev field | ||
private def revisionCalculate(doc: JsObject): (String, String) = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
object MongoDBClient { | ||
private var _client: Option[MongoClient] = None | ||
|
||
def client(config: MongoDBConfig): MongoClient = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be better to accept whole MongoURI as config instead of breaking it down. That would allow configuring lots of other setting in same uri. Otherwise you would need to later add support for various settings as defined by MongoClientOptions
a685d18
to
423a2d1
Compare
423a2d1
to
2e25399
Compare
Co-authored-by: Chetan Mehrotra <[email protected]> Co-authored-by: jiangpengcheng <[email protected]>
to keep same with the couchdb.yml
To make the wskadmin-next work with the generated whisk.conf
1. use the whole mongodb uri in mongodb config instead of separated host/port/username/password 2. initialize mongodb indexes in the MongoDBArtifactStore instead of ansible task 3. rename of some variables/classes
2e25399
to
2fe323d
Compare
@@ -87,12 +87,16 @@ dependencies { | |||
compile 'io.reactivex:rxjava-reactive-streams:1.2.1' | |||
compile 'com.microsoft.azure:azure-cosmosdb:2.3.0' | |||
|
|||
//for mongo | |||
compile 'org.mongodb.scala:mongo-scala-driver_2.12:2.6.0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i rebased this pr to master, and resolved conflicts.
in doing this, i picked up the latest version of the driver which might have been foolish.
please check it.
closed as this is outdated |
This PR provides a MongoDBArtifactStore implementation to enable using MongoDB for storing subjects, whisks and activation in MongoDB
Description
Some users may want to use other database backends instead of CouchDB to store entities, this PR provides MongoDB as an alternative choice for these users
MongoDB driver
MongoDB provides official drivers for many languages, in this case, we pick the mongo-scala-driver, this driver provides an idiomatic Scala API that is built on top of the MongoDB Async Java driver.
Design Considerations
Data Model
The data scheme in MongoDB is just like CouchDB except that MongoDB doesn't have a
_rev
field, below is the data ofwhisk.system/invokerHealthTestAction0
in CouchDB and MongoDB:The MongoDB use BSON to store data,
MongoDbStrore
will convert bson to json, this is transparent to usersRegarding the
_rev
field, there is no alternative way in MongoDB to do the same thing, so theMongoDbStore
calculate a value according to the document's content manually, and make it as the_rev
field.Attachment
MongoDB use GridFS to store and retrieve files that exceed the BSON-document size limit of 16 MB.
Attachment in MongoDB is stored in a separate collection with a independent
_id
, this PR use thedoc._id + doc.file_name
as the attachment's_id
field, then we can find the relative attachment easily.Query
CouchDB use
designDoc
to do the query operation,MongoDB can use:view
feature to achieve the same effect, there are some examples belowwhisks-filters.v2.1.0/activations
As @chetanmeh suggest and I test again with 300,000 activations, the performance of
view
in MongoDB is not good as native query while callwsk activation list
, so I change strategies and follow the openwhisk-mongoReduce Design Doc
Unfortunately, there is no easy way to do the same thing with the CouchDB style, currently, this pr doesn't support thereduce
in MongoDB, because ofreduce
in CouchDB can usestart_key
、end_key
as filters to produce a different result whilereduce
in MongoDB can not, so I do a trade-off to simulate some simple CouchDB stylereduce
Reduce
feature, need investigate it laterMongoDB Deployment
Currently only support deploying MongoDB replica set.
Progress
Finished Work
Basic Usage(create、update、query、delete)
Attachment Support
Automated deployment of MongoDB using Ansible
CI Integration
Documents
FurtherWork
My changes affect the following components
Types of changes
Checklist: