Skip to content

ArtifactStore implementation for MongoDB #3570

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

jiangpengcheng
Copy link
Contributor

@jiangpengcheng jiangpengcheng commented Apr 24, 2018

This PR provides a MongoDBArtifactStore implementation to enable using MongoDB for storing subjects, whisks and activation in MongoDB

Description

Some users may want to use other database backends instead of CouchDB to store entities, this PR provides MongoDB as an alternative choice for these users

MongoDB driver

MongoDB provides official drivers for many languages, in this case, we pick the mongo-scala-driver, this driver provides an idiomatic Scala API that is built on top of the MongoDB Async Java driver.

Design Considerations

Data Model

The data scheme in MongoDB is just like CouchDB except that MongoDB doesn't have a _rev field, below is the data of whisk.system/invokerHealthTestAction0 in CouchDB and MongoDB:

CouchDB MongoDB
{
  "_id": "whisk.system/invokerHealthTestAction0",
  "_rev": "68-e72440f911c64ab11441c09e730e5ab8",
  "name": "invokerHealthTestAction0",
  "publish": false,
  "annotations": [],
  "version": "0.0.1",
  "updated": 1524476933182,
  "entityType": "action",
  "exec": {
    "kind": "nodejs:6",
    "code": "function main(params) { return params; }",
    "binary": false
  },
  "parameters": [],
  "limits": {
    "timeout": 60000,
    "memory": 256,
    "logs": 10
  },
  "namespace": "whisk.system"
}
{
  "_id" : "whisk.system/invokerHealthTestAction0",
  "name" : "invokerHealthTestAction0",
  "publish" : false,
  "annotations" : [ ],
  "version" : "0.0.1",
  "updated" : NumberLong("1524473794826"),
  "entityType" : "action",
  "exec" : {
    "kind" : "nodejs:6",
    "code" : "function main(params) { return params; }",
    "binary" : false
  },
  "parameters" : [ ],
  "limits" : {
    "timeout" : 60000,
    "memory" : 256,
    "logs" : 10
  },
  "namespace" : "whisk.system"
}

The MongoDB use BSON to store data, MongoDbStrore will convert bson to json, this is transparent to users

Regarding the _rev field, there is no alternative way in MongoDB to do the same thing, so the MongoDbStore calculate a value according to the document's content manually, and make it as the _rev field.

Attachment

MongoDB use GridFS to store and retrieve files that exceed the BSON-document size limit of 16 MB.

Attachment in MongoDB is stored in a separate collection with a independent _id, this PR use the doc._id + doc.file_name as the attachment's _id field, then we can find the relative attachment easily.

Query

CouchDB use designDoc to do the query operation, MongoDB can use view feature to achieve the same effect, there are some examples below:

whisks-filters.v2.1.0/activations

db.createView(
    "whisk_local_whisks.whisks-filters.v2.1.0/activations",
    "whisk_local_whisks",
    [
        { "$match": { "activationId": { "$exists": true } } },
        {
            "$addFields": {
                "flag": {
                    "$arrayElemAt": [{ "$filter": { "input": "$annotations", "cond": { "$eq": ["$$this.key", "path"] } } }, 0]
                }
            }
        },
        { "$lookup": { "from": "whisk_local_whisks", "localField": "_id", "foreignField": "_id", "as": "doc" } },
        {
            "$project": {
                "key": [{ "$ifNull": ["$flag.value", { "$concat": ["$namespace", "/", "$name"] }] }, "$start"],
                "value": {
                    "namespace": "$namespace",
                    "name": "$name",
                    "version": "$version",
                    "publish": "$publish",
                    "annotations": "$annotations",
                    "activationId": "$activationId",
                    "start": "$start",
                    "end": "$end",
                    "duration": "$duration",
                    "cause": "$cause",
                    "statusCode": "$response.statusCode"
                },
                "_id": 0,
                "id": "$_id",
                "doc": { "$arrayElemAt": ["$doc", 0] }
            }
        }
    ]
)

As @chetanmeh suggest and I test again with 300,000 activations, the performance of view in MongoDB is not good as native query while call wsk activation list, so I change strategies and follow the openwhisk-mongo

Reduce Design Doc

Unfortunately, there is no easy way to do the same thing with the CouchDB style reduce in MongoDB, because of reduce in CouchDB can use start_keyend_key as filters to produce a different result while reduce in MongoDB can not, so I do a trade-off to simulate some simple CouchDB style reduce, currently, this pr doesn't support the Reduce feature, need investigate it later

MongoDB Deployment

Currently only support deploying MongoDB replica set.

Progress

Finished Work

  • Basic Usage(create、update、query、delete)

  • Attachment Support

  • Automated deployment of MongoDB using Ansible

  • CI Integration

  • Documents

FurtherWork

  • Add MongoDB support for wskadmin

My changes affect the following components

  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • Documentation

Types of changes

  • Enhancement or new feature (adds new functionality).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

@jiangpengcheng jiangpengcheng changed the title ArtifactStore implementation for MongoDB ArtifactStore implementation for MongoDB[WIP] Apr 24, 2018
@jiangpengcheng jiangpengcheng force-pushed the add_mongodb_implementation branch 2 times, most recently from 0fa9225 to 328b20d Compare May 1, 2018 07:31
@jiangpengcheng jiangpengcheng changed the title ArtifactStore implementation for MongoDB[WIP] ArtifactStore implementation for MongoDB May 1, 2018
@jiangpengcheng jiangpengcheng force-pushed the add_mongodb_implementation branch 8 times, most recently from cac391d to 02c77ae Compare May 3, 2018 08:22
@codecov-io
Copy link

codecov-io commented May 3, 2018

Codecov Report

Merging #3570 into master will decrease coverage by 4.35%.
The diff coverage is 87.12%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3570      +/-   ##
==========================================
- Coverage   85.82%   81.47%   -4.36%     
==========================================
  Files         146      151       +5     
  Lines        7084     7480     +396     
  Branches      440      451      +11     
==========================================
+ Hits         6080     6094      +14     
- Misses       1004     1386     +382
Impacted Files Coverage Δ
.../scala/src/main/scala/whisk/core/WhiskConfig.scala 94.21% <100%> (+0.04%) ⬆️
...k/core/database/mongodb/MongoDBArtifactStore.scala 80.42% <80.42%> (ø)
.../whisk/core/database/mongodb/AsyncStreamSink.scala 93.54% <93.54%> (ø)
...atabase/mongodb/MongoDBArtifactStoreProvider.scala 94.44% <94.44%> (ø)
...hisk/core/database/mongodb/AsyncStreamSource.scala 95.83% <95.83%> (ø)
...hisk/core/database/mongodb/MongoDBViewMapper.scala 98.85% <98.85%> (ø)
...core/database/cosmosdb/RxObservableImplicits.scala 0% <0%> (-100%) ⬇️
...core/database/cosmosdb/CosmosDBArtifactStore.scala 0% <0%> (-95.1%) ⬇️
...sk/core/database/cosmosdb/CosmosDBViewMapper.scala 0% <0%> (-92.6%) ⬇️
...whisk/core/database/cosmosdb/CosmosDBSupport.scala 0% <0%> (-81.82%) ⬇️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1cc3832...423a2d1. Read the comment docs.

@junoyoon
Copy link

junoyoon commented May 8, 2018

rebase please.

@jiangpengcheng jiangpengcheng force-pushed the add_mongodb_implementation branch 3 times, most recently from 1d08e0b to acc6a6f Compare May 9, 2018 06:03
@junoyoon
Copy link

rebase please.

Copy link
Member

@chetanmeh chetanmeh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiangpengcheng a MongoDb support would be a very useful addition so thanks for initiating this PR and detailed writeup 🎉

As part of a POC I had started work on implementing such an implementation at https://github.com/chetanmeh/openwhisk-mongo. This formed the basis for later work in #3517 and #3562. So you may find that implementation useful for current work. The POC is currently lagging the state in Master and I was planning to update it post work related to attachments is completed

Based on my learnings from that work would like to share some feedback wrt approach taken in this PR.

Use of Mongo Views vs Mongo Query

View is a new feature in Mongo which I have not used much. So below observations may not be all valid.

Current PR uses Mongo View support which makes use of aggregation to compute the views. Compared to CouchDB the views in MongoDB are non materialized views and hence are computed for each read query. So conceptually view feature matches CouchDB view support but it may not perform well.

Due to that I suspect that performance of query done via views may result in collection scan as the match clause used so far cover most of the entries in collection. From what I understand querying a view is essentially a filtering scan over the entries in the view. So if view content is large the query would end up scanning all those entries.

So would be good to check performance of queries done via ArtifactStore over large data set to understand how views perform. If that is not good I suggest to use std MongoDB query support instead with required indexes configured. This would require to store some computed fields and then construct specific query based on CouchDB view names

MVCC Support via _rev

I liked the use of document hash for revision check. Another approach that can be used is to use an integer field with conditional update support

Streaming Attachments

MongoDbStore is currently converting the stream to byte[]

    val uploadStream = gridFSBucket.openUploadStream(BsonString(doc.id.id + name), name, option)
    val inputStream: InputStream = docStream.runWith(StreamConverters.asInputStream())
    val b: Array[Byte] = IOUtils.toByteArray(inputStream)
    val f = uploadStream.write(ByteBuffer.wrap(b)).toFuture()

It would be better to avoid materializing the stream content and instead stream the Source to GridFS. For an example have a look at attachment supportfrom the POC

@jiangpengcheng jiangpengcheng force-pushed the add_mongodb_implementation branch from acc6a6f to fc8728e Compare May 22, 2018 05:30
@jiangpengcheng
Copy link
Contributor Author

@chetanmeh
thanks for your review! I will check it

@jiangpengcheng jiangpengcheng force-pushed the add_mongodb_implementation branch 2 times, most recently from 4edc7b6 to 4c84cf5 Compare May 22, 2018 07:31
@jiangpengcheng jiangpengcheng force-pushed the add_mongodb_implementation branch 4 times, most recently from af5b933 to 2c25716 Compare June 5, 2018 05:50
@jiangpengcheng jiangpengcheng force-pushed the add_mongodb_implementation branch 2 times, most recently from 72c2305 to bbc944d Compare June 5, 2018 08:35
@jiangpengcheng jiangpengcheng force-pushed the add_mongodb_implementation branch 7 times, most recently from ca2a15a to 7dc6c7c Compare August 6, 2018 03:37
@jiangpengcheng
Copy link
Contributor Author

jiangpengcheng commented Aug 16, 2018

@chetanmeh, I followed many idead from your project https://github.com/chetanmeh/openwhisk-mongo, so I added you as the co-author, is it ok?

@dubee @chetanmeh could you please re-review this pr?

@chetanmeh
Copy link
Member

@jiangpengcheng Thanks for updating the PR to use query. I would try to review it this week. It may take some time

Copy link
Member

@chetanmeh chetanmeh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass of review and provided some initial feedback. Its a good comprehensive implementation for Mongo !!

database = "{{ db.mongodb.database }}"
username = "{{ db.credentials.admin.user }}"
password = "{{ db.credentials.admin.pass }}"
collections {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of making this as a configuration it would be better to make it a function of entity class as Mongo provide 2 level of namespacing server -> db -> collection. So even if multiple DBs for. OpenWhisk are hosted on same server each db can have set of collection name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, but I didn't get you meaning, do you mean not set the collections option in the configuration?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes as this can remain static and not wary per deployment

implicit class StringPurge(val s: String) {
def escapeDollar: String = {
// replace '"$xxx":' to '"_mark_of_dollar_xxx":' as the "$" can not be the first char of field name
s.replaceAll("\"\\$([^}]+?\":)", "\"_mark_of_dollar_$1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does such a case exist on OpenWhisk usage of Mongo i.e. any scala instance having fields first char of $. RegEx application on json string value looking for field names looks risky!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there are some examples in the openwhisk-catalog, e.g. https://github.com/apache/incubator-openwhisk-catalog/blob/master/packages/installCombinators.sh#L51, is it ok to ignore them?

---
# WARNING: This playbook wipes the database. This action is not reversible. Be very careful and know what you are doing.

- hosts: ansible
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would better to initialize the index as part of MongoDBArtifactStore and avoid requiring this explicit setup for now. USed similar approach in CosmosDBArtifactStore. We may need such a step later once we decide to introduce a new index and where Mongo is already in use.

}

// calculate the revision manually, to be compatible with couchdb's _rev field
private def revisionCalculate(doc: JsObject): (String, String) = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

object MongoDBClient {
private var _client: Option[MongoClient] = None

def client(config: MongoDBConfig): MongoClient = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be better to accept whole MongoURI as config instead of breaking it down. That would allow configuring lots of other setting in same uri. Otherwise you would need to later add support for various settings as defined by MongoClientOptions

@jiangpengcheng jiangpengcheng force-pushed the add_mongodb_implementation branch from a685d18 to 423a2d1 Compare August 29, 2018 06:32
@rabbah rabbah unassigned dubee Feb 9, 2019
@rabbah rabbah force-pushed the add_mongodb_implementation branch from 423a2d1 to 2e25399 Compare February 9, 2019 14:25
jiangpengcheng and others added 7 commits February 9, 2019 09:28
Co-authored-by: Chetan Mehrotra <[email protected]>
Co-authored-by: jiangpengcheng <[email protected]>
to keep same with the couchdb.yml
To make the wskadmin-next work with the generated whisk.conf
1. use the whole mongodb uri in mongodb config instead of separated host/port/username/password
2. initialize mongodb indexes in the MongoDBArtifactStore instead of ansible task
3. rename of some variables/classes
@rabbah rabbah force-pushed the add_mongodb_implementation branch from 2e25399 to 2fe323d Compare February 9, 2019 14:29
@@ -87,12 +87,16 @@ dependencies {
compile 'io.reactivex:rxjava-reactive-streams:1.2.1'
compile 'com.microsoft.azure:azure-cosmosdb:2.3.0'

//for mongo
compile 'org.mongodb.scala:mongo-scala-driver_2.12:2.6.0'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i rebased this pr to master, and resolved conflicts.
in doing this, i picked up the latest version of the driver which might have been foolish.
please check it.

@rabbah rabbah added the review Review for this PR has been requested and yet needs to be done. label Feb 9, 2019
@jiangpengcheng
Copy link
Contributor Author

closed as this is outdated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
artifactstore review Review for this PR has been requested and yet needs to be done.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants