Skip to content

feat: Initial database support #246

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

ruivieira
Copy link
Member

@ruivieira ruivieira commented Jun 12, 2024

See:

This PR enables a new syntax for TrustyAIService CRs to enable database configuration, while
maintaining backwards compatibility with the previous PVC-mode. Example:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: TrustyAIService
metadata:
  name: trustyai-service-db
spec:
  storage:
    format: "DATABASE"
    databaseConfigurations: db-credentials
  metrics:
    schedule: "5s"

Other changes

  • Add status checking
  • Add better storage flags
  • Add spec.storage.format validation
  • Add HIBERNATE format to DB (test)
  • Update service image
  • Revert identifier to DATABASE
  • Update CR options (remove mandatory data)

Notes

⚠️ Note
params.env must be reverted from the test settings before merge

Test DSC available with

kind: DataScienceCluster
apiVersion: datasciencecluster.opendatahub.io/v1
metadata:
  labels:
    app.kubernetes.io/created-by: opendatahub-operator
    app.kubernetes.io/instance: default
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/name: datasciencecluster
    app.kubernetes.io/part-of: opendatahub-operator
  name: default-dsc
spec:
  components:
    codeflare:
      managementState: Managed
    dashboard:
      managementState: Managed
    datasciencepipelines:
      managementState: Managed
    kserve:
      managementState: Managed
      serving:
        ingressGateway:
          certificate:
            type: SelfSigned
        managementState: Managed
        name: knative-serving
    kueue:
      managementState: Managed
    modelmeshserving:
      managementState: Managed
    modelregistry:
      managementState: Removed
    ray:
      managementState: Managed
    trainingoperator:
      managementState: Removed
    trustyai:
      devFlags:
        manifests:
          - contextDir: config
            sourcePath: ''
            uri: https://api.github.com/repos/ruivieira/trustyai-service-operator/tarball/RHOAIENG-8178-database
      managementState: Managed
    workbenches:
      managementState: Managed

- Add status checking
- Add better storage flags
- Add spec.storage.format validation
- Add DDL
-Add HIBERNATE format to DB (test)
- Update service image
- Revert identifier to DATABASE
- Update CR options (remove mandatory data)
Copy link

github-actions bot commented Jun 12, 2024

PR image build and manifest generation completed successfully!

📦 PR image: quay.io/trustyai/trustyai-service-operator-ci:07dc15a9ecc0a2b09d2852a69be12434dc891908

🗂️ CI manifests

…NG-8178-database

# Conflicts:
#	config/crd/bases/trustyai.opendatahub.io_trustyaiservices.yaml
#	controllers/constants.go
Copy link

openshift-ci bot commented Jul 12, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: RobGeada

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

openshift-ci bot commented Jul 12, 2024

@ruivieira: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/trustyai-service-operator-e2e 07dc15a link true /test trustyai-service-operator-e2e

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ruivieira ruivieira merged commit 4c8550e into trustyai-explainability:main Jul 12, 2024
5 of 7 checks passed
ruivieira added a commit that referenced this pull request Jul 24, 2024
* feat: Initial database support (#246)

* Initial database support

- Add status checking
- Add better storage flags
- Add spec.storage.format validation
- Add DDL
-Add HIBERNATE format to DB (test)
- Update service image
- Revert identifier to DATABASE
- Update CR options (remove mandatory data)

* Remove default DDL generation env var

* Update service image to latest tag

* Add migration awareness

* Add updating pods for migration

* Change JDBC url from mysql to mariadb

* Fix TLS mount

* Revert images

* Remove redundant logic

* Fix comments

* feat: Add TLS certificate mount on ModelMesh (#255)

* feat: Add TLS certificate mount on ModelMesh

* Revert from http to https until kserve/modelmesh#147 is merged

* Add lm-eval-service controller

refactor the existing TrustyAIService controller and
add LMEvalService controller

Signed-off-by: Yihong Wang <[email protected]>

---------

Signed-off-by: Yihong Wang <[email protected]>
Co-authored-by: Rui Vieira <[email protected]>
ruivieira added a commit that referenced this pull request Aug 5, 2024
* feat: Initial database support (#246)

* Initial database support

- Add status checking
- Add better storage flags
- Add spec.storage.format validation
- Add DDL
-Add HIBERNATE format to DB (test)
- Update service image
- Revert identifier to DATABASE
- Update CR options (remove mandatory data)

* Remove default DDL generation env var

* Update service image to latest tag

* Add migration awareness

* Add updating pods for migration

* Change JDBC url from mysql to mariadb

* Fix TLS mount

* Revert images

* Remove redundant logic

* Fix comments

* feat: Add TLS certificate mount on ModelMesh (#255)

* feat: Add TLS certificate mount on ModelMesh

* Revert from http to https until kserve/modelmesh#147 is merged

* Pin oc version, ubi version (#263)

* Restore checkout of trustyai-exp (#265)

* Add operator installation robustness (#266)

* fix: Skip InferenceService patching for KServe RawDeployment (#262)

* feat: ConfigMap key to disable KServe Serverless configuration (#267)

* feat: Add support for custom certificates in database connection (#259)

* Add TLS endpoint for ModelMesh payload processors. (#268)

Keep non-TLS endpoint for KServe Serverless (disabled by default)

---------

Signed-off-by: Yihong Wang <[email protected]>
Co-authored-by: Rui Vieira <[email protected]>
Co-authored-by: Rob Geada <[email protected]>
ruivieira added a commit that referenced this pull request Aug 23, 2024
* feat: Initial database support (#246)

* Initial database support

- Add status checking
- Add better storage flags
- Add spec.storage.format validation
- Add DDL
-Add HIBERNATE format to DB (test)
- Update service image
- Revert identifier to DATABASE
- Update CR options (remove mandatory data)

* Remove default DDL generation env var

* Update service image to latest tag

* Add migration awareness

* Add updating pods for migration

* Change JDBC url from mysql to mariadb

* Fix TLS mount

* Revert images

* Remove redundant logic

* Fix comments

* feat: Add TLS certificate mount on ModelMesh (#255)

* feat: Add TLS certificate mount on ModelMesh

* Revert from http to https until kserve/modelmesh#147 is merged

* Pin oc version, ubi version (#263)

* Restore checkout of trustyai-exp (#265)

* Add operator installation robustness (#266)

* fix: Skip InferenceService patching for KServe RawDeployment (#262)

* feat: ConfigMap key to disable KServe Serverless configuration (#267)

* feat: Add support for custom certificates in database connection (#259)

* Add TLS endpoint for ModelMesh payload processors. (#268)

Keep non-TLS endpoint for KServe Serverless (disabled by default)

* fix: Correct maxSurge and maxUnavailable (#275)

* feat: Add support for custom DB names (#257)

* feat: Add support for custom DB names

* fix: Correct custom DB name

---------

Signed-off-by: Yihong Wang <[email protected]>
Co-authored-by: Rui Vieira <[email protected]>
Co-authored-by: Rob Geada <[email protected]>
ruivieira added a commit that referenced this pull request Oct 22, 2024
* Add lm-eval-service controller (#258)

* feat: Initial database support (#246)

* Initial database support

- Add status checking
- Add better storage flags
- Add spec.storage.format validation
- Add DDL
-Add HIBERNATE format to DB (test)
- Update service image
- Revert identifier to DATABASE
- Update CR options (remove mandatory data)

* Remove default DDL generation env var

* Update service image to latest tag

* Add migration awareness

* Add updating pods for migration

* Change JDBC url from mysql to mariadb

* Fix TLS mount

* Revert images

* Remove redundant logic

* Fix comments

* feat: Add TLS certificate mount on ModelMesh (#255)

* feat: Add TLS certificate mount on ModelMesh

* Revert from http to https until kserve/modelmesh#147 is merged

* Add lm-eval-service controller

refactor the existing TrustyAIService controller and
add LMEvalService controller

Signed-off-by: Yihong Wang <[email protected]>

---------

Signed-off-by: Yihong Wang <[email protected]>
Co-authored-by: Rui Vieira <[email protected]>

* fix: Fix typo in operator's arguments (#261)

Operator's arguments changed from `--eanble-services` to `--enable-services`.
trustyai.opendatahub.io_lmevaljobs.yaml and zz_generated.deepcopy.go regenerated.

* feat: Add LMES driver build to GHA (#272)

* sync: sync dev/lm-eval with main branch (#271)

* feat: Initial database support (#246)

* Initial database support

- Add status checking
- Add better storage flags
- Add spec.storage.format validation
- Add DDL
-Add HIBERNATE format to DB (test)
- Update service image
- Revert identifier to DATABASE
- Update CR options (remove mandatory data)

* Remove default DDL generation env var

* Update service image to latest tag

* Add migration awareness

* Add updating pods for migration

* Change JDBC url from mysql to mariadb

* Fix TLS mount

* Revert images

* Remove redundant logic

* Fix comments

* feat: Add TLS certificate mount on ModelMesh (#255)

* feat: Add TLS certificate mount on ModelMesh

* Revert from http to https until kserve/modelmesh#147 is merged

* Pin oc version, ubi version (#263)

* Restore checkout of trustyai-exp (#265)

* Add operator installation robustness (#266)

* fix: Skip InferenceService patching for KServe RawDeployment (#262)

* feat: ConfigMap key to disable KServe Serverless configuration (#267)

* feat: Add support for custom certificates in database connection (#259)

* Add TLS endpoint for ModelMesh payload processors. (#268)

Keep non-TLS endpoint for KServe Serverless (disabled by default)

---------

Signed-off-by: Yihong Wang <[email protected]>
Co-authored-by: Rui Vieira <[email protected]>
Co-authored-by: Rob Geada <[email protected]>

* Weekly sync up of dev/lm-eval branch (#278)

* feat: Initial database support (#246)

* Initial database support

- Add status checking
- Add better storage flags
- Add spec.storage.format validation
- Add DDL
-Add HIBERNATE format to DB (test)
- Update service image
- Revert identifier to DATABASE
- Update CR options (remove mandatory data)

* Remove default DDL generation env var

* Update service image to latest tag

* Add migration awareness

* Add updating pods for migration

* Change JDBC url from mysql to mariadb

* Fix TLS mount

* Revert images

* Remove redundant logic

* Fix comments

* feat: Add TLS certificate mount on ModelMesh (#255)

* feat: Add TLS certificate mount on ModelMesh

* Revert from http to https until kserve/modelmesh#147 is merged

* Pin oc version, ubi version (#263)

* Restore checkout of trustyai-exp (#265)

* Add operator installation robustness (#266)

* fix: Skip InferenceService patching for KServe RawDeployment (#262)

* feat: ConfigMap key to disable KServe Serverless configuration (#267)

* feat: Add support for custom certificates in database connection (#259)

* Add TLS endpoint for ModelMesh payload processors. (#268)

Keep non-TLS endpoint for KServe Serverless (disabled by default)

* fix: Correct maxSurge and maxUnavailable (#275)

* feat: Add support for custom DB names (#257)

* feat: Add support for custom DB names

* fix: Correct custom DB name

---------

Signed-off-by: Yihong Wang <[email protected]>
Co-authored-by: Rui Vieira <[email protected]>
Co-authored-by: Rob Geada <[email protected]>

* Driver updates job's status periodically (#280)

The driver periodically update the LMEvalJob.Status.Message field
with the outputs from the lm-eval. The message pattern the driver
captures is like `Running text generation:  81%|`. Then users
can use this information to check the progress of the job.

Signed-off-by: Yihong Wang <[email protected]>

* Add Dockerfile for LMES job image (#276)

Add Dockerfile for LMES job image and the needed files

Signed-off-by: Yihong Wang <[email protected]>

* feat: Add overlays (#283)

* feat: Add overlays

* Remove redundant lmes-tas overlay. Change job image name.

* Add job image build (#284)

* Change job image use midstream lm-evaluation-harness (#285)

* feat: support batch size (#290)

Add batch size support in the LMEvalJob which
leverages the `--batch_size` in the `lm-evaluation-harness`.
This only affects the local models. The `--bath_size` doesn't
work for remote inference APIs.

Signed-off-by: Yihong Wang <[email protected]>

* Add the `openai` package into the lmes job image (#292)

update the LMES job's Dockerfile to include the
`openai` package.

Signed-off-by: Yihong Wang <[email protected]>

* fix: fix dependency error in the job image (#296)

Split up the unitxt and openai dependencies to
avoid the conflict.

Signed-off-by: Yihong Wang <[email protected]>

* feat: add device detection in lmes driver (#298)

Added a new feature in LMES driver to detect the available
devices by using the PyTorch API. This feature can be disabled
by passing the `--detect-device false` option.

Signed-off-by: Yihong Wang <[email protected]>

* feat: support unitxt recipes (#301)

Add new fields in the CRD to support unitxt recipes and
leverage the driver to create corresponding yaml files
of the unitxt recipes.

Signed-off-by: Yihong Wang <[email protected]>

* feat: support custom dataset (#309)

Updated the CRD data struct to allow users to specify a custom Unitxt card in
JSON format. The custom Unitxt card is equivalent to a custom dataset
definition. Also restructured and updated the CRD to support Volumes,
VolumeMounts, Env, Resources, Labels, and Annotations.

Signed-off-by: Yihong Wang <[email protected]>

* feat: new pulling mechanism for job statuses (#314)

Update the driver to keep running even the user program
finishes. The driver provides two APIs:
- GetStatus(): retrieve job status
- Shutdown(): properly tear down the driver

In the controller side, it uses `pod/exec` resource
to run the driver command to invoke the driver APIs
to retrieve the job status and shutdown the driver
when job is done.

Signed-off-by: Yihong Wang <[email protected]>

* Move operator's cmd/operator/main.go to cmd/main.go to keep operator-sdk compatibility (#295)

* Remove hardcoded job's user ID (#322)

* Fix mkdir command in Job dockerfile (#330)

* Refactor some lmesreconcile methods (#323)

* Refactor lmes reconcile  optoins

Signed-off-by: ted chang <[email protected]>

* Update controllers/lmes/lmevaljob_controller.go

Co-authored-by: Yihong Wang <[email protected]>

* Update controllers/lmes/lmevaljob_controller.go

Co-authored-by: Yihong Wang <[email protected]>
Signed-off-by: ted chang <[email protected]>

---------

Signed-off-by: ted chang <[email protected]>
Co-authored-by: Yihong Wang <[email protected]>

* tidy: clean up lmes-job image (#333)

remove BAM related packages and patch.

Signed-off-by: Yihong Wang <[email protected]>

* Enable job suspend for Kueue (#317)

* Refactor lmes reconcile  optoins

Signed-off-by: ted chang <[email protected]>

* Update controllers/lmes/lmevaljob_controller.go

Co-authored-by: Yihong Wang <[email protected]>

* Update controllers/lmes/lmevaljob_controller.go

Co-authored-by: Yihong Wang <[email protected]>
Signed-off-by: ted chang <[email protected]>

* Enable job suspend for Kueue

Signed-off-by: ted chang <[email protected]>

---------

Signed-off-by: ted chang <[email protected]>
Co-authored-by: Yihong Wang <[email protected]>

* Add overlay placeholders for main merge (#334)

* sync: sync up dev/lm-eval branch with main branch (#336)

* [CI] Run tests from trustyai-tests (#279)

* Change Dockerfile to clone trustyai-tests

* Add PYTEST_MARKERS env and remove TESTS_REGEX

* RHOAIENG-12274: Update operator's overlays (#287)

* Update operator's overlays

* Update kustomization.yaml

* Add devflag printout to GH Action comment (#289)

* Add timeout loop to DSC install (#305)

* RHOAIENG-13625: Add DBAvailable status to CR (#304)

* Add DBAvailable status to CR

* Remove probes

* Add KServe destination rule for Inference Services in the ServiceMesh (#315)

* Add DestinationRule creation for KServe serverless

* Add permissions for destination rules

* Add role for destination rules

* Add missing role for creating destination rules

* Fix spacing in DestinationRule template

* Add check if DestinationRule CRD is present before creating it (#316)

* Add check for DestinationRule CRD

* Add API extensions to operator's scheme

* Add permission for CRD resource

* Fix operator metrics service target port (#320)

* Add readiness probes (#312)

* Enable KServe serverless in the rhoai overlay (#321)

* Update overlay images (#331)

* Add correct CA cert to JDBC (#324)

* Add correct CA cert to JDBC

* Add require SSL

* Support for VirtualServices for InferenceLogger traffic (#332)

* Generate KServe Inference Logger in conformance with DestinationRule and VirtualService

* Add VirtualService creation for models in the mesh

* Add permissions for VirtualServices

* Update manifests for VirtualServices

* Fix VirtualServiceName variable

* fix yaml linter after the sync

Signed-off-by: Yihong Wang <[email protected]>

* tidy the go.mod and go.sum as well

Signed-off-by: Yihong Wang <[email protected]>

---------

Signed-off-by: Yihong Wang <[email protected]>
Co-authored-by: Adolfo Aguirrezabal <[email protected]>
Co-authored-by: Rui Vieira <[email protected]>
Co-authored-by: Rob Geada <[email protected]>
Co-authored-by: Rui Vieira <[email protected]>

---------

Signed-off-by: Yihong Wang <[email protected]>
Signed-off-by: ted chang <[email protected]>
Co-authored-by: Yihong Wang <[email protected]>
Co-authored-by: Rob Geada <[email protected]>
Co-authored-by: ted chang <[email protected]>
Co-authored-by: Adolfo Aguirrezabal <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Operator should support database configurations for TrustyAIService CRs
3 participants