-
Notifications
You must be signed in to change notification settings - Fork 767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix https://github.com/kubeflow/training-operator/issues/1706 #1707
Conversation
cmd/training-operator.v1/main.go
Outdated
@@ -74,6 +75,7 @@ func main() { | |||
"If set, it only monitors kubeflow jobs in the given namespace.") | |||
flag.IntVar(&monitoringPort, "monitoring-port", 9443, "Endpoint port for displaying monitoring metrics. "+ | |||
"It can be set to \"0\" to disable the metrics serving.") | |||
flag.IntVar(&controllerThreads, "controller-threads", 10, "Number of worker threads used by the controller.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep current defaults here(1)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most users are too "lazy" to change the default values, or they wish the default values can handle most situations, I think the default value set to 10 is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value is highly dependent on deployment environment and user needs. I would recommend to keep it default 1 and user can override it based on their needs. (Ref: https://docs.okd.io/latest/operators/operator_sdk/golang/osdk-golang-tutorial.html#osdk-golang-controller-configs_osdk-golang-tutorial)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HeGaoYuan I agree with @johnugeorge. We should keep it default 1
since it would be best to set values as the minimal requirement one as default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far I am still stick on my opinion.
spark-on-k8s-operator is a great project which is similar to training-operator. It use the default value 10 for controllerThreads.( https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/c261df66a00635509f7d8cb2a7fba4c602c9228e/main.go#L56 )
And I think the robustness is more important than minimal requirement for a production-grade project.
Welcome more programmers to convince me of 1 is better than 10. 😁😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was totally agree that the default value should be 1 here, regards as following reasons and more,
- the default value of
controller.Options.MaxConcurrentReconciles
is 1, cf - if we do not make sure that EVERYTHING now and future are completely thread safe, it's better to keep the minimal working configuration as default.
Anyway, keep default as DEFAULT would be better in my personal point of view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HeGaoYuan Can you update the PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HeGaoYuan Can you update the PR?
@johnugeorge done
Pull Request Test Coverage Report for Build 3781033970
💛 - Coveralls |
Related: #1708 |
cmd/training-operator.v1/main.go
Outdated
@@ -74,6 +75,7 @@ func main() { | |||
"If set, it only monitors kubeflow jobs in the given namespace.") | |||
flag.IntVar(&monitoringPort, "monitoring-port", 9443, "Endpoint port for displaying monitoring metrics. "+ | |||
"It can be set to \"0\" to disable the metrics serving.") | |||
flag.IntVar(&controllerThreads, "controller-threads", 10, "Number of worker threads used by the controller.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we set the controllerThreads
per each framework since I don't think users create many Jobs using all frameworks simultaneously?
This means we introduce --tfjob-threads
, pytorchjob-threads
, and so on here instead of the --controller-threads
.
@johnugeorge wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users now can enable and disable schemes that supported by the operator instance.
I think we can keep the command line options simple at first and update it to a more sophisticated version when real cases pump in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zw0610 I see; we can select schemes in the following. Thanks for clarifying!
@@ -176,9 +176,10 @@ func (jc *MPIJobReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ct | |||
} | |||
|
|||
// SetupWithManager sets up the controller with the Manager. | |||
func (jc *MPIJobReconciler) SetupWithManager(mgr ctrl.Manager) error { | |||
func (jc *MPIJobReconciler) SetupWithManager(mgr ctrl.Manager, controllerThreads int) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we or should we introduce a way to keep the SetupWithManager
API, may be extending the configuration struct is better than extending/changing API.
Is that make sense to introduce MaxConcurrentReconciles
config to common.JobController
?
What's your opinion? @johnugeorge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kuizhiqing Yes. But I think, we need more refactoring. eg: NewReconciler can take common.JobControllerConfiguration as argument and we can add MaxConcurrentReconciles
to common.JobControllerConfiguration
.
We can do this in a separate PR as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnugeorge yes, I agree with that.
e701c9a
to
600dffe
Compare
Thanks @HeGaoYuan /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: HeGaoYuan, johnugeorge The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* Update sdk to v1.6.0rc0 * Update sdk to v1.6.0rc0 * Update sdk to v1.6.0rc0 * Update to latest Image * [Bugfix][SDK] pod has no metadata attr anymore in the get_job_logs() … (#1760) --------- Co-authored-by: Hongzhi Chen <[email protected]>
Which issue(s) this PR fixes : #1706
Checklist: