-
Notifications
You must be signed in to change notification settings - Fork 87
feat: add storage latency and runs not stored metrics for PipelineRuns and TaskRuns #1065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Output from the testing: |
|
The following is the coverage report on the affected files.
|
23fc41d to
e22e87a
Compare
|
The following is the coverage report on the affected files.
|
|
/retest |
|
@divyansh42: GitHub didn't allow me to request PR reviews from the following users: aThorp96. Note that only tektoncd members and repo collaborators can review this PR, and authors cannot review their own PRs. |
e22e87a to
72d7f5d
Compare
|
The following is the coverage report on the affected files.
|
72d7f5d to
9c0d74c
Compare
|
The following is the coverage report on the affected files.
|
e34c72d to
fe521aa
Compare
|
The following is the coverage report on the affected files.
|
fe521aa to
68eb24d
Compare
|
The following is the coverage report on the affected files.
|
68eb24d to
4423576
Compare
|
The following is the coverage report on the affected files.
|
4423576 to
92b7a82
Compare
|
The following is the coverage report on the affected files.
|
d222da6 to
101d1a5
Compare
|
The following is the coverage report on the affected files.
|
101d1a5 to
71470c6
Compare
|
The following is the coverage report on the affected files.
|
|
@divyansh42 This requires rebase. Let's merge this. |
1c58af7 to
09d2cb1
Compare
|
It's still showing conflict. @divyansh42 |
Add storage latency tracking to measure time from run completion to successful database storage. This doesn't track if the runs are failed to get stored on the database. Supported tags are: kind (pipelinerun/taskrun), namespace, pipeline, task pipeline and task denotes the name of pipeline and task respectively Signed-off-by: divyansh42 <diagrawa@redhat.com>
09d2cb1 to
c66668f
Compare
|
Need to fix the unit tests, something wrong with the rebase |
c66668f to
ed8b162
Compare
|
/approve |
aThorp96
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This suggestion may be out of scope for this PR but lmk what you think
If we could determine that we failed to delete finalizers on the last storage attempt then we could decrement the metric perhaps?
We can check if the finalizers failed to be updated by listing those events for the pipelinerun:
events, err := r.kubeClientSet.CoreV1().Events(pr.GetNamespace()).List(ctx, metav1.ListOptions{
FieldSelector: "reason=FinalizerUpdateFailed,involvedObject.uid=" + string(pr.GetUID()),
})However there's still two elements missing for this strategy to work:
- We can't currently tell if that event was emitted by TektonResults. In order to do that we'd have to set an
AgentNamein the ControllerOptions when creating the controller (e.g.AgentName: "tekton-results-pipelinerun-controller"), and then reference the agent name in the field selector:reason=FinalizerUpdateFailed,reportingComponent=tekton-results-pipelinerun-controller,involvedObject.uid=" + string(pr.GetUID()) - Since the
FinalierUpdateFailedevent can also be emitted when setting the finalizer, we need to ensure we're only checking for events which occur after the pipelinerun was completed.
Putting it all together I think this becomes a lot more resilient to retries with something like this:
// pipelinerun/controller.go
impl := pipelinerunreconciler.NewImpl(ctx, c, func(_ *controller.Impl) controller.Options {
return controller.Options{
// This results pipelinerun reconciler shouldn't mutate the pipelinerun's status.
SkipStatusUpdates: true,
ConfigStore: configStore,
FinalizerName: "results.tekton.dev/pipelinerun",
AgentName: "tekton-results-pipelinerun-controller",
}
})
// pipelinerun/reconciler.go
events, _ := r.kubeClientSet.CoreV1().Events(pr.GetNamespace()).List(ctx, metav1.ListOptions{
FieldSelector: "reason=FinalizerUpdateFailed,reportingComponent=tekton-results-pipelinerun-controller,involvedObject.uid=" + string(pr.GetUID()),
})
if slices.ContainsFunc(events.Items, func(event corev1.Event) bool { event.LastTimestamp.After(pr.Status.CompletionTime.Time) }) {
if err := metrics.CountRunNotStored(ctx, pr.Namespace, "PipelineRun"); err != nil {
logging.FromContext(ctx).Errorf("error counting PipelineRun as not stored: %w", err)
}
}| | `watcher_taskrun_delete_duration_seconds_[bucket, sum, count]` | The duration that take to delete the TaskRun since completion | Histogram/LastValue(Gauge) | `*pipeline`=<pipeline_name> <br> `status`=<status> <br> `*task`=<task_name> <br> `*taskrun`=<taskrun_name><br> `namespace`=<pipelineruns-taskruns-namespace> | experimental | | ||
| | `watcher_pipelinerun_delete_count` | The total count of deleted PipelineRun | Counter | `status`=<status> <br> `namespace`=<pipelinerun-namespace> | experimental | | ||
| | `watcher_taskrun_delete_count` | The total count of deleted TaskRun | Counter | `status`=<status> <br> `namespace`=<pipelinerun-namespace> | experimental | | ||
| | `watcher_run_storage_latency_seconds_[bucket, sum, count]` | The duration between run completion and successful storage | Histogram | `kind`=<pipelinerun\|taskrun> <br> `namespace`=<run_namespace> | experimental | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include a note about the expected behavior of watcher_run_storage_latency_seconds when the Run is stored before it is complete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can do that.
I'll add a note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a note.
| if err := metrics.CountRunNotStored(ctx, pr.Namespace, "PipelineRun"); err != nil { | ||
| logging.FromContext(ctx).Errorf("error counting PipelineRun as not stored: %w", err) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish we had a way to only run this if the client reconciler succeeded :/
Maybe we could have a TEP for adding support for a PostReconcile hook on the PipelineRunReconciler interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we should add.
Thanks @aThorp96 for the detailed explanation. I think this approach makes sense. |
ed8b162 to
ac4679a
Compare
Signed-off-by: divyansh42 <diagrawa@redhat.com>
ac4679a to
4889b26
Compare
aThorp96
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aThorp96, enarha The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@divyansh42 I am not opposed to that being handled in separate PR |
khrm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Changes
This PR add two metrics:
Add storage latency tracking to measure time from run
completion to successful database storage. This doesn't track
If the runs fail to get stored in the database.
Supported tags are: kind (pipelinerun/taskrun), namespace.
Another one is runs_not_stored_count metrics which tracks
number of runs (Pipelineruns/Taskruns) which are deleted before
getting stored into the database.
/kind feature
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you review them:
/kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tepRelease Notes