adding health checks #432
Conversation
| async Task RunDockerContainer(IEnumerable<(int ExternalPort, int Port, int? ContainerPort, string? Protocol)> ports) | ||
| { | ||
| var hasPorts = ports.Any(); | ||
| while (!dockerInfo.StoppingTokenSource.IsCancellationRequested) |
There was a problem hiding this comment.
When I implement the monitor, it might decide to kill a container because it's unhealthy. In that case I'd want the DockerRunner to recreate the replica.
Another way to do it, is to kill the container by calling docker restart, and then there is no need for that loop. what do you think?
There was a problem hiding this comment.
The argument behind the former, is that it's consistent with how I'll do it in ProcessRunner
There was a problem hiding this comment.
The docker runner already restarts the replica albiet with the same name.
There was a problem hiding this comment.
@davidfowl I think that the problem with doing docker restart is that the health monitor won't be able to tell that the container was restarted (because there won't be Stopped and Started events). Unless, I explicitly send ReplicaState.Stopped and ReplicaState.Started events whenever I restart the container, but I don't like that approach.
There was a problem hiding this comment.
From what I can tell with this loop, you would call docker rm and other commands I don't think you want to run:
tye/src/Microsoft.Tye.Hosting/DockerRunner.cs
Line 429 in 65c7ae0
The restart loop for rerunning docker is here:
tye/src/Microsoft.Tye.Hosting/DockerRunner.cs
Line 382 in 65c7ae0
There was a problem hiding this comment.
@jkotalik yeah, makes sense. I'll try doing it that way.
There was a problem hiding this comment.
@jkotalik changed the DockerRunner to do docker restart when StoppingTokenSource is canceled.
|
@davidfowl there are two major things left
Maybe it's best to start reviewing this PR and I can submit the rest later on this PR or another PR? |
# Conflicts: # src/Microsoft.Tye.Hosting/DockerRunner.cs
| replicas: 3 | ||
| bindings: | ||
| - port: 8004 | ||
| liveness: |
There was a problem hiding this comment.
These 2 probes are bigger than the entire yaml, that makes me 😢
There was a problem hiding this comment.
@davidfowl yeah... I thought it's best to go with something flexible like in k8s, but you can get a way with just providing a path, the rest have default values
# Conflicts: # src/Microsoft.Tye.Core/ApplicationFactory.cs
|
I'll probably add the k8s manifest generation tomorrow. |
|
@areller this change is amazing. Got my coffee ☕ and will be reviewing now. |
jkotalik
left a comment
There was a problem hiding this comment.
Great start. I think there is some small cleanup tasks for code readability in the replica state management, but overall looks like the right idea.
|
|
||
| if (builderHttp.Protocol != null) | ||
| { | ||
| httpGet.Add("scheme", builderHttp.Protocol!.ToUpper()); |
There was a problem hiding this comment.
I think you don't need ! here, other similar cases in this file as well.
| httpGet.Add("scheme", builderHttp.Protocol!.ToUpper()); | |
| httpGet.Add("scheme", builderHttp.Protocol.ToUpper()); |
There was a problem hiding this comment.
yep. I removed it and it didn't throw any warnings. probably because it's inside a block that runs if builderHttp.Protocol != null.
It's weird because at times the compiler will throw a warning when not using the ! operator, even though the expression is in a context where it's guaranteed to not be null...
There was a problem hiding this comment.
found an example: if you go to ReplicaMonitor.cs:113 and remove the ! after serviceDesc.Readiness, the compiler will throw a warning, even though there is no way for that variable to be null at that point.
| else | ||
| { | ||
| // If port is not given, we pull default port | ||
| var binding = service.Bindings.First(b => builderHttp.Protocol is null || b.Protocol == builderHttp.Protocol); |
There was a problem hiding this comment.
It does feel a bit awkward to have health checks per service rather than per endpoint, but it seems to be what k8s does. Or maybe we require a port?
There was a problem hiding this comment.
requiring a port will indeed make it more compatible with k8s, but it will make the tye.yaml more verbose and less maintainable in my opinion. one thing that k8s does to help with maintainability is to allow to specify a variable as a port. e.g.
port: liveness-port
| } | ||
|
|
||
| if (service.Description.Replicas == 1) | ||
| if ((service.Description.Readiness is null || service.Description.RunInfo is IngressRunInfo) && service.Description.Replicas == 1) |
There was a problem hiding this comment.
Why add the check for service.Description.RunInfo is IngressRunInfo
There was a problem hiding this comment.
because I forgot to remove when added service.Description.Readiness is null :)
| private void StartLivenessProbe(Probe probe, ReplicaState moveToOnSuccess = ReplicaState.Healthy) | ||
| { | ||
| // currently only HTTP is available | ||
| if (probe.Http is null) |
There was a problem hiding this comment.
| if (probe.Http is null) | |
| if (probe.Http == null) |
There was a problem hiding this comment.
And for the rest of the file
| // currently only HTTP is available | ||
| if (probe.Http is null) | ||
| { | ||
| _logger.LogWarning("cannot start probing replica {name} because probe configuration is not set", _replica.Name); |
There was a problem hiding this comment.
| _logger.LogWarning("cannot start probing replica {name} because probe configuration is not set", _replica.Name); | |
| _logger.LogWarning("Cannot start probing replica {name} because probe configuration is not set", _replica.Name); |
There was a problem hiding this comment.
And for the rest of the file.
|
|
||
| private void MoveToHealthy(ReplicaState from) | ||
| { | ||
| _logger.LogInformation("replica {name} is moving to an healthy state", _replica.Name); |
There was a problem hiding this comment.
These logs should probably be at debug level.
|
|
||
| public override void Start() | ||
| { | ||
| Func<ReplicaBinding, bool> bindingClosure = (_httpProberSettings.Port.HasValue, _httpProberSettings.Protocol != null) switch |
There was a problem hiding this comment.
This probably deserves a comment as well.
jkotalik
left a comment
There was a problem hiding this comment.
I think this overall looks good. There will be some follow up I'd like for us to do for polish, but can be done in a separate PR, including:
- Sample
- Docs
- Recipe
| } | ||
| } | ||
|
|
||
| private static void HandleServiceProbe(YamlMappingNode yamlMappingNode, ConfigProbe probe) |
There was a problem hiding this comment.
nit: return ConfigProbe instead of passing it in s.t you can set service.Liveness/service.Readiness directly.
There was a problem hiding this comment.
Yeah. That would probably make more sense but it will break compatibility with the rest of the methods. What do you think?
|
|
||
| private static void HandleServiceProbe(YamlMappingNode yamlMappingNode, ConfigProbe probe) | ||
| { | ||
| foreach (var child in yamlMappingNode.Children) |
There was a problem hiding this comment.
Can we add some tests for these parsing wise? Much faster for checking these kinds of conditions like making sure initialDelay can't be negative.
There was a problem hiding this comment.
I'll try to come up with something. You want it as part of this PR? or break it up in a different one?
There was a problem hiding this comment.
I think following the structure here: https://github.com/dotnet/tye/blob/master/test/UnitTests/TyeDeserializationTests.cs#L25.
Mostly just parsing/ validation of tye.yaml to make sure we cover these edge cases correctly.
No preference whether we add them in this PR now or follow up, I'd just like them in general 😄
| { | ||
| public class ReplicaMonitor : IApplicationProcessor | ||
| { | ||
| private ILogger _logger; |
Co-authored-by: Justin Kotalik <jukotali@microsoft.com>
Co-authored-by: Justin Kotalik <jukotali@microsoft.com>
|
I'm going to go ahead and merge this, let's do some follow up on top of it 😄 . Again, thanks @areller for all the work you have done here |
|
@jkotalik thank you! Will create a PR in the following days |
adding health checks according to this #19
Changes
liveness,readiness. each can hold these settingsHealthy,Ready(Ready= passed bothlivenessandreadinessprobing.Healthy= passedlivenessbut notreadiness).If neither
livenessnorreadinesswere provided, replica will be upgraded toReadyupon creation. If onlylivenessif provided, replica will be upgraded toReadyupon passing thelivenessprobe. if onlyreadinessis provided, replica will be upgraded toHealthyupon creation.ReplicaMonitorthat reads thelivenessandreadinessprobe configuration, and kills/changes the state of the replica according to probe resultsReadyreplicaslivenessandreadinessconfiguration.