Skip to content

Conversation

@jefferbrecht
Copy link
Member

Description

sles-12 failed in the nightlies for the sudo issue, for which we had previously added a retry policy (5 times with 5 seconds in between). This change moves the retry to a more general area before tests are run, and tries a general sudo command instead: if it succeeds then later testing code should not have issues running sudo commands anymore. The retry count is also greatly increased, from 5 to 60.

Related issue

b/259122953

How has this been tested?

Tested sles-12-sp2-sap locally, though this is a flake to begin with so we'll have to assume it's working until it fails again.

Checklist:

  • Unit tests
    • Unit tests do not apply.
    • Unit tests have been added/modified and passed for this PR.
  • Integration tests
    • Integration tests do not apply.
    • Integration tests have been added/modified and passed for this PR.
  • Documentation
    • This PR introduces no user visible changes.
    • This PR introduces user visible changes and the corresponding documentation change has been made.
  • Minor version bump
    • This PR introduces no new features.
    • This PR introduces new features, and there is a separate PR to bump the minor version since the last release already.
    • This PR bumps the version.

@jefferbrecht jefferbrecht marked this pull request as ready for review November 17, 2022 14:39
// TODO(b/259122953): wait until sudo is ready
backoffPolicy := backoff.WithContext(backoff.WithMaxRetries(backoff.NewConstantBackOff(slesStartupSudoDelay), slesStartupSudoMaxAttempts), ctx)
err := backoff.Retry(func() error {
_, err := RunRemotely(ctx, logger, vm, "", "sudo ls /root")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason to choose sudo ls /root as the sudo command ? I like it and I think it's simple enough. My only concern is that something weird would happen like e.g. The /root folder takes a while to be created even after is-system-running is a success.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted a built-in non-destructive command that's guaranteed to only be runnable with sudo, and sudo ls /root was just the first thing that came to mind. I think /root gets created pretty early during initialization, but even if there's a delay after is-system-running passes I think that's actually a good thing -- the goal is to block on as much startup initialization as possible.

I suppose I could change it to sudo stat /etc/sudoers.d/google_sudoers, which would more directly accomplish the goal of "make sure sudo works", but I'm not sure it matters much either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! Understood, I'm fine with either of the commands you mentioned (sudo ls /root or sudo stat /etc/sudoers.d/google_sudoers). LGTM!

@jefferbrecht jefferbrecht merged commit 0093562 into master Nov 17, 2022
@jefferbrecht jefferbrecht deleted the jefferbrecht-sles12-sudo branch November 17, 2022 16:38
avilevy18 added a commit that referenced this pull request Nov 18, 2022
Attempting to force flush feature tracking metrics

Attempting to force flush feature tracking metrics

Attempting to force flush feature tracking metrics

Attempting to force flush feature tracking metrics

Attempting to force flush feature tracking metrics

Attempting to force flush feature tracking metrics

Attempting to force flush feature tracking metrics

Attempting to force flush feature tracking metrics

Attempting to force flush feature tracking metrics

Attempting to force flush feature tracking metrics

Attempting to force flush feature tracking metrics

Attempting to force flush feature tracking metrics

Attempting to force flush feature tracking metrics

Attempting to force flush feature tracking metrics

Attempting to force flush feature tracking metrics

Attempting to force flush feature tracking metrics

Attempting to force flush feature tracking metrics

prometheus: add receiver for ingesting prometheus metrics using the Ops Agent (#904)

* prometheus: add config generation for the prometheus receiver (#844)

* prometheus: add config generation for the prometheus receiver

This change does the following:
- [  ] Pulls in the googlemanagedprometheus exporter for prometheus
- [  ] Pulls in prometheus so we can use the exact same config structure
- [  ] Adds the config generation for prometheus receivers
- [  ] Refactor some of the pipeline logic so prometheus receivers have
  their own exporter
- [  ] Adds config validation for prometheus receivers
- [  ] Adds basic unit tests for the prometheus receiver

* prometheus: add more unit tests

Signed-off-by: Ridwan Sharif <ridwanmsharif@google.com>

* prometheus: regesiter all service discovery implementations so yaml parsing doesn't fail

Signed-off-by: Ridwan Sharif <ridwanmsharif@google.com>

* prometheus: report error in platform-agnostic way

Signed-off-by: Ridwan Sharif <ridwanmsharif@google.com>

Signed-off-by: Ridwan Sharif <ridwanmsharif@google.com>

* prometheus: add metadata labels to every static config (#872)

* prometheus: add metadata labels to every static config

This change adds the following:
- [] Hooks up the receiver to the metadata detector
- [] Adds labels to every static config
- [] Adds unit tests
- [] Adds integration tests
- [] Disallows updating namespace, location and cluster labels

* integration_test: ignore instance_id label for prom metrics

* prometheus: add groupbyattrs processor so namespace, location and cluster fields can be used

Signed-off-by: Ridwan Sharif <ridwanmsharif@google.com>

Signed-off-by: Ridwan Sharif <ridwanmsharif@google.com>

* prometheus: use prometheus styled regex instead of OTel (#886)

* prometheus: use prometheus styled regex isntead of OTel

This is mainly focussed on the `replacement` field and us not using
the otel styled `$` syntax for the user visible prom config.

* prometheus: deep copy using marshal and unmarshal before updating regex

Signed-off-by: Ridwan Sharif <ridwanmsharif@google.com>

* prometheus: simplify deepcopy and escaping of $ in replacement strings

Signed-off-by: Ridwan Sharif <ridwanmsharif@google.com>

Signed-off-by: Ridwan Sharif <ridwanmsharif@google.com>

* Prometheus receiver: add integration test with JSON exporter (#869)

* prometheus: disable receiver by default

* prometheus: presubmit update license and yamlfmt

* prometheus: skip integration test on centos

* prometheus: address PR comments

* prometheus: update golden files

Signed-off-by: Ridwan Sharif <ridwanmsharif@google.com>
Co-authored-by: Lujie Duan <lujieduan@google.com>

Add caching to Windows build (#939)

* testing windows caching

* comment

* new fancy run

* try moving submodule update into dockerfile

* install git in dockerfile

* productionize flow

* comments resolved

* removed blank lines

* extra space

* remove cache missing logic from dockerfile

Add workaround for Windows 2012 fluent-bit lockups (#952)

* Add workaround for Windows 2012 fluent-bit lockups

See b/240564518 for more background.

* Revert "Testing: Add workaround for windows-2012 flakes"

This reverts commit a0c21e3.

* Revert force-restart workaround for Windows 2012

Testing : Add retries on `sudo sed` command when setting up SUSE test VM. (#956)

* Add retries on `sudo sed` command when setting up SUSE test VM.

* Add sles specific constants more max attempts and backoff duration.

Get bison package from team vendor repo (#954)

integration_test: skip prometheus tests on rhel (#959)

Add compatible restart command for sles-15-sap (#960)

Testing: change restart command to work on SLES-15 (#961)

Let's try just removing the `.target` option.

Use docker-credential-gcr instead of gcloud to match kokoro's prefetching logic (#964)

Fix missing opensuse condition (#974)

Fixes `TestPrometheusMetricsWithJSONExporter/opensuse-leap*`.

Add startup delay on SUES platforms (#976)

Add ZYPP_LOCK_TIMEOUT to reduce flakes (#975)

Vault install and user documentation update (#973)

* add metric policy to script and replace init references

* add cleaner enable script that will guide users if they do not follow the configuration options

* add configure_integration documentation

* update doc nit

Internal: tests install `go` from a GCS bucket (#977)

This is to prevent flakes due to `golang.org` throttling us. :)

Strip out mentions of winrm.par (#925)

Use absolute path for mkswap and swapon (#978)

We're still not sure why `mkswap` is randomly failing on sles-15-sap,
but providing the absolute path does seem to help...

Remove sudo from scripts along with updated docker install (#955)

* Try skipping "update docker" step

* also print out docker version

* remove more obsolete steps

* focal masquerades as hirsute

* try again with jammy

* test removing sudo

* focal

Co-authored-by: Martijn van Schaardenburg <martijnvs@google.com>

Fall back to unqualified mkswap (#979)

Some platforms, e.g. bionic, have mkswap under a different folder.
We haven't had any problems running mkswap unqualified on bionic, or
anywhere outside of SLES really, so add a fallback to the unqualified
version of the command if the absolute version fails.

Testing: Run Oracle DB test in a more normal way (#893)

Update VERSION (#987)

Update minimum_supported_agent_version in metadata.yaml. (#988)

Co-authored-by: Rafael Westphal <westphalrafael@google.com>

Attempting to force flush feature tracking metrics

Try getting sudo access before running tests (#986)

resourcedetector: Get default service account scopes. (#984)

* Add getDefaultScopes() to resourcedetector.

* Add `getSlice()` to testin FakeProvider.

* Verify DefaultScopes in TestGettingResourceWithoutError.
avilevy18 added a commit that referenced this pull request Dec 6, 2022
* Adding support for feature tracking

* Added feature tracking into `CollectOpsAgentSelfMetrics()``

* Added feature tracking metric in `expected_metric` metadata.yaml

* Added confgenerator import to `main_windows`

* Fix bug

prometheus: add receiver for ingesting prometheus metrics using the Ops Agent (#904)

* prometheus: add config generation for the prometheus receiver (#844)

* prometheus: add config generation for the prometheus receiver

This change does the following:
- [  ] Pulls in the googlemanagedprometheus exporter for prometheus
- [  ] Pulls in prometheus so we can use the exact same config structure
- [  ] Adds the config generation for prometheus receivers
- [  ] Refactor some of the pipeline logic so prometheus receivers have
  their own exporter
- [  ] Adds config validation for prometheus receivers
- [  ] Adds basic unit tests for the prometheus receiver

* prometheus: add more unit tests

Signed-off-by: Ridwan Sharif <ridwanmsharif@google.com>

* prometheus: regesiter all service discovery implementations so yaml parsing doesn't fail

Signed-off-by: Ridwan Sharif <ridwanmsharif@google.com>

* prometheus: report error in platform-agnostic way

Signed-off-by: Ridwan Sharif <ridwanmsharif@google.com>

Signed-off-by: Ridwan Sharif <ridwanmsharif@google.com>

* prometheus: add metadata labels to every static config (#872)

* prometheus: add metadata labels to every static config

This change adds the following:
- [] Hooks up the receiver to the metadata detector
- [] Adds labels to every static config
- [] Adds unit tests
- [] Adds integration tests
- [] Disallows updating namespace, location and cluster labels

* integration_test: ignore instance_id label for prom metrics

* prometheus: add groupbyattrs processor so namespace, location and cluster fields can be used

Signed-off-by: Ridwan Sharif <ridwanmsharif@google.com>

Signed-off-by: Ridwan Sharif <ridwanmsharif@google.com>

* prometheus: use prometheus styled regex instead of OTel (#886)

* prometheus: use prometheus styled regex isntead of OTel

This is mainly focussed on the `replacement` field and us not using
the otel styled `$` syntax for the user visible prom config.

* prometheus: deep copy using marshal and unmarshal before updating regex

Signed-off-by: Ridwan Sharif <ridwanmsharif@google.com>

* prometheus: simplify deepcopy and escaping of $ in replacement strings

Signed-off-by: Ridwan Sharif <ridwanmsharif@google.com>

Signed-off-by: Ridwan Sharif <ridwanmsharif@google.com>

* Prometheus receiver: add integration test with JSON exporter (#869)

* prometheus: disable receiver by default

* prometheus: presubmit update license and yamlfmt

* prometheus: skip integration test on centos

* prometheus: address PR comments

* prometheus: update golden files

Signed-off-by: Ridwan Sharif <ridwanmsharif@google.com>
Co-authored-by: Lujie Duan <lujieduan@google.com>

Add caching to Windows build (#939)

* testing windows caching

* comment

* new fancy run

* try moving submodule update into dockerfile

* install git in dockerfile

* productionize flow

* comments resolved

* removed blank lines

* extra space

* remove cache missing logic from dockerfile

Add workaround for Windows 2012 fluent-bit lockups (#952)

* Add workaround for Windows 2012 fluent-bit lockups

See b/240564518 for more background.

* Revert "Testing: Add workaround for windows-2012 flakes"

This reverts commit a0c21e3.

* Revert force-restart workaround for Windows 2012

Testing : Add retries on `sudo sed` command when setting up SUSE test VM. (#956)

* Add retries on `sudo sed` command when setting up SUSE test VM.

* Add sles specific constants more max attempts and backoff duration.

Get bison package from team vendor repo (#954)

integration_test: skip prometheus tests on rhel (#959)

Add compatible restart command for sles-15-sap (#960)

Testing: change restart command to work on SLES-15 (#961)

Let's try just removing the `.target` option.

Use docker-credential-gcr instead of gcloud to match kokoro's prefetching logic (#964)

Fix missing opensuse condition (#974)

Fixes `TestPrometheusMetricsWithJSONExporter/opensuse-leap*`.

Add startup delay on SUES platforms (#976)

Add ZYPP_LOCK_TIMEOUT to reduce flakes (#975)

Vault install and user documentation update (#973)

* add metric policy to script and replace init references

* add cleaner enable script that will guide users if they do not follow the configuration options

* add configure_integration documentation

* update doc nit

Internal: tests install `go` from a GCS bucket (#977)

This is to prevent flakes due to `golang.org` throttling us. :)

Strip out mentions of winrm.par (#925)

Use absolute path for mkswap and swapon (#978)

We're still not sure why `mkswap` is randomly failing on sles-15-sap,
but providing the absolute path does seem to help...

Remove sudo from scripts along with updated docker install (#955)

* Try skipping "update docker" step

* also print out docker version

* remove more obsolete steps

* focal masquerades as hirsute

* try again with jammy

* test removing sudo

* focal

Co-authored-by: Martijn van Schaardenburg <martijnvs@google.com>

Fall back to unqualified mkswap (#979)

Some platforms, e.g. bionic, have mkswap under a different folder.
We haven't had any problems running mkswap unqualified on bionic, or
anywhere outside of SLES really, so add a fallback to the unqualified
version of the command if the absolute version fails.

Testing: Run Oracle DB test in a more normal way (#893)

Update VERSION (#987)

Update minimum_supported_agent_version in metadata.yaml. (#988)

Co-authored-by: Rafael Westphal <westphalrafael@google.com>

Attempting to force flush feature tracking metrics

Try getting sudo access before running tests (#986)

resourcedetector: Get default service account scopes. (#984)

* Add getDefaultScopes() to resourcedetector.

* Add `getSlice()` to testin FakeProvider.

* Verify DefaultScopes in TestGettingResourceWithoutError.

* Refactoring tests to include internal metrics

Refactoring tests to include internal metrics

* Refactoring tests to include internal metrics

* Fixed dependencies

* Testing third party integrations - active_directory_ds

* Testing third party integrations - activemq

* Testing third party integrations - apache

* Added extra expected metric for active_directory_ds

* Fixed bug where feature extraction did not properly capture values of pointers

* Testing third party integrations - aerospike

* Merged, and fixed `go.mod`

* Fixed go.sum

* Addressed comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants