Tags: MironAtHome/yugabyte-db
Tags
[PLAT-17905]: Prevent non-restart upgrades when universe nodes are in… …-transit. Summary: Add a check to prevent non-restart upgrades when universe nodes are in-transit Test Plan: unit test Reviewers: anijhawan Reviewed By: anijhawan Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D45598
[BACKPORT 2.20][PLAT-17905]: Prevent non-restart upgrades when univer… …se nodes are in-transit. Summary: Add a check to prevent non-restart upgrades when universe nodes are in-transit Original diff/commit: 8c752d8 / D45598 Test Plan: unit test Reviewers: anijhawan, dkumar Reviewed By: dkumar Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D45628
[BACKPORT 2025.1.0][PLAT-18178]: Fix K8s software upgrade rollback af… …ter catalog upgrade failure Summary: **Problem**: During a Kubernetes software upgrade rollback involving PostgreSQL major version upgrades (e.g., 2024.2 → 2025.1), a critical issue occurs when the catalog upgrade phase failed during software uprgade. In this scenario, only master nodes are upgraded to the target version (2025.1) while all tservers remain on the previous version (2024.2). When attempting to update PostgreSQL compatibility flags during the rollback process, we face a dilemma: - We cannot set the compatibility flag to version 2024.2 because the catalog upgrade needs to be rolled back, for which we need master on 2025.1 - We cannot set the compatibility flag to version 2025.1 because the catalog upgrade was never completed This creates a deadlock situation where the gflags upgrade task cannot proceed safely. **Root Cause:** The issue is specific to Kubernetes universes because both gflags upgrades and software upgrades are performed through Helm upgrades, which require specifying the software version during gflags operations. This differs from VM deployments, where controlled flag upgrades can be performed in mixed software mode. **Solution:** Implemented a tracking mechanism to store whether all tservers were successfully upgraded during the software upgrade process. This information is captured in the prevYBSoftwareConfig and used during rollback to determine if the PostgreSQL compatibility flags should be updated: - If all tservers were upgraded, we can safely perform the gflags upgrade during rollback - If not all tservers were upgraded, we skip the gflags update since the flags are already set from the failed software upgrade task Original diff/commit: f5fdf87/D45616 Test Plan: Tested manually by making a K8s software upgrade fail during catalog upgrade and rolling back the software version successfully. Verified that the rollback upgrade works when the software upgrade is completed. Reviewers: anijhawan, hsunder, anabaria Reviewed By: anabaria Subscribers: yash.priyam, yugaware Differential Revision: https://phorge.dev.yugabyte.com/D45622
[yugabyte#28084] YSQL: [pg15 upgrade] Use RPC bind address on master Summary: Use RPC bind address on master just like we do on tserver. This is needed for kubernetes like deployments where the node_name and bind addresses are different, and the cert name is not the hostname. Jira: DB-17714 Test Plan: jenkins: urgent Reviewers: fizaa Reviewed By: fizaa Subscribers: ybase Differential Revision: https://phorge.dev.yugabyte.com/D45608
[BACKPORT 2025.1][yugabyte#26912] YSQL: Fix flaky test PgDDLConcurren… …cyTest.IndexCreation Summary: The test PgDDLConcurrencyTest.IndexCreation is flaky. It runs concurrent create index statements to trigger race conditions that can cause some of the create index statements to fail. The test verifies that when create index aborts, the PG backend's DDL state is properly reset. The test has a set of expected errors that are suppressed. The test fails if an unexpected error is encountered, or the test itself times out after 10 minutes. When the test fails the following error is found: ``` Bad status: Network error (yb/yql/pgwrapper/libpq_utils.cc:457): Execute of 'CREATE INDEX IF NOT EXISTS t0_v ON t0(v)' failed: 7, message: ERROR: timed out waiting for postgres backends to catch up DETAIL: 2 backends on database 13515 are still behind catalog version 15. HINT: Run the following query on all tservers to find the lagging backends: SELECT * FROM pg_stat_activity WHERE backend_type != 'walsender' AND backend_type != 'yb-conn-mgr walsender' AND catalog_version < 15 AND datid = 13515; (pgsql error XX000) (aux msg ERROR: timed out waiting for postgres backends to catch up ``` I found two reasons for the test flakiness: (1) The test can fail due to `WaitForYsqlBackendsCatalogVersion` timed out. Because we do not officially support concurrent DDLs, it can happen that a PG backend cannot update its local catalog version because it is calling `WaitForYsqlBackendsCatalogVersion`. If another PG backend also stucks for the same reason, we can have a deadlock like situation until `WaitForYsqlBackendsCatalogVersion` times out. By default --ysql_yb_wait_for_backends_catalog_version_timeout=300000ms (5 min). It only takes two `WaitForYsqlBackendsCatalogVersion` timeout before the test itself times out. (2) Even when `WaitForYsqlBackendsCatalogVersion` times out, the test will check for the returned error to see if it should be suppressed, and PG will usually append the following message to the error message: ``` [ts-1] 2025-07-24 11:58:42.372 GMT [56059] CONTEXT: Catalog Version Mismatch: A DDL occurred while processing this query. Try again. ``` The test has ``` Status SuppressAllowedErrors(const Status& s) { if (HasTransactionError(s) || IsRetryable(s)) { return Status::OK(); } return s; } bool IsRetryable(const Status& status) { static const auto kExpectedErrors = { "Try again", "Catalog Version Mismatch", "Restart read required", "schema version mismatch for table" }; return HasSubstring(status.message(), kExpectedErrors); } ``` which means on seeing "Try again" in the error message, the test will continue its execution and will not fail unless due to timeout as described in (1). However, PG only appends the "Try again" message in the common case, in other uncommon situations (e.g., when `need_global_cache_refresh` is false), PG does not append the "Try again" message. When that happens, the error is not suppressed and the test fails with just the error show above. To fix the test failure, I made two changes: (1) changed two gflags to have smaller values: --wait_for_ysql_backends_catalog_version_client_master_rpc_timeout_ms from 20s to 2s --ysql_yb_wait_for_backends_catalog_version_timeout from 300s to 30s (2) if the error contains "waiting for postgres backends to catch up", suppress the error and let the test continue to execute. Jira: DB-16337 Original commit: 88880a4 / D45593 Test Plan: ./yb_build.sh release --cxx-test pgwrapper_pg_ddl_concurrency-test --gtest_filter PgDDLConcurrencyTest.IndexCreation -n 200 Backport-through: 2025.1 The test seems stable in 2024.2, 2024.1 and 2.20. Probably some code changes have happened that caused the flakiness. For example, some PG error handling code may have changed so that earlier we always had "Try again" in the error text and the error was suppressed. Reviewers: jason, sanketh Reviewed By: jason Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D45604
[BACKPORT 2025.1][yugabyte#27267] Backup, Tests: Test dump_role_check… …s flag against YBC Summary: Original commit: 083be7c / D44990 The tests were implemented for yb_backup.py in the commit: 06bb2e5 / D41975 This diff enables the tests against YBC backup process. All 5 tests are implemented via `TestYbBackup::doTestBackupRestoreRoles`. Test Plan: YB_TEST_YB_CONTROLLER=1 ./yb_build.sh --java-test org.yb.pgsql.TestYbBackup#testBackupRolesWithDumpRoleChecks YB_TEST_YB_CONTROLLER=1 ./yb_build.sh --java-test org.yb.pgsql.TestYbBackup#testBackupRolesWithoutDumpRoleChecks YB_TEST_YB_CONTROLLER=1 ./yb_build.sh --java-test org.yb.pgsql.TestYbBackup#testBackupRestoreRoles YB_TEST_YB_CONTROLLER=1 ./yb_build.sh --java-test org.yb.pgsql.TestYbBackup#testBackupRolesWithoutUseRoles YB_TEST_YB_CONTROLLER=1 ./yb_build.sh --java-test org.yb.pgsql.TestYbBackup#testBackupRolesWithoutRestoreRoles NOTE: YB_TEST_YB_CONTROLLER=1 by default Reviewers: mihnea, sanketh Reviewed By: sanketh Subscribers: yql, dshubin, vkumar Differential Revision: https://phorge.dev.yugabyte.com/D45550
[BACKPORT 2025.1][yugabyte#27705, yugabyte#26393, yugabyte#28039] Doc… …db: Handle out of order messages for table locks. Summary: Original commit: 7912ca0 / D44898 Also included is a follow-up fix: cb345c2 / D45508 If there are out of order messages, caused by duplicate/retrying acquire lock messages, that end up being processed after the corresponding release has been processed; we need a mechanism to clean the locks that may be taken to prevent lock leakage. Each Acquire request comes in with a `ignore_after_hybrid_time`, that specifies the time after which it may be ignored. The main change here is to track Acquire operations's `ignore_after_hybrid_time`s ensure that they get cleaned up twice. - Once at the end of the transaction; and - again once after we are past the corresponding max of `ignore_after_hybrid_time`s. This information is tracked in the ts_local_lock_manager, and will be used to schedule the duplicate release. Jira: DB-17306, DB-15754, DB-17662 Test Plan: Added test to cause out-of-order/duplication of acquire message between a) LocalTServer -> Master; and b) Master -> Dest TServer. yb_build.sh fastdebug --cxx-test pg_object_locks-test --test_args --vmodule=tablet_service=0,libpq_utils=2,object_lock_manager=3,object_lock_info_manager=3,ts_local*=3,pg_client_session=2,pg_txn_manager=2,ysql_ddl_*=3,transaction_participant=0,async_rpc_task*=3 --gtest_filter *TestOutOfOrderMessageHandling*/* Note that the PgPerform call from Pg-backend -> Local TServer does not contribute to out-of-order lock acquire, as it does not retry upon failure(s). Reviewers: bkolagani, rthallam, zdrudi Reviewed By: bkolagani Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D45594
[BACKPORT 2024.2][yugabyte#27267] Backup, Tests: Test dump_role_check… …s flag against YBC Summary: Original commit: 083be7c / D44990 The tests were implemented for yb_backup.py in the commit: 06bb2e5 / D41975 This diff enables the tests against YBC backup process. All 5 tests are implemented via `TestYbBackup::doTestBackupRestoreRoles`. Jira: DB-16748 Test Plan: YB_TEST_YB_CONTROLLER=1 ./yb_build.sh --java-test org.yb.pgsql.TestYbBackup#testBackupRolesWithDumpRoleChecks YB_TEST_YB_CONTROLLER=1 ./yb_build.sh --java-test org.yb.pgsql.TestYbBackup#testBackupRolesWithoutDumpRoleChecks YB_TEST_YB_CONTROLLER=1 ./yb_build.sh --java-test org.yb.pgsql.TestYbBackup#testBackupRestoreRoles YB_TEST_YB_CONTROLLER=1 ./yb_build.sh --java-test org.yb.pgsql.TestYbBackup#testBackupRolesWithoutUseRoles YB_TEST_YB_CONTROLLER=1 ./yb_build.sh --java-test org.yb.pgsql.TestYbBackup#testBackupRolesWithoutRestoreRoles NOTE: YB_TEST_YB_CONTROLLER=1 by default Reviewers: mihnea, sanketh Reviewed By: sanketh Subscribers: yql, dshubin, vkumar Differential Revision: https://phorge.dev.yugabyte.com/D45551
[BACKPORT 2024.2][yugabyte#27996, yugabyte#28032] Docdb: Fix log spew… … in transaction.cc Summary: Original commit: 3aa0773 / D45447 transaction.cc is meant to log the transaction's trace whenever the time taken by the transaction is `> FLAGS_txn_slow_op_threshold_ms` However, after the refactor in a0e6d55 The log line does not special case the fact that the flag being 0 should disable printing the log line. This causes log spew as `FLAGS_txn_slow_op_threshold_ms` defaults to 0. We should only consider printing the trace if this flag is non-zero. Also includes a4dc4d3 / D45498 Jira: DB-17617, DB-17653 Test Plan: yb_build.sh fastdebug --cxx-test pg_mini-test --gtest_filter PgMiniTestTracing/PgMiniTestTracing.Tracing/* Reviewers: rthallam, bkolagani Reviewed By: bkolagani Subscribers: yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D45571