Skip to content

Tags: MironAtHome/yugabyte-db

Tags

2.27.0.0-b376

Toggle 2.27.0.0-b376's commit message
[PLAT-17905]: Prevent non-restart upgrades when universe nodes are in…

…-transit.

Summary: Add a check to prevent non-restart upgrades when universe nodes are in-transit

Test Plan: unit test

Reviewers: anijhawan

Reviewed By: anijhawan

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D45598

2.20.12.0-b23

Toggle 2.20.12.0-b23's commit message
[BACKPORT 2.20][PLAT-17905]: Prevent non-restart upgrades when univer…

…se nodes are in-transit.

Summary:
Add a check to prevent non-restart upgrades when universe nodes are in-transit
Original diff/commit: 8c752d8 / D45598

Test Plan: unit test

Reviewers: anijhawan, dkumar

Reviewed By: dkumar

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D45628

2025.1.0.1-b1

Toggle 2025.1.0.1-b1's commit message
[BACKPORT 2025.1.0][PLAT-18178]: Fix K8s software upgrade rollback af…

…ter catalog upgrade failure

Summary:
**Problem**:
During a Kubernetes software upgrade rollback involving PostgreSQL major version upgrades (e.g., 2024.2 → 2025.1), a critical issue occurs when the catalog upgrade phase failed during software uprgade. In this scenario, only master nodes are upgraded to the target version (2025.1) while all tservers remain on the previous version (2024.2). When attempting to update PostgreSQL compatibility flags during the rollback process, we face a dilemma:
 - We cannot set the compatibility flag to version 2024.2 because the catalog upgrade needs to be rolled back, for which we need master on 2025.1
 - We cannot set the compatibility flag to version 2025.1 because the catalog upgrade was never completed
This creates a deadlock situation where the gflags upgrade task cannot proceed safely.

**Root Cause:**
The issue is specific to Kubernetes universes because both gflags upgrades and software upgrades are performed through Helm upgrades, which require specifying the software version during gflags operations. This differs from VM deployments, where controlled flag upgrades can be performed in mixed software mode.

**Solution:**
Implemented a tracking mechanism to store whether all tservers were successfully upgraded during the software upgrade process. This information is captured in the prevYBSoftwareConfig and used during rollback to determine if the PostgreSQL compatibility flags should be updated:
- If all tservers were upgraded, we can safely perform the gflags upgrade during rollback
- If not all tservers were upgraded, we skip the gflags update since the flags are already set from the failed software upgrade task

Original diff/commit: f5fdf87/D45616

Test Plan:
Tested manually by making a K8s software upgrade fail during catalog upgrade and rolling back the software version successfully.
Verified that the rollback upgrade works when the software upgrade is completed.

Reviewers: anijhawan, hsunder, anabaria

Reviewed By: anabaria

Subscribers: yash.priyam, yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D45622

2.27.0.0-b375

Toggle 2.27.0.0-b375's commit message
[yugabyte#28084] YSQL: [pg15 upgrade] Use RPC bind address on master

Summary:
Use RPC bind address on master just like we do on tserver.
This is needed for kubernetes like deployments where the node_name and bind addresses are different, and the cert name is not the hostname.
Jira: DB-17714

Test Plan: jenkins: urgent

Reviewers: fizaa

Reviewed By: fizaa

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D45608

2025.1.1.0-b88

Toggle 2025.1.1.0-b88's commit message
[BACKPORT 2025.1][yugabyte#26912] YSQL: Fix flaky test PgDDLConcurren…

…cyTest.IndexCreation

Summary:
The test PgDDLConcurrencyTest.IndexCreation is flaky. It runs concurrent
create index statements to trigger race conditions that can cause some of the create
index statements to fail. The test verifies that when create index aborts, the
PG backend's DDL state is properly reset. The test has a set of expected errors that
are suppressed. The test fails if an unexpected error is encountered, or the test
itself times out after 10 minutes.

When the test fails the following error is found:

```
Bad status: Network error (yb/yql/pgwrapper/libpq_utils.cc:457): Execute of 'CREATE INDEX IF NOT EXISTS t0_v ON t0(v)' failed: 7, message: ERROR:  timed out waiting for postgres backends to catch up
DETAIL:  2 backends on database 13515 are still behind catalog version 15.
HINT:  Run the following query on all tservers to find the lagging backends: SELECT * FROM pg_stat_activity WHERE backend_type != 'walsender' AND backend_type != 'yb-conn-mgr walsender' AND catalog_version < 15 AND datid = 13515; (pgsql error XX000) (aux msg ERROR:  timed out waiting for postgres backends to catch up

```

I found two reasons for the test flakiness:

(1) The test can fail due to `WaitForYsqlBackendsCatalogVersion` timed out.
Because we do not officially support concurrent DDLs, it can happen that a PG
backend cannot update its local catalog version because it is calling
`WaitForYsqlBackendsCatalogVersion`. If another PG backend also stucks for the
same reason, we can have a deadlock like situation until
`WaitForYsqlBackendsCatalogVersion` times out.
By default --ysql_yb_wait_for_backends_catalog_version_timeout=300000ms (5 min).
It only takes two `WaitForYsqlBackendsCatalogVersion` timeout before the test itself
times out.

(2) Even when `WaitForYsqlBackendsCatalogVersion` times out, the test will check
for the returned error to see if it should be suppressed, and PG will usually append
the following message to the error message:

```
[ts-1] 2025-07-24 11:58:42.372 GMT [56059] CONTEXT:  Catalog Version Mismatch: A DDL occurred while processing this query. Try again.
```

The test has
```
Status SuppressAllowedErrors(const Status& s) {
  if (HasTransactionError(s) || IsRetryable(s)) {
    return Status::OK();
  }
  return s;
}

bool IsRetryable(const Status& status) {
  static const auto kExpectedErrors = {
      "Try again",
      "Catalog Version Mismatch",
      "Restart read required",
      "schema version mismatch for table"
  };
  return HasSubstring(status.message(), kExpectedErrors);
}

```
which means on seeing "Try again" in the error message, the test will continue
its execution and will not fail unless due to timeout as described in (1).

However, PG only appends the "Try again" message in the common case, in other
uncommon situations (e.g., when `need_global_cache_refresh` is false), PG does
not append the "Try again" message. When that happens, the error is not
suppressed and the test fails with just the error show above.

To fix the test failure, I made two changes:

(1) changed two gflags to have smaller values:
--wait_for_ysql_backends_catalog_version_client_master_rpc_timeout_ms
from 20s to 2s
--ysql_yb_wait_for_backends_catalog_version_timeout
from 300s to 30s

(2) if the error contains "waiting for postgres backends to catch up",
suppress the error and let the test continue to execute.

Jira: DB-16337

Original commit: 88880a4 / D45593

Test Plan:
./yb_build.sh release --cxx-test pgwrapper_pg_ddl_concurrency-test --gtest_filter PgDDLConcurrencyTest.IndexCreation -n 200

Backport-through: 2025.1

The test seems stable in 2024.2, 2024.1 and 2.20. Probably some code changes have happened
that caused the flakiness. For example, some PG error handling code may have changed so that
earlier we always had "Try again" in the error text and the error was suppressed.

Reviewers: jason, sanketh

Reviewed By: jason

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D45604

2025.1.1.0-b87

Toggle 2025.1.1.0-b87's commit message
[BACKPORT 2025.1][yugabyte#27267] Backup, Tests: Test dump_role_check…

…s flag against YBC

Summary:
Original commit: 083be7c / D44990

The tests were implemented for yb_backup.py in the commit: 06bb2e5 / D41975

This diff enables the tests against YBC backup process.
All 5 tests are implemented via `TestYbBackup::doTestBackupRestoreRoles`.

Test Plan:
YB_TEST_YB_CONTROLLER=1 ./yb_build.sh --java-test org.yb.pgsql.TestYbBackup#testBackupRolesWithDumpRoleChecks
YB_TEST_YB_CONTROLLER=1 ./yb_build.sh --java-test org.yb.pgsql.TestYbBackup#testBackupRolesWithoutDumpRoleChecks
YB_TEST_YB_CONTROLLER=1 ./yb_build.sh --java-test org.yb.pgsql.TestYbBackup#testBackupRestoreRoles
YB_TEST_YB_CONTROLLER=1 ./yb_build.sh --java-test org.yb.pgsql.TestYbBackup#testBackupRolesWithoutUseRoles
YB_TEST_YB_CONTROLLER=1 ./yb_build.sh --java-test org.yb.pgsql.TestYbBackup#testBackupRolesWithoutRestoreRoles

NOTE: YB_TEST_YB_CONTROLLER=1 by default

Reviewers: mihnea, sanketh

Reviewed By: sanketh

Subscribers: yql, dshubin, vkumar

Differential Revision: https://phorge.dev.yugabyte.com/D45550

2025.1.1.0-b86

Toggle 2025.1.1.0-b86's commit message
[BACKPORT 2025.1][yugabyte#27705, yugabyte#26393, yugabyte#28039] Doc…

…db: Handle out of order messages for table locks.

Summary:
Original commit: 7912ca0 / D44898
Also included is a follow-up fix: cb345c2 / D45508
If there are out of order messages, caused by duplicate/retrying acquire lock messages,
that end up being processed after the corresponding release has been processed; we need
a mechanism to clean the locks that may be taken to prevent lock leakage.

Each Acquire request comes in with a `ignore_after_hybrid_time`, that specifies the time
after which it may be ignored. The main change here is to track Acquire operations's `ignore_after_hybrid_time`s ensure that
they get cleaned up twice.
 - Once at the end of the transaction; and
 - again once after we are past the corresponding max of `ignore_after_hybrid_time`s.

This information is tracked in the ts_local_lock_manager, and will be used to schedule
the duplicate release.

Jira: DB-17306, DB-15754, DB-17662

Test Plan:
Added test to cause out-of-order/duplication of acquire message between
a) LocalTServer -> Master; and
b) Master -> Dest TServer.

yb_build.sh fastdebug --cxx-test pg_object_locks-test --test_args --vmodule=tablet_service=0,libpq_utils=2,object_lock_manager=3,object_lock_info_manager=3,ts_local*=3,pg_client_session=2,pg_txn_manager=2,ysql_ddl_*=3,transaction_participant=0,async_rpc_task*=3 --gtest_filter *TestOutOfOrderMessageHandling*/*

Note that the PgPerform call from Pg-backend -> Local TServer does not contribute to  out-of-order lock acquire, as it does not retry upon failure(s).

Reviewers: bkolagani, rthallam, zdrudi

Reviewed By: bkolagani

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D45594

2024.2.5.0-b31

Toggle 2024.2.5.0-b31's commit message
[BACKPORT 2024.2][yugabyte#27267] Backup, Tests: Test dump_role_check…

…s flag against YBC

Summary:
Original commit: 083be7c / D44990

The tests were implemented for yb_backup.py in the commit: 06bb2e5 / D41975

This diff enables the tests against YBC backup process.
All 5 tests are implemented via `TestYbBackup::doTestBackupRestoreRoles`.
Jira: DB-16748

Test Plan:
YB_TEST_YB_CONTROLLER=1 ./yb_build.sh --java-test org.yb.pgsql.TestYbBackup#testBackupRolesWithDumpRoleChecks
YB_TEST_YB_CONTROLLER=1 ./yb_build.sh --java-test org.yb.pgsql.TestYbBackup#testBackupRolesWithoutDumpRoleChecks
YB_TEST_YB_CONTROLLER=1 ./yb_build.sh --java-test org.yb.pgsql.TestYbBackup#testBackupRestoreRoles
YB_TEST_YB_CONTROLLER=1 ./yb_build.sh --java-test org.yb.pgsql.TestYbBackup#testBackupRolesWithoutUseRoles
YB_TEST_YB_CONTROLLER=1 ./yb_build.sh --java-test org.yb.pgsql.TestYbBackup#testBackupRolesWithoutRestoreRoles

NOTE: YB_TEST_YB_CONTROLLER=1 by default

Reviewers: mihnea, sanketh

Reviewed By: sanketh

Subscribers: yql, dshubin, vkumar

Differential Revision: https://phorge.dev.yugabyte.com/D45551

2024.2.5.0-b30

Toggle 2024.2.5.0-b30's commit message
[BACKPORT 2024.2][yugabyte#27996, yugabyte#28032] Docdb: Fix log spew…

… in transaction.cc

Summary:
Original commit: 3aa0773 / D45447
transaction.cc is meant to log the transaction's trace whenever the time taken by the transaction is `> FLAGS_txn_slow_op_threshold_ms`

However, after the refactor in a0e6d55
The log line does not special case the fact that the flag being 0 should disable printing the log line.

This causes log spew as `FLAGS_txn_slow_op_threshold_ms` defaults to 0.

We should only consider printing the trace if this flag is non-zero.

Also includes a4dc4d3 / D45498
Jira: DB-17617, DB-17653

Test Plan: yb_build.sh fastdebug --cxx-test pg_mini-test --gtest_filter PgMiniTestTracing/PgMiniTestTracing.Tracing/*

Reviewers: rthallam, bkolagani

Reviewed By: bkolagani

Subscribers: yql, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D45571

2.27.0.2337-b1

Toggle 2.27.0.2337-b1's commit message
Bumping version to 2.27.0.2337 on branch 2.27.0.2337