Skip to content

Conversation

@bashtanov
Copy link
Contributor

The test is broken: it swallows errors, timeout ones in particular.
Make it propagate errors. To avoid timeouts:

  • make sure verifier offline mode only waits for consumption not for anything in the querying thread
  • reduce production rate
  • allow more time to complete

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

  • none

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Feb 10, 2025

CI test results

test results on build#61767
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/61767#0194ed78-5b46-4f61-a4d1-c0b9f80c8fe8 FLAKY 1/3
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/61767#0194ed78-5b46-4a7a-b743-4765ba61ffe9 FLAKY 1/2
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_HADOOP ducktape https://buildkite.com/redpanda/redpanda/builds/61767#0194ed65-07a0-4a9d-b85f-052e159a33fa FLAKY 1/2
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_JDBC ducktape https://buildkite.com/redpanda/redpanda/builds/61767#0194ed65-079e-47f9-8066-e78270290712 FLAKY 1/2
rptest.tests.datalake.mount_unmount_test.MountUnmountIcebergTest.test_simple_unmount.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/61767#0194ed65-079f-4549-a7c5-cf74a8afb220 FLAKY 1/2
rptest.tests.e2e_shadow_indexing_test.ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=True.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/61767#0194ed65-079f-43ff-85ad-4a407350be3a FLAKY 1/2
rptest.tests.log_compaction_test.LogCompactionTest.compaction_stress_test.cleanup_policy=compact.delete.key_set_cardinality=1000.storage_compaction_key_map_memory_kb=10 ducktape https://buildkite.com/redpanda/redpanda/builds/61767#0194ed65-079f-43ff-85ad-4a407350be3a FLAKY 1/2
rptest.tests.log_compaction_test.LogCompactionTest.compaction_stress_test.cleanup_policy=compact.delete.key_set_cardinality=1000.storage_compaction_key_map_memory_kb=3 ducktape https://buildkite.com/redpanda/redpanda/builds/61767#0194ed65-079e-47f9-8066-e78270290712 FLAKY 1/2
rptest.tests.log_compaction_test.LogCompactionTest.compaction_stress_test.cleanup_policy=compact.key_set_cardinality=1000.storage_compaction_key_map_memory_kb=3 ducktape https://buildkite.com/redpanda/redpanda/builds/61767#0194ed65-079f-43ff-85ad-4a407350be3a FLAKY 1/2
rptest.tests.scaling_up_test.ScalingUpTest.test_scaling_up_with_recovered_topic ducktape https://buildkite.com/redpanda/redpanda/builds/61767#0194ed65-079f-43ff-85ad-4a407350be3a FLAKY 1/3
test results on build#61891
test_id test_kind job_url test_status passed
rptest.tests.availability_test.AvailabilityTests.test_recovery_after_catastrophic_failure ducktape https://buildkite.com/redpanda/redpanda/builds/61891#01950c9d-c81b-4cd0-b9c8-8b55cd0d47f5 FLAKY 1/2
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/61891#01950c9d-c81d-4026-9b9d-23174d15a298 FLAKY 1/2
rptest.tests.delete_records_test.DeleteRecordsTest.test_delete_records_concurrent_truncations.cloud_storage_enabled=True.truncate_point=start_offset ducktape https://buildkite.com/redpanda/redpanda/builds/61891#01950c9d-c81c-4acb-a996-8092df40b022 FLAKY 1/2
rptest.tests.partition_move_interruption_test.PartitionMoveInterruption.test_cancelling_partition_move.replication_factor=3.unclean_abort=True.recovery=no_recovery.compacted=False ducktape https://buildkite.com/redpanda/redpanda/builds/61891#01950c9d-c81b-4cd0-b9c8-8b55cd0d47f5 FLAKY 1/2
storage_single_thread_rpunit.storage_single_thread_rpunit unit https://buildkite.com/redpanda/redpanda/builds/61891#01950c5a-10e6-4d7d-8288-a8b7815d2517 FLAKY 1/2
test results on build#61893
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/61893#01950e4e-95ab-4976-b770-c4685c51f5e9 FLAKY 1/2
rptest.tests.log_compaction_test.LogCompactionTest.compaction_stress_test.cleanup_policy=compact.delete.key_set_cardinality=1000.storage_compaction_key_map_memory_kb=3 ducktape https://buildkite.com/redpanda/redpanda/builds/61893#01950e68-4471-4097-9721-039f49ff8226 FLAKY 1/5
storage_single_thread_rpunit.storage_single_thread_rpunit unit https://buildkite.com/redpanda/redpanda/builds/61893#01950e09-c5a2-43c5-9445-1a61bce8346c FLAKY 1/2
test results on build#62179
test_id test_kind job_url test_status passed
rptest.tests.audit_log_test.AuditLogTestsAppLifecycle.test_app_lifecycle ducktape https://buildkite.com/redpanda/redpanda/builds/62179#019537ec-114d-456f-a43f-7b040a640bca FLAKY 1/2
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/62179#019537ec-114d-456f-a43f-7b040a640bca FLAKY 1/2
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/62179#01953806-f982-43a8-a429-b1e51cb1a5cc FLAKY 1/2
rptest.tests.e2e_shadow_indexing_test.ShadowIndexingWhileBusyTest.test_create_or_delete_topics_while_busy.short_retention=True.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/62179#01953806-f981-4d64-9151-4adfe2927e56 FLAKY 1/2
rptest.tests.scaling_up_test.ScalingUpTest.test_scaling_up_with_recovered_topic ducktape https://buildkite.com/redpanda/redpanda/builds/62179#01953806-f983-4b67-acd1-47fe356b98cd FLAKY 1/2
rptest.tests.write_caching_fi_e2e_test.WriteCachingFailureInjectionE2ETest.test_crash_all_with_consumer_group ducktape https://buildkite.com/redpanda/redpanda/builds/62179#019537ec-114d-456f-a43f-7b040a640bca FLAKY 1/2
test results on build#62203
test_id test_kind job_url test_status passed
rptest.tests.partition_movement_test.SIPartitionMovementTest.test_shadow_indexing.num_to_upgrade=0.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/62203#01953969-87ba-4b89-bd7b-92c922ea417e FLAKY 1/2

@bashtanov bashtanov marked this pull request as draft February 10, 2025 08:52
@bashtanov
Copy link
Contributor Author

Meh. It worked locally. Will debug.

- wait for consuming all messages regardless translation state
- avoid race conditions when stopping consumer
sometimes maximum throughput is not desired
@bashtanov bashtanov force-pushed the fix-iceberg-data-migration-test-timeout branch from 162ed94 to 3b4aa86 Compare February 16, 2025 01:20
@bashtanov
Copy link
Contributor Author

/dt

@bashtanov bashtanov marked this pull request as ready for review February 16, 2025 09:11
compacted: bool = False,
table_override: Optional[str] = None):
table_override: Optional[str] = None,
buffer=5000):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe call this max_buffered_msgs or somesuch?

config=self.avro_stream_config(
self.TOPIC_NAME, "verifier_schema", 1000000))
self.TOPIC_NAME, "verifier_schema", 1000000,
1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe pull 1 out into some low_interval_ms and explain in a comment why it's necessary?

verifier = DatalakeVerifier(self.redpanda,
self.TOPIC_NAME,
self.dl.spark(),
buffer=50000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe pull 50000 out into some high_buffered_msgs and explain in a comment why it's necessary?

connect.stop_stream("ducky_stream", should_finish=False)
time.sleep(1) # just it case: let verifier consume remaining messages
verifier.go_offline()
verifier.go_offline(600)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comment explaining why this is necessary?

include_query_engines=[QueryEngineType.SPARK])

def avro_stream_config(self, topic, subject, cnt=3000):
def avro_stream_config(self, topic, subject, cnt=3000, interval_ms=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Production rate reduced as otherwise RPCN produces too much data while unmounting

Hmm I thought we blocked writes while we unmounted. Are we sure we eventually actually quiesce? The tests I saw were hanging around for 12 minutes without completing -- I don't imagine unmount would take that long, would it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we do, but it takes some time since we block it.
The 12 minutes you saw was a combination of many problems, swallowed errors in particular -- see 1st commit.

1000 messages per second is not a crazy production rate, but a buffer
of 5000 messages will only keep 5 seconds worth, which is less than
unmount or translation delays
Production rate reduced as otherwise RPCN produces too much data while
unmounting, so it takes unreasonable time to complete.

Sleep removed as verifier is more robust.

Time limit increased as sometimes unmount takes more time and it takes
longer to get offline:
- offline mode waits for consuming till migrations blocking offset
- consume thread waits for query thread (limited comparison buffer)
- query thread may lag because translation lags
@bashtanov bashtanov force-pushed the fix-iceberg-data-migration-test-timeout branch from 3b4aa86 to 323b0b9 Compare February 24, 2025 11:09
@bashtanov bashtanov requested a review from andrwng February 24, 2025 15:24
@bashtanov bashtanov force-pushed the fix-iceberg-data-migration-test-timeout branch from 323b0b9 to 87ce98c Compare February 24, 2025 17:36
@bashtanov bashtanov merged commit a42df2c into redpanda-data:dev Feb 24, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants