Skip to content

Performance issue: Generic.reseed() is slow due to per-provider random generators #1662

@conformist-mw

Description

@conformist-mw

While working on a data obfuscation tool, I needed to generate deterministic fake data (e.g., based on a real email address) and reseeded the generator before each record to ensure consistent results.
In general, mimesis is noticeably faster than faker when generating random data. However, when I tried to use reseeding for deterministic output, mimesis became much slower than faker due to its implementation details.

After reviewing the code, I realized that Generic.reseed() iterates over all internal providers and calls reseed() on each of them, since each provider maintains its own random generator. This causes a major slowdown when reseeding frequently (such as once per obfuscated record).

For example:

from faker import Faker
from mimesis import Generic
import timeit

generic = Generic()
fake = Faker()


def test_mimesis():
    generic.reseed(42)
    return generic.person.email()


def test_faker():
    fake.seed_instance(42)
    return fake.email()


mimesis_time = timeit.timeit(test_mimesis, number=100000)
faker_time = timeit.timeit(test_faker, number=100000)

print(f"Mimesis time: {mimesis_time:.4f} seconds")
print(f"Faker time:   {faker_time:.4f} seconds")

Outputs:

Mimesis time: 18.9220 seconds
Faker time:   7.5092 seconds

This approach makes deterministic data generation with mimesis much less efficient compared to faker.

Possible solutions:

  • Consider using a shared random generator for all providers in Generic, or
  • At least add a note in the documentation that Generic.reseed() is an expensive operation because it reseeds every provider individually.

Thank you for your great work on mimesis! Let me know if any more details are needed.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions