Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement Suggestions] #33

Open
vincentteyssier opened this issue Nov 6, 2018 · 1 comment
Open

[Improvement Suggestions] #33

vincentteyssier opened this issue Nov 6, 2018 · 1 comment

Comments

@vincentteyssier
Copy link

vincentteyssier commented Nov 6, 2018

Hi Paul,

As discussed on SO please find below my feedback on what I think would be nice additions to this already great library:

  1. possibility to merge generated stats: a typical use case is a daily ETL receiving the daily data as a set of multiple files with common prefix. The stats need to be computed on the whole batch, not on each file. Therefore the possibility to merge several stats results would be useful. Accepting a prefix as an input file instead of a filename would also be very useful to that use case

  2. documentation: a list of validation error messages would be very nice to have on the doc. Even more important is a list of all the feature properties that can be written in the protobuf schema.

  3. RAM overhead: datasets are often big and TFDV requires a large RAM overhead. The possibility to process datasets by mini-batches would allow to run TFDV even in lower RAM environments. That could even be combined with my first suggestion.

  4. Schema inference: TFDV is generating quite generic schemas when infering a dataset. I understand that domain knowledge is required in order to define precise schema properties, but maybe adding a suggestion output could help. For example if we have a standard deviation quite high when looking at the mean, or a missing value rate of 6% (then maybe 10% would be acceptable) you could output a message saying that such stdv may be out of bound and the following code could add this to your schema: code_example.
    This kind of interactivity and suggestion would make it really easy to get better schemas imo

  5. Performance: as my fellow colleague in this Issue list, I experience really low performance on big datasets composed of many files. There must be some tuning that we are missing. When I compute a mean with Apache Beam on the same dataset with a DirectRunner I get quite good performance for example. However I am very subjective here.... :)

  6. Facets Dive: very cool to have the facets overview directly accessible when visualizing stats. A super useful addition would be to also have Dive in the same package.

Anyway thanks a lot for the super cool library, it is already very useful as is!!!

@paulgc
Copy link
Member

paulgc commented Nov 8, 2018

Thanks for detailed feedback, Vincent. This is really useful. We will continue to address these issues in the subsequent releases.

TFDV 0.11.0 will be releasing within a week and comes with some new features and performance improvements. Regarding further performance improvement, Beam is adding support for Flink runner which would allow you to run statistics generation across a cluster or using multiple processes within a single machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants