[Improvement Suggestions] #33

vincentteyssier · 2018-11-06T15:29:28Z

Hi Paul,

As discussed on SO please find below my feedback on what I think would be nice additions to this already great library:

possibility to merge generated stats: a typical use case is a daily ETL receiving the daily data as a set of multiple files with common prefix. The stats need to be computed on the whole batch, not on each file. Therefore the possibility to merge several stats results would be useful. Accepting a prefix as an input file instead of a filename would also be very useful to that use case
documentation: a list of validation error messages would be very nice to have on the doc. Even more important is a list of all the feature properties that can be written in the protobuf schema.
RAM overhead: datasets are often big and TFDV requires a large RAM overhead. The possibility to process datasets by mini-batches would allow to run TFDV even in lower RAM environments. That could even be combined with my first suggestion.
Schema inference: TFDV is generating quite generic schemas when infering a dataset. I understand that domain knowledge is required in order to define precise schema properties, but maybe adding a suggestion output could help. For example if we have a standard deviation quite high when looking at the mean, or a missing value rate of 6% (then maybe 10% would be acceptable) you could output a message saying that such stdv may be out of bound and the following code could add this to your schema: code_example.
This kind of interactivity and suggestion would make it really easy to get better schemas imo
Performance: as my fellow colleague in this Issue list, I experience really low performance on big datasets composed of many files. There must be some tuning that we are missing. When I compute a mean with Apache Beam on the same dataset with a DirectRunner I get quite good performance for example. However I am very subjective here.... :)
Facets Dive: very cool to have the facets overview directly accessible when visualizing stats. A super useful addition would be to also have Dive in the same package.

Anyway thanks a lot for the super cool library, it is already very useful as is!!!

The text was updated successfully, but these errors were encountered:

paulgc · 2018-11-08T03:46:03Z

Thanks for detailed feedback, Vincent. This is really useful. We will continue to address these issues in the subsequent releases.

TFDV 0.11.0 will be releasing within a week and comes with some new features and performance improvements. Regarding further performance improvement, Beam is adding support for Flink runner which would allow you to run statistics generation across a cluster or using multiple processes within a single machine.

Harshini-Gadige added the type:feature label Nov 16, 2018

Harshini-Gadige added the stat:awaiting tensorflower label Apr 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improvement Suggestions] #33

[Improvement Suggestions] #33

vincentteyssier commented Nov 6, 2018 •

edited

Loading

paulgc commented Nov 8, 2018

[Improvement Suggestions] #33

[Improvement Suggestions] #33

Comments

vincentteyssier commented Nov 6, 2018 • edited Loading

paulgc commented Nov 8, 2018

vincentteyssier commented Nov 6, 2018 •

edited

Loading