Skip to content

Problems with nested data structures' datashapes #750

@chdoig

Description

@chdoig

Problem: I have a twitter dataset with nested structures in a mongo database that I'm not able to load as a blaze Table:

Error:

In [7]: Table(db.ebola)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-6ad34cf7425f> in <module>()
----> 1 Table(db.ebola)

/home/cdoig/blaze/blaze/api/table.py in __init__(self, data, dshape, name, columns, schema)
     70             dshape = var * dshape
     71         if not dshape:
---> 72             dshape = discover(data)
     73             types = None
     74             if isinstance(dshape[1], Tuple):

/home/cdoig/envs/memex/lib/python2.7/site-packages/multipledispatch/dispatcher.pyc in __call__(self, *args, **kwargs)
    161             self._cache[types] = func
    162         try:
--> 163             return func(*args, **kwargs)
    164 
    165         except MDNotImplementedError:

/home/cdoig/blaze/blaze/mongo.py in discover(coll, n)
     31         return coll.count() * ds.subshape[0]
     32     else:
---> 33         raise ValueError("Consistent datashape not found")
     34 
     35 

ValueError: Consistent datashape not found

More infomation:

Discover on some rows:

In [5]: discover(data[0])
Out[5]: dshape("(null, null, datetime, { hashtags : var * string, symbols : var * string, trends : var * string, urls : var * string, user_mentions : var * string }, int64, bool, string, null, int64, int64, null, null, null, null, null, string, null, bool, int64, bool, string, string, int64, bool, { contributors_enabled : bool, created_at : datetime, default_profile : bool, default_profile_image : bool, description : string, favourites_count : int64, follow_request_sent : null, followers_count : int64, following : null, friends_count : int64, geo_enabled : bool, id : int64, id_str : int64, is_translator : bool, lang : string, listed_count : int64, location : string, name : string, notifications : null, profile_background_color : int64, profile_background_image_url : string, profile_background_image_url_https : string, profile_background_tile : bool, profile_banner_url : string, profile_image_url : string, profile_image_url_https : string, profile_link_color : string, profile_sidebar_border_color : string, profile_sidebar_fill_color : string, profile_text_color : int64, profile_use_background_image : bool, protected : bool, screen_name : string, statuses_count : int64, time_zone : string, url : string, utc_offset : int64, verified : bool })")

In [6]: discover(data[1])
Out[6]: dshape("(null, null, datetime, { hashtags : var * string, symbols : var * string, trends : var * string, urls : var * string, user_mentions : var * string }, int64, bool, string, null, int64, int64, null, null, null, null, null, string, null, bool, int64, bool, string, string, int64, bool, { contributors_enabled : bool, created_at : datetime, default_profile : bool, default_profile_image : bool, description : null, favourites_count : int64, follow_request_sent : null, followers_count : int64, following : null, friends_count : int64, geo_enabled : bool, id : int64, id_str : int64, is_translator : bool, lang : string, listed_count : int64, location : null, name : string, notifications : null, profile_background_color : string, profile_background_image_url : string, profile_background_image_url_https : string, profile_background_tile : bool, profile_banner_url : string, profile_image_url : string, profile_image_url_https : string, profile_link_color : string, profile_sidebar_border_color : string, profile_sidebar_fill_color : string, profile_text_color : int64, profile_use_background_image : bool, protected : bool, screen_name : string, statuses_count : int64, time_zone : string, url : null, utc_offset : int64, verified : bool })")

In [7]: discover(data[2])
Out[7]: dshape("(null, null, datetime, { hashtags : var * string, symbols : var * string, trends : var * string, urls : var * string, user_mentions : var * string }, int64, bool, string, null, int64, int64, null, null, null, null, null, string, null, bool, int64, bool, string, string, int64, bool, { contributors_enabled : bool, created_at : datetime, default_profile : bool, default_profile_image : bool, description : string, favourites_count : int64, follow_request_sent : null, followers_count : int64, following : null, friends_count : int64, geo_enabled : bool, id : int64, id_str : int64, is_translator : bool, lang : string, listed_count : int64, location : string, name : string, notifications : null, profile_background_color : string, profile_background_image_url : string, profile_background_image_url_https : string, profile_background_tile : bool, profile_banner_url : string, profile_image_url : string, profile_image_url_https : string, profile_link_color : int64, profile_sidebar_border_color : string, profile_sidebar_fill_color : string, profile_text_color : int64, profile_use_background_image : bool, protected : bool, screen_name : string, statuses_count : int64, time_zone : string, url : string, utc_offset : int64, verified : bool })")

Data structure in mongodb of the first tweet:

screen shot 2014-10-14 at 6 04 00 pm

cc: @mrocklin

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions