optimize memory consumption of idnadata, take 3 #24

slingamn · 2016-03-01T08:35:15Z

I'm sorry I keep changing my mind here. I think I finally have an approach with no significant drawbacks:

As in optimize memory consumption of idnadata, take 2 #23, stop storing the DISALLOWED codepoints, as well as any scripts that aren't special-cased by RFC 5892.
As in the master branch, take advantage of the fact that the script and PVALID codepoint lists have long runs in them --- but instead of expanding those runs into a flat hash-set at runtime, leave them as a sequence of (start, end) tuples. These tuples can be binary-searched for integer membership with the bisect module. The binary search is faster than the one in optimize memory consumption of idnadata, take 2 #23 because it costs O(log(# runs)), not O(log(# codepoints)).

Memory consumption is down under 2 MB, the naive benchmark is not significantly affected (the increase is less than 10%, with overall time still dominated by punycode), and there's no need to package a large opaque binary datafile.

Caveats about the Unicode version still apply (again, the idnadata rebuild is an isolated commit and can be rebased out).

slingamn · 2016-03-09T11:16:32Z

Let me know if I can clarify, document, or change anything :-)

kjd · 2016-03-09T11:19:33Z

Thanks! I am at the ICANN meeting in Marrakech this week so I won't have an opportunity to work on this until next week. I hope then I can push out a new release with this included.

Optimizations to reduce memory footprint

slingamn · 2016-03-15T23:46:01Z

Thanks very much!

Just wanted to confirm: the outcome of the discussion was to move to Unicode version 8.0?

kjd · 2016-03-15T23:54:12Z

That issue is not resolved, but possible moot with respect to these particular patches, as the classification of code points into scripts is theoretically purely additive, and this won't add any additional PVALID characters which are still derived from 6.3. I'll confirm this assumption before pushing a new version to PyPI.

Optimizations to reduce memory footprint

kjd · 2016-03-20T23:20:00Z

v2.1 has been pushed that contains this. I made some minor changes that relate to how the data is encapsulated so it is easier to diff and debug, but the functionality is the same.

Thanks again!

slingamn · 2016-03-21T00:21:36Z

Thanks!

The wheel uploaded here is for Python 2 only --- should it be for both 2 and 3? https://pypi.python.org/pypi/idna/2.1

kjd · 2016-03-21T01:08:06Z

There should be a universal one there now. I am new at using wheels....

slingamn · 2016-03-21T05:54:42Z

Thanks, that worked. You might want to take down the 2-only wheel? Not critical, though.

slingamn added 5 commits February 29, 2016 22:48

ignore vim swapfiles

08dca26

remove DISALLOWED codepoints from datafile, since they're unread

a00dfae

only store scripts that are specifically referenced in logic

ddf8104

range-based integer containment code

3d6370e

rebuild idnadata

d1e9f48

kjd added a commit that referenced this pull request Mar 15, 2016

Merge pull request #24 from slingamn/range_bisect.1

0813933

Optimizations to reduce memory footprint

kjd merged commit 0813933 into kjd:master Mar 15, 2016

kjd mentioned this pull request Mar 15, 2016

optimize memory consumption of idnadata, take 2 #23

Closed

kjd added a commit that referenced this pull request Mar 20, 2016

Merge pull request #24 from slingamn/range_bisect.1

ff02174

Optimizations to reduce memory footprint

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

optimize memory consumption of idnadata, take 3 #24

optimize memory consumption of idnadata, take 3 #24

Uh oh!

slingamn commented Mar 1, 2016

Uh oh!

slingamn commented Mar 9, 2016

Uh oh!

kjd commented Mar 9, 2016

Uh oh!

slingamn commented Mar 15, 2016

Uh oh!

kjd commented Mar 15, 2016

Uh oh!

kjd commented Mar 20, 2016

Uh oh!

slingamn commented Mar 21, 2016

Uh oh!

kjd commented Mar 21, 2016

Uh oh!

slingamn commented Mar 21, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

optimize memory consumption of idnadata, take 3 #24

optimize memory consumption of idnadata, take 3 #24

Uh oh!

Conversation

slingamn commented Mar 1, 2016

Uh oh!

slingamn commented Mar 9, 2016

Uh oh!

kjd commented Mar 9, 2016

Uh oh!

slingamn commented Mar 15, 2016

Uh oh!

kjd commented Mar 15, 2016

Uh oh!

kjd commented Mar 20, 2016

Uh oh!

slingamn commented Mar 21, 2016

Uh oh!

kjd commented Mar 21, 2016

Uh oh!

slingamn commented Mar 21, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants