-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pykx pandas conversion offers no speedup with short vs float columns #36
Comments
If you are not already using Pandas 2.0 it's worth upgrading as you will see a 3x speed improvement for these conversions. Pandas 1.5.3
Pandas 2.1.4
|
@rianoc-kx this is on pandas 2.2.1. Could you try going over TCP/IP? |
Isolating the IPC portion you can see the larger float data is slower to transfer: In [3]: %timeit dat1 = handle('dat1')
1.13 s ± 49.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %timeit dat2 = handle('dat2')
1.56 s ± 32.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Removing the In [9]: kx.q['N'] =50000000;
...: dat1 = kx.q('([] date:2000.01.01;q1:N?100h; q2:N?5000h; q3:N?50h)')
...: dat2 = kx.q('([] date:2000.01.01; q1:N?100f; q2:N?5000f; q3:N?50f)')
In [10]: %timeit df1 = dat1.pd()
311 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [11]: %timeit df2 = dat2.pd()
270 ms ± 18.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Going further again and removing In [12]: dat1 = kx.q('([] q1:N?100h; q2:N?5000h; q3:N?50h)')
...: dat2 = kx.q('([] q1:N?100f; q2:N?5000f; q3:N?50f)')
In [13]: %timeit df1 = dat1.pd()
47.7 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [14]: %timeit df2 = dat2.pd()
121 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) Float arrays are able to be zero copied from q to NumPy arrays to build the dataframe which gives this operation making it effectively a constant time operation.
In [3]: kx.q['N'] =50000000;
In [4]: dat1 = kx.q('([] q1:N?100h; q2:N?5000h; q3:N?50h)')
In [5]: %timeit df1 = dat1.pd()
43.8 ms ± 701 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [6]: dat2 = kx.q('([] q1:N?100i; q2:N?5000i; q3:N?50i)')
In [7]: %timeit df2 = dat2.pd()
58.3 ms ± 304 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [8]: dat3 = kx.q('([] q1:N?100; q2:N?5000; q3:N?50)')
In [9]: %timeit df3 = dat3.pd()
97.4 ms ± 685 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) |
There seems to be marginal benefit in terms speed when converting kdb tables into pandas using the .pd() method when using short ints instead of floats. Although the memory usage of the table drops accordingly, the time spent in conversion to dataframe does not improve much.
Indeed, the size of dat1 is 40% the size of dat2, and yet in python:
This gets closer the more float/short columns you add.
Is there a way to optimize the call to .pd when dealing with very large tables whose column values are mostly shorts? Otherwise one can spend forever waiting for the conversion.
The text was updated successfully, but these errors were encountered: