Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pykx pandas conversion offers no speedup with short vs float columns #36

Open
antipisa opened this issue Oct 9, 2024 · 3 comments
Open

Comments

@antipisa
Copy link

antipisa commented Oct 9, 2024

Python: 3.11.8
Pandas: 2.2.1
Numpy: 1.26.4

There seems to be marginal benefit in terms speed when converting kdb tables into pandas using the .pd() method when using short ints instead of floats. Although the memory usage of the table drops accordingly, the time spent in conversion to dataframe does not improve much.

N:50000000;
dat1:([] date:2000.01.01; sym:`A; q1:N?100h; q2:N?5000h; q3:N?50h);
dat2:([] date:2000.01.01; sym:`A; q1:N?100f; q2:N?5000f; q3:N?50f);

Indeed, the size of dat1 is 40% the size of dat2, and yet in python:


handle = pykx.SyncQConnection(host, port)
%timeit df1 = handle('dat1').pd() #short
%timeit df2 = handle('dat2').pd() #float
3.48s per loop
5.76s per loop

This gets closer the more float/short columns you add.

Is there a way to optimize the call to .pd when dealing with very large tables whose column values are mostly shorts? Otherwise one can spend forever waiting for the conversion.

@rianoc-kx
Copy link
Collaborator

If you are not already using Pandas 2.0 it's worth upgrading as you will see a 3x speed improvement for these conversions.

Pandas 1.5.3

In [1]: import pykx as kx

In [2]: kx.q['N'] =50000000;
   ...: dat1 = kx.q('([] date:2000.01.01; sym:`A; q1:N?100h; q2:N?5000h; q3:N?50h)')
   ...: dat2 = kx.q('([] date:2000.01.01; sym:`A; q1:N?100f; q2:N?5000f; q3:N?50f)')
   ...:

In [3]: %timeit df1 = dat1.pd()
2.13 s ± 163 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit df2 = dat2.pd()
2.1 s ± 158 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Pandas 2.1.4

In [1]: import pykx as kx
   ...:
   ...: kx.q['N'] =50000000;
   ...: dat1 = kx.q('([] date:2000.01.01; sym:`A; q1:N?100h; q2:N?5000h; q3:N?50h)')
   ...: dat2 = kx.q('([] date:2000.01.01; sym:`A; q1:N?100f; q2:N?5000f; q3:N?50f)')
   ...:

In [2]: %timeit df1 = dat1.pd()
738 ms ± 26.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit df2 = dat2.pd()
700 ms ± 27.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@antipisa
Copy link
Author

@rianoc-kx this is on pandas 2.2.1. Could you try going over TCP/IP?

@rianoc-kx
Copy link
Collaborator

rianoc-kx commented Oct 11, 2024

Isolating the IPC portion you can see the larger float data is slower to transfer:

In [3]: %timeit dat1 = handle('dat1')
1.13 s ± 49.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit dat2 = handle('dat2')
1.56 s ± 32.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Removing the sym column we can see that this is responsible for the majority on the conversion time: ~700ms --> ~300ms

In [9]: kx.q['N'] =50000000;
   ...: dat1 = kx.q('([] date:2000.01.01;q1:N?100h; q2:N?5000h; q3:N?50h)')
   ...: dat2 = kx.q('([] date:2000.01.01; q1:N?100f; q2:N?5000f; q3:N?50f)')

In [10]: %timeit df1 = dat1.pd()
311 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [11]: %timeit df2 = dat2.pd()
270 ms ± 18.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Going further again and removing date we can see it was responsible for the remaining majority of the conversion time. Leaving only the numeric conversions which are the most direct and therefore fastest:

In [12]:  dat1 = kx.q('([] q1:N?100h; q2:N?5000h; q3:N?50h)')
    ...:  dat2 = kx.q('([] q1:N?100f; q2:N?5000f; q3:N?50f)')

In [13]: %timeit df1 = dat1.pd()
47.7 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [14]: %timeit df2 = dat2.pd()
121 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Float arrays are able to be zero copied from q to NumPy arrays to build the dataframe which gives this operation making it effectively a constant time operation.

short -> int -> long does show a progression on conversion time with the datatype size:

In [3]:  kx.q['N'] =50000000;

In [4]: dat1 = kx.q('([] q1:N?100h; q2:N?5000h; q3:N?50h)')

In [5]: %timeit df1 = dat1.pd()
43.8 ms ± 701 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: dat2 = kx.q('([] q1:N?100i; q2:N?5000i; q3:N?50i)')

In [7]: %timeit df2 = dat2.pd()
58.3 ms ± 304 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: dat3 = kx.q('([] q1:N?100; q2:N?5000; q3:N?50)')

In [9]: %timeit df3 = dat3.pd()
97.4 ms ± 685 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants