Skip to content

Conversation

@skcho
Copy link

@skcho skcho commented Sep 25, 2019

map_string copies the full buffer (b.Bi_outbuf.o_s) with calling Bytes.to_string.

  • The size of buffer seems to increase consistently during the reading, which makes the copies expensive, especially when there has been a long-length field data.
  • But the function f is using only a substring of the buffer.

This PR tries to copy less in map_string with Bytes.sub_string.

`map_string` copies the full buffer (`b.Bi_outbuf.o_s`) with calling
`Bytes.to_string`.

* The size of buffer seems to increase consistently during the
  reading, which makes the copies expensive.
* But the function `f` is using only a substring of the buffer.

This PR tries to copy less in `map_string` with `Bytes.sub_string`.
@skcho skcho requested a review from NathanReb as a code owner September 25, 2019 12:41
@skcho
Copy link
Author

skcho commented Sep 25, 2019

But the function f is using only a substring of the buffer.

Please let me know, if this thought is wrong.


A similar issue is also in map_lexeme.

yojson/lib/read.mll

Lines 152 to 154 in 6f8ed3f

let map_lexeme f lexbuf =
let len = lexbuf.lex_curr_pos - lexbuf.lex_start_pos in
f (Bytes.to_string lexbuf.lex_buffer) lexbuf.lex_start_pos len

I believe the last line can be replaced with

f (Bytes.sub_string lexbuf.lex_buffer lexbuf.lex_start_pos len) 0 len

but I am not sure what is the assumption about the mapping function f.

If you agree with this, I will add this change in this PR too.

@skcho
Copy link
Author

skcho commented Oct 11, 2019

@NathanReb Could you check this PR out please?

@NathanReb
Copy link
Collaborator

I was off for a few weeks, I'll take a look ASAP!

@rgrinberg
Copy link
Contributor

@Leonidas-from-XIV mind taking a look at this?

@Leonidas-from-XIV
Copy link
Member

I looked a little bit at this and of course the PR doesn't apply cleanly anymore because we removed biniou, but can easily be fixed with Buffer.sub, but it is also not really necessary since the current code does Buffer.to_string which returns the current content of the buffer, not the underlying length of the buffer.

But in general this code does not seem to be necessary since map_string is only used by map_ident and that is not used by… anything, so it also isn't called as part of reading JSON.

That said, map_string and map_ident are part of the undocumented API and I am not entirely sure what they are used for in the first place since this is pretty much part of of the lexing and does not make that much sense outside of it. I suggest just deleting them altogether. @NathanReb if you agree I'll make a PR dropping this.

@skcho
Copy link
Author

skcho commented Nov 2, 2020

Hello @Leonidas-from-XIV . Thank you for looking into this.

That said, map_string and map_ident are part of the undocumented API and I am not entirely sure what they are used for in the first place since this is pretty much part of of the lexing and does not make that much sense outside of it.

We were using atdgen to transfer soma data from C++ to OCaml. In the atdgen-generated code, there were map_ident calls. I don't know much about recent changes on yojson and atd though.

but it is also not really necessary since the current code does Buffer.to_string

OK, map_string seems to be fixed already. Great! Feel free to close the PR, if you don't mind map_lexeme that calls Bytes.to_string too.

Leonidas-from-XIV added a commit that referenced this pull request Nov 2, 2020
As mentioned in #85 by @skcho, the code in `map_ident`/`map_string` was
changed to not copy the whole buffer and give a start/end point (though
this happened somewhat more on the accidental than deliberate side) thus
subtly changing the semantics of the data `f` was called on. This saves
copying but prevents `f` from peeking beyond the borders it was
assigned. Whether this breaks consumers depends on what they do with it.

Nevertheless, what this PR does is it applies the same logic to
`map_lexeme`, so while the semantics might have changed at least they
change in a consistent way.

Closes #85
@Leonidas-from-XIV
Copy link
Member

In the atdgen-generated code, there were map_ident calls. I don't know much about recent changes on yojson and atd though.

Ahh, okay, that's what it is for. Interesting. I assume this stays the same but you uncovered a potential incompatibility in the 2.0 release, where this is changed to always return 0 and length (to the point of being somewhat silly).

OK, map_string seems to be fixed already. Great! Feel free to close the PR, if you don't mind map_lexeme that calls Bytes.to_string too.

I've opened #108 to at least align the behaviour of the functions so they're at least consistent and of course slightly more efficient.

@skcho
Copy link
Author

skcho commented Nov 4, 2020

and of course slightly more efficient.

As far as I can remember, in map_string, the buffer size of b.Bi_outbuf.o_s monotonely increased whenever it read very long string inputs. So, after the buffer size became very big, it took so long time to copy the full buffer to string. I am not sure whether there is a similar issue in map_lexeme or not though.

in the 2.0 release

I can't wait it! Thank you.

@skcho skcho closed this Nov 30, 2020
@jvillard jvillard mentioned this pull request Nov 4, 2021
Leonidas-from-XIV added a commit to Leonidas-from-XIV/yojson that referenced this pull request Jan 25, 2022
As mentioned in ocaml-community#85 by @skcho, the code in `map_ident`/`map_string` was
changed to not copy the whole buffer and give a start/end point (though
this happened somewhat more on the accidental than deliberate side) thus
subtly changing the semantics of the data `f` was called on. This saves
copying but prevents `f` from peeking beyond the borders it was
assigned. Whether this breaks consumers depends on what they do with it.

Nevertheless, what this PR does is it applies the same logic to
`map_lexeme`, so while the semantics might have changed at least they
change in a consistent way.

Closes ocaml-community#85
panglesd added a commit to panglesd/opam-repository that referenced this pull request Jun 2, 2022
CHANGES:

### Removed

- Removed dependency on easy-format and removed `pretty_format` from
  `Yojson`, `Yojson.Basic`, `Yojson.Safe` and `Yojson.Raw`. (@c-cube, ocaml-community/yojson#90)
- Removed dependency on `biniou`, simplifying the chain of dependencies. This
  changes some APIs:
  * `Bi_outbuf.t` in signatures is replaced with `Buffer.t`
  * `to_outbuf` becomes `to_buffer` and `stream_to_outbuf` becomes
    `stream_to_buffer`
  (@Leonidas-from-XIV, ocaml-community/yojson#74, and @gasche, ocaml-community/yojson#132)
- Removed `yojson-biniou` library
- Removed deprecated `json` type aliasing type `t` which has been available
  since 1.6.0 (@Leonidas-from-XIV, ocaml-community/yojson#100).
- Removed `json_max` type (@Leonidas-from-XIV, ocaml-community/yojson#103)
- Removed constraint that the "root" value being rendered (via either
  `pretty_print` or `to_string`) must be an object or array. (@cemerick, ocaml-community/yojson#121)
- Removed `validate_json` as it only made sense if the type was called `json`.
  (@Leonidas-from-XIV, ocaml-community/yojson#137)

### Add

- Add an opam package `yojson-bench` to deal with benchmarks dependency
  (@tmcgilchrist, ocaml-community/yojson#117)
- Add a benchmark to judge the respective performance of providing a buffer vs
  letting Yojson create an internal (ocaml-community/yojson#134, @Leonidas-from-XIV)
- Add an optional `suf` keyword argument was added to functions that write
  serialized JSON, thus allowing NDJSON output. Most functions default to not
  adding any suffix except for `to_file` (ocaml-community/yojson#124, @panglesd) and functions
  writing sequences of values where the default is `\n` (ocaml-community/yojson#135,
  @Leonidas-from-XIV)

### Change

- The `stream_from_*` and `stream_to_*` functions now use a `Seq.t` instead of a
  `Stream.t`, and they are renamed into `seq_from_*` and `seq_to_*` (@gasche, ocaml-community/yojson#131).

### Fix

- Avoid copying unnecessarily large amounts of strings when parsing (ocaml-community/yojson#85, ocaml-community/yojson#108,
  @Leonidas-from-XIV)
- Fix `stream_to_file` (ocaml-community/yojson#133, @tcoopman and @gasche)
panglesd added a commit to panglesd/opam-repository that referenced this pull request Jun 3, 2022
CHANGES:

### Removed

- Removed dependency on easy-format and removed `pretty_format` from
  `Yojson`, `Yojson.Basic`, `Yojson.Safe` and `Yojson.Raw`. (@c-cube, ocaml-community/yojson#90)
- Removed dependency on `biniou`, simplifying the chain of dependencies. This
  changes some APIs:
  * `Bi_outbuf.t` in signatures is replaced with `Buffer.t`
  * `to_outbuf` becomes `to_buffer` and `stream_to_outbuf` becomes
    `stream_to_buffer`
  (@Leonidas-from-XIV, ocaml-community/yojson#74, and @gasche, ocaml-community/yojson#132)
- Removed `yojson-biniou` library
- Removed deprecated `json` type aliasing type `t` which has been available
  since 1.6.0 (@Leonidas-from-XIV, ocaml-community/yojson#100).
- Removed `json_max` type (@Leonidas-from-XIV, ocaml-community/yojson#103)
- Removed constraint that the "root" value being rendered (via either
  `pretty_print` or `to_string`) must be an object or array. (@cemerick, ocaml-community/yojson#121)
- Removed `validate_json` as it only made sense if the type was called `json`.
  (@Leonidas-from-XIV, ocaml-community/yojson#137)

### Add

- Add an opam package `yojson-bench` to deal with benchmarks dependency
  (@tmcgilchrist, ocaml-community/yojson#117)
- Add a benchmark to judge the respective performance of providing a buffer vs
  letting Yojson create an internal (ocaml-community/yojson#134, @Leonidas-from-XIV)
- Add an optional `suf` keyword argument was added to functions that write
  serialized JSON, thus allowing NDJSON output. Most functions default to not
  adding any suffix except for `to_file` (ocaml-community/yojson#124, @panglesd) and functions
  writing sequences of values where the default is `\n` (ocaml-community/yojson#135,
  @Leonidas-from-XIV)

### Change

- The `stream_from_*` and `stream_to_*` functions now use a `Seq.t` instead of a
  `Stream.t`, and they are renamed into `seq_from_*` and `seq_to_*` (@gasche, ocaml-community/yojson#131).

### Fix

- Avoid copying unnecessarily large amounts of strings when parsing (ocaml-community/yojson#85, ocaml-community/yojson#108,
  @Leonidas-from-XIV)
- Fix `stream_to_file` (ocaml-community/yojson#133, @tcoopman and @gasche)
Leonidas-from-XIV pushed a commit to panglesd/opam-repository that referenced this pull request Jun 9, 2022
CHANGES:

### Removed

- Removed dependency on easy-format and removed `pretty_format` from
  `Yojson`, `Yojson.Basic`, `Yojson.Safe` and `Yojson.Raw`. (@c-cube, ocaml-community/yojson#90)
- Removed dependency on `biniou`, simplifying the chain of dependencies. This
  changes some APIs:
  * `Bi_outbuf.t` in signatures is replaced with `Buffer.t`
  * `to_outbuf` becomes `to_buffer` and `stream_to_outbuf` becomes
    `stream_to_buffer`
  (@Leonidas-from-XIV, ocaml-community/yojson#74, and @gasche, ocaml-community/yojson#132)
- Removed `yojson-biniou` library
- Removed deprecated `json` type aliasing type `t` which has been available
  since 1.6.0 (@Leonidas-from-XIV, ocaml-community/yojson#100).
- Removed `json_max` type (@Leonidas-from-XIV, ocaml-community/yojson#103)
- Removed constraint that the "root" value being rendered (via either
  `pretty_print` or `to_string`) must be an object or array. (@cemerick, ocaml-community/yojson#121)
- Removed `validate_json` as it only made sense if the type was called `json`.
  (@Leonidas-from-XIV, ocaml-community/yojson#137)

### Add

- Add an opam package `yojson-bench` to deal with benchmarks dependency
  (@tmcgilchrist, ocaml-community/yojson#117)
- Add a benchmark to judge the respective performance of providing a buffer vs
  letting Yojson create an internal (ocaml-community/yojson#134, @Leonidas-from-XIV)
- Add an optional `suf` keyword argument was added to functions that write
  serialized JSON, thus allowing NDJSON output. Most functions default to not
  adding any suffix except for `to_file` (ocaml-community/yojson#124, @panglesd) and functions
  writing sequences of values where the default is `\n` (ocaml-community/yojson#135,
  @Leonidas-from-XIV)

### Change

- The `stream_from_*` and `stream_to_*` functions now use a `Seq.t` instead of a
  `Stream.t`, and they are renamed into `seq_from_*` and `seq_to_*` (@gasche, ocaml-community/yojson#131).

### Fix

- Avoid copying unnecessarily large amounts of strings when parsing (ocaml-community/yojson#85, ocaml-community/yojson#108,
  @Leonidas-from-XIV)
- Fix `stream_to_file` (ocaml-community/yojson#133, @tcoopman and @gasche)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants