Page MenuHomePhabricator

Implement generic MW-API endpoints to replace the math endpoints of restbase
Open, Needs TriagePublic

Description

Currently,

https://wikimedia.org/api/rest_v1/media/math/ lists the following endpoints:

image.png (458×2 px, 98 KB)

  • check/{type} Example curl -v -d 'q=E=mc^2' https://wikimedia.org/api/rest_v1/media/math/check/tex returns {"success":true,"checked":"E=mc^{2}","requiredPackages":[],"identifiers":["E","m","c"],"endsWithDot":false} with the header x-resource-location: 4c0004393a88f350a93bcef62106d556c7fc827b https://github.com/wikimedia/mediawiki-extensions-Math/blob/master/src/InputCheck/MathoidChecker.php is the implementation that gets respective information from mathoid backed by a WAN cache.
    • Check if the cache key is exactly the same as it used to be and determine if it needs to be exactly the same.
    • Implement tests for successful and failing examples
  • formula/{hash} example curl https://wikimedia.org/api/rest_v1/media/math/formula/4c0004393a88f350a93bcef62106d556c7fc827b

returns {"q":"E=mc^{2}","type":"tex"} so it can be extracted from the same WANCache

    • figure out if this endpoint is used
  • render/{format}/{hash} example https://wikimedia.org/api/rest_v1/media/math/render/mml/4c0004393a88f350a93bcef62106d556c7fc827b returns
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block" alttext="E=mc^{2}">
  <semantics>
    <mrow>
      <mi>E</mi>
      <mo>=</mo>
      <mi>m</mi>
      <msup>
        <mi>c</mi>
        <mrow class="MJX-TeXAtom-ORD">
          <mn>2</mn>
        </mrow>
      </msup>
    </mrow>
    <annotation encoding="application/x-tex">E=mc^{2}</annotation>
  </semantics>
</math>

Event Timeline

Yesterday we discussed the problems that arise when check can't store the rendered formula indefinitely. In that case, any requests that try to retrieve the rendered formula based on the hash will fail.

How about, instead of a hash, we use gzdeflate and base64_encode to encode the normalized formula in the "hash"? This will break URLs when the fomula gets too buig, but anything up to 800 or so bytes compressed and encoded should work. Larger fomulas would fail, but those should be very rare.

Yesterday we discussed the problems that arise when check can't store the rendered formula indefinitely. In that case, any requests that try to retrieve the rendered formula based on the hash will fail.

When we replaced the database backend with the cache, I investigated that one could use a special type of cache (DB) that stores the data infinitely. See https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Math/+/975432 . Thus, I don't fully understand the problem.

How about, instead of a hash, we use gzdeflate and base64_encode to encode the normalized formula in the "hash"? This will break URLs when the fomula gets too buig, but anything up to 800 or so bytes compressed and encoded should work. Larger fomulas would fail, but those should be very rare.

We could do that, but defining and implementing the edge cases would require.

  1. On the server side, there needs to be 414 responses if the client sends longer requests than we want to process, maybe this is already handled by the MW REST API code but it would need to be checked
  2. We would need to run the check again, on the second request
  3. We would need to identify when we receive incomplete data
  4. I am unsure if we can/should define a limit independent of the browser. According to StackOverflow 2000 might be a good value, but maybe it's best to use the same limit as defined in on the server side (1).
  5. We would need to add a new error message. Most examples will be missing closing math tags. eg <math> formula1 some long wikitext <math>formula2</math>

When we replaced the database backend with the cache, I investigated that one could use a special type of cache (DB) that stores the data infinitely. See https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Math/+/975432 . Thus, I don't fully understand the problem.

Setting up a database for permanent persistance adds operational overhead. For a permanent solution, that would be justified. For temporary backwards-compatibility, I don't think it would be.

We could do that, but defining and implementing the edge cases would require.

  1. On the server side, there needs to be 414 responses if the client sends longer requests than we want to process, maybe this is already handled by the MW REST API code but it would need to be checked
  2. We would need to run the check again, on the second request
  3. We would need to identify when we receive incomplete data
  4. I am unsure if we can/should define a limit independent of the browser. According to StackOverflow 2000 might be a good value, but maybe it's best to use the same limit as defined in on the server side (1).
  5. We would need to add a new error message. Most examples will be missing closing math tags. eg <math> formula1 some long wikitext <math>formula2</math>

That all seems pretty solvable. Encoding the formula (gzdeflate+base64) doesn't save much space (maybe 20%) but it would make it less likely that people just "send their own", and would make it obvious when the data was truncated. If we want to be extra sure, we can add a checksum or even an hmac (don't need to be full length, we don't need strong security).

As to running the check again - we only need to do that on a cache miss. The encoded version of the formula still serves as a cache key.

Since we have to account for the size of other parts of the URL, I'd go for a limit of 1000 bytes encoded, or even less.