Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[css-syntax] Review requested of new Parsing text #8834

Open
tabatkins opened this issue May 12, 2023 · 21 comments
Open

[css-syntax] Review requested of new Parsing text #8834

tabatkins opened this issue May 12, 2023 · 21 comments

Comments

@tabatkins
Copy link
Member

tabatkins commented May 12, 2023

Since we resolved on the new Nesting behavior (try to parse as a declaration, then parse as a rule if that's invalid), I had to do some decent rewriting of Syntax's algorithms, and went ahead and dove into a larger rewrite to clean it up in general. I've implemented the new text in my CSS parser library and the (fairly limited, admittedly) tests I've run look good, but I'd appreciate a larger review.

Significant changes from the previous version:

  • Algorithm structure generally changed; rather than consuming a token and often reconsuming for another algorithm to deal with, it just always relies on lookahead and doesn't consume tokens until they're actually going to be used for certain. This should better resemble how an actual parser works. (I haven't changed the tokenizer to this structure, but doing so is probably a good idea at some point.)
  • Previously, I had "consume a list of rules" for rule+at-rules and "consume a list of declarations" for declarations+at-rules. Stylesheets and things like @media used "list of rules"; style rules and things like @font-face used "list of declarations". I've shifted all blocks to just use the new "consume a block's contents", and since stylesheets are now the only user of "list of rules", renamed it to "consume a stylesheet's contents" and specialized it to always ignore the CDO/CDC tokens.

Aside from allowing the new nesting behavior, all of these changes should be only editorial, with one exception: blocks that previously only contained rules (@media, @keyframes, etc) previously used the "consume a list of rules", but now use the unified "consume a block's contents", which means their error-recovery in the face of semicolons changes.

For example, @media { garbage; bar {...} } previously would contain a style rule with a garbage; bar selector. (This is what happens at the top-level of a stylesheet, still.) Now the rule's selector will be just bar, since the garbage; part will get dropped as an invalid declaration. This means that rules which were accidentally invalid and dropped due to garbage might now be valid, if there's a semicolon separating them from preceding garbage.

I suspect this is fine, and I'd really like it to be, because it means the overall parsing behavior doesn't need to branch on grammar knowledge (and thus, whether a rule is known or unknown won't change its generic parsing). It used to be the case that parsing depended on this kind of knowledge, and it was super awkward to use in tooling. Also, it means that parsing doesn't change between a top-level @media and a nested one, except for declarations becoming valid instead of invalid; all the rules inside of the @media remain precisely the same.

But if necessary, we can hardcode some at-rules to trigger a different parsing behavior that preserves backwards compatibility more completely.

(Technically parsing in general depends on grammar knowledge anyway, since you need to know whether a declaration is valid in a given context to tell if you should try and redo parsing as a rule. But it turns out there's a simple and reliable rule you can use generically to get approximately the right behavior without having to know anything about grammars.)

@tabatkins
Copy link
Member Author

Agenda+ to get approval on this, and ask for a new publication of Syntax.

@tabatkins
Copy link
Member Author

@cdoublev Please don't delete your comments, it's very confusing reading an email, coming here to answer the questions, and finding there's no comment anymore.

Restoring the comment:

In examples 12 and 13, I do not understand why @keyframes is not defined with <qualified-rule-list>, and why @page is defined with <declaration-rule-list> (in which at-rules - like margin rules - are automatically invalid).

Ah, it's because I'm dumb. @keyframes definitely should be <qualified-rule-list>, and I misdefined <declaration-rule-list> - it should allow declarations and at-rules, and disallow qualified rules. Both fixed now.

I am not sure what type of item represents <block-contents> and its specific sub-productions: is it a <simple-block> (ie. a component) or an undefined structure (containing declarations and/or rules)? The return value of consume a block's contents is a bit handwavey. But this may be intentional...

The return value isn't at all handwavey - it's a list of declarations and a list of rules. <block-contents> represents the contents of a block; it is not the block itself (it doesn't represent the wrapping {}). These productions aren't substantially different from what was there before, I just reshuffled them a little bit.

@cdoublev
Copy link
Collaborator

Please don't delete your comments, it's very confusing reading an email, coming here to answer the questions, and finding there's no comment anymore.

Sorry about that. My intent was to avoid hijacking the thread with a minor mistake and something I was wrong about, indeed.


<declaration-at-rule-list> is perhaps clearer than <declaration-rule-list>:

<declaration-rule-list>: declarations and at-rules are allowed; qualified-rules are automatically invalid.

...and more consistent with <at-rule-list>.


A link to consume a block's contents in the description of <block-contents> and its subproductions might also be usefull, perhaps in this paragraph:

The CSS parser is agnostic as to the contents of blocks—they’re all parsed with the same algorithm, and differentiate themselves solely by what constructs are valid.


5.4.5. Parse a list of declarations

To parse a block’s contents from input:

I think it should be "To parse a list of declarations from input:".

@tabatkins
Copy link
Member Author

My intent was to avoid hijacking the thread with a minor mistake and something I was wrong about, indeed.

If you'd been wrong about it I wouldn't have said anything, but I had to repost your comment because you did catch some mistakes.

A link to consume a block's contents

Done.

I think it should be "To parse a list of declarations from input:".

No, it's the heading that I didn't update. I fixed the algo name and the heading's own ID, but not the heading's text. ^_^ Fixed.

@emilio
Copy link
Collaborator

emilio commented Jun 4, 2023

I think we should treat error recovery in nested and unnested rules the same (#7961 (comment)), but other than that the new text looks good to me.

I've enabled this in Firefox Nightly in bug 1835066, fwiw.

@tabatkins
Copy link
Member Author

@emilio I'm a little confused - the error recovery is the same. I noted that as the only (non-Nesting related) normative change, because switching the algos around meant that I was changing the non-nested error-recovery (to match the nested error recovery).

@tabatkins
Copy link
Member Author

Unless you're referring to wanting to change top-level error-recovery to be the same as what you get inside of rule bodies? (afaict this isn't what you were asking for in your linked comment, but I can't tell what you're asking for otherwise)

@emilio
Copy link
Collaborator

emilio commented Jun 4, 2023

No, I meant that as long as we assure that all rule bodies recover errors the same regardless of whether we're nested on a style rule, we're good I believe.

Which is the case with the current spec text AFAICT, I just wanted to explicitly point out that, since that was one of the breaking changes I noticed while implementing this, sorry if I wasn't clear on that :)

@emilio
Copy link
Collaborator

emilio commented Jun 4, 2023

Sorry, I guess I shouldn't have worded it with "other than that" :)

@css-meeting-bot
Copy link
Member

The CSS Working Group just discussed [css-syntax] Review requested of new Parsing text, and agreed to the following:

  • RESOLVED: Accept parsing behavior change as described in issue
  • RESOLVED: Republish CSS Syntax as CRD
The full IRC log of that discussion <florian> TabAtkins: as part of refactorings for nesting, I realized that there is some common structures we could use to make more sense
<florian> TabAtkins: this results into 1 non editorial change
<fantasai> scribe+
<fantasai> TabAtkins: Rules that previously only contained rules, like @media, previously used an lagorithm that only cared about rules, not declarations
<fantasai> TabAtkins: now that they do care about declarations, at least in nested form -- ideally top-level form parses the same way -- and that causes an error-recovery change
<TabAtkins> @media { garbage; bar {...} }
<fantasai> TabAtkins: specifically ...
<fantasai> TabAtkins: this ^ previously would have resulted in a single rule, "garbage; bar { ... }"
<fantasai> TabAtkins: now it results in two statements, a "garbage;" statement and a "bar { ... }" statement
<fantasai> TabAtkins: only affects when there's a semicolon
<florian> q+
<fantasai> TabAtkins: but changes parsing on existing stylesheets
<emilio> q+
<fantasai> TabAtkins: question is if that's OK
<emilio> ack hober
<Rossen_> ack florian
<fantasai> florian: So you say previously we wouldn't discard "garbage;" and keep "bar { ... }"
<fantasai> florian: that seems better, but is it Web-compatible?
<fantasai> TabAtkins: Suspect so, since only small cases
<fantasai> florian: Doesn't seem unliely ppl will put in a media query and forgot selector and then add full rules with a selector later, and the garbage which used to gobble up next rule and thereby disabled it
<fantasai> florian: nolonger disables it
<fantasai> florian: seems plausible that it could be written by accident
<fantasai> florian: Let's try, but it would be possible it causes problems
<Rossen_> ack emilio
<fantasai> emilio: Fact that nested and unnested parse differently seems super weird, so +1 to try fixing this
<fantasai> emilio: I *think* that's how I implemented in Firefox, but might have been conservative when implementing, don't remember
<fantasai> Rossen_: Hearing support, any reasons to not resolve?
<fantasai> RESOLVED: Accept parsing behavior change as described in issue
<fantasai> TabAtkins: can I republish Syntax?
<fantasai> TabAtkins: all the changes should be captured in Changes list; will double-check before publishing
<fantasai> TabAtkins: still need to do review of tests that might need changing, but that's separate
<fantasai> Rossen_: Any objections to new CR?
<fantasai> RESOLVED: Republish CSS Syntax as CRD

@zstarvit
Copy link

I don't know if this is particularly the place to make a public comment, but I wanted to make sure that y'all are aware for the spec that there is legacy CSS code floating around that leans on certain CSS property declarations being invalid as a load bearing compatibility """feature""".

For example, older versions of Bootstrap contain declarations like *width: 100%; with the full understanding that *width is an invalid property and won't do anything on most browsers, but does do something in some browsers, notably IE7.

@tabatkins
Copy link
Member Author

Yes, that's very specifically still supported. It will fail to parse as both a property and a rule, and stop at the semicolon, exactly as it does today.

@cdoublev
Copy link
Collaborator

cdoublev commented Aug 10, 2023

Can you please clarify (not necessarily in the spec) the meaning of valid in the current context in consume a declaration, consume an at-rule, consume a qualified rule.

I first thought it was about what rules/properties/descriptors are accepted in the style sheet or rule:

The CSS parser is agnostic as to the contents of blocks [...].

Accompanying prose must define what is valid and invalid in this context.

But then I wondered if their grammar must also be valid, and how these two validations would address the new needs for parsing nested style rules.

Does it correspond to the validations in isValidInContext() in your CSS parser?

@tabatkins
Copy link
Member Author

I'm not sure how to clarify it further than what's already there. The current parsing context (some declaration or rule, possible with parent rules, etc) determines whether any given thing is valid, according to specific grammars and/or prose in the relevant specs.

The function in my CSS parser library is just a generic, relatively dumb version of this that will usually correctly tell apart a rule from a declaration, without knowing anything about what is actually allowed. You just need something that can spot that a:hover { color: blue; } is not a valid declaration, so it can get parsed properly as a rule.

@cdoublev
Copy link
Collaborator

You just need something that can spot that a:hover { color: blue; } is not a valid declaration, so it can get parsed properly as a rule.

Understood, thanks.

Assuming that consume an at-rule receives @import following a style rule, it would return nothing if valid in the context includes looking at the rules preceding @import. But CSSStyleSheet.insertRule() involves consume an at-rule and insert a CSS rule, which must throw HierarchyRequestError in this situation.

@tabatkins
Copy link
Member Author

Yeah. It's purposely somewhat handwavy about what "current context" is, because it's obvious in context what the necessary behavior is in any given algo and there's no real need to lock it down more precisely. It just needs a way to fail on invalid stuff.

@cdoublev
Copy link
Collaborator

I am not sure what type of item represents <block-contents> and its specific sub-productions: is it a simple block (ie. a component) or an undefined structure (containing declarations and/or rules)? The return value of consume a block's contents is a bit handwavey. But this may be intentional...

The return value isn't at all handwavey - it's a list of declarations and a list of rules. <block-contents> represents the contents of a block; it is not the block itself (it doesn't represent the wrapping {}). These productions aren't substantially different from what was there before, I just reshuffled them a little bit.

I think my point was that I find a bit inconsistent that the algorithms for consuming rules no longer process one level at a time, leaving lower levels as a list of component values. I would expect to process the block with parse a block’s contents before checking its grammar:

The CSS parser is agnostic as to the contents of blocks—they’re all parsed with the same algorithm

But it expects a list of component values, as it is only intended for an independent block, eg. style.cssText. It cannot handle a block resulting from parse a rule.

When rule blocks were consumed with consume a simple block while consuming a rule, the resulting simple block value was a list of component values to consume with the appropriate algorithm before checking the grammar.

Anyway, it is probably not important.

@cdoublev
Copy link
Collaborator

cdoublev commented Aug 20, 2023

You just need something that can spot that a:hover { color: blue; } is not a valid declaration, so it can get parsed properly as a rule.

Isn't it possible to implement something that can spot this in consume a declaration and consume a list of component values?

The former would run the latter with a flag when the declaration name is not a custom property. When this flag is set, consume a list of component values would return nothing when it processes <{-token>. In turn, consume a declaration would consume the remnants of a bad declaration and return nothing.

edit: 77e9601 now fulfills this requirement, so there may no longer be a need for consume a declaration, consume an at-rule, consume a qualified rule, to require that the declaration or rule be valid in the current context.

@tabatkins
Copy link
Member Author

I think my point was that I find a bit inconsistent that the algorithms for consuming rules no longer process one level at a time, leaving lower levels as a list of component values.

The only reason the old spec did so was because I didn't have a unified parsing model for all possible block contents. You had to know whether to parse a block's contents as "at-rules and declarations" or "at-rules and qualified rules", and that depended on what type of rule the block was in; it was easier to write the spec to have it delay this decision until later, when doing grammar-checking.

The new rules don't actually require having any grammar knowledge, because we've baked some restrictions into the grammar itself, so I was able to unify the two algorithms into one. (Knowing the grammar just lets you parse more efficiently; it doesn't change the end result.) So there was no longer any need to delay the parsing, and we could just get everything done in one pass.

But it expects a list of component values, as it is only intended for an independent block, eg. style.cssText. It cannot handle a block resulting from parse a rule.

Correct, but I'm not sure why that's relevant. The "parse" algorithms are the high-level algos called by other specs, while the "consume" algorithms are the low-level things that actually do the parsing. "Parse" algorithms call into "consume" ones; not vice versa.

so there may no longer be a need for consume a declaration, consume an at-rule, consume a qualified rule, to require that the declaration or rule be valid in the current context.

In theory, you can indeed remove the check from all of these places. You could just check, in "consume a block's contents", whether the returned declaration contains an improperly-used {}-block, and then do all the rest of the validity checking after parsing. But that wouldn't be very efficient; every top-level rule that starts like div:hover would consume the entire rest of the stylesheet as a declaration, then the parser would realize it contains a {}-block in the wrong place, and re-parse a single rule from it. Then do the same thing again, and again, and again. The end result would be identical to one that checks validity as it goes, as the spec currently says, it would just be massively less efficient.

So, it's better to write the spec in a way that guides implementations to do the better thing. This also ensures that if we did accidentally make a change that would cause the "check validity now" and "check validity later" impls to diverge, the "check validity now" impls would stay correct; the spec wouldn't be accidentally requiring impls to change to the less efficient method.

@cdoublev
Copy link
Collaborator

cdoublev commented Jan 5, 2024

[The "parse" algorithms do not expect high level objects as input]

Correct, but I'm not sure why that's relevant. [...]

I thought that the "parse [a CSS]" algos could be used recursively. This is no longer possible with the unified model, because nested contents are already consumed.

You also cannot use parse something according to a CSS grammar with a <block-contents> subproduction anymore, because this entry point consumes a list of component values, which consumes simple blocks, whereas consume a block's contents expects tokens and consumes lists of rules and declarations.


every top-level rule that starts like div:hover would consume the entire rest of the stylesheet as a declaration, then the parser would realize it contains a {}-block in the wrong place

In my comment, I suggested to stop consuming a list of component values (for a non-custom property value) when encoutering the { following hover in the declaration value. However, I was wrong about consuming the remnants of the (bad) declaration: the algo would just return nothing, and consume a block's contents would backtrack to div and parse a rule.


But for both of these concerns, the current spec is fine with me. I understand that this is not a detailed implementation guide. When you try to conform to it as closely as possible, it can just be tricky to put pieces together sometimes.

Thank you for your clarifications.

@AtkinsSJ
Copy link
Contributor

I hope feedback is still welcome! I have a few notes from implementing this myself last week:

  • "Consume a block's contents" loses declarations, see [css-syntax-3] "Consume a block's contents" leaves decls list with  #11017.
  • "Consume a list of component values" has the parameters listed as input, stop-token, nested, but the usage in "Consume a declaration" step 5 has them in input, nested, stop-token order. The other usage in "parse a comma-separated list of component values" is correct.
  • Algorithms are inconsistent between checking if the next token is EOF, or checking if the token stream is empty. These are functionally the same so it would be nice if they all checked for empty.
  • I think it would help if "valid in the current context" had a linked-to definition somewhere, even if that definition is brief and says that it depends on the current rule. (Explicit instructions for keeping track of context would be a nice too. 😉 )
  • The definition for at-rule says it contains a list of declarations and a list of child rules, but "Consume an at-rule" only ever sets its child rules, using "Consume a block". I think this could be clearer. There is a note about materializing into the CSSOM, but really, it either needs handling before assigning to child rules, or child rules needs to optionally hold lists of declarations until the at-rule gets materialized later. I would be tempted to replace the note with an explicit "assign the contents to child rules and declarations, depending on the rule" step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants