Skip to content

kernel: schedule upcall, bool -> Result#2656

Merged
bors[bot] merged 4 commits into
masterfrom
upcall-schedule-as-result
Jul 13, 2021
Merged

kernel: schedule upcall, bool -> Result#2656
bors[bot] merged 4 commits into
masterfrom
upcall-schedule-as-result

Conversation

@ppannuto

@ppannuto ppannuto commented Jul 7, 2021

Copy link
Copy Markdown
Member

Pull Request Overview

Following up on #2639 (comment), this looks at changing the signature of scheduling upcalls to return a Result rather than a simple bool.

To start, this just looks at the kernel interfaces (i.e. this introduces 110 "unused Result" warnings in capsule code right now).

One interesting thing surfaced is that IPC does use the result of upcall scheduling to report to the IPC sender whether the receiver was scheduled successfully or not. Previously, this treated scheduling the null upcall as a failure. I'm not 100% sure that's right though -- a receiver has every right (although it would be odd) to ignore notifications. I used the new error code granularity to separate out this case right now; though maybe the best solution is to simply bubble up the error all the way to userspace for IPC.


This commit just changes the kernel interfaces, it does not address
all of the new unused Result warnings at all of the capsule scheduling
call sites.

Testing Strategy

N/A yet [compiling]

TODO or Help Wanted

Do these error code assignments make sense?

Should IPC treat scheduling a null upcall as an error or success?

Should I move forward with the rest of the capsules?

Documentation Updated

  • Updated the relevant files in /docs, or no updates are required.

Formatting

  • [] Ran make prepush.

@github-actions github-actions Bot added kernel tock-libraries This affects libraries supported by the Tock project labels Jul 7, 2021
@ppannuto ppannuto force-pushed the upcall-schedule-as-result branch from aad91a9 to 89af0c0 Compare July 7, 2021 17:13
@lschuermann

lschuermann commented Jul 7, 2021

Copy link
Copy Markdown
Member

One interesting thing surfaced is that IPC does use the result of upcall scheduling to report to the IPC sender whether the receiver was scheduled successfully or not. Previously, this treated scheduling the null upcall as a failure.

I'm reasonably confident that we, at least somewhere in Tock 2.0 development, decided to not tell capsules whether an Upcall was properly scheduled at all (that is, whether a process task was enqueued). As far as I remember, the boolean return value depended only on whether the process was still alive, and not on the fact whether the upcall in question is currently a null-upcall. @phil-levis is that correct?

Currently, schedule_upcall will return false if this is a null-upcall. Maybe this confusion around error handling (and it having changed over time) led to these implementation issues?

@bradjc

bradjc commented Jul 7, 2021

Copy link
Copy Markdown
Contributor

For reference, reasons schedule() can fail:

  1. The process is not valid
  2. The upcall is null
  3. The subscribe num is invalid
  4. The tasks array in the process was taken and not replaced
  5. There is no room in the tasks queue

In practice, (1) cannot happen, because the grant.enter() would fail first, and the process has to remain valid while the grant is entered, and (4) won't happen, because that would be a kernel bug that must be fixed and everything would break immediately.

For (3), subscribe_num is almost certainly going to be a hardcoded value (all instances today are hardcoded), and I don't see any value in a runtime check.

For (2) I agree capsules shouldn't do anything in response to a null upcall.

Then (5) is the only case I could see returning an error, but I don't see what a capsule could do with that information.

@ppannuto

ppannuto commented Jul 7, 2021

Copy link
Copy Markdown
Member Author
  1. There is no room in the tasks queue
    Then (5) is the only case I could see returning an error, but I don't see what a capsule could do with that information.

Here again I think IPC might be the interesting kid on the block. In particular, if process A does something, and wants to notify process B via IPC, it does so, enqueues a task and all is happy. If A keeps running, and during this same quanta (or just generally before B gets to run, due to process priorities, etc) and does another something which requires notifying B via IPC, then [eventually] B will need to run and drain IPC events. If A finds it is generating many events that B never responds to, it may wish to yield or otherwise ask the scheduler to lower its priority so that B can run.

@lschuermann

Copy link
Copy Markdown
Member

@bradjc I generally agree with your assessment, although I'm not sure about (3). You're right in that the subscribe number is hardcoded in current capsules, but given we're providing interfaces to downstream developers and this parameter could well be determined at runtime -- regardless of whether there is a good reason to do so -- I'm of the opinion we should return an error in this case.

If we are returning an error in one case, we might as well inform the capsule about all cases except (4), as that would indicate a serious kernel bug and might even justify a panic, and (2), which is to say that a null-callback is defined as a valid callback which is simply never delivered or from the capsules point of view ignored by the process.

@bradjc

bradjc commented Jul 7, 2021

Copy link
Copy Markdown
Contributor

I can certainly see the argument "wait I passed in an invalid subscribe_num and didn't get an error?!?". I propose then returning Result<(), UpcallError>:

enum UpcallError {
  InvalidSubscribeNum,
  QueueFull,
  // Process is inactive (can never happen) or process.tasks was empty (won't ever happen).
  // If either _did_ happen, well that would be a kernel error.
  // We do this in ProcessError to solve this same problem, and make it clear
  // that users of the error type shouldn't try to handle this case.
  KernelError,
}

@lschuermann

lschuermann commented Jul 8, 2021

Copy link
Copy Markdown
Member

@bradjc I like that approach. I've taken at stab at implementing it here: ce9c084. If @ppannuto doesn't mind I can push it to this branch.

I wasn't sure whether UpcallError should be in kernel::grant or kernel::upcall, both seemed equally fitting giving that GrantUpcallTable is in kernel::grant. Furthermore, we need to export this enum, as otherwise user's will have no way to properly use it, for instance in a match expression.

Comment thread kernel/src/process_standard.rs Outdated
@alexandruradovici

Copy link
Copy Markdown
Contributor

Then (5) is the only case I could see returning an error, but I don't see what a capsule could do with that information.

The touch driver receives a lot of callbacks from the underlying hardware driver as a result of the movement of fingers on the touch panel's surface. If the application is not processing callbacks fast enough, the task queue will be filled. The driver counts the dropped upcalls and reports them to the application. In turn, the application can decide to prioritize the callbacks.

@bradjc

bradjc commented Jul 8, 2021

Copy link
Copy Markdown
Contributor

@bradjc I like that approach. I've taken at stab at implementing it here: ce9c084. If @ppannuto doesn't mind I can push it to this branch.

I say push, we can always remove the commit.

I wasn't sure whether UpcallError should be in kernel::grant or kernel::upcall, both seemed equally fitting giving that GrantUpcallTable is in kernel::grant. Furthermore, we need to export this enum, as otherwise user's will have no way to properly use it, for instance in a match expression.

I would say upcall.rs, and that upcall things being in grant.rs is the exception we would otherwise like to avoid.

@bradjc

bradjc commented Jul 8, 2021

Copy link
Copy Markdown
Contributor

I'm marking this 2.0 because I think we either do it now while we are thinking about it and change capsules, or we leave it -> bool.

@bradjc bradjc added the tock-2.0 Issues and PRs related to Tock version 2.0. label Jul 8, 2021
@lschuermann lschuermann force-pushed the upcall-schedule-as-result branch from 89af0c0 to ce9c084 Compare July 8, 2021 21:40
@lschuermann

lschuermann commented Jul 8, 2021

Copy link
Copy Markdown
Member

I say push, we can always remove the commit.

Pushed my changes. This still does not address the unused Results. @ppannuto happy to fix if people are otherwise happy with the interface.

I would say upcall.rs

Makes sense. Will move it there.

Comment thread kernel/src/grant.rs Outdated
Comment on lines +203 to +210
/// A kernel-internal invariant has been violated.
///
/// This error should never happen. It can be returned if the
/// process is inactive (which should be caught by
/// [`Grant::enter`]) or `process.tasks` was taken.
///
/// These cases cannot be reasonably handled.
KernelError,

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this error worth including? I am imaging me as a handler of this code -- what do I do? The options are

  1. Just ignore the error and hope (same as if an error had not been returned)
  2. panic!, the world is broken

I think that making capsule authors choose between these will likely just result in a lot of arbitrary coin flips between these two options.

@lschuermann lschuermann Jul 8, 2021

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the issue is that panic!ing in the kernel is not an option (even if these error cases were to occur, the system might not be in an entirely unrecoverable state), whereas returning Ok(()) as if the callback was scheduled is obviously misleading.

This is the cheap way out, blaming the capsule code for either panic!ing or misbehaving in response to this error. :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's convenient to include the error, because rust forces us to handle cases that can "never" happen. I think it is better to have the error than do something like hardcode panics into the core kernel binary which will never be triggered, but still take up space.

From a capsule author's point of view, I think it's pretty clear that it's not an error they should handle, but could report upstream to us if they ever saw it.

However, we do not want capsule authors to include a debug or panic, because that just re-creates the size issue (and probably worse). But that is also true of handling this return value at all, unless a capsule can do something meaningful it should just ignore the return value.

Comment thread kernel/src/ipc.rs
Comment thread kernel/src/upcall.rs Outdated
Comment thread kernel/src/process_standard.rs Outdated
Comment thread kernel/src/process_standard.rs
@phil-levis

Copy link
Copy Markdown
Contributor

For reference, reasons schedule() can fail:
...
5. There is no room in the tasks queue
...
Then (5) is the only case I could see returning an error, but I don't see what a capsule could do with that information.

I agree. I am worried that making this a Result is going to add a lot of boilerplate. The only real answer a capsule might have is to mark the upcall as pending and try to issue it again, e.g., after a timeout or on next invocation. How many capsules handle this case? If the answer is zero, then I'd suggest changing it to no return value until we have a case to work with. It might be, for example, that now that we count the number of upcalls we create a reserved upcall queue, such that an upcall fails iff there is a copy of that upcall pending. If so, the error cases for UpcallError are going to be different.

@bradjc

bradjc commented Jul 9, 2021

Copy link
Copy Markdown
Contributor

I agree. I am worried that making this a Result is going to add a lot of boilerplate.

Well, the boilerplate is only adding .ok() to the end of the schedule calls. I pushed a commit which changes capsules to prevent the warning for unhandled result so you can see the change.

The only real answer a capsule might have is to mark the upcall as pending and try to issue it again, e.g., after a timeout or on next invocation. How many capsules handle this case? If the answer is zero, then I'd suggest changing it to no return value until we have a case to work with. It might be, for example, that now that we count the number of upcalls we create a reserved upcall queue, such that an upcall fails iff there is a copy of that upcall pending. If so, the error cases for UpcallError are going to be different.

No capsules use the return value today. However, if we make change the return to Result now, then we can update the capsule interface now as part of 2.0, and if we expand the error types later we don't have to go back and change any capsules that do not need to handle schedule errors.

@phil-levis

phil-levis commented Jul 9, 2021

Copy link
Copy Markdown
Contributor

I agree. I am worried that making this a Result is going to add a lot of boilerplate.

Well, the boilerplate is only adding .ok() to the end of the schedule calls. I pushed a commit which changes capsules to prevent the warning for unhandled result so you can see the change.

I think I am ok with that. :)

Is there any increase in code size? I've become a bit wary of flexibility that we might need in the future that increases code size now. If the increase is minimal then I'm on board.

@bradjc bradjc force-pushed the upcall-schedule-as-result branch from 02ab15b to 9359719 Compare July 9, 2021 14:53
@github-actions github-actions Bot removed the tock-libraries This affects libraries supported by the Tock project label Jul 9, 2021
@bradjc bradjc changed the title kernel: schedule upcall, bool -> Result (WIP) kernel: schedule upcall, bool -> Result Jul 9, 2021
@bradjc bradjc marked this pull request as ready for review July 9, 2021 14:57
@bradjc bradjc force-pushed the upcall-schedule-as-result branch from b06daed to 6f7fd27 Compare July 9, 2021 15:01
@bradjc

bradjc commented Jul 9, 2021

Copy link
Copy Markdown
Contributor

Is there any increase in code size? I've become a bit wary of flexibility that we might need in the future that increases code size now. If the increase is minimal then I'm on board.

I think it saves space. Looking through https://github.com/tock/tock/pull/2656/checks?check_run_id=3030271308 a lot of the numbers look smaller.

@lschuermann lschuermann force-pushed the upcall-schedule-as-result branch from 21d1568 to 78af01a Compare July 9, 2021 15:48
@bradjc

bradjc commented Jul 12, 2021

Copy link
Copy Markdown
Contributor

I think this is waiting on @ppannuto to review his own PR with the latest changes.

@ppannuto ppannuto left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You go away for one long weekend and your PR flies on its own :P

Looks good to me overall; one small interface question

Comment thread kernel/src/process_standard.rs Outdated
lschuermann and others added 4 commits July 13, 2021 13:58
Previously this function only returned a boolean, indicating whether
enqueueing the task worked. Users of this function might be interested
in the precise reason why enqueueing the task failed. In particular,
whether the process is no longer alive, the process' task queue is
full, or some kernel-internal error occurred.

Signed-off-by: Leon Schuermann <leon@is.currently.online>
Return a Result on GrantUpcallTable::schedule_upcall, indicating
whether an error occurred and the precise reason for the error. Users
of this API may utilize this information to, for example, adjust the
rate at which Upcalls are scheduled or reschedule an Upcall which has
been lost due to the process' task queue being full.

This commit just changes the kernel interfaces, it does not address
all of the new unused Result warnings at all of the capsule scheduling
call sites.

Co-authored-by: Leon Schuermann <leon@is.currently.online>
No cases where we need to handle a failed schedule.
@bradjc bradjc force-pushed the upcall-schedule-as-result branch from 78af01a to 0f09cd2 Compare July 13, 2021 17:58

@ppannuto ppannuto left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

But I would if I coult

@ppannuto ppannuto added the last-call Final review period for a pull request. label Jul 13, 2021
@bradjc

bradjc commented Jul 13, 2021

Copy link
Copy Markdown
Contributor

bors r+

@bors

bors Bot commented Jul 13, 2021

Copy link
Copy Markdown
Contributor

@bors bors Bot merged commit 9293a7f into master Jul 13, 2021
@bors bors Bot deleted the upcall-schedule-as-result branch July 13, 2021 18:48
sirchnik pushed a commit to sirchnik/tock that referenced this pull request May 12, 2026
2656: kernel: schedule upcall, bool -> Result r=bradjc a=ppannuto

### Pull Request Overview

Following up on tock#2639 (comment), this looks at changing the signature of scheduling upcalls to return a `Result` rather than a simple `bool`.

To start, this just looks at the kernel interfaces (i.e. this introduces 110 "unused Result" warnings in capsule code right now).

One interesting thing surfaced is that IPC does use the result of upcall scheduling to report to the IPC sender whether the receiver was scheduled successfully or not. Previously, this treated scheduling the null upcall as a failure. I'm not 100% sure that's right though -- a receiver has every right (although it would be odd) to ignore notifications. I used the new error code granularity to separate out this case right now; though maybe the best solution is to simply bubble up the error all the way to userspace for IPC.

---

This commit just changes the kernel interfaces, it does not address
all of the new unused Result warnings at all of the capsule scheduling
call sites.

### Testing Strategy

N/A yet [compiling]

### TODO or Help Wanted

Do these error code assignments make sense?

Should IPC treat scheduling a null upcall as an error or success?

Should I move forward with the rest of the capsules?

### Documentation Updated

- [x] Updated the relevant files in `/docs`, or no updates are required.

### Formatting

- [] Ran `make prepush`.


Co-authored-by: Leon Schuermann <leon@is.currently.online>
Co-authored-by: Pat Pannuto <pat.pannuto@gmail.com>
Co-authored-by: Brad Campbell <bradjc5@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kernel last-call Final review period for a pull request. tock-2.0 Issues and PRs related to Tock version 2.0.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants