kernel: schedule upcall, bool -> Result by ppannuto · Pull Request #2656 · tock/tock

ppannuto · 2021-07-07T17:11:20Z

Pull Request Overview

Following up on #2639 (comment), this looks at changing the signature of scheduling upcalls to return a Result rather than a simple bool.

To start, this just looks at the kernel interfaces (i.e. this introduces 110 "unused Result" warnings in capsule code right now).

One interesting thing surfaced is that IPC does use the result of upcall scheduling to report to the IPC sender whether the receiver was scheduled successfully or not. Previously, this treated scheduling the null upcall as a failure. I'm not 100% sure that's right though -- a receiver has every right (although it would be odd) to ignore notifications. I used the new error code granularity to separate out this case right now; though maybe the best solution is to simply bubble up the error all the way to userspace for IPC.

This commit just changes the kernel interfaces, it does not address
all of the new unused Result warnings at all of the capsule scheduling
call sites.

Testing Strategy

N/A yet [compiling]

TODO or Help Wanted

Do these error code assignments make sense?

Should IPC treat scheduling a null upcall as an error or success?

Should I move forward with the rest of the capsules?

Documentation Updated

Updated the relevant files in /docs, or no updates are required.

Formatting

[] Ran make prepush.

lschuermann · 2021-07-07T17:43:00Z

One interesting thing surfaced is that IPC does use the result of upcall scheduling to report to the IPC sender whether the receiver was scheduled successfully or not. Previously, this treated scheduling the null upcall as a failure.

I'm reasonably confident that we, at least somewhere in Tock 2.0 development, decided to not tell capsules whether an Upcall was properly scheduled at all (that is, whether a process task was enqueued). As far as I remember, the boolean return value depended only on whether the process was still alive, and not on the fact whether the upcall in question is currently a null-upcall. @phil-levis is that correct?

Currently, schedule_upcall will return false if this is a null-upcall. Maybe this confusion around error handling (and it having changed over time) led to these implementation issues?

bradjc · 2021-07-07T18:00:57Z

For reference, reasons schedule() can fail:

The process is not valid
The upcall is null
The subscribe num is invalid
The tasks array in the process was taken and not replaced
There is no room in the tasks queue

In practice, (1) cannot happen, because the grant.enter() would fail first, and the process has to remain valid while the grant is entered, and (4) won't happen, because that would be a kernel bug that must be fixed and everything would break immediately.

For (3), subscribe_num is almost certainly going to be a hardcoded value (all instances today are hardcoded), and I don't see any value in a runtime check.

For (2) I agree capsules shouldn't do anything in response to a null upcall.

Then (5) is the only case I could see returning an error, but I don't see what a capsule could do with that information.

ppannuto · 2021-07-07T19:52:07Z

There is no room in the tasks queue
Then (5) is the only case I could see returning an error, but I don't see what a capsule could do with that information.

Here again I think IPC might be the interesting kid on the block. In particular, if process A does something, and wants to notify process B via IPC, it does so, enqueues a task and all is happy. If A keeps running, and during this same quanta (or just generally before B gets to run, due to process priorities, etc) and does another something which requires notifying B via IPC, then [eventually] B will need to run and drain IPC events. If A finds it is generating many events that B never responds to, it may wish to yield or otherwise ask the scheduler to lower its priority so that B can run.

lschuermann · 2021-07-07T20:14:51Z

@bradjc I generally agree with your assessment, although I'm not sure about (3). You're right in that the subscribe number is hardcoded in current capsules, but given we're providing interfaces to downstream developers and this parameter could well be determined at runtime -- regardless of whether there is a good reason to do so -- I'm of the opinion we should return an error in this case.

If we are returning an error in one case, we might as well inform the capsule about all cases except (4), as that would indicate a serious kernel bug and might even justify a panic, and (2), which is to say that a null-callback is defined as a valid callback which is simply never delivered or from the capsules point of view ignored by the process.

bradjc · 2021-07-07T20:56:04Z

I can certainly see the argument "wait I passed in an invalid subscribe_num and didn't get an error?!?". I propose then returning Result<(), UpcallError>:

enum UpcallError {
  InvalidSubscribeNum,
  QueueFull,
  // Process is inactive (can never happen) or process.tasks was empty (won't ever happen).
  // If either _did_ happen, well that would be a kernel error.
  // We do this in ProcessError to solve this same problem, and make it clear
  // that users of the error type shouldn't try to handle this case.
  KernelError,
}

lschuermann · 2021-07-08T07:27:51Z

@bradjc I like that approach. I've taken at stab at implementing it here: ce9c084. If @ppannuto doesn't mind I can push it to this branch.

I wasn't sure whether UpcallError should be in kernel::grant or kernel::upcall, both seemed equally fitting giving that GrantUpcallTable is in kernel::grant. Furthermore, we need to export this enum, as otherwise user's will have no way to properly use it, for instance in a match expression.

alexandruradovici · 2021-07-08T10:33:30Z

Then (5) is the only case I could see returning an error, but I don't see what a capsule could do with that information.

The touch driver receives a lot of callbacks from the underlying hardware driver as a result of the movement of fingers on the touch panel's surface. If the application is not processing callbacks fast enough, the task queue will be filled. The driver counts the dropped upcalls and reports them to the application. In turn, the application can decide to prioritize the callbacks.

bradjc · 2021-07-08T20:15:01Z

@bradjc I like that approach. I've taken at stab at implementing it here: ce9c084. If @ppannuto doesn't mind I can push it to this branch.

I say push, we can always remove the commit.

I wasn't sure whether UpcallError should be in kernel::grant or kernel::upcall, both seemed equally fitting giving that GrantUpcallTable is in kernel::grant. Furthermore, we need to export this enum, as otherwise user's will have no way to properly use it, for instance in a match expression.

I would say upcall.rs, and that upcall things being in grant.rs is the exception we would otherwise like to avoid.

bradjc · 2021-07-08T20:17:55Z

I'm marking this 2.0 because I think we either do it now while we are thinking about it and change capsules, or we leave it -> bool.

lschuermann · 2021-07-08T21:41:59Z

I say push, we can always remove the commit.

Pushed my changes. This still does not address the unused Results. @ppannuto happy to fix if people are otherwise happy with the interface.

I would say upcall.rs

Makes sense. Will move it there.

ppannuto · 2021-07-08T21:51:56Z

+    /// A kernel-internal invariant has been violated.
+    ///
+    /// This error should never happen. It can be returned if the
+    /// process is inactive (which should be caught by
+    /// [`Grant::enter`]) or `process.tasks` was taken.
+    ///
+    /// These cases cannot be reasonably handled.
+    KernelError,


Is this error worth including? I am imaging me as a handler of this code -- what do I do? The options are

Just ignore the error and hope (same as if an error had not been returned)

panic!, the world is broken

I think that making capsule authors choose between these will likely just result in a lot of arbitrary coin flips between these two options.

I suppose the issue is that panic!ing in the kernel is not an option (even if these error cases were to occur, the system might not be in an entirely unrecoverable state), whereas returning Ok(()) as if the callback was scheduled is obviously misleading.

This is the cheap way out, blaming the capsule code for either panic!ing or misbehaving in response to this error. :)

It's convenient to include the error, because rust forces us to handle cases that can "never" happen. I think it is better to have the error than do something like hardcode panics into the core kernel binary which will never be triggered, but still take up space.

From a capsule author's point of view, I think it's pretty clear that it's not an error they should handle, but could report upstream to us if they ever saw it.

However, we do not want capsule authors to include a debug or panic, because that just re-creates the size issue (and probably worse). But that is also true of handling this return value at all, unless a capsule can do something meaningful it should just ignore the return value.

phil-levis · 2021-07-09T02:54:19Z

For reference, reasons schedule() can fail:
...
5. There is no room in the tasks queue
...
Then (5) is the only case I could see returning an error, but I don't see what a capsule could do with that information.

I agree. I am worried that making this a Result is going to add a lot of boilerplate. The only real answer a capsule might have is to mark the upcall as pending and try to issue it again, e.g., after a timeout or on next invocation. How many capsules handle this case? If the answer is zero, then I'd suggest changing it to no return value until we have a case to work with. It might be, for example, that now that we count the number of upcalls we create a reserved upcall queue, such that an upcall fails iff there is a copy of that upcall pending. If so, the error cases for UpcallError are going to be different.

bradjc · 2021-07-09T12:30:43Z

I agree. I am worried that making this a Result is going to add a lot of boilerplate.

Well, the boilerplate is only adding .ok() to the end of the schedule calls. I pushed a commit which changes capsules to prevent the warning for unhandled result so you can see the change.

The only real answer a capsule might have is to mark the upcall as pending and try to issue it again, e.g., after a timeout or on next invocation. How many capsules handle this case? If the answer is zero, then I'd suggest changing it to no return value until we have a case to work with. It might be, for example, that now that we count the number of upcalls we create a reserved upcall queue, such that an upcall fails iff there is a copy of that upcall pending. If so, the error cases for UpcallError are going to be different.

No capsules use the return value today. However, if we make change the return to Result now, then we can update the capsule interface now as part of 2.0, and if we expand the error types later we don't have to go back and change any capsules that do not need to handle schedule errors.

phil-levis · 2021-07-09T14:47:59Z

I agree. I am worried that making this a Result is going to add a lot of boilerplate.

Well, the boilerplate is only adding .ok() to the end of the schedule calls. I pushed a commit which changes capsules to prevent the warning for unhandled result so you can see the change.

I think I am ok with that. :)

Is there any increase in code size? I've become a bit wary of flexibility that we might need in the future that increases code size now. If the increase is minimal then I'm on board.

bradjc · 2021-07-09T15:23:15Z

Is there any increase in code size? I've become a bit wary of flexibility that we might need in the future that increases code size now. If the increase is minimal then I'm on board.

I think it saves space. Looking through https://github.com/tock/tock/pull/2656/checks?check_run_id=3030271308 a lot of the numbers look smaller.

bradjc · 2021-07-12T21:26:27Z

I think this is waiting on @ppannuto to review his own PR with the latest changes.

ppannuto

You go away for one long weekend and your PR flies on its own :P

Looks good to me overall; one small interface question

Previously this function only returned a boolean, indicating whether enqueueing the task worked. Users of this function might be interested in the precise reason why enqueueing the task failed. In particular, whether the process is no longer alive, the process' task queue is full, or some kernel-internal error occurred. Signed-off-by: Leon Schuermann <leon@is.currently.online>

Return a Result on GrantUpcallTable::schedule_upcall, indicating whether an error occurred and the precise reason for the error. Users of this API may utilize this information to, for example, adjust the rate at which Upcalls are scheduled or reschedule an Upcall which has been lost due to the process' task queue being full. This commit just changes the kernel interfaces, it does not address all of the new unused Result warnings at all of the capsule scheduling call sites. Co-authored-by: Leon Schuermann <leon@is.currently.online>

No cases where we need to handle a failed schedule.

ppannuto

But I would if I coult

bradjc · 2021-07-13T18:33:21Z

bors r+

bors · 2021-07-13T18:48:06Z

Build succeeded:

2656: kernel: schedule upcall, bool -> Result r=bradjc a=ppannuto ### Pull Request Overview Following up on tock#2639 (comment), this looks at changing the signature of scheduling upcalls to return a `Result` rather than a simple `bool`. To start, this just looks at the kernel interfaces (i.e. this introduces 110 "unused Result" warnings in capsule code right now). One interesting thing surfaced is that IPC does use the result of upcall scheduling to report to the IPC sender whether the receiver was scheduled successfully or not. Previously, this treated scheduling the null upcall as a failure. I'm not 100% sure that's right though -- a receiver has every right (although it would be odd) to ignore notifications. I used the new error code granularity to separate out this case right now; though maybe the best solution is to simply bubble up the error all the way to userspace for IPC. --- This commit just changes the kernel interfaces, it does not address all of the new unused Result warnings at all of the capsule scheduling call sites. ### Testing Strategy N/A yet [compiling] ### TODO or Help Wanted Do these error code assignments make sense? Should IPC treat scheduling a null upcall as an error or success? Should I move forward with the rest of the capsules? ### Documentation Updated - [x] Updated the relevant files in `/docs`, or no updates are required. ### Formatting - [] Ran `make prepush`. Co-authored-by: Leon Schuermann <leon@is.currently.online> Co-authored-by: Pat Pannuto <pat.pannuto@gmail.com> Co-authored-by: Brad Campbell <bradjc5@gmail.com>

github-actions Bot added kernel tock-libraries This affects libraries supported by the Tock project labels Jul 7, 2021

ppannuto mentioned this pull request Jul 7, 2021

Tock 2.0: implement Upcall swapping restrictions (v4) #2639

Merged

6 tasks

ppannuto force-pushed the upcall-schedule-as-result branch from aad91a9 to 89af0c0 Compare July 7, 2021 17:13

lschuermann reviewed Jul 8, 2021

View reviewed changes

Comment thread kernel/src/process_standard.rs Outdated

bradjc added the tock-2.0 Issues and PRs related to Tock version 2.0. label Jul 8, 2021

lschuermann force-pushed the upcall-schedule-as-result branch from 89af0c0 to ce9c084 Compare July 8, 2021 21:40

ppannuto commented Jul 8, 2021

View reviewed changes

bradjc reviewed Jul 8, 2021

View reviewed changes

Comment thread kernel/src/process_standard.rs Outdated

Comment thread kernel/src/process_standard.rs

bradjc force-pushed the upcall-schedule-as-result branch from 02ab15b to 9359719 Compare July 9, 2021 14:53

github-actions Bot removed the tock-libraries This affects libraries supported by the Tock project label Jul 9, 2021

bradjc changed the title ~~kernel: schedule upcall, bool -> Result (WIP)~~ kernel: schedule upcall, bool -> Result Jul 9, 2021

bradjc marked this pull request as ready for review July 9, 2021 14:57

bradjc force-pushed the upcall-schedule-as-result branch from b06daed to 6f7fd27 Compare July 9, 2021 15:01

lschuermann force-pushed the upcall-schedule-as-result branch from 21d1568 to 78af01a Compare July 9, 2021 15:48

bradjc added the needs-rebase label Jul 12, 2021

ppannuto commented Jul 13, 2021

View reviewed changes

Comment thread kernel/src/process_standard.rs Outdated

lschuermann and others added 4 commits July 13, 2021 13:58

capsules: suppress schedule_upcall warning

57c20bb

No cases where we need to handle a failed schedule.

kernel: ipc: supress warning from schedule()

0f09cd2

bradjc force-pushed the upcall-schedule-as-result branch from 78af01a to 0f09cd2 Compare July 13, 2021 17:58

bradjc removed the needs-rebase label Jul 13, 2021

bradjc approved these changes Jul 13, 2021

View reviewed changes

ppannuto commented Jul 13, 2021

View reviewed changes

ppannuto added the last-call Final review period for a pull request. label Jul 13, 2021

bors Bot merged commit 9293a7f into master Jul 13, 2021

bors Bot deleted the upcall-schedule-as-result branch July 13, 2021 18:48

Uh oh!

Conversation

ppannuto commented Jul 7, 2021

Pull Request Overview

Testing Strategy

TODO or Help Wanted

Documentation Updated

Formatting

Uh oh!

lschuermann commented Jul 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bradjc commented Jul 7, 2021

Uh oh!

ppannuto commented Jul 7, 2021

Uh oh!

lschuermann commented Jul 7, 2021

Uh oh!

bradjc commented Jul 7, 2021

Uh oh!

lschuermann commented Jul 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

alexandruradovici commented Jul 8, 2021

Uh oh!

bradjc commented Jul 8, 2021

Uh oh!

bradjc commented Jul 8, 2021

Uh oh!

lschuermann commented Jul 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ppannuto Jul 8, 2021

Choose a reason for hiding this comment

Uh oh!

lschuermann Jul 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bradjc Jul 8, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

phil-levis commented Jul 9, 2021

Uh oh!

bradjc commented Jul 9, 2021

Uh oh!

phil-levis commented Jul 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bradjc commented Jul 9, 2021

Uh oh!

bradjc commented Jul 12, 2021

Uh oh!

ppannuto left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ppannuto left a comment

Choose a reason for hiding this comment

Uh oh!

bradjc commented Jul 13, 2021

Uh oh!

bors Bot commented Jul 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lschuermann commented Jul 7, 2021 •

edited

Loading

lschuermann commented Jul 8, 2021 •

edited

Loading

lschuermann commented Jul 8, 2021 •

edited

Loading

lschuermann Jul 8, 2021 •

edited

Loading

phil-levis commented Jul 9, 2021 •

edited

Loading