16:29:32 #startmeeting fedora_coreos_meeting 16:29:32 Meeting started Wed May 5 16:29:32 2021 UTC. 16:29:32 This meeting is logged and archived in a public location. 16:29:32 The chair is dustymabe. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:29:32 Useful Commands: #action #agreed #halp #info #idea #link #topic. 16:29:32 The meeting name has been set to 'fedora_coreos_meeting' 16:29:36 #topic roll call 16:29:42 .hi 16:29:43 cyberpear: cyberpear 'James Cassell' 16:29:47 .hi 16:29:47 slowrie: slowrie 'Stephen Lowrie' 16:30:06 .hello sohank2602 16:30:07 skunkerk: sohank2602 'Sohan Kunkerkar' 16:30:25 .hello jasonbrooks 16:30:26 jbrooks: jasonbrooks 'Jason Brooks' 16:30:47 .hello siosm 16:30:47 travier: siosm 'Timothée Ravier' 16:31:07 .hello2 16:31:08 jlebon: jlebon 'None' 16:31:19 .hi 16:31:20 lorbus: lorbus 'Christian Glombek' 16:31:20 .hello2 jaimelm 16:31:22 jaimelm: jaimelm 'Jaime Magiera' 16:31:53 .hi 16:31:54 bgilbert: bgilbert 'Benjamin Gilbert' 16:31:57 #chair cyberpear slowrie skunkerk jbrooks travier jlebon lorbus jaimelm bgilbert 16:31:57 Current chairs: bgilbert cyberpear dustymabe jaimelm jbrooks jlebon lorbus skunkerk slowrie travier 16:33:02 #topic Action items from last meeting 16:33:08 * bgilbert to investigate updating the Ignition type registration 16:33:10 * jaimelm bring nftables changes to attention of OKD WG/developers for feedback 16:34:22 still not done :-(, but: 16:34:25 #info bgilbert filed https://github.com/coreos/ignition/issues/1203 16:34:54 bgilbert: re-action or.. ? 16:35:06 nah, let's track it in the bug instead 16:35:24 OKD WG has started using the discussion functionality of our documentation repo. I've been putting placeholders there to bring up at meetings. Placeholder is at the link below. OKD WG will be discussing it Tuesday. 16:35:33 #link https://github.com/openshift/okd/discussions/613 16:36:08 #info jaimelm started a discussion ticket with OKD to discuss nftables implications on OKD. https://github.com/openshift/okd/discussions/613 16:36:10 Tangentially I also have a discussion stub for cgroups v2 16:36:13 https://github.com/openshift/okd/discussions/611 16:36:18 nice 16:36:55 ok I'll move on to meeting topics 16:37:07 #topic Cannot upgrade from N-2 releases due to missing GPG key 16:37:12 #link https://github.com/coreos/fedora-coreos-tracker/issues/749 16:37:18 jlebon: want to explain this one? 16:37:38 sure 16:38:20 so, previously we had a policy where we would put down update barriers before each major rebase so that older nodes have a GPG-covered path to the tip 16:39:07 (because the GPG keys required for the latest might not be on those old nodes) 16:39:43 but in fact, the way rpm-ostree works is that it wants to verify that the commit hashes provided truly do belong on the branch, and so fetches the tip and goes backwards from there up the chain 16:40:19 that first fetch of the tip will fail on old nodes because they don't have the public key it's signed with 16:40:34 and so updating fails 16:41:01 that's the problem statement. maybe we can pause here for clarifications before going to potential solutions 16:42:16 no clarification needed for me. issue understood. 16:42:22 Who decides which key signs the repo? 16:42:29 so basically now that we're on f34, the tip commit is signed with the f34 key 16:42:48 and f32 (and earlier) will fail updating 16:42:55 so the actual "no interaction" update window is quite small 16:43:04 F33 without the F34 key too 16:43:10 ^ 16:43:29 Isn't it implicit that Fedora releases have N+1's signing key in them at launch? 16:43:41 hmm. I think f33 should have had the f34 key early enough 16:43:45 https://getfedora.org/security/ > no F35 key here 16:44:12 dustymabe: more recent f32s should have the f34 key 16:44:24 the N+2 key comes in some time in the lifecycle of N 16:44:26 Which would suggest that you can not go more than 5-6 months out of date without manual intervention (right now) 16:45:27 If the N+1 key isn't shipped in the original release then we would also have to rethink the single upgrade barrier per fedora major path 16:45:37 f34 key got added to fedora-gpg-keys rpm in 2020-08-06 16:46:08 Scratch that, I hgave F35 & F36 keys on my F34 right now 16:46:19 https://src.fedoraproject.org/rpms/fedora-repos/c/49d0933 16:46:27 It's just not displayed on the website 16:46:31 right yup N GA definitely includes N+1 16:46:43 so.. 16:46:50 potential solutions? 16:47:08 well first, one thing worth clarifying maybe is: 16:47:53 this is never going to be an issue for nodes which just update with our releases. the more common way this could happen is if you always start from an old bootimage when reprovisioning and then let it auto-update to the latest 16:48:36 specifically, it'd have to be a bootimage older or equal to N-3 16:48:37 That can happen in OKD-land. 16:48:49 right 16:49:07 anyway, potential solutions 16:49:24 dustymabe and I discussed this and some possibilities were: 16:49:43 - drop the validation check in rpm-ostree and just have it trust that the commit is on the same branch 16:49:57 - add a switch to skip the validation check and make zincati use it 16:49:57 option A ^^ 16:50:04 and option B ^^ 16:50:37 I like "--skip-checksum-validation" because it's yum-like, and will be familiar to users as an option, as opposed to dropping what is generally a good check I think. 16:50:38 N-3 for bootimages in OKD wouldn't be something we"d really want to do though 16:50:40 - sign the commits with multiple keys, going back to how far we care about old bootimages 16:50:56 option C ^^ 16:51:02 For clarification rpm-ostree would still validate the hash in these scenarios for the commit it's updating to (the update from the barrier) correct? 16:51:40 slowrie: in option A and B it wouldn't be able to be 100% sure that the commit actually lives on that branch 16:51:51 it'd just have to trust the update driver 16:51:52 just not which branch, right 16:52:02 jlebon: but it would be sure that the commit itself was signed with a valid key 16:52:05 all commits are still GPG verified 16:52:20 and the commit that is being deployed came from zincati anyway (trusted) 16:52:54 right, what rpm-ostree is trying to guard against there isn't really a security thing, but more a semantic thing 16:53:03 I don't like option C; Between A & B I'd probably lean towards B. 16:53:22 * jaimelm likes B 16:53:23 I like B 16:53:31 but wait 16:53:32 What if rpm-ostree read the log until it could find a commit it could verify and kept it from there? 16:53:40 there's more! 16:53:45 :) 16:53:45 C has other benefits 16:54:06 first, we get to reduce the actual numer of keys in the chain we use to verify commits 16:54:31 right now, we're just importing everything from /etc/pki/rpm-gpg, which includes keys for all previous fedora releases 16:55:05 second, less keys has an impact on latency and I/O: https://github.com/coreos/fedora-coreos-tracker/issues/761 16:55:52 this was noticed by a user who was seeing I/O spikes at regular intervals 16:56:15 which... i failed to link to from that issue 16:56:33 jlebon: seems like there are other ways to solve that problem? 16:56:58 hmm, let me reframe C 16:57:00 Sure, but what does the cost on the releng side look like for implementing & performing the signing of each commit with multiple keys. Also is there any other costs required to get that off the ground? 16:57:13 C allows us to *keep* the rpm-ostree sanity-check, *and* has other side benefits 16:57:16 should we change the rpm-ostreed shutdown timeout if we're going to be polling it regularly? 16:57:37 with C, three years from now, we'd be signing each release with 9 keys, right? 16:57:58 I really like how we rotate our keys naturally right now 16:58:00 #link https://github.com/coreos/zincati/issues/137 16:58:00 i think we can draw a line somewhere 16:58:04 bgilbert: ^ 16:58:30 travier: +1 16:58:41 let's put this out there since travier mentioned it: 16:58:52 option D: What if rpm-ostree read the log until it could find a commit it could verify and kept it from there? 16:59:21 semantically it seems strange to me that we'd sign an F35 release with an F32 key 16:59:34 agreed 16:59:36 hmm, i think we'd be hesitant in libostree to even parse the commit object if it's not verifiable 16:59:39 we'd also need to discuss this with releng ^^ 16:59:56 and you need to parse it to get the parent commit 17:00:06 indeed 17:00:22 final thing about the sanity-check, and then I'll shut up: 17:01:14 imagine if we somehow screw up the graph, and the node goes from testing to stable. now, from that point on it's extremely difficult to rectify that situation because the node will now be checking the wrong update graph 17:01:29 it's super unlikely, but super bad 17:02:01 hum, zincati is indeed the real source of truth 17:02:33 jlebon: what if option A/B are fallback modes 17:02:38 also re C, if a key gets compromised, we'd have to continue signing new releases with it. that doesn't affect the security properties (it only affects nodes that only trust the compromised key) but again, feels weird 17:02:53 anyway, if everyone still prefers disabling the check, SGTM 17:03:03 Could we keep refs specifically for update barriers? 17:03:12 Maybe let's work backward: Does anyone support C? 17:03:19 If not, let's take it off the table. 17:03:27 jaimelm: jlebon does :) 17:03:55 if we have refs just for barrier releases, then rpm-ostree can fetch them and they will verify 17:04:07 zincati can ask to move to them 17:04:18 jlebon: Say we were to move forward with C; what actions would we have to start taking beyond just signing the builds multiple times? 17:04:25 this also feels extremely linked to the ostree in container discussion 17:04:33 we would not have this issue with that 17:04:43 travier: yeah, separate refs is the libostree-native way of implementing update barriers 17:04:53 travier: I suspect the ostree that is inside the container has a commit that is signed 17:05:03 slowrie: nothing 17:05:05 yeah 17:05:39 it'd be a patch to robosignatory to have it use not just the key for N, but also N-1, N-2, ... to some integer X we decide on 17:05:40 dustymabe: sure, but we just fetch the release barrier commit which is signed by a key we have 17:05:48 jlebon: well, we'd have to request the signatures get created and also get approval from releng to use them this way 17:06:10 If we go C then I could agree on N-2, but more would be weird 17:06:15 travier: ahh, yeah I guess the validation check wouldn't exist then 17:06:42 can we discuss option B just a bit more? 17:06:49 sure 17:06:56 so let's say we go with option B 17:07:12 then it completely destroy the signature process 17:07:19 and there is a `--skip-checksum-validation` option 17:07:26 that zincati knows how to use 17:07:49 could we default to not using `--skip-checksum-validation` but only for certain barrier releases, use it? 17:08:08 i.e. encode some info in the stream metadata that tells zincati it can use that lever if it needs to? 17:08:33 Oh, nice 17:08:46 or maybe we have it only use --skip-checksum-validation if it detects that the node is super old? 17:08:47 that way we never have this problem that jlebon mentioned about branches getting messed up in zincati 17:08:55 jlebon: yeah, that's another option 17:08:58 less precise 17:09:12 I think it would be used in the worst case: you're not validating the latest commit so your option is to trust zincati, but you're so old that you might be doing something wrong 17:09:32 Feels like exactly the time where update checks should be the most strict 17:10:35 hmm actually, one other option is to have rpm-ostree verify that the fetched commit has the same stream encoded into its metadata 17:10:58 since that's the source of truth for which stream a node is on 17:11:19 yeah I thought we talked about that as an option (at least that's what I was thinking "OSTree ref bindings" were 17:11:39 no, it's a special key, `fedora-coreos.stream` 17:11:51 ok, then yeah. that sounds reasonable 17:11:54 https://github.com/coreos/fedora-coreos-config/blob/113682404839f7fa727f79c27d43fe412f03f2fa/manifest.yaml#L25 17:12:00 yeah, i'd be ok with that 17:12:13 so.. would that be good enough on its own? 17:12:23 or would that be best as a fallback mode? 17:12:30 we'd still want the switch to skip validation 17:12:47 cool 17:12:48 and another switch to force validation using commit metadata 17:12:55 I'm now in favor of option C with N-2, which would put us back to N-4 releases in support. Older nodes can always manually import GPG keys from a trusted source to update. 17:13:18 "Older nodes can always manually import GPG keys from a trusted source to update." 17:13:25 ^^ that's an argument for doing nothing IMO 17:13:30 Big Picture Question: Do we have a sense of how much this issue impacts the community? - as in estimated users/nodes (given the amount of effort being put into a solution) 17:13:43 dustymabe: indeed 17:13:56 In all other cases we parse commit data we can not validate, which means that we are essentially giving up on signing 17:14:17 travier: no, we still validate the commit we are deploying is signed with a key we trust 17:14:26 jaimelm: we're just starting to hit this issue, so no. it'll get worse over time, but we don't know how much. 17:14:37 i think one thing we should nail down is how much we should care about really old bootimages 17:14:38 right 17:14:46 e.g. do we want to support them forever? 17:14:54 jlebon: yes. that's where I'm going. 17:15:06 jaimelm: ack 17:15:15 i agree that's worth ironing out 17:15:16 i honestly think that is a larger topic than just what we're discussing right now 17:15:28 dustymabe: it directly drives the answer though 17:15:34 definitely related - and we can table this until we go through that 17:15:51 We have 15 minutes. So, might be good to table. 17:15:54 because if we say "N-2", then to me it's not worth losing the check 17:16:07 jlebon: right 17:16:10 maybe we can chat further in the issue? 17:16:16 it's not just old bootimages, it's old nodes 17:17:00 same premise though, which ultimately falls on users to update with care (e.g. download keys from trusted source) 17:17:37 "I haven't booted this test VM in a year and now it won't update" isn't great UX 17:17:44 dustymabe told me to take over the meeting because he had to drop 17:17:48 Same thing in the HPC cluster world. If you're running old nodes, you generally have to make some manual changes to get it updated. 17:18:01 let's put the highlights of this discussion in the ticket and move on for now? 17:18:40 (especially since updates are supposed to be automatic) 17:18:59 looks like that was the only issue on the table, so we can go to open floor and if there's nothing else to discuss, we can keep going on this topic 17:19:07 agreed? 17:19:11 +1 17:19:18 +1 17:19:21 #topic Open Floor 17:19:54 announcement: we've now shipped f34 in testing! thanks travier for running the releases! 17:20:07 +1 17:20:09 #info in the next few weeks, we're going to be renaming the `master` branches of CoreOS Git repos to `main` 17:20:49 bgilbert: do i understand correctly the strategy there is to just rename everything in one shot? 17:21:04 there will be a symref from `master` to `main`, so existing checkouts will still be able to pull 17:21:11 cool 17:21:21 (but it's a bit cleaner if you update your local repo afterward) 17:21:40 jlebon: more-or-less one shot. we're not aiming for One Big Flag Day. 17:21:42 bgilbert: ahhh ok cool 17:22:33 anyone have anything else? 17:23:05 doesn't look like it 17:23:32 do anyone want to keep discussing the update issue or just continue in the tickets? 17:23:38 does* 17:23:39 return to the previous issue? 17:23:56 I think we need to summarize the options in the ticket to clarify them 17:24:05 ^^ 17:24:06 and weight the risks for each one 17:24:22 and the exact support we want, which issue exactly to fix 17:24:47 (sorry for giving you work jlebon) 17:25:04 I'll try to summarize my options too 17:25:08 travier: heh, no worries. was already planning to do that anyway 17:25:54 i think this is kinda related to the zincati <-> rpm-ostree integration too 17:26:52 if we had zincati fully own updates like the MCO does, then it could pin to commits, and e.g. `rpm-ostree upgrade` would pass through zincati 17:27:32 right now, we're in this middle world where we use zincati, but you can also use `rpm-ostree upgrade --bypass-driver` 17:28:39 ok, going to call the meeting in 30s :) 17:29:09 #endmeeting