16:29:32 <dustymabe> #startmeeting fedora_coreos_meeting
16:29:32 <zodbot> Meeting started Wed May  5 16:29:32 2021 UTC.
16:29:32 <zodbot> This meeting is logged and archived in a public location.
16:29:32 <zodbot> The chair is dustymabe. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:29:32 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic.
16:29:32 <zodbot> The meeting name has been set to 'fedora_coreos_meeting'
16:29:36 <dustymabe> #topic roll call
16:29:42 <cyberpear> .hi
16:29:43 <zodbot> cyberpear: cyberpear 'James Cassell' <fedoraproject@cyberpear.com>
16:29:47 <slowrie> .hi
16:29:47 <zodbot> slowrie: slowrie 'Stephen Lowrie' <slowrie@redhat.com>
16:30:06 <skunkerk> .hello sohank2602
16:30:07 <zodbot> skunkerk: sohank2602 'Sohan Kunkerkar' <skunkerk@redhat.com>
16:30:25 <jbrooks> .hello jasonbrooks
16:30:26 <zodbot> jbrooks: jasonbrooks 'Jason Brooks' <jbrooks@redhat.com>
16:30:47 <travier> .hello siosm
16:30:47 <zodbot> travier: siosm 'Timothée Ravier' <travier@redhat.com>
16:31:07 <jlebon> .hello2
16:31:08 <zodbot> jlebon: jlebon 'None' <jonathan@jlebon.com>
16:31:19 <lorbus> .hi
16:31:20 <zodbot> lorbus: lorbus 'Christian Glombek' <cglombek@redhat.com>
16:31:20 <jaimelm> .hello2 jaimelm
16:31:22 <zodbot> jaimelm: jaimelm 'Jaime Magiera' <jaimelm@umich.edu>
16:31:53 <bgilbert> .hi
16:31:54 <zodbot> bgilbert: bgilbert 'Benjamin Gilbert' <bgilbert@backtick.net>
16:31:57 <dustymabe> #chair cyberpear slowrie skunkerk jbrooks travier jlebon lorbus jaimelm bgilbert
16:31:57 <zodbot> Current chairs: bgilbert cyberpear dustymabe jaimelm jbrooks jlebon lorbus skunkerk slowrie travier
16:33:02 <dustymabe> #topic Action items from last meeting
16:33:08 <dustymabe> * bgilbert to investigate updating the Ignition type registration
16:33:10 <dustymabe> * jaimelm bring nftables changes to attention of OKD WG/developers for feedback
16:34:22 <bgilbert> still not done :-(, but:
16:34:25 <bgilbert> #info bgilbert filed https://github.com/coreos/ignition/issues/1203
16:34:54 <dustymabe> bgilbert: re-action or.. ?
16:35:06 <bgilbert> nah, let's track it in the bug instead
16:35:24 <jaimelm> OKD WG has started using the discussion functionality of our documentation repo. I've been putting placeholders there to bring up at meetings. Placeholder is at the link below. OKD WG will be discussing it Tuesday.
16:35:33 <jaimelm> #link https://github.com/openshift/okd/discussions/613
16:36:08 <dustymabe> #info jaimelm started a discussion ticket with OKD to discuss nftables implications on OKD. https://github.com/openshift/okd/discussions/613
16:36:10 <jaimelm> Tangentially I also have a discussion stub for cgroups v2
16:36:13 <jaimelm> https://github.com/openshift/okd/discussions/611
16:36:18 <dustymabe> nice
16:36:55 <dustymabe> ok I'll move on to meeting topics
16:37:07 <dustymabe> #topic Cannot upgrade from N-2 releases due to missing GPG key
16:37:12 <dustymabe> #link https://github.com/coreos/fedora-coreos-tracker/issues/749
16:37:18 <dustymabe> jlebon: want to explain this one?
16:37:38 <jlebon> sure
16:38:20 <jlebon> so, previously we had a policy where we would put down update barriers before each major rebase so that older nodes have a GPG-covered path to the tip
16:39:07 <jlebon> (because the GPG keys required for the latest might not be on those old nodes)
16:39:43 <jlebon> but in fact, the way rpm-ostree works is that it wants to verify that the commit hashes provided truly do belong on the branch, and so fetches the tip and goes backwards from there up the chain
16:40:19 <jlebon> that first fetch of the tip will fail on old nodes because they don't have the public key it's signed with
16:40:34 <jlebon> and so updating fails
16:41:01 <jlebon> that's the problem statement.  maybe we can pause here for clarifications before going to potential solutions
16:42:16 <jaimelm> no clarification needed for me. issue understood.
16:42:22 <travier> Who decides which key signs the repo?
16:42:29 <dustymabe> so basically now that we're on f34, the tip commit is signed with the f34 key
16:42:48 <dustymabe> and f32 (and earlier) will fail updating
16:42:55 <travier> so the actual "no interaction" update window is quite small
16:43:04 <travier> F33  without the F34 key too
16:43:10 <jaimelm> ^
16:43:29 <slowrie> Isn't it implicit that Fedora releases have N+1's signing key in them at launch?
16:43:41 <dustymabe> hmm. I think f33 should have had the f34 key early enough
16:43:45 <travier> https://getfedora.org/security/ > no F35 key here
16:44:12 <jlebon> dustymabe: more recent f32s should have the f34 key
16:44:24 <jlebon> the N+2 key comes in some time in the lifecycle of N
16:44:26 <travier> Which would suggest that you can not go more than 5-6 months out of date without manual intervention (right now)
16:45:27 <slowrie> If the N+1 key isn't shipped in the original release then we would also have to rethink the single upgrade barrier per fedora major path
16:45:37 <dustymabe> f34 key got added to fedora-gpg-keys rpm in 2020-08-06
16:46:08 <travier> Scratch that, I hgave F35 & F36 keys on my F34 right now
16:46:19 <dustymabe> https://src.fedoraproject.org/rpms/fedora-repos/c/49d0933
16:46:27 <travier> It's just not displayed on the website
16:46:31 <jlebon> right yup N GA definitely includes N+1
16:46:43 <dustymabe> so..
16:46:50 <dustymabe> potential solutions?
16:47:08 <jlebon> well first, one thing worth clarifying maybe is:
16:47:53 <jlebon> this is never going to be an issue for nodes which just update with our releases.  the more common way this could happen is if you always start from an old bootimage when reprovisioning and then let it auto-update to the latest
16:48:36 <jlebon> specifically, it'd have to be a bootimage older or equal to N-3
16:48:37 <jaimelm> That can happen in OKD-land.
16:48:49 <jlebon> right
16:49:07 <jlebon> anyway, potential solutions
16:49:24 <jlebon> dustymabe and I discussed this and some possibilities were:
16:49:43 <jlebon> - drop the validation check in rpm-ostree and just have it trust that the commit is on the same branch
16:49:57 <jlebon> - add a switch to skip the validation check and make zincati use it
16:49:57 <dustymabe> option A ^^
16:50:04 <dustymabe> and option B ^^
16:50:37 <jaimelm> I like "--skip-checksum-validation" because it's yum-like, and will be familiar to users as an option, as opposed to dropping what is generally a good check I think.
16:50:38 <lorbus> N-3 for bootimages in OKD wouldn't be something we"d really want to do though
16:50:40 <jlebon> - sign the commits with multiple keys, going back to how far we care about old bootimages
16:50:56 <dustymabe> option C ^^
16:51:02 <slowrie> For clarification rpm-ostree would still validate the hash in these scenarios for the commit it's updating to (the update from the barrier) correct?
16:51:40 <jlebon> slowrie: in option A and B it wouldn't be able to be 100% sure that the commit actually lives on that branch
16:51:51 <jlebon> it'd just have to trust the update driver
16:51:52 <jaimelm> just not which branch, right
16:52:02 <dustymabe> jlebon: but it would be sure that the commit itself was signed with a valid key
16:52:05 <jlebon> all commits are still GPG verified
16:52:20 <dustymabe> and the commit that is being deployed came from zincati anyway (trusted)
16:52:54 <jlebon> right, what rpm-ostree is trying to guard against there isn't really a security thing, but more a semantic thing
16:53:03 <slowrie> I don't like option C; Between A & B I'd probably lean towards B.
16:53:22 * jaimelm likes B
16:53:23 <dustymabe> I like B
16:53:31 <jlebon> but wait
16:53:32 <travier> What if rpm-ostree read the log until it could find a commit it could verify and kept it from there?
16:53:40 <jaimelm> there's more!
16:53:45 <dustymabe> :)
16:53:45 <jlebon> C has other benefits
16:54:06 <jlebon> first, we get to reduce the actual numer of keys in the chain we use to verify commits
16:54:31 <jlebon> right now, we're just importing everything from /etc/pki/rpm-gpg, which includes keys for all previous fedora releases
16:55:05 <jlebon> second, less keys has an impact on latency and I/O: https://github.com/coreos/fedora-coreos-tracker/issues/761
16:55:52 <jlebon> this was noticed by a user who was seeing I/O spikes at regular intervals
16:56:15 <jlebon> which... i failed to link to from that issue
16:56:33 <dustymabe> jlebon: seems like there are other ways to solve that problem?
16:56:58 <jlebon> hmm, let me reframe C
16:57:00 <slowrie> Sure, but what does the cost on the releng side look like for implementing & performing the signing of each commit with multiple keys. Also is there any other costs required to get that off the ground?
16:57:13 <jlebon> C allows us to *keep* the rpm-ostree sanity-check, *and* has other side benefits
16:57:16 <bgilbert> should we change the rpm-ostreed shutdown timeout if we're going to be polling it regularly?
16:57:37 <bgilbert> with C, three years from now, we'd be signing each release with 9 keys, right?
16:57:58 <dustymabe> I really like how we rotate our keys naturally right now
16:58:00 <travier> #link https://github.com/coreos/zincati/issues/137
16:58:00 <jlebon> i think we can draw a line somewhere
16:58:04 <jlebon> bgilbert: ^
16:58:30 <bgilbert> travier: +1
16:58:41 <dustymabe> let's put this out there since travier mentioned it:
16:58:52 <dustymabe> option D: What if rpm-ostree read the log until it could find a commit it could verify and kept it from there?
16:59:21 <bgilbert> semantically it seems strange to me that we'd sign an F35 release with an F32 key
16:59:34 <jaimelm> agreed
16:59:36 <jlebon> hmm, i think we'd be hesitant in libostree to even parse the commit object if it's not verifiable
16:59:39 <dustymabe> we'd also need to discuss this with releng ^^
16:59:56 <jlebon> and you need to parse it to get the parent commit
17:00:06 <travier> indeed
17:00:22 <jlebon> final thing about the sanity-check, and then I'll shut up:
17:01:14 <jlebon> imagine if we somehow screw up the graph, and the node goes from testing to stable. now, from that point on it's extremely difficult to rectify that situation because the node will now be checking the wrong update graph
17:01:29 <jlebon> it's super unlikely, but super bad
17:02:01 <travier> hum, zincati is indeed the real source of truth
17:02:33 <dustymabe> jlebon: what if option A/B are fallback modes
17:02:38 <bgilbert> also re C, if a key gets compromised, we'd have to continue signing new releases with it.  that doesn't affect the security properties (it only affects nodes that only trust the compromised key) but again, feels weird
17:02:53 <jlebon> anyway, if everyone still prefers disabling the check, SGTM
17:03:03 <travier> Could we keep refs specifically for update barriers?
17:03:12 <jaimelm> Maybe let's work backward: Does anyone support C?
17:03:19 <jaimelm> If not, let's take it off the table.
17:03:27 <dustymabe> jaimelm: jlebon does :)
17:03:55 <travier> if we have refs just for barrier releases, then rpm-ostree can fetch them and they will verify
17:04:07 <travier> zincati can ask to move to them
17:04:18 <slowrie> jlebon: Say we were to move forward with C; what actions would we have to start taking beyond just signing the builds multiple times?
17:04:25 <travier> this also feels extremely linked to the ostree in container discussion
17:04:33 <travier> we would not have this issue with that
17:04:43 <jlebon> travier: yeah, separate refs is the libostree-native way of implementing update barriers
17:04:53 <dustymabe> travier: I suspect the ostree that is inside the container has a commit that is signed
17:05:03 <jlebon> slowrie: nothing
17:05:05 <jaimelm> yeah
17:05:39 <jlebon> it'd be a patch to robosignatory to have it use not just the key for N, but also N-1, N-2, ... to some integer X we decide on
17:05:40 <travier> dustymabe: sure, but we just fetch the release barrier commit which is signed by a key we have
17:05:48 <dustymabe> jlebon: well, we'd have to request the signatures get created and also get approval from releng to use them this way
17:06:10 <travier> If we go C then I could agree on N-2, but more would be weird
17:06:15 <dustymabe> travier: ahh, yeah I guess the validation check wouldn't exist then
17:06:42 <dustymabe> can we discuss option B just a bit more?
17:06:49 <jlebon> sure
17:06:56 <dustymabe> so let's say we go with option B
17:07:12 <travier> then it completely destroy the signature process
17:07:19 <dustymabe> and there is a `--skip-checksum-validation` option
17:07:26 <dustymabe> that zincati knows how to use
17:07:49 <dustymabe> could we default to not using `--skip-checksum-validation` but only for certain barrier releases, use it?
17:08:08 <dustymabe> i.e. encode some info in the stream metadata that tells zincati it can use that lever if it needs to?
17:08:33 <jaimelm> Oh, nice
17:08:46 <jlebon> or maybe we have it only use --skip-checksum-validation if it detects that the node is super old?
17:08:47 <dustymabe> that way we never have this problem that jlebon mentioned about branches getting messed up in zincati
17:08:55 <dustymabe> jlebon: yeah, that's another option
17:08:58 <dustymabe> less precise
17:09:12 <travier> I think it would be used in the worst case: you're not validating the latest commit so your option is to trust zincati, but you're so old that you might be doing something wrong
17:09:32 <travier> Feels like exactly the time where update checks should be the most strict
17:10:35 <jlebon> hmm actually, one other option is to have rpm-ostree verify that the fetched commit has the same stream encoded into its metadata
17:10:58 <jlebon> since that's the source of truth for which stream a node is on
17:11:19 <dustymabe> yeah I thought we talked about that as an option (at least that's what I was thinking "OSTree ref bindings" were
17:11:39 <jlebon> no, it's a special key, `fedora-coreos.stream`
17:11:51 <dustymabe> ok, then yeah. that sounds reasonable
17:11:54 <jlebon> https://github.com/coreos/fedora-coreos-config/blob/113682404839f7fa727f79c27d43fe412f03f2fa/manifest.yaml#L25
17:12:00 <jlebon> yeah, i'd be ok with that
17:12:13 <dustymabe> so.. would that be good enough on its own?
17:12:23 <dustymabe> or would that be best as a fallback mode?
17:12:30 <jlebon> we'd still want the switch to skip validation
17:12:47 <dustymabe> cool
17:12:48 <jlebon> and another switch to force validation using commit metadata
17:12:55 <travier> I'm now in favor of option C with N-2, which would put us back to N-4 releases in support. Older nodes can always manually import GPG keys from a trusted source to update.
17:13:18 <dustymabe> "Older nodes can always manually import GPG keys from a trusted source to update."
17:13:25 <dustymabe> ^^ that's an argument for doing nothing IMO
17:13:30 <jaimelm> Big Picture Question: Do we have a sense of how much this issue impacts the community? - as in estimated users/nodes (given the amount of effort being put into a solution)
17:13:43 <jaimelm> dustymabe: indeed
17:13:56 <travier> In all other cases we parse commit data we can not validate, which means that we are essentially giving up on signing
17:14:17 <dustymabe> travier: no, we still validate the commit we are deploying is signed with a key we trust
17:14:26 <bgilbert> jaimelm: we're just starting to hit this issue, so no.  it'll get worse over time, but we don't know how much.
17:14:37 <jlebon> i think one thing we should nail down is how much we should care about really old bootimages
17:14:38 <jaimelm> right
17:14:46 <jlebon> e.g. do we want to support them forever?
17:14:54 <jaimelm> jlebon: yes. that's where I'm going.
17:15:06 <jlebon> jaimelm: ack
17:15:15 <jlebon> i agree that's worth ironing out
17:15:16 <dustymabe> i honestly think that is a larger topic than just what we're discussing right now
17:15:28 <jlebon> dustymabe: it directly drives the answer though
17:15:34 <dustymabe> definitely related - and we can table this until we go through that
17:15:51 <jaimelm> We have 15 minutes. So, might be good to table.
17:15:54 <jlebon> because if we say "N-2", then to me it's not worth losing the check
17:16:07 <dustymabe> jlebon: right
17:16:10 <jlebon> maybe we can chat further in the issue?
17:16:16 <bgilbert> it's not just old bootimages, it's old nodes
17:17:00 <jaimelm> same premise though, which ultimately falls on users to update with care (e.g. download keys from trusted source)
17:17:37 <bgilbert> "I haven't booted this test VM in a year and now it won't update" isn't great UX
17:17:44 <jlebon> dustymabe told me to take over the meeting because he had to drop
17:17:48 <jaimelm> Same thing in the HPC cluster world. If you're running old nodes, you generally have to make some manual changes to get it updated.
17:18:01 <jlebon> let's put the highlights of this discussion in the ticket and move on for now?
17:18:40 <bgilbert> (especially since updates are supposed to be automatic)
17:18:59 <jlebon> looks like that was the only issue on the table, so we can go to open floor and if there's nothing else to discuss, we can keep going on this topic
17:19:07 <jlebon> agreed?
17:19:11 <travier> +1
17:19:18 <bgilbert> +1
17:19:21 <jlebon> #topic Open Floor
17:19:54 <jlebon> announcement: we've now shipped f34 in testing! thanks travier for running the releases!
17:20:07 <travier> +1
17:20:09 <bgilbert> #info in the next few weeks, we're going to be renaming the `master` branches of CoreOS Git repos to `main`
17:20:49 <jlebon> bgilbert: do i understand correctly the strategy there is to just rename everything in one shot?
17:21:04 <bgilbert> there will be a symref from `master` to `main`, so existing checkouts will still be able to pull
17:21:11 <jaimelm> cool
17:21:21 <bgilbert> (but it's a bit cleaner if you update your local repo afterward)
17:21:40 <bgilbert> jlebon: more-or-less one shot.  we're not aiming for One Big Flag Day.
17:21:42 <jlebon> bgilbert: ahhh ok cool
17:22:33 <jlebon> anyone have anything else?
17:23:05 <jlebon> doesn't look like it
17:23:32 <jlebon> do anyone want to keep discussing the update issue or just continue in the tickets?
17:23:38 <jlebon> does*
17:23:39 <jaimelm> return to the previous issue?
17:23:56 <travier> I think we need to summarize the options in the ticket to clarify them
17:24:05 <jaimelm> ^^
17:24:06 <travier> and weight the risks for each one
17:24:22 <travier> and the exact support we want, which issue exactly to fix
17:24:47 <travier> (sorry for giving you work jlebon)
17:25:04 <travier> I'll try to summarize my options too
17:25:08 <jlebon> travier: heh, no worries. was already planning to do that anyway
17:25:54 <jlebon> i think this is kinda related to the zincati <-> rpm-ostree integration too
17:26:52 <jlebon> if we had zincati fully own updates like the MCO does, then it could pin to commits, and e.g. `rpm-ostree upgrade` would pass through zincati
17:27:32 <jlebon> right now, we're in this middle world where we use zincati, but you can also use `rpm-ostree upgrade --bypass-driver`
17:28:39 <jlebon> ok, going to call the meeting in 30s :)
17:29:09 <jlebon> #endmeeting