#fedora-meeting-1 log

16:30:02 <jlebon> #startmeeting fedora_coreos_meeting
16:30:02 <zodbot> Meeting started Wed Jan 18 16:30:02 2023 UTC.
16:30:02 <zodbot> This meeting is logged and archived in a public location.
16:30:02 <zodbot> The chair is jlebon. Information about MeetBot at https://fedoraproject.org/wiki/Zodbot#Meeting_Functions.
16:30:02 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic.
16:30:02 <zodbot> The meeting name has been set to 'fedora_coreos_meeting'
16:30:09 <jlebon> #topic roll call
16:30:13 <bgilbert> .hi
16:30:14 <zodbot> bgilbert: bgilbert 'Benjamin Gilbert' <bgilbert@backtick.net>
16:30:29 <dustymabe> .hi
16:30:31 <zodbot> dustymabe: dustymabe 'Dusty Mabe' <dusty@dustymabe.com>
16:30:38 <jlebon> #chair bgilbert dustymabe
16:30:38 <zodbot> Current chairs: bgilbert dustymabe jlebon
16:31:04 <jmarrero> .hi
16:31:05 <zodbot> jmarrero: jmarrero 'Joseph Marrero' <jmarrero@redhat.com>
16:31:10 <jbrooks> .hello jasonbrooks
16:31:11 <zodbot> jbrooks: jasonbrooks 'Jason Brooks' <jbrooks@redhat.com>
16:31:34 <jlebon> #chair jmarrero jbrooks
16:31:34 <zodbot> Current chairs: bgilbert dustymabe jbrooks jlebon jmarrero
16:32:16 <aaradhak[m]> .hi
16:32:17 <zodbot> aaradhak[m]: Sorry, but user 'aaradhak [m]' does not exist
16:32:21 <jlebon> #chair aaradhak[m]
16:32:21 <zodbot> Current chairs: aaradhak[m] bgilbert dustymabe jbrooks jlebon jmarrero
16:32:34 <jlebon> let's wait another minute :)
16:33:34 <jlebon> alrighty, let's get started!
16:33:39 <walters> .hi
16:33:40 <zodbot> walters: walters 'Colin Walters' <walters@redhat.com>
16:33:45 <jlebon> #topic Action items from last meeting
16:33:48 <jlebon> #chair walters
16:33:48 <zodbot> Current chairs: aaradhak[m] bgilbert dustymabe jbrooks jlebon jmarrero walters
16:33:55 <jlebon> last meeting as a video meeting
16:34:20 <jlebon> i don't *think* we had any action items come out of it, did we?
16:34:58 <fifofonix> .hi
16:35:01 <zodbot> fifofonix: fifofonix 'Fifo Phonics' <fifofonix@gmail.com>
16:35:10 <jlebon> i'll take that as a no :)
16:35:14 <jlebon> #chair fifofonix
16:35:14 <zodbot> Current chairs: aaradhak[m] bgilbert dustymabe fifofonix jbrooks jlebon jmarrero walters
16:35:28 <jlebon> #topic "Kernel Errors w/latest next/testing likely CIFS related."
16:35:31 <jlebon> #link https://github.com/coreos/fedora-coreos-tracker/issues/1381
16:35:43 <jlebon> fifofonix: just in time :)
16:36:04 <jlebon> dustymabe, fifofonix: either of you want to introduce this one?
16:36:15 <fifofonix> sure.
16:37:11 <fifofonix> basically, on next/testing, for my oldest nodes (which happen to be docker swarm mgmt nodes) i'm getting potentially cifs related issues.
16:37:35 <fifofonix> machine soft crashes.  can be rebooted.  will survive with mounted cifs drives for hours but then kernel error again.
16:38:05 <fifofonix> not captured in issue to date is that i have successfully migrated all next/testing machines including other docker swarms without issue (and these too use cifs in the same way)
16:38:39 <fifofonix> clarification: so what i'm saying is this only affects my very oldest machines in my dev/test environments running next/testing.
16:39:11 <fifofonix> right now i'm trying to re-create these master nodes from scratch (which i should be doing anyway) but this requires some engineering on my part
16:39:21 <dustymabe> the reason I tagged the issue for meeting is because (I think) our ad-hoc release fixed some CIFS issues, but fifofonix was still reporting some failures.
16:39:47 <dustymabe> to me the open question is if we should pin the kernel in `stable` when we do the next `stable` release to whatever the current kernel is in `stable`
16:39:48 <fifofonix> yes, i tried latest next and still affected on this very small number of machines.
16:40:50 <dustymabe> fifofonix: re: old machines versus new machines.. I wonder if this could somehow be cgroupsV1 related
16:41:00 <dustymabe> ^^ random shot in the dark
16:41:02 <fifofonix> from my perspective i've decided to pin my production clusters for now from receiving any updates.  i'm not asking for you to hold off but obviously some others may be affected.
16:41:43 <fifofonix> running cgroups2 but i'm just worried about some accumulated lint in the machine.
16:42:06 <dustymabe> so the old machines that are observing the problem are confirmed to be on cgroupsV2 ?
16:42:13 <fifofonix> anyway, there is a BZ for this although I'm not sure that anyone has looked at it?
16:42:20 <fifofonix> confirmed
16:42:28 <fifofonix> (i just checked)
16:42:36 <dustymabe> fifofonix: +1
16:42:57 <jlebon> hmm to confirm, are you saying that *new* testing/next machines hit the CIFS issue, but old machines upgraded to the same next/testing release do no?
16:44:03 <fifofonix> not exactly.  i'm saying very oldest vintage machines are experiencing the issue.  pre-existing machines re-created since 6 months ago are fine.
16:44:21 <fifofonix> and brand new clusters ok too.
16:44:55 <fifofonix> (we have a separate vmware networking issue affecting swarm - not to do with cifs as far as i know - on new clusters only)
16:44:57 <dustymabe> it's not great but in the absense of a known root cause at least we have a workaround (re-deploy)
16:45:01 <jlebon> ok so it's the opposite. new testing/next machines do not hit the CIFS issue, but very old machines upgraded to the same next/testing release do
16:45:12 <jlebon> interesting. this adds a new dimension to the issue
16:45:18 <fifofonix> jlebon: +1
16:46:02 <dustymabe> is there any chance the older machines have a different environment (I'm thinking network) that could make them behave different than the new machines?
16:46:03 <jlebon> have you done an /etc diff between the two?
16:46:42 <fifofonix> obviously what i want to do is re-create these machines asap to see whether it is something specific to their workloads or due to unknown issues somehow carried forward from long ago.
16:47:12 <dustymabe> fifofonix: +1 that information would be good to have
16:47:30 <fifofonix> issue is replicable across my clustered-NUC environment housed in an office, and a datacenter (running same version of vsphere
16:48:29 <fifofonix> and the machines that do work are on the same hardware/networks.
16:48:52 <fifofonix> I will get on /etc diff.  anyway, a lot of time on this, and don't want to derail you?
16:49:21 <jlebon> dustymabe: sounds good re. not holding stable, but i wonder if we should send an email ahead of time
16:50:04 <dustymabe> #proposed It appears this issue may affect older machines that are upgraded but we are still investigating to get more details. Since currently this issue only has one reported affected user/environment and they have pinned on a known working version we will release the next `testing` as usual.
16:50:09 <fifofonix> the downside effect is quite bad.  meaning the failure mode can create clusterwide issues.
16:50:11 <dustymabe> jlebon: re: communication - maybe?
16:50:32 <fifofonix> especially if you're aggressive on rollout.
16:50:51 <dustymabe> bgilbert: jmarrero: anyone else.. WDYT?
16:51:44 <jlebon> dustymabe: i mean, "there's a known issue and it's coming to you" isn't great, but is better than affected users finding out after the fact
16:51:45 <bgilbert> it'd be nice to get that /etc diff.  "we've had one report of an issue" makes for a strange coreos-status post.
16:51:46 <jmarrero> +1 to proposed
16:52:01 <bgilbert> +1 to proposed
16:52:38 <fifofonix> just to re-iterate on the failure mode point in case it should somehow be incorporated simply (!)
16:53:23 <fifofonix> the problem is if you are affected, you upgrade one machine all looks good for 2 hours, another machine in a 3-node upgrades, and then they both crash overnight and you have lost cluster coherence.
16:53:32 <dustymabe> We could pin stable, but what do we do next time if the situation hasn't improved? I wish it was clear what the problem was and there was an upstream fix (or even upstream reports of the same problem)
16:53:57 <bgilbert> fifofonix: I agree that we should fix this, but I think we need more data
16:54:00 <fifofonix> (and i did also replicate this on rawhide)
16:54:14 <bgilbert> ideally a reproducer
16:54:26 <dustymabe> fifofonix: ^^ implies it can be reproced on new nodes and not just upgrading ones
16:54:26 <jlebon> bgilbert: i guess if we add details like "we encourage you to test your workload on testing/next to find out if you're affected" makes it more actionable
16:54:56 <fifofonix> dustymabe: well, not sure, i switched streams, i didn't reprovision the machine.
16:55:07 <walters> Yeah this one feels to me like it's in a rather common scenario of there's a likely bug/regression but to get traction it needs more diagnosis and potentially bisection, like a lot of other kernel bugs
16:55:17 <dustymabe> fifofonix: ahh I see.
16:56:08 <jlebon> ok, we can discuss whether to send out a communication in #fedora-coreos maybe
16:56:18 <jlebon> +1 to proposed
16:56:29 <jlebon> want to #agreed it?
16:56:32 <dustymabe> we *can* reverse course on the proposed if new details come in
16:56:59 <dustymabe> I think the key here is "if no substantial new information comes in" then we know what to do on Monday when we start the releases
16:57:17 <dustymabe> jlebon: go for it
16:57:17 <bgilbert> dustymabe: +1
16:58:19 <jlebon> dustymabe: i think it's easier if you do it. i'd have to copy paste and fix newlines
16:58:41 <dustymabe> #agreed It appears this issue may affect older machines that are upgraded but we are still investigating to get more details. Since currently this issue only has one reported affected user/environment and they have pinned on a known working version we will release the next `testing` as usual.
16:58:54 <dustymabe> before we move on
16:58:56 <dustymabe> The other open question: does this warrant a status post if we do decide to ship what's in `testing` right now to `stable`? We've touched on it above, but is there agreement we should?
16:59:23 <dustymabe> I'm mixed, but I guess we probably should
16:59:25 <jlebon> dustymabe: i suggested discussing it in #fedora-coreos
16:59:48 <dustymabe> ok
16:59:58 <jlebon> just to get to some of the other tickets :)
17:00:16 <jlebon> #topic [CfP] Container Plumbing Days 2023
17:00:20 <jlebon> #link https://github.com/coreos/fedora-coreos-tracker/issues/1378
17:00:30 <jlebon> travier[m]: around?
17:00:46 <bgilbert> dustymabe: I think we should not
17:01:07 <bgilbert> with the current state of our knowledge
17:01:24 <jlebon> this is an FYI that the CfP for Container Plumbing Days 2023 is open
17:02:01 <jlebon> maybe add a comment to the ticket if submit something
17:02:04 <jlebon> if you*
17:02:14 <jlebon> like walters just did :)
17:03:11 <jlebon> dustymabe: do we have enough context on https://github.com/coreos/fedora-coreos-tracker/issues/1374 to discuss it now or should we wait for travier[m]?
17:03:38 <dustymabe> jlebon: I think we can probably start the discussion
17:03:44 <jlebon> ok cool
17:03:49 <jlebon> #topic Podman begins CNI plugins deprecation
17:03:52 <jlebon> #link https://github.com/coreos/fedora-coreos-tracker/issues/1374
17:03:58 <jlebon> want to introduce it?
17:04:09 <dustymabe> I'll do my best :)
17:04:51 <dustymabe> My takeaway from this ticket is that there will be a time in the future (the hard deadline will be quite some time away) where peoples existing containers on older systems will stop working
17:06:03 <dustymabe> maybe some of the details are more nuanced than that (which we should tease out). Like for example, maybe it only applies to containers where a `podman network` was used. But either way a set of our users will be affected
17:06:24 <dustymabe> as we have done in the past with other "migrations" we should try to determine the best and least disruptive path forward for our users
17:07:40 <jlebon> it sounds like the "migration" is to do a full containers reset
17:07:51 <dustymabe> right
17:08:29 <dustymabe> but.. since we know about the problem now we can start making noise on the communication channels and also on the nodes themselves (via CLHM)
17:08:53 <dustymabe> the users can either re-deploy from scratch OR `podman system reset` existing systems
17:09:19 <jlebon> yeah, agreed
17:09:28 <dustymabe> it's certainly not great that we'
17:09:52 <dustymabe> we'd require manual intervention like this, but being that we don't own podman and this is the pattern that team came up with, I don't think we have much choice
17:10:09 <jlebon> maybe let's revisit this when the removal timeline in Fedora is clearer
17:10:35 <jlebon> because then we can work on communications that use actual dates
17:11:08 <dustymabe> jlebon: I agree with that ^^, but I think we should push on the podman team to make that clearer in the next month. I don't want to wait until we are 3 or even 6 months out to start warning our users.
17:11:39 <dustymabe> I also want to make this CLHM warning `red` in color
17:11:43 <jlebon> dustymabe: indeed. i'll ping them in the issue
17:11:54 <dustymabe> jlebon: +1
17:12:02 <jlebon> ok cool, moving on
17:12:08 <jlebon> #topic Create container repo tags for each FCOS release
17:12:09 <dustymabe> on a side note.. I assume RHCOS is going to have to deal with this downstream too
17:12:10 <jlebon> #link https://github.com/coreos/fedora-coreos-tracker/issues/1367
17:12:28 <dustymabe> since it does still rely on podman for something
17:12:47 <jlebon> dustymabe: all of OCP will have to deal with it :)
17:13:05 <walters> OpenShift nodes don't usually use networked podman containers
17:13:36 <walters> (I mean they're all --net=host)
17:13:50 <dustymabe> walters: +1 - that's good info
17:14:17 <jlebon> so we discussed this ticket in an earlier meeting. we came to some conclusions in a follow-up chat we had.
17:14:30 <jlebon> that proposal is in https://github.com/coreos/fedora-coreos-tracker/issues/1367#issuecomment-1372870700
17:14:37 <walters> (further bootstrap containers are usually --rm, just like kubernetes pods; i.e. only persisted by configuration, not implicitly)
17:15:30 <jlebon> walters: did you want to discuss the last comments added there?
17:16:40 <dustymabe> jlebon: my takeaway is that we can't really prune production tags (at least not for a long while to consider the update graph)
17:16:43 <walters> I just added a 👍️ to the last comment but will only feel really confident when I try to actually dig in to the code and testing things
17:17:18 <dustymabe> ahh - yeah from what bgilbert posted - we can prune, but just make sure we don't prune the barrier ones
17:17:36 <dustymabe> so we'd modify the proposal slightly
17:17:48 <walters> (The goal of "avoid bespoke things by reusing container infrastructure directly" clashing with the "hey we made up this cincinnati thing" is painful)
17:19:14 <jlebon> ok right, that seems reasonable to me. it does mean though that users will now be able to pin on tags for a while, which was something we were trying to avoid
17:19:41 <bgilbert> not arbitrary ones, though
17:19:43 <jlebon> we don't really have wording around what the oldest starting version we support is. we should probably firm that up at some point
17:20:01 <jlebon> indeed
17:20:01 <bgilbert> I'm comfortable with the idea of keeping barrier releases around, given the value they provide
17:20:36 <jlebon> the fact that we eventually do get rid of them i think is important
17:20:53 <bgilbert> AIUI we'd need to prune not only the barrier releases, but the release list used to build the Cincinnati graph
17:20:55 <walters> I wouldn't want to try to scope in removing them right now
17:20:59 <bgilbert> walters: +1
17:21:07 <dustymabe> sounds good. jlebon I guess you can update the ticket with that new info and then we can mark it as ready for action?
17:21:29 <jlebon> yup, SGTM
17:22:00 <jlebon> bgilbert: i guess it depends how we decide to handle barrier releases in the larger GC discussion
17:22:13 <jlebon> but agreed that discussion can wait
17:22:21 <walters> does the ostree pruner take that into account?
17:22:34 <jlebon> the ostree pruner currently doesn't prune prod refs at all
17:23:12 <jlebon> ok cool, we're making good progress. let's see if we can squeeze one more in :)
17:23:26 <walters> it's not about the refs though, but the barriers including older refs
17:23:46 <jlebon> actually, i think all the other ones are larger discussions, so maybe we should stop here
17:24:13 <jlebon> walters: can you clarify?
17:24:31 <jlebon> dustymabe: did you want to talk about f38 changes or good to tackle that again next week?
17:25:46 <dustymabe> next week
17:25:47 <jlebon> ok, not much time left at this point, so let's just move to open floor
17:25:54 <jlebon> #topic Open Floor
17:26:39 <jlebon> walters: i'm not sure i follow. which older refs are you *ref*erring to?
17:27:13 <jlebon> anything anyone wants to bring up?
17:29:08 <dustymabe> nothing here
17:29:43 <jlebon> will end meeting in 30s
17:30:13 <jlebon> #endmeeting