16:30:02 #startmeeting fedora_coreos_meeting 16:30:02 Meeting started Wed Jan 18 16:30:02 2023 UTC. 16:30:02 This meeting is logged and archived in a public location. 16:30:02 The chair is jlebon. Information about MeetBot at https://fedoraproject.org/wiki/Zodbot#Meeting_Functions. 16:30:02 Useful Commands: #action #agreed #halp #info #idea #link #topic. 16:30:02 The meeting name has been set to 'fedora_coreos_meeting' 16:30:09 #topic roll call 16:30:13 .hi 16:30:14 bgilbert: bgilbert 'Benjamin Gilbert' 16:30:29 .hi 16:30:31 dustymabe: dustymabe 'Dusty Mabe' 16:30:38 #chair bgilbert dustymabe 16:30:38 Current chairs: bgilbert dustymabe jlebon 16:31:04 .hi 16:31:05 jmarrero: jmarrero 'Joseph Marrero' 16:31:10 .hello jasonbrooks 16:31:11 jbrooks: jasonbrooks 'Jason Brooks' 16:31:34 #chair jmarrero jbrooks 16:31:34 Current chairs: bgilbert dustymabe jbrooks jlebon jmarrero 16:32:16 .hi 16:32:17 aaradhak[m]: Sorry, but user 'aaradhak [m]' does not exist 16:32:21 #chair aaradhak[m] 16:32:21 Current chairs: aaradhak[m] bgilbert dustymabe jbrooks jlebon jmarrero 16:32:34 let's wait another minute :) 16:33:34 alrighty, let's get started! 16:33:39 .hi 16:33:40 walters: walters 'Colin Walters' 16:33:45 #topic Action items from last meeting 16:33:48 #chair walters 16:33:48 Current chairs: aaradhak[m] bgilbert dustymabe jbrooks jlebon jmarrero walters 16:33:55 last meeting as a video meeting 16:34:20 i don't *think* we had any action items come out of it, did we? 16:34:58 .hi 16:35:01 fifofonix: fifofonix 'Fifo Phonics' 16:35:10 i'll take that as a no :) 16:35:14 #chair fifofonix 16:35:14 Current chairs: aaradhak[m] bgilbert dustymabe fifofonix jbrooks jlebon jmarrero walters 16:35:28 #topic "Kernel Errors w/latest next/testing likely CIFS related." 16:35:31 #link https://github.com/coreos/fedora-coreos-tracker/issues/1381 16:35:43 fifofonix: just in time :) 16:36:04 dustymabe, fifofonix: either of you want to introduce this one? 16:36:15 sure. 16:37:11 basically, on next/testing, for my oldest nodes (which happen to be docker swarm mgmt nodes) i'm getting potentially cifs related issues. 16:37:35 machine soft crashes. can be rebooted. will survive with mounted cifs drives for hours but then kernel error again. 16:38:05 not captured in issue to date is that i have successfully migrated all next/testing machines including other docker swarms without issue (and these too use cifs in the same way) 16:38:39 clarification: so what i'm saying is this only affects my very oldest machines in my dev/test environments running next/testing. 16:39:11 right now i'm trying to re-create these master nodes from scratch (which i should be doing anyway) but this requires some engineering on my part 16:39:21 the reason I tagged the issue for meeting is because (I think) our ad-hoc release fixed some CIFS issues, but fifofonix was still reporting some failures. 16:39:47 to me the open question is if we should pin the kernel in `stable` when we do the next `stable` release to whatever the current kernel is in `stable` 16:39:48 yes, i tried latest next and still affected on this very small number of machines. 16:40:50 fifofonix: re: old machines versus new machines.. I wonder if this could somehow be cgroupsV1 related 16:41:00 ^^ random shot in the dark 16:41:02 from my perspective i've decided to pin my production clusters for now from receiving any updates. i'm not asking for you to hold off but obviously some others may be affected. 16:41:43 running cgroups2 but i'm just worried about some accumulated lint in the machine. 16:42:06 so the old machines that are observing the problem are confirmed to be on cgroupsV2 ? 16:42:13 anyway, there is a BZ for this although I'm not sure that anyone has looked at it? 16:42:20 confirmed 16:42:28 (i just checked) 16:42:36 fifofonix: +1 16:42:57 hmm to confirm, are you saying that *new* testing/next machines hit the CIFS issue, but old machines upgraded to the same next/testing release do no? 16:44:03 not exactly. i'm saying very oldest vintage machines are experiencing the issue. pre-existing machines re-created since 6 months ago are fine. 16:44:21 and brand new clusters ok too. 16:44:55 (we have a separate vmware networking issue affecting swarm - not to do with cifs as far as i know - on new clusters only) 16:44:57 it's not great but in the absense of a known root cause at least we have a workaround (re-deploy) 16:45:01 ok so it's the opposite. new testing/next machines do not hit the CIFS issue, but very old machines upgraded to the same next/testing release do 16:45:12 interesting. this adds a new dimension to the issue 16:45:18 jlebon: +1 16:46:02 is there any chance the older machines have a different environment (I'm thinking network) that could make them behave different than the new machines? 16:46:03 have you done an /etc diff between the two? 16:46:42 obviously what i want to do is re-create these machines asap to see whether it is something specific to their workloads or due to unknown issues somehow carried forward from long ago. 16:47:12 fifofonix: +1 that information would be good to have 16:47:30 issue is replicable across my clustered-NUC environment housed in an office, and a datacenter (running same version of vsphere 16:48:29 and the machines that do work are on the same hardware/networks. 16:48:52 I will get on /etc diff. anyway, a lot of time on this, and don't want to derail you? 16:49:21 dustymabe: sounds good re. not holding stable, but i wonder if we should send an email ahead of time 16:50:04 #proposed It appears this issue may affect older machines that are upgraded but we are still investigating to get more details. Since currently this issue only has one reported affected user/environment and they have pinned on a known working version we will release the next `testing` as usual. 16:50:09 the downside effect is quite bad. meaning the failure mode can create clusterwide issues. 16:50:11 jlebon: re: communication - maybe? 16:50:32 especially if you're aggressive on rollout. 16:50:51 bgilbert: jmarrero: anyone else.. WDYT? 16:51:44 dustymabe: i mean, "there's a known issue and it's coming to you" isn't great, but is better than affected users finding out after the fact 16:51:45 it'd be nice to get that /etc diff. "we've had one report of an issue" makes for a strange coreos-status post. 16:51:46 +1 to proposed 16:52:01 +1 to proposed 16:52:38 just to re-iterate on the failure mode point in case it should somehow be incorporated simply (!) 16:53:23 the problem is if you are affected, you upgrade one machine all looks good for 2 hours, another machine in a 3-node upgrades, and then they both crash overnight and you have lost cluster coherence. 16:53:32 We could pin stable, but what do we do next time if the situation hasn't improved? I wish it was clear what the problem was and there was an upstream fix (or even upstream reports of the same problem) 16:53:57 fifofonix: I agree that we should fix this, but I think we need more data 16:54:00 (and i did also replicate this on rawhide) 16:54:14 ideally a reproducer 16:54:26 fifofonix: ^^ implies it can be reproced on new nodes and not just upgrading ones 16:54:26 bgilbert: i guess if we add details like "we encourage you to test your workload on testing/next to find out if you're affected" makes it more actionable 16:54:56 dustymabe: well, not sure, i switched streams, i didn't reprovision the machine. 16:55:07 Yeah this one feels to me like it's in a rather common scenario of there's a likely bug/regression but to get traction it needs more diagnosis and potentially bisection, like a lot of other kernel bugs 16:55:17 fifofonix: ahh I see. 16:56:08 ok, we can discuss whether to send out a communication in #fedora-coreos maybe 16:56:18 +1 to proposed 16:56:29 want to #agreed it? 16:56:32 we *can* reverse course on the proposed if new details come in 16:56:59 I think the key here is "if no substantial new information comes in" then we know what to do on Monday when we start the releases 16:57:17 jlebon: go for it 16:57:17 dustymabe: +1 16:58:19 dustymabe: i think it's easier if you do it. i'd have to copy paste and fix newlines 16:58:41 #agreed It appears this issue may affect older machines that are upgraded but we are still investigating to get more details. Since currently this issue only has one reported affected user/environment and they have pinned on a known working version we will release the next `testing` as usual. 16:58:54 before we move on 16:58:56 The other open question: does this warrant a status post if we do decide to ship what's in `testing` right now to `stable`? We've touched on it above, but is there agreement we should? 16:59:23 I'm mixed, but I guess we probably should 16:59:25 dustymabe: i suggested discussing it in #fedora-coreos 16:59:48 ok 16:59:58 just to get to some of the other tickets :) 17:00:16 #topic [CfP] Container Plumbing Days 2023 17:00:20 #link https://github.com/coreos/fedora-coreos-tracker/issues/1378 17:00:30 travier[m]: around? 17:00:46 dustymabe: I think we should not 17:01:07 with the current state of our knowledge 17:01:24 this is an FYI that the CfP for Container Plumbing Days 2023 is open 17:02:01 maybe add a comment to the ticket if submit something 17:02:04 if you* 17:02:14 like walters just did :) 17:03:11 dustymabe: do we have enough context on https://github.com/coreos/fedora-coreos-tracker/issues/1374 to discuss it now or should we wait for travier[m]? 17:03:38 jlebon: I think we can probably start the discussion 17:03:44 ok cool 17:03:49 #topic Podman begins CNI plugins deprecation 17:03:52 #link https://github.com/coreos/fedora-coreos-tracker/issues/1374 17:03:58 want to introduce it? 17:04:09 I'll do my best :) 17:04:51 My takeaway from this ticket is that there will be a time in the future (the hard deadline will be quite some time away) where peoples existing containers on older systems will stop working 17:06:03 maybe some of the details are more nuanced than that (which we should tease out). Like for example, maybe it only applies to containers where a `podman network` was used. But either way a set of our users will be affected 17:06:24 as we have done in the past with other "migrations" we should try to determine the best and least disruptive path forward for our users 17:07:40 it sounds like the "migration" is to do a full containers reset 17:07:51 right 17:08:29 but.. since we know about the problem now we can start making noise on the communication channels and also on the nodes themselves (via CLHM) 17:08:53 the users can either re-deploy from scratch OR `podman system reset` existing systems 17:09:19 yeah, agreed 17:09:28 it's certainly not great that we' 17:09:52 we'd require manual intervention like this, but being that we don't own podman and this is the pattern that team came up with, I don't think we have much choice 17:10:09 maybe let's revisit this when the removal timeline in Fedora is clearer 17:10:35 because then we can work on communications that use actual dates 17:11:08 jlebon: I agree with that ^^, but I think we should push on the podman team to make that clearer in the next month. I don't want to wait until we are 3 or even 6 months out to start warning our users. 17:11:39 I also want to make this CLHM warning `red` in color 17:11:43 dustymabe: indeed. i'll ping them in the issue 17:11:54 jlebon: +1 17:12:02 ok cool, moving on 17:12:08 #topic Create container repo tags for each FCOS release 17:12:09 on a side note.. I assume RHCOS is going to have to deal with this downstream too 17:12:10 #link https://github.com/coreos/fedora-coreos-tracker/issues/1367 17:12:28 since it does still rely on podman for something 17:12:47 dustymabe: all of OCP will have to deal with it :) 17:13:05 OpenShift nodes don't usually use networked podman containers 17:13:36 (I mean they're all --net=host) 17:13:50 walters: +1 - that's good info 17:14:17 so we discussed this ticket in an earlier meeting. we came to some conclusions in a follow-up chat we had. 17:14:30 that proposal is in https://github.com/coreos/fedora-coreos-tracker/issues/1367#issuecomment-1372870700 17:14:37 (further bootstrap containers are usually --rm, just like kubernetes pods; i.e. only persisted by configuration, not implicitly) 17:15:30 walters: did you want to discuss the last comments added there? 17:16:40 jlebon: my takeaway is that we can't really prune production tags (at least not for a long while to consider the update graph) 17:16:43 I just added a 👍️ to the last comment but will only feel really confident when I try to actually dig in to the code and testing things 17:17:18 ahh - yeah from what bgilbert posted - we can prune, but just make sure we don't prune the barrier ones 17:17:36 so we'd modify the proposal slightly 17:17:48 (The goal of "avoid bespoke things by reusing container infrastructure directly" clashing with the "hey we made up this cincinnati thing" is painful) 17:19:14 ok right, that seems reasonable to me. it does mean though that users will now be able to pin on tags for a while, which was something we were trying to avoid 17:19:41 not arbitrary ones, though 17:19:43 we don't really have wording around what the oldest starting version we support is. we should probably firm that up at some point 17:20:01 indeed 17:20:01 I'm comfortable with the idea of keeping barrier releases around, given the value they provide 17:20:36 the fact that we eventually do get rid of them i think is important 17:20:53 AIUI we'd need to prune not only the barrier releases, but the release list used to build the Cincinnati graph 17:20:55 I wouldn't want to try to scope in removing them right now 17:20:59 walters: +1 17:21:07 sounds good. jlebon I guess you can update the ticket with that new info and then we can mark it as ready for action? 17:21:29 yup, SGTM 17:22:00 bgilbert: i guess it depends how we decide to handle barrier releases in the larger GC discussion 17:22:13 but agreed that discussion can wait 17:22:21 does the ostree pruner take that into account? 17:22:34 the ostree pruner currently doesn't prune prod refs at all 17:23:12 ok cool, we're making good progress. let's see if we can squeeze one more in :) 17:23:26 it's not about the refs though, but the barriers including older refs 17:23:46 actually, i think all the other ones are larger discussions, so maybe we should stop here 17:24:13 walters: can you clarify? 17:24:31 dustymabe: did you want to talk about f38 changes or good to tackle that again next week? 17:25:46 next week 17:25:47 ok, not much time left at this point, so let's just move to open floor 17:25:54 #topic Open Floor 17:26:39 walters: i'm not sure i follow. which older refs are you *ref*erring to? 17:27:13 anything anyone wants to bring up? 17:29:08 nothing here 17:29:43 will end meeting in 30s 17:30:13 #endmeeting