16:29:22 #startmeeting fedora_coreos_meeting 16:29:22 Meeting started Wed Nov 8 16:29:22 2023 UTC. 16:29:22 This meeting is logged and archived in a public location. 16:29:22 The chair is dustymabe. Information about MeetBot at https://fedoraproject.org/wiki/Zodbot#Meeting_Functions. 16:29:22 Useful Commands: #action #agreed #halp #info #idea #link #topic. 16:29:23 The meeting name has been set to 'fedora_coreos_meeting' 16:29:31 #topic roll call 16:29:33 .hi 16:29:40 hi :) 16:29:40 dustymabe: dustymabe 'Dusty Mabe' 16:30:14 👋 Nemric 16:30:41 #chair Nemric 16:30:41 Current chairs: Nemric dustymabe 16:30:54 might be a few minutes - some people are finishing up another meeting 16:31:08 dustymabe: (can't see pictures :( ) 16:31:28 .hello siosm 16:31:36 travier: siosm 'Timothée Ravier' 16:31:39 #chair travier 16:31:39 Current chairs: Nemric dustymabe travier 16:32:15 .hi 16:32:22 marmijo: marmijo 'Michael Armijo' 16:32:48 I don't see any issues marked for discussion today 16:33:05 .hello c4rt0 16:33:06 travier: indeed. we need to be more proactive about this in the future. 16:33:13 apiaseck: c4rt0 'Adam Piasecki' 16:33:13 .hi 16:33:17 https://github.com/coreos/fedora-coreos-tracker/issues/1608 maybe? 16:33:20 jmarrero: jmarrero 'Joseph Marrero' 16:33:35 .hello spresti 16:33:38 apiaseck: that email address.. it's like password complexity 16:33:43 spresti: spresti 'Steven Presti' 16:33:44 .hello ydesouza 16:33:50 ydesouza: ydesouza 'Yasmin Valim de Souza' 16:33:52 #chair marmijo apiaseck jmarrero spresti ydesouza 16:33:52 Current chairs: Nemric apiaseck dustymabe jmarrero marmijo spresti travier ydesouza 16:34:11 .hello c4rt0 16:34:18 apiaseck: c4rt0 'Adam Piasecki' 16:34:42 did I miss anyone with #chair? 16:35:04 dustymabe: trust me, I hate spelling it over the phone 16:35:13 #topic Action items from last meeting 16:35:30 * travier to create a change proposal for F40 for switching away from nss-altfiles for OSTree based systems 16:35:41 and then I think we have one that I'll make some text up for now: 16:36:11 * travier to schedule meeting with Assisted installer team to understand the use case around https://github.com/coreos/fedora-coreos-tracker/issues/1595 16:36:32 travier: :) - welcome back from a week away 16:36:44 dustymabe: :D 16:36:52 we'll have to re-action those 16:36:59 still haven't been able to do them 16:37:09 ok, i only through that last one in there because I didn't want it to drop 16:37:14 it wasn't actually an action item from last time 16:37:26 #action travier to create a change proposal for F40 for switching away from nss-altfiles for OSTree based systems 16:37:37 #action travier to schedule meeting with Assisted installer team to understand the use case around https://github.com/coreos/fedora-coreos-tracker/issues/1595 16:37:53 #topic Nodes Fail To Update (Zincati Reports libsystemd errors regarding EMFILE: Too many open files) 16:38:00 #link https://github.com/coreos/fedora-coreos-tracker/issues/1608 16:38:32 so we have an issue that is/will cause nodes to not be able to upgrade 16:39:12 It appears a new zincati RPM (just made it to our stable stream) leaks open files and eventually hits its limit and then can't continue to function 16:39:55 it's worth doing a mini retrospective on this here maybe in a 5 minutes.. but for now - anything we should bring up? 16:40:08 has the dependency been updated in Fedora RPMs? 16:40:30 travier: you mean the zincati dep that caused the issue ? 16:40:37 yes 16:41:02 a new release just came out and was updated upstream in: https://github.com/coreos/zincati/pull/1118 16:41:19 so we should now be able to do a new zincati release/build 16:41:33 Jonathan just did one: https://github.com/coreos/zincati/pull/1119 16:41:39 to answer your question, though. I think at least right now (we want to change it) zincati is built with bundled deps 16:41:59 so the dependent rust libsystemd package in fedora isn't important (to us at least) 16:42:06 hmm, missed the start. did we not announce it in matrix? 16:42:07 oh, right, we bundle, so this should be fine 16:42:18 jlebon: sigh 16:42:23 I typed it but didn't press enter 16:42:34 ahh heh 16:42:56 i was multi-tasking :) 16:43:02 guessing the context: yeah, working on cutting a new zincati releast with the fix 16:43:09 release* 16:43:37 for now I have pinned to the old package in the stable stream in https://github.com/coreos/fedora-coreos-config/pull/2720 16:44:17 if we get a new stable release out today then stable stream nodes should not encounter a problem (as they wouldn't have run out of open files yet, it takes a few days) 16:44:40 I'm ready to hit ok-to-promote 16:44:48 on the stable: new release on 2023-11-08 (38.20231027.3.2) 16:45:03 apiaseck: that part is already done. the build as already happened, just need the part where we check that all tests passed etc.. 16:45:18 btw, it's easier to reproduce this with `steady_interval_secs = 1`: https://github.com/coreos/fedora-coreos-tracker/issues/1608#issuecomment-1802126860 16:45:19 that'll make zincati emit sd_notify updates much more frequently 16:45:29 dustymabe: All test did passed at this stage 16:45:36 pass ** 16:45:43 ahh, then yeah. it's ready for the release job then 16:46:17 dustymabe: I'll do that now then 16:46:18 anyhow I think the problem is pretty well known at this point and the path forward is hopefully not too complicated 16:46:31 1. pin to old zincati on stable - do ad-hoc release 16:46:47 2. get new zincati into testing/next - do ad-hoc releases 16:47:03 3. send communication telling people how to get their nodes unstuck 16:47:30 yeah, agreed 16:47:33 manual intervention should only be required for testing/next nodes 16:48:29 ideally also 16:48:29 4. add CI testing that would have caught this. it shouldn't focus on leaking fds, but be more general 16:48:36 4. next time look at fifonix reported issues sooner 16:48:57 :) 16:49:07 :) 16:49:08 jlebon: yeah let's talk about that in a minute 16:49:11 yeah, that was a mistake on our part 16:49:17 does anyone disagree with the course of action here? 16:49:44 I'm good. Feel like the Zincati refresh interval is too small but meh 16:50:12 ok let's move to a mini-retro then 16:50:24 i.e. how did we get here? what can we do to improve in the future 16:50:25 travier: that can certainly be discussed too 16:50:47 If we had zincati triggered from a systemd timer and then exit, we would not have that issue 16:50:57 i.e. zincati does not really need to run 100% of the time 16:51:02 there are a few things that caused us to miss this issue 16:51:18 1. adjusted release schedule based on the Fedora GA 16:51:52 -> this meant we did many releases in a short time span, including doing a stable release in two consecutive weeks 16:52:51 I don't necessarily think we should change our current policy/schedule around GA, just highlighting it as something that contributed here. If we were on the normal two week cadence I think we would have caught it before we promoted that content to GA. 16:53:37 i think also we should've been more wary of the new zincati. it hadn't been updated in a long time, and had *a lot* of dependency bumps 16:53:43 any comments on 1. before I move to another point? 16:53:53 so one strategy could've been to have it bake in next only to start 16:54:13 jlebon: maybe, but in general we try to let our CI be the arbiter or "good" or "bad" 16:54:34 having different policies per package makes things harder to maintain unfortunately :( 16:54:36 dustymabe: updates specifically are harder to test in realistic scenarios in CI 16:54:44 jlebon: agree 16:54:55 what is crazy is that we did learn from something like this in the past 16:55:15 we created an extended upgrade test that literally tests from a matrix of starting points in the past 16:55:36 but... that test runs upgrades consecutively one after the other 16:55:41 .hi 16:55:42 fifofonix: fifofonix 'Fifo Phonics' 16:55:52 so it would have never hit the time based problem - the nodes were never up long enough 16:56:22 but I think there is some positive takeaways her 16:56:24 This kind of leak is hard to find in CI 16:56:26 here* 16:56:27 exactly. it being time-based means it would've been very unlikely a CI test would've covered this 16:56:42 the good news is that our "stream model" works! 16:57:08 The leak is also here in stable as far as I can see 16:57:08 +1 16:57:11 we were able to find and fix this issue, even though we caught it very late, we still caught it soon enough that stable nodes should not need intervention. This is a huge win IMO 16:57:30 travier: yes, it's in current latest released `stable` 16:58:02 👍 16:58:04 we stopped the rollout and will do a new release today 16:58:22 ah yes, I'm already on 38.20231027.3.1 on my systems 16:58:44 I think one thing we can do in the future is possibly add some monitoring to our persistent nodes that our team has running 16:59:17 the monitoring would have helped us ID stuff like this sooner, which is I think how fifofonix found the anomaly in his environment 16:59:42 monitering/alerting 17:00:02 jlebon: any other ideas on things we should do to catch something like this? 17:00:13 FD count was at 248 for 18h of uptime so we have about 3 days to push the fix before stable nodes that upgraded needs manual intervention 17:00:59 I also had a suggestion in https://github.com/coreos/fedora-coreos-tracker/issues/1608#issuecomment-1802038890 of making it so that zincati gets restarted periodically, which is a less extreme version of what travier proposed above of making zincati a oneshot service 17:01:07 travier: just for context: the release job on the latest stable is already running 17:01:15 does the countme data tell a story? ie. can you see materially less next nodes upgraded by when you would expect? 17:01:23 apiaseck: ++ 17:01:36 dustymabe: i think we got a good bunch. we should collate them somewhere and see which ones we want to pursue 17:01:46 fifofonix: not really :( - countme data is really a poor man's statistics 17:01:52 fifofonix: countme data does not report node versions, only major fedora versions 17:01:54 countme database only gets upated once a week 17:02:08 travier: yeah, that too 17:02:27 yeah, countme is very broad, not precise at all by design (privacy, etc.) 17:02:59 restarting zincati periodically sounds ok to me 17:03:31 though it makes prod even more unlike CI 17:03:59 here is a hackmd where we can collate ideas on how to avoid this in the future 17:04:03 https://hackmd.io/xPcq8EvxTH6TQhFFx_Rfug?edit 17:04:39 one path is also that we move away from Zincati as it is 17:04:45 but we're not there yet 17:05:39 #info we have a path forward where we think stable nodes won't need manual intervention, but testing/next nodes will. We will put out ad-hoc releases for all streams and send a communication to everyone a bout necessary steps. 17:05:49 ^^ let me know if anything needs to be changed 17:06:10 Really think that running some instance with monitoring should do the trick, I'm running a "next" node with logs/metrics and a filter for error/warning logs 17:06:58 this node boot on pxe so zincati is not running ^^ don't have logs for that 17:07:07 yep 17:07:20 +1 for info 17:07:37 ok we've spent a long time on this. let's move on 17:08:03 #topic open floor 17:08:09 anyone with anything for open floor? 17:08:24 https://github.com/coreos/fedora-coreos-tracker/issues/1296 ? 17:09:29 Reproduction steps 17:09:30 I did run tests about this oin the last test-week for FCOS39 and it's always true 17:09:31 Start a diskless coreos from PXE boot 17:09:33 Expected behavior 17:09:35 No failed units at startup 17:09:53 i'm wondering why our CI doesn't catch this? jlebon don't we have a PXE diskless test? 17:10:41 we do, yup 17:10:52 well 17:11:09 cheese say : https://github.com/coreos/fedora-coreos-tracker/issues/1296#issuecomment-1306747053 17:11:23 we have pxe installs, but the unit state checking might happen only on the installed system 17:12:01 ahh, so we don't have any tests that don't run an install? 17:12:07 "any pxe tests" 17:12:13 maybe PXE boots are full in memory so systemd does not remount a tmpfs on top of /tmp? 17:12:23 we do for the iso, but don't *think* for pxe 17:12:43 yeah, we need a `pxe-live-login` test 17:13:24 maybe something for a volunteer :) 17:13:54 jlebon: that should be pretty easy? we should have all the scafolding, just need the test case to be defined? 17:14:12 should be, yeah 17:14:43 Nemric: if you volunteer to add the test then you'd know your use case was tested in the future :) 17:15:22 I could give it a try, just let me know where to begin ^^ 17:15:22 any other topics for open floor? 17:15:35 Nemric: catch us in Fedora CoreOS matrix after the meeting? 17:15:53 matrix ? 17:16:02 Nemric: yeah :( 17:16:21 https://github.com/coreos/fedora-coreos-tracker/issues/1566 17:16:30 bullet time ! :D 17:16:46 if you need help getting that set up you can ping me in the IRC channel 17:16:53 i'm still there, just don't monitor it 17:17:06 oh I actually had a topic to bring up from baude today 17:17:09 almost forgot it 17:18:44 that's it ? https://matrix.org/docs/chat_basics/matrix-for-im/ 17:18:49 baude was trying to solve this pain point/papercut problem for a use case for podman machine: https://github.com/coreos/rpm-ostree/issues/337 (Support empty toplevel mount points) and he implemented it poorly and caused some rework: https://github.com/containers/podman/pull/20612#discussion_r1384846241 17:19:30 his plea is something along the lines of "can we please fix an longstanding issue that many people have hit and been open since 2016?" 17:20:15 jlebon: has an open PR in https://github.com/ostreedev/ostree/pull/2681 but there was some rework needed after code review 17:20:38 jlebon: could we maybe mentor someone on picking up that work and pushing it over the finish line? 17:20:39 nit: it's an RFE not an issue :) 17:21:32 true, but it borders into "issue" when it violates principle of least surprise 17:21:33 the workaround isn't hard, which is why it hasn't been super prioritized outside of a hack day project 17:21:33 i would like to pick it up again though 17:22:01 the limitations of ostree in that respect is pretty known at this point 17:22:07 to be clear I'm bringing this up at the request of baude, who had a water heater leak and couldn't make the meeting 17:22:39 and that's all I had 17:22:44 +1 17:22:53 I can have my own matrix server !? yeahh ! one more workload for FCOS ! :D 17:23:03 :) 17:23:04 Nemric: :) 17:23:05 :) 17:23:18 Nemric: https://github.com/travier/fedora-coreos-matrix :) 17:23:56 travier: thanks ! 17:24:14 I did see your nomad repo to ;) 17:24:17 jlebon: on that topic, would it be something easy enough that someone could use it as a learning opportunity? if so maybe we can solicit volunteers 17:24:28 travier: I will definitely look into that, TY! 17:24:34 any other topics for open floor? 17:25:12 there is a fedora release party I think this friday and saturday 17:25:33 unfortunately with being so busy lately I didn't submit anything 17:25:46 and I'll also not be able to attend (will be AFK Fri-Sun) 17:25:52 dustymabe: it could be, yeah. if someone wants to learn more about libostree 17:25:55 https://fedoraproject.org/wiki/Fedora_Linux_39_Release_Party_Schedule 17:26:19 opportunity for learning ^^ 17:26:42 a few minutes left in the meeting. I'll close it out soon unless new topics come up 17:26:58 reminder: please add the meeting label to issues we should discuss in future meetings 17:27:07 Wondering if we should document a set of units to do that properly 17:28:30 travier: like https://github.com/ostreedev/ostree/pull/2681#issuecomment-1481906472 ? 17:29:08 t's more complex than that if you have mulitple mounts and want a unit for each one 17:29:24 yeah, either way would be better to get the RFE implemented 17:29:31 closing out the meeting soon 17:29:56 have a unit do the chattr -i, oder the mkdir units after, have another unit do the chattr +i with the right order 17:30:28 idea: systemd unit directive that takes a semaphore (ability to serialize with other units without having to name them) 17:30:40 and I'll let that bad idea end the meeting 17:30:42 #endmeeting