16:29:22 <dustymabe> #startmeeting fedora_coreos_meeting 16:29:22 <zodbot> Meeting started Wed Nov 8 16:29:22 2023 UTC. 16:29:22 <zodbot> This meeting is logged and archived in a public location. 16:29:22 <zodbot> The chair is dustymabe. Information about MeetBot at https://fedoraproject.org/wiki/Zodbot#Meeting_Functions. 16:29:22 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic. 16:29:23 <zodbot> The meeting name has been set to 'fedora_coreos_meeting' 16:29:31 <dustymabe> #topic roll call 16:29:33 <dustymabe> .hi 16:29:40 <Nemric> hi :) 16:29:40 <zodbot> dustymabe: dustymabe 'Dusty Mabe' <dusty@dustymabe.com> 16:30:14 <dustymabe> 👋 Nemric 16:30:41 <dustymabe> #chair Nemric 16:30:41 <zodbot> Current chairs: Nemric dustymabe 16:30:54 <dustymabe> might be a few minutes - some people are finishing up another meeting 16:31:08 <Nemric> dustymabe: (can't see pictures :( ) 16:31:28 <travier> .hello siosm 16:31:36 <zodbot> travier: siosm 'Timothée Ravier' <travier@redhat.com> 16:31:39 <dustymabe> #chair travier 16:31:39 <zodbot> Current chairs: Nemric dustymabe travier 16:32:15 <marmijo> .hi 16:32:22 <zodbot> marmijo: marmijo 'Michael Armijo' <marmijo@redhat.com> 16:32:48 <travier> I don't see any issues marked for discussion today 16:33:05 <apiaseck> .hello c4rt0 16:33:06 <dustymabe> travier: indeed. we need to be more proactive about this in the future. 16:33:13 <zodbot> apiaseck: c4rt0 'Adam Piasecki' <c4rt0gr4ph3r@gmail.com> 16:33:13 <jmarrero> .hi 16:33:17 <travier> https://github.com/coreos/fedora-coreos-tracker/issues/1608 maybe? 16:33:20 <zodbot> jmarrero: jmarrero 'Joseph Marrero' <jmarrero@redhat.com> 16:33:35 <spresti> .hello spresti 16:33:38 <dustymabe> apiaseck: that email address.. it's like password complexity 16:33:43 <zodbot> spresti: spresti 'Steven Presti' <spresti@redhat.com> 16:33:44 <ydesouza> .hello ydesouza 16:33:50 <zodbot> ydesouza: ydesouza 'Yasmin Valim de Souza' <ydesouza@redhat.com> 16:33:52 <dustymabe> #chair marmijo apiaseck jmarrero spresti ydesouza 16:33:52 <zodbot> Current chairs: Nemric apiaseck dustymabe jmarrero marmijo spresti travier ydesouza 16:34:11 <apiaseck> .hello c4rt0 16:34:18 <zodbot> apiaseck: c4rt0 'Adam Piasecki' <c4rt0gr4ph3r@gmail.com> 16:34:42 <dustymabe> did I miss anyone with #chair? 16:35:04 <apiaseck> dustymabe: trust me, I hate spelling it over the phone 16:35:13 <dustymabe> #topic Action items from last meeting 16:35:30 <dustymabe> * travier to create a change proposal for F40 for switching away from nss-altfiles for OSTree based systems 16:35:41 <dustymabe> and then I think we have one that I'll make some text up for now: 16:36:11 <dustymabe> * travier to schedule meeting with Assisted installer team to understand the use case around https://github.com/coreos/fedora-coreos-tracker/issues/1595 16:36:32 <dustymabe> travier: :) - welcome back from a week away 16:36:44 <travier> dustymabe: :D 16:36:52 <travier> we'll have to re-action those 16:36:59 <travier> still haven't been able to do them 16:37:09 <dustymabe> ok, i only through that last one in there because I didn't want it to drop 16:37:14 <dustymabe> it wasn't actually an action item from last time 16:37:26 <dustymabe> #action travier to create a change proposal for F40 for switching away from nss-altfiles for OSTree based systems 16:37:37 <dustymabe> #action travier to schedule meeting with Assisted installer team to understand the use case around https://github.com/coreos/fedora-coreos-tracker/issues/1595 16:37:53 <dustymabe> #topic Nodes Fail To Update (Zincati Reports libsystemd errors regarding EMFILE: Too many open files) 16:38:00 <dustymabe> #link https://github.com/coreos/fedora-coreos-tracker/issues/1608 16:38:32 <dustymabe> so we have an issue that is/will cause nodes to not be able to upgrade 16:39:12 <dustymabe> It appears a new zincati RPM (just made it to our stable stream) leaks open files and eventually hits its limit and then can't continue to function 16:39:55 <dustymabe> it's worth doing a mini retrospective on this here maybe in a 5 minutes.. but for now - anything we should bring up? 16:40:08 <travier> has the dependency been updated in Fedora RPMs? 16:40:30 <dustymabe> travier: you mean the zincati dep that caused the issue ? 16:40:37 <travier> yes 16:41:02 <dustymabe> a new release just came out and was updated upstream in: https://github.com/coreos/zincati/pull/1118 16:41:19 <dustymabe> so we should now be able to do a new zincati release/build 16:41:33 <jmarrero> Jonathan just did one: https://github.com/coreos/zincati/pull/1119 16:41:39 <dustymabe> to answer your question, though. I think at least right now (we want to change it) zincati is built with bundled deps 16:41:59 <dustymabe> so the dependent rust libsystemd package in fedora isn't important (to us at least) 16:42:06 <jlebon> hmm, missed the start. did we not announce it in matrix? 16:42:07 <travier> oh, right, we bundle, so this should be fine 16:42:18 <dustymabe> jlebon: sigh 16:42:23 <dustymabe> I typed it but didn't press enter 16:42:34 <jlebon> ahh heh 16:42:56 <dustymabe> i was multi-tasking :) 16:43:02 <jlebon> guessing the context: yeah, working on cutting a new zincati releast with the fix 16:43:09 <jlebon> release* 16:43:37 <dustymabe> for now I have pinned to the old package in the stable stream in https://github.com/coreos/fedora-coreos-config/pull/2720 16:44:17 <dustymabe> if we get a new stable release out today then stable stream nodes should not encounter a problem (as they wouldn't have run out of open files yet, it takes a few days) 16:44:40 <apiaseck> I'm ready to hit ok-to-promote 16:44:48 <apiaseck> on the stable: new release on 2023-11-08 (38.20231027.3.2) 16:45:03 <dustymabe> apiaseck: that part is already done. the build as already happened, just need the part where we check that all tests passed etc.. 16:45:18 <jlebon> btw, it's easier to reproduce this with `steady_interval_secs = 1`: https://github.com/coreos/fedora-coreos-tracker/issues/1608#issuecomment-1802126860 16:45:19 <jlebon> that'll make zincati emit sd_notify updates much more frequently 16:45:29 <apiaseck> dustymabe: All test did passed at this stage 16:45:36 <apiaseck> pass ** 16:45:43 <dustymabe> ahh, then yeah. it's ready for the release job then 16:46:17 <apiaseck> dustymabe: I'll do that now then 16:46:18 <dustymabe> anyhow I think the problem is pretty well known at this point and the path forward is hopefully not too complicated 16:46:31 <dustymabe> 1. pin to old zincati on stable - do ad-hoc release 16:46:47 <dustymabe> 2. get new zincati into testing/next - do ad-hoc releases 16:47:03 <dustymabe> 3. send communication telling people how to get their nodes unstuck 16:47:30 <jlebon> yeah, agreed 16:47:33 <dustymabe> manual intervention should only be required for testing/next nodes 16:48:29 <jlebon> ideally also 16:48:29 <jlebon> 4. add CI testing that would have caught this. it shouldn't focus on leaking fds, but be more general 16:48:36 <dustymabe> 4. next time look at fifonix reported issues sooner 16:48:57 <dustymabe> :) 16:49:07 <travier> :) 16:49:08 <dustymabe> jlebon: yeah let's talk about that in a minute 16:49:11 <jlebon> yeah, that was a mistake on our part 16:49:17 <dustymabe> does anyone disagree with the course of action here? 16:49:44 <travier> I'm good. Feel like the Zincati refresh interval is too small but meh 16:50:12 <dustymabe> ok let's move to a mini-retro then 16:50:24 <dustymabe> i.e. how did we get here? what can we do to improve in the future 16:50:25 <jlebon> travier: that can certainly be discussed too 16:50:47 <travier> If we had zincati triggered from a systemd timer and then exit, we would not have that issue 16:50:57 <travier> i.e. zincati does not really need to run 100% of the time 16:51:02 <dustymabe> there are a few things that caused us to miss this issue 16:51:18 <dustymabe> 1. adjusted release schedule based on the Fedora GA 16:51:52 <dustymabe> -> this meant we did many releases in a short time span, including doing a stable release in two consecutive weeks 16:52:51 <dustymabe> I don't necessarily think we should change our current policy/schedule around GA, just highlighting it as something that contributed here. If we were on the normal two week cadence I think we would have caught it before we promoted that content to GA. 16:53:37 <jlebon> i think also we should've been more wary of the new zincati. it hadn't been updated in a long time, and had *a lot* of dependency bumps 16:53:43 <dustymabe> any comments on 1. before I move to another point? 16:53:53 <jlebon> so one strategy could've been to have it bake in next only to start 16:54:13 <dustymabe> jlebon: maybe, but in general we try to let our CI be the arbiter or "good" or "bad" 16:54:34 <dustymabe> having different policies per package makes things harder to maintain unfortunately :( 16:54:36 <jlebon> dustymabe: updates specifically are harder to test in realistic scenarios in CI 16:54:44 <dustymabe> jlebon: agree 16:54:55 <dustymabe> what is crazy is that we did learn from something like this in the past 16:55:15 <dustymabe> we created an extended upgrade test that literally tests from a matrix of starting points in the past 16:55:36 <dustymabe> but... that test runs upgrades consecutively one after the other 16:55:41 <fifofonix> .hi 16:55:42 <zodbot> fifofonix: fifofonix 'Fifo Phonics' <fifofonix@gmail.com> 16:55:52 <dustymabe> so it would have never hit the time based problem - the nodes were never up long enough 16:56:22 <dustymabe> but I think there is some positive takeaways her 16:56:24 <travier> This kind of leak is hard to find in CI 16:56:26 <dustymabe> here* 16:56:27 <jlebon> exactly. it being time-based means it would've been very unlikely a CI test would've covered this 16:56:42 <dustymabe> the good news is that our "stream model" works! 16:57:08 <travier> The leak is also here in stable as far as I can see 16:57:08 <Nemric> +1 16:57:11 <dustymabe> we were able to find and fix this issue, even though we caught it very late, we still caught it soon enough that stable nodes should not need intervention. This is a huge win IMO 16:57:30 <dustymabe> travier: yes, it's in current latest released `stable` 16:58:02 <travier> 👍 16:58:04 <dustymabe> we stopped the rollout and will do a new release today 16:58:22 <travier> ah yes, I'm already on 38.20231027.3.1 on my systems 16:58:44 <dustymabe> I think one thing we can do in the future is possibly add some monitoring to our persistent nodes that our team has running 16:59:17 <dustymabe> the monitoring would have helped us ID stuff like this sooner, which is I think how fifofonix found the anomaly in his environment 16:59:42 <dustymabe> monitering/alerting 17:00:02 <dustymabe> jlebon: any other ideas on things we should do to catch something like this? 17:00:13 <travier> FD count was at 248 for 18h of uptime so we have about 3 days to push the fix before stable nodes that upgraded needs manual intervention 17:00:59 <dustymabe> I also had a suggestion in https://github.com/coreos/fedora-coreos-tracker/issues/1608#issuecomment-1802038890 of making it so that zincati gets restarted periodically, which is a less extreme version of what travier proposed above of making zincati a oneshot service 17:01:07 <apiaseck> travier: just for context: the release job on the latest stable is already running 17:01:15 <fifofonix> does the countme data tell a story? ie. can you see materially less next nodes upgraded by when you would expect? 17:01:23 <travier> apiaseck: ++ 17:01:36 <jlebon> dustymabe: i think we got a good bunch. we should collate them somewhere and see which ones we want to pursue 17:01:46 <dustymabe> fifofonix: not really :( - countme data is really a poor man's statistics 17:01:52 <travier> fifofonix: countme data does not report node versions, only major fedora versions 17:01:54 <dustymabe> countme database only gets upated once a week 17:02:08 <dustymabe> travier: yeah, that too 17:02:27 <travier> yeah, countme is very broad, not precise at all by design (privacy, etc.) 17:02:59 <travier> restarting zincati periodically sounds ok to me 17:03:31 <jlebon> though it makes prod even more unlike CI 17:03:59 <dustymabe> here is a hackmd where we can collate ideas on how to avoid this in the future 17:04:03 <dustymabe> https://hackmd.io/xPcq8EvxTH6TQhFFx_Rfug?edit 17:04:39 <travier> one path is also that we move away from Zincati as it is 17:04:45 <travier> but we're not there yet 17:05:39 <dustymabe> #info we have a path forward where we think stable nodes won't need manual intervention, but testing/next nodes will. We will put out ad-hoc releases for all streams and send a communication to everyone a bout necessary steps. 17:05:49 <dustymabe> ^^ let me know if anything needs to be changed 17:06:10 <Nemric> Really think that running some instance with monitoring should do the trick, I'm running a "next" node with logs/metrics and a filter for error/warning logs 17:06:58 <Nemric> this node boot on pxe so zincati is not running ^^ don't have logs for that 17:07:07 <dustymabe> yep 17:07:20 <travier> +1 for info 17:07:37 <dustymabe> ok we've spent a long time on this. let's move on 17:08:03 <dustymabe> #topic open floor 17:08:09 <dustymabe> anyone with anything for open floor? 17:08:24 <Nemric> https://github.com/coreos/fedora-coreos-tracker/issues/1296 ? 17:09:29 <dustymabe> Reproduction steps 17:09:30 <Nemric> I did run tests about this oin the last test-week for FCOS39 and it's always true 17:09:31 <dustymabe> Start a diskless coreos from PXE boot 17:09:33 <dustymabe> Expected behavior 17:09:35 <dustymabe> No failed units at startup 17:09:53 <dustymabe> i'm wondering why our CI doesn't catch this? jlebon don't we have a PXE diskless test? 17:10:41 <jlebon> we do, yup 17:10:52 <jlebon> well 17:11:09 <Nemric> cheese say : https://github.com/coreos/fedora-coreos-tracker/issues/1296#issuecomment-1306747053 17:11:23 <jlebon> we have pxe installs, but the unit state checking might happen only on the installed system 17:12:01 <dustymabe> ahh, so we don't have any tests that don't run an install? 17:12:07 <dustymabe> "any pxe tests" 17:12:13 <travier> maybe PXE boots are full in memory so systemd does not remount a tmpfs on top of /tmp? 17:12:23 <jlebon> we do for the iso, but don't *think* for pxe 17:12:43 <jlebon> yeah, we need a `pxe-live-login` test 17:13:24 <dustymabe> maybe something for a volunteer :) 17:13:54 <dustymabe> jlebon: that should be pretty easy? we should have all the scafolding, just need the test case to be defined? 17:14:12 <jlebon> should be, yeah 17:14:43 <dustymabe> Nemric: if you volunteer to add the test then you'd know your use case was tested in the future :) 17:15:22 <Nemric> I could give it a try, just let me know where to begin ^^ 17:15:22 <dustymabe> any other topics for open floor? 17:15:35 <dustymabe> Nemric: catch us in Fedora CoreOS matrix after the meeting? 17:15:53 <Nemric> matrix ? 17:16:02 <dustymabe> Nemric: yeah :( 17:16:21 <dustymabe> https://github.com/coreos/fedora-coreos-tracker/issues/1566 17:16:30 <Nemric> bullet time ! :D 17:16:46 <dustymabe> if you need help getting that set up you can ping me in the IRC channel 17:16:53 <dustymabe> i'm still there, just don't monitor it 17:17:06 <dustymabe> oh I actually had a topic to bring up from baude today 17:17:09 <dustymabe> almost forgot it 17:18:44 <Nemric> that's it ? https://matrix.org/docs/chat_basics/matrix-for-im/ 17:18:49 <dustymabe> baude was trying to solve this pain point/papercut problem for a use case for podman machine: https://github.com/coreos/rpm-ostree/issues/337 (Support empty toplevel mount points) and he implemented it poorly and caused some rework: https://github.com/containers/podman/pull/20612#discussion_r1384846241 17:19:30 <dustymabe> his plea is something along the lines of "can we please fix an longstanding issue that many people have hit and been open since 2016?" 17:20:15 <dustymabe> jlebon: has an open PR in https://github.com/ostreedev/ostree/pull/2681 but there was some rework needed after code review 17:20:38 <dustymabe> jlebon: could we maybe mentor someone on picking up that work and pushing it over the finish line? 17:20:39 <jlebon> nit: it's an RFE not an issue :) 17:21:32 <dustymabe> true, but it borders into "issue" when it violates principle of least surprise 17:21:33 <jlebon> the workaround isn't hard, which is why it hasn't been super prioritized outside of a hack day project 17:21:33 <jlebon> i would like to pick it up again though 17:22:01 <jlebon> the limitations of ostree in that respect is pretty known at this point 17:22:07 <dustymabe> to be clear I'm bringing this up at the request of baude, who had a water heater leak and couldn't make the meeting 17:22:39 <dustymabe> and that's all I had 17:22:44 <jlebon> +1 17:22:53 <Nemric> I can have my own matrix server !? yeahh ! one more workload for FCOS ! :D 17:23:03 <travier> :) 17:23:04 <dustymabe> Nemric: :) 17:23:05 <apiaseck> :) 17:23:18 <travier> Nemric: https://github.com/travier/fedora-coreos-matrix :) 17:23:56 <Nemric> travier: thanks ! 17:24:14 <Nemric> I did see your nomad repo to ;) 17:24:17 <dustymabe> jlebon: on that topic, would it be something easy enough that someone could use it as a learning opportunity? if so maybe we can solicit volunteers 17:24:28 <apiaseck> travier: I will definitely look into that, TY! 17:24:34 <dustymabe> any other topics for open floor? 17:25:12 <dustymabe> there is a fedora release party I think this friday and saturday 17:25:33 <dustymabe> unfortunately with being so busy lately I didn't submit anything 17:25:46 <dustymabe> and I'll also not be able to attend (will be AFK Fri-Sun) 17:25:52 <jlebon> dustymabe: it could be, yeah. if someone wants to learn more about libostree 17:25:55 <dustymabe> https://fedoraproject.org/wiki/Fedora_Linux_39_Release_Party_Schedule 17:26:19 <dustymabe> opportunity for learning ^^ 17:26:42 <dustymabe> a few minutes left in the meeting. I'll close it out soon unless new topics come up 17:26:58 <dustymabe> reminder: please add the meeting label to issues we should discuss in future meetings 17:27:07 <travier> Wondering if we should document a set of units to do that properly 17:28:30 <dustymabe> travier: like https://github.com/ostreedev/ostree/pull/2681#issuecomment-1481906472 ? 17:29:08 <travier> t's more complex than that if you have mulitple mounts and want a unit for each one 17:29:24 <dustymabe> yeah, either way would be better to get the RFE implemented 17:29:31 <dustymabe> closing out the meeting soon 17:29:56 <travier> have a unit do the chattr -i, oder the mkdir units after, have another unit do the chattr +i with the right order 17:30:28 <dustymabe> idea: systemd unit directive that takes a semaphore (ability to serialize with other units without having to name them) 17:30:40 <dustymabe> and I'll let that bad idea end the meeting 17:30:42 <dustymabe> #endmeeting