16:29:22 <dustymabe> #startmeeting fedora_coreos_meeting
16:29:22 <zodbot> Meeting started Wed Nov  8 16:29:22 2023 UTC.
16:29:22 <zodbot> This meeting is logged and archived in a public location.
16:29:22 <zodbot> The chair is dustymabe. Information about MeetBot at https://fedoraproject.org/wiki/Zodbot#Meeting_Functions.
16:29:22 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic.
16:29:23 <zodbot> The meeting name has been set to 'fedora_coreos_meeting'
16:29:31 <dustymabe> #topic roll call
16:29:33 <dustymabe> .hi
16:29:40 <Nemric> hi :)
16:29:40 <zodbot> dustymabe: dustymabe 'Dusty Mabe' <dusty@dustymabe.com>
16:30:14 <dustymabe> 👋 Nemric
16:30:41 <dustymabe> #chair Nemric
16:30:41 <zodbot> Current chairs: Nemric dustymabe
16:30:54 <dustymabe> might be a few minutes - some people are finishing up another meeting
16:31:08 <Nemric> dustymabe: (can't see pictures :( )
16:31:28 <travier> .hello siosm
16:31:36 <zodbot> travier: siosm 'Timothée Ravier' <travier@redhat.com>
16:31:39 <dustymabe> #chair travier
16:31:39 <zodbot> Current chairs: Nemric dustymabe travier
16:32:15 <marmijo> .hi
16:32:22 <zodbot> marmijo: marmijo 'Michael Armijo' <marmijo@redhat.com>
16:32:48 <travier> I don't see any issues marked for discussion today
16:33:05 <apiaseck> .hello c4rt0
16:33:06 <dustymabe> travier: indeed. we need to be more proactive about this in the future.
16:33:13 <zodbot> apiaseck: c4rt0 'Adam Piasecki' <c4rt0gr4ph3r@gmail.com>
16:33:13 <jmarrero> .hi
16:33:17 <travier> https://github.com/coreos/fedora-coreos-tracker/issues/1608 maybe?
16:33:20 <zodbot> jmarrero: jmarrero 'Joseph Marrero' <jmarrero@redhat.com>
16:33:35 <spresti> .hello spresti
16:33:38 <dustymabe> apiaseck: that email address.. it's like password complexity
16:33:43 <zodbot> spresti: spresti 'Steven Presti' <spresti@redhat.com>
16:33:44 <ydesouza> .hello ydesouza
16:33:50 <zodbot> ydesouza: ydesouza 'Yasmin Valim de Souza' <ydesouza@redhat.com>
16:33:52 <dustymabe> #chair marmijo apiaseck jmarrero spresti ydesouza
16:33:52 <zodbot> Current chairs: Nemric apiaseck dustymabe jmarrero marmijo spresti travier ydesouza
16:34:11 <apiaseck> .hello c4rt0
16:34:18 <zodbot> apiaseck: c4rt0 'Adam Piasecki' <c4rt0gr4ph3r@gmail.com>
16:34:42 <dustymabe> did I miss anyone with #chair?
16:35:04 <apiaseck> dustymabe: trust me, I hate spelling it over the phone
16:35:13 <dustymabe> #topic Action items from last meeting
16:35:30 <dustymabe> * travier to create a change proposal for F40 for switching away from nss-altfiles for OSTree based systems
16:35:41 <dustymabe> and then I think we have one that I'll make some text up for now:
16:36:11 <dustymabe> * travier to schedule meeting with Assisted installer team to understand the use case around https://github.com/coreos/fedora-coreos-tracker/issues/1595
16:36:32 <dustymabe> travier: :) - welcome back from a week away
16:36:44 <travier> dustymabe: :D
16:36:52 <travier> we'll have to re-action those
16:36:59 <travier> still haven't been able to do them
16:37:09 <dustymabe> ok, i only through that last one in there because I didn't want it to drop
16:37:14 <dustymabe> it wasn't actually an action item from last time
16:37:26 <dustymabe> #action travier to create a change proposal for F40 for switching away from nss-altfiles for OSTree based systems
16:37:37 <dustymabe> #action travier to schedule meeting with Assisted installer team to understand the use case around https://github.com/coreos/fedora-coreos-tracker/issues/1595
16:37:53 <dustymabe> #topic Nodes Fail To Update (Zincati Reports libsystemd errors regarding EMFILE: Too many open files)
16:38:00 <dustymabe> #link https://github.com/coreos/fedora-coreos-tracker/issues/1608
16:38:32 <dustymabe> so we have an issue that is/will cause nodes to not be able to upgrade
16:39:12 <dustymabe> It appears a new zincati RPM (just made it to our stable stream) leaks open files and eventually hits its limit and then can't continue to function
16:39:55 <dustymabe> it's worth doing a mini retrospective on this here maybe in a 5 minutes.. but for now - anything we should bring up?
16:40:08 <travier> has the dependency been updated in Fedora RPMs?
16:40:30 <dustymabe> travier: you mean the zincati dep that caused the issue ?
16:40:37 <travier> yes
16:41:02 <dustymabe> a new release just came out and was updated upstream in: https://github.com/coreos/zincati/pull/1118
16:41:19 <dustymabe> so we should now be able to do a new zincati release/build
16:41:33 <jmarrero> Jonathan just did one: https://github.com/coreos/zincati/pull/1119
16:41:39 <dustymabe> to answer your question, though. I think at least right now (we want to change it) zincati is built with bundled deps
16:41:59 <dustymabe> so the dependent rust libsystemd package in fedora isn't important (to us at least)
16:42:06 <jlebon> hmm, missed the start. did we not announce it in matrix?
16:42:07 <travier> oh, right, we bundle, so this should be fine
16:42:18 <dustymabe> jlebon: sigh
16:42:23 <dustymabe> I typed it but didn't press enter
16:42:34 <jlebon> ahh heh
16:42:56 <dustymabe> i was multi-tasking :)
16:43:02 <jlebon> guessing the context: yeah, working on cutting a new zincati releast with the fix
16:43:09 <jlebon> release*
16:43:37 <dustymabe> for now I have pinned to the old package in the stable stream in https://github.com/coreos/fedora-coreos-config/pull/2720
16:44:17 <dustymabe> if we get a new stable release out today then stable stream nodes should not encounter a problem (as they wouldn't have run out of open files yet, it takes a few days)
16:44:40 <apiaseck> I'm ready to hit ok-to-promote
16:44:48 <apiaseck> on the stable: new release on 2023-11-08 (38.20231027.3.2)
16:45:03 <dustymabe> apiaseck: that part is already done. the build as already happened, just need the part where we check that all tests passed etc..
16:45:18 <jlebon> btw, it's easier to reproduce this with `steady_interval_secs = 1`: https://github.com/coreos/fedora-coreos-tracker/issues/1608#issuecomment-1802126860
16:45:19 <jlebon> that'll make zincati emit sd_notify updates much more frequently
16:45:29 <apiaseck> dustymabe: All test did passed at this stage
16:45:36 <apiaseck> pass **
16:45:43 <dustymabe> ahh, then yeah. it's ready for the release job then
16:46:17 <apiaseck> dustymabe: I'll do that now then
16:46:18 <dustymabe> anyhow I think the problem is pretty well known at this point and the path forward is hopefully not too complicated
16:46:31 <dustymabe> 1. pin to old zincati on stable - do ad-hoc release
16:46:47 <dustymabe> 2. get new zincati into testing/next - do ad-hoc releases
16:47:03 <dustymabe> 3. send communication telling people how to get their nodes unstuck
16:47:30 <jlebon> yeah, agreed
16:47:33 <dustymabe> manual intervention should only be required for testing/next nodes
16:48:29 <jlebon> ideally also
16:48:29 <jlebon> 4. add CI testing that would have caught this. it shouldn't focus on leaking fds, but be more general
16:48:36 <dustymabe> 4. next time look at fifonix reported issues sooner
16:48:57 <dustymabe> :)
16:49:07 <travier> :)
16:49:08 <dustymabe> jlebon: yeah let's talk about that in a minute
16:49:11 <jlebon> yeah, that was a mistake on our part
16:49:17 <dustymabe> does anyone disagree with the course of action here?
16:49:44 <travier> I'm good. Feel like the Zincati refresh interval is too small but meh
16:50:12 <dustymabe> ok let's move to a mini-retro then
16:50:24 <dustymabe> i.e. how did we get here? what can we do to improve in the future
16:50:25 <jlebon> travier: that can certainly be discussed too
16:50:47 <travier> If we had zincati triggered from a systemd timer and then exit, we would not have that issue
16:50:57 <travier> i.e. zincati does not really need to run 100% of the time
16:51:02 <dustymabe> there are a few things that caused us to miss this issue
16:51:18 <dustymabe> 1. adjusted release schedule based on the Fedora GA
16:51:52 <dustymabe> -> this meant we did many releases in a short time span, including doing a stable release in two consecutive weeks
16:52:51 <dustymabe> I don't necessarily think we should change our current policy/schedule around GA, just highlighting it as something that contributed here. If we were on the normal two week cadence I think we would have caught it before we promoted that content to GA.
16:53:37 <jlebon> i think also we should've been more wary of the new zincati. it hadn't been updated in a long time, and had *a lot* of dependency bumps
16:53:43 <dustymabe> any comments on 1. before I move to another point?
16:53:53 <jlebon> so one strategy could've been to have it bake in next only to start
16:54:13 <dustymabe> jlebon: maybe, but in general we try to let our CI be the arbiter or "good" or "bad"
16:54:34 <dustymabe> having different policies per package makes things harder to maintain unfortunately :(
16:54:36 <jlebon> dustymabe: updates specifically are harder to test in realistic scenarios in CI
16:54:44 <dustymabe> jlebon: agree
16:54:55 <dustymabe> what is crazy is that we did learn from something like this in the past
16:55:15 <dustymabe> we created an extended upgrade test  that literally tests from a matrix of starting points in the past
16:55:36 <dustymabe> but... that test runs upgrades consecutively one after the other
16:55:41 <fifofonix> .hi
16:55:42 <zodbot> fifofonix: fifofonix 'Fifo Phonics' <fifofonix@gmail.com>
16:55:52 <dustymabe> so it would have never hit the time based problem - the nodes were never up long enough
16:56:22 <dustymabe> but I think there is some positive takeaways her
16:56:24 <travier> This kind of leak is hard to find in CI
16:56:26 <dustymabe> here*
16:56:27 <jlebon> exactly. it being time-based means it would've been very unlikely a CI test would've covered this
16:56:42 <dustymabe> the good news is that our "stream model" works!
16:57:08 <travier> The leak is also here in stable as far as I can see
16:57:08 <Nemric> +1
16:57:11 <dustymabe> we were able to find and fix this issue, even though we caught it very late, we still caught it soon enough that stable nodes should not need intervention. This is a huge win IMO
16:57:30 <dustymabe> travier: yes, it's in current latest released `stable`
16:58:02 <travier> 👍
16:58:04 <dustymabe> we stopped the rollout and will do a new release today
16:58:22 <travier> ah yes, I'm already on 38.20231027.3.1 on my systems
16:58:44 <dustymabe> I think one thing we can do in the future is possibly add some monitoring to our persistent nodes that our team has running
16:59:17 <dustymabe> the monitoring would have helped us ID stuff like this sooner, which is I think how fifofonix found the anomaly in his environment
16:59:42 <dustymabe> monitering/alerting
17:00:02 <dustymabe> jlebon: any other ideas on things we should do to catch something like this?
17:00:13 <travier> FD count was at 248 for 18h of uptime so we have about 3 days to push the fix before stable nodes that upgraded needs manual intervention
17:00:59 <dustymabe> I also had a suggestion in https://github.com/coreos/fedora-coreos-tracker/issues/1608#issuecomment-1802038890 of making it so that zincati gets restarted periodically, which is a less extreme version of what travier proposed above of making zincati a oneshot service
17:01:07 <apiaseck> travier: just for context: the release job on the latest stable is already running
17:01:15 <fifofonix> does the countme data tell a story? ie. can you see materially less next nodes upgraded by when you would expect?
17:01:23 <travier> apiaseck: ++
17:01:36 <jlebon> dustymabe: i think we got a good bunch. we should collate them somewhere and see which ones we want to pursue
17:01:46 <dustymabe> fifofonix: not really :( - countme data is really a poor man's statistics
17:01:52 <travier> fifofonix: countme data does not report node versions, only major fedora versions
17:01:54 <dustymabe> countme database only gets upated once a week
17:02:08 <dustymabe> travier: yeah, that too
17:02:27 <travier> yeah, countme is very broad, not precise at all by design (privacy, etc.)
17:02:59 <travier> restarting zincati periodically sounds ok to me
17:03:31 <jlebon> though it makes prod even more unlike CI
17:03:59 <dustymabe> here is a hackmd where we can collate ideas on how to avoid this in the future
17:04:03 <dustymabe> https://hackmd.io/xPcq8EvxTH6TQhFFx_Rfug?edit
17:04:39 <travier> one path is also that we move away from Zincati as it is
17:04:45 <travier> but we're not there yet
17:05:39 <dustymabe> #info we have a path forward where we think stable nodes won't need manual intervention, but testing/next nodes will. We will put out ad-hoc releases for all streams and send a communication to everyone a bout necessary steps.
17:05:49 <dustymabe> ^^ let me know if anything needs to be changed
17:06:10 <Nemric> Really think that running some instance with monitoring should do the trick, I'm running a "next" node with logs/metrics and a filter for error/warning logs
17:06:58 <Nemric> this node boot on pxe so zincati is not running ^^ don't have logs for that
17:07:07 <dustymabe> yep
17:07:20 <travier> +1 for info
17:07:37 <dustymabe> ok we've spent a long time on this. let's move on
17:08:03 <dustymabe> #topic open floor
17:08:09 <dustymabe> anyone with anything for open floor?
17:08:24 <Nemric> https://github.com/coreos/fedora-coreos-tracker/issues/1296 ?
17:09:29 <dustymabe> Reproduction steps
17:09:30 <Nemric> I did run tests about this oin the last test-week for FCOS39 and it's always true
17:09:31 <dustymabe> Start a diskless coreos from PXE boot
17:09:33 <dustymabe> Expected behavior
17:09:35 <dustymabe> No failed units at startup
17:09:53 <dustymabe> i'm wondering why our CI doesn't catch this? jlebon don't we have a PXE diskless test?
17:10:41 <jlebon> we do, yup
17:10:52 <jlebon> well
17:11:09 <Nemric> cheese say : https://github.com/coreos/fedora-coreos-tracker/issues/1296#issuecomment-1306747053
17:11:23 <jlebon> we have pxe installs, but the unit state checking might happen only on the installed system
17:12:01 <dustymabe> ahh, so we don't have any tests that don't run an install?
17:12:07 <dustymabe> "any pxe tests"
17:12:13 <travier> maybe PXE boots are full in memory so systemd does not remount a tmpfs on top of /tmp?
17:12:23 <jlebon> we do for the iso, but don't *think* for pxe
17:12:43 <jlebon> yeah, we need a `pxe-live-login` test
17:13:24 <dustymabe> maybe something for a volunteer :)
17:13:54 <dustymabe> jlebon: that should be pretty easy? we should have all the scafolding, just need the test case to be defined?
17:14:12 <jlebon> should be, yeah
17:14:43 <dustymabe> Nemric: if you volunteer to add the test then you'd know your use case was tested in the future :)
17:15:22 <Nemric> I could give it a try, just let me know where to begin ^^
17:15:22 <dustymabe> any other topics for open floor?
17:15:35 <dustymabe> Nemric: catch us in Fedora CoreOS matrix after the meeting?
17:15:53 <Nemric> matrix ?
17:16:02 <dustymabe> Nemric: yeah :(
17:16:21 <dustymabe> https://github.com/coreos/fedora-coreos-tracker/issues/1566
17:16:30 <Nemric> bullet time ! :D
17:16:46 <dustymabe> if you need help getting that set up you can ping me in the IRC channel
17:16:53 <dustymabe> i'm still there, just don't monitor it
17:17:06 <dustymabe> oh I actually had a topic to bring up from baude today
17:17:09 <dustymabe> almost forgot it
17:18:44 <Nemric> that's it ? https://matrix.org/docs/chat_basics/matrix-for-im/
17:18:49 <dustymabe> baude was trying to solve this pain point/papercut problem for a use case for podman machine: https://github.com/coreos/rpm-ostree/issues/337 (Support empty toplevel mount points) and he implemented it poorly and caused some rework: https://github.com/containers/podman/pull/20612#discussion_r1384846241
17:19:30 <dustymabe> his plea is something along the lines of "can we please fix an longstanding issue that many people have hit and been open since 2016?"
17:20:15 <dustymabe> jlebon: has an open PR in https://github.com/ostreedev/ostree/pull/2681 but there was some rework needed after code review
17:20:38 <dustymabe> jlebon: could we maybe mentor someone on picking up that work and pushing it over the finish line?
17:20:39 <jlebon> nit: it's an RFE not an issue :)
17:21:32 <dustymabe> true, but it borders into "issue" when it violates principle of least surprise
17:21:33 <jlebon> the workaround isn't hard, which is why it hasn't been super prioritized outside of a hack day project
17:21:33 <jlebon> i would like to pick it up again though
17:22:01 <jlebon> the limitations of ostree in that respect is pretty known at this point
17:22:07 <dustymabe> to be clear I'm bringing this up at the request of baude, who had a water heater leak and couldn't make the meeting
17:22:39 <dustymabe> and that's all I had
17:22:44 <jlebon> +1
17:22:53 <Nemric> I can have my own matrix server !? yeahh ! one more workload for FCOS ! :D
17:23:03 <travier> :)
17:23:04 <dustymabe> Nemric: :)
17:23:05 <apiaseck> :)
17:23:18 <travier> Nemric: https://github.com/travier/fedora-coreos-matrix :)
17:23:56 <Nemric> travier: thanks !
17:24:14 <Nemric> I did see your nomad repo to ;)
17:24:17 <dustymabe> jlebon: on that topic, would it be something easy enough that someone could use it as a learning opportunity? if so maybe we can solicit volunteers
17:24:28 <apiaseck> travier: I will definitely look into that, TY!
17:24:34 <dustymabe> any other topics for open floor?
17:25:12 <dustymabe> there is a fedora release party I think this friday and saturday
17:25:33 <dustymabe> unfortunately with being so busy lately I didn't submit anything
17:25:46 <dustymabe> and I'll also not be able to attend (will be AFK Fri-Sun)
17:25:52 <jlebon> dustymabe: it could be, yeah. if someone wants to learn more about libostree
17:25:55 <dustymabe> https://fedoraproject.org/wiki/Fedora_Linux_39_Release_Party_Schedule
17:26:19 <dustymabe> opportunity for learning ^^
17:26:42 <dustymabe> a few minutes left in the meeting. I'll close it out soon unless new topics come up
17:26:58 <dustymabe> reminder: please add the meeting label to issues we should discuss in future meetings
17:27:07 <travier> Wondering if we should document a set of units to do that properly
17:28:30 <dustymabe> travier: like https://github.com/ostreedev/ostree/pull/2681#issuecomment-1481906472 ?
17:29:08 <travier> t's more complex than that if you have mulitple mounts and want a unit for each one
17:29:24 <dustymabe> yeah, either way would be better to get the RFE implemented
17:29:31 <dustymabe> closing out the meeting soon
17:29:56 <travier> have a unit do the chattr -i, oder the mkdir units after, have another unit do the chattr +i with the right order
17:30:28 <dustymabe> idea: systemd unit directive that takes a semaphore (ability to serialize with other units without having to name them)
17:30:40 <dustymabe> and I'll let that bad idea end the meeting
17:30:42 <dustymabe> #endmeeting