16:30:19 #startmeeting fedora_coreos_meeting 16:30:19 Meeting started Wed May 13 16:30:19 2020 UTC. 16:30:19 This meeting is logged and archived in a public location. 16:30:19 The chair is dustymabe. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:30:19 Useful Commands: #action #agreed #halp #info #idea #link #topic. 16:30:19 The meeting name has been set to 'fedora_coreos_meeting' 16:30:23 .hello2 16:30:24 cyberpear: cyberpear 'James Cassell' 16:30:26 .hello2 16:30:27 jdoss: jdoss 'Joe Doss' 16:30:29 .hello2 16:30:30 jlebon: jlebon 'None' 16:30:32 .hello2 16:30:33 lorbus: lorbus 'Christian Glombek' 16:30:36 #topic roll call 16:30:52 #chair cyberpear jdoss jlebon lorbus 16:30:52 Current chairs: cyberpear dustymabe jdoss jlebon lorbus 16:30:58 glad to see the ol' jdoss 16:31:11 * jdoss waves 16:31:13 .hello2 16:31:14 bgilbert: bgilbert 'Benjamin Gilbert' 16:31:34 I am still alive. Just been getting firehosed at my new job. 16:32:01 * gilliard__ listening (satellit) 16:33:21 jdoss: it happens :) 16:33:39 #chair bgilbert 16:33:39 Current chairs: bgilbert cyberpear dustymabe jdoss jlebon lorbus 16:33:47 #topic Action items from last meeting 16:33:55 no action items to speak of specifically :) 16:34:07 #topic topics for this meeting 16:34:08 .hello2 16:34:10 lucab: lucab 'Luca Bruno' 16:34:29 any topics anyone would like to discuss during this meeting? we have one meeting ticket and then we can discuss other topics 16:34:39 otherwise we'll skip to open floor after the meeting ticket 16:34:41 #chair lucab 16:34:41 Current chairs: bgilbert cyberpear dustymabe jdoss jlebon lorbus lucab 16:35:28 * dustymabe waits another minute for topic suggestions 16:36:11 #topic F32 rebase tracker for changes discussion 16:36:15 #link https://github.com/coreos/fedora-coreos-tracker/issues/372 16:36:38 ok I'm re-using this ticket for "let's talk about the mechanics of switching to f32 for our testing/stable streams" 16:36:48 I updated the ticket with our proposed timeline for switching to f32 16:37:16 which IIUC means our next `testing` release is when we switch to f32 16:37:30 +1 16:37:51 i just linked https://github.com/coreos/fedora-coreos-config/pull/394 to that ticket 16:38:23 any other things we need to consider? 16:38:34 lorbus: how do we look from an OKD perspective ? 16:39:15 It should work out of the box as long as podman doesn't break :) We're still explicitly using cgroupsv1 there, too 16:39:50 yes, we're still on cgroups v1 16:39:53 That explicit config will go away soon, but as long as we don't switch to cgroupsv2 with FCOS now, that won't be an issue 16:40:01 lorbus: there is a iptables/nft change 16:40:24 I'd feel much more comfortable if we could get you or vadim to try an OKD cluster on `next` 16:40:44 do you have a link? I thought there is a compat layer 16:40:45 dustymabe: I still need to close on https://github.com/coreos/fedora-coreos-tracker/issues/468, I'll find some time before the end of the week 16:41:00 with our currently proposed schedule we've got 3 weeks to fix any bugs we find 16:41:20 lorbus: https://github.com/coreos/fedora-coreos-tracker/issues/372#issuecomment-588368597 16:41:28 that links to the change proposal 16:42:02 +1 to the idea to test OKD on next 16:42:05 I'll see to that 16:42:33 #action lorbus to try out OKD on our `next` stream so we can work out any kinks before switching `stable` to f32 16:43:00 lorbus: vadim may have tried it already, so maybe work together with him on that Action Item 16:43:31 yep, definitely 16:43:55 lucab: thanks! I think I need to respond upstream on that fstrim issue as well 16:44:18 any other things we need to do or mechanics we need to discuss regarding the switch to f32? 16:44:36 anyone here running next? did you rebase to it or did you start fresh? 16:45:05 I'm running it for my IRC server 16:46:27 I think we've also said this before.. we need to create an update barrier so that all upgrades from f31 -> f32 go through the same path 16:46:29 correct ? 16:46:40 dustymabe: upgraded from f31? re. IRC server 16:46:48 yes, indeed 16:47:06 https://github.com/coreos/fedora-coreos-streams/issues/99#issuecomment-625291969 16:47:29 jlebon: I think my irc server started as rebased from stable to next (f32), but I've redeployed it since 16:48:08 jlebon: lucab: ok so we agreed to that barrier.. should we also limit the paths to f32 ? 16:48:08 yeah, that's probably the bit we should test the most -- upgrade testing 16:48:31 so let's say the final f31 release on stable is A 16:48:50 we know we will filter all previous releases of f31 to A before allowing them to upgrade to f32 16:49:07 I did a bunch of reabses (including stable-testing-next), I only spotted the Zincati fragment out of place 16:49:08 but will we allow for A->B A->C A->D 16:49:23 or will we only allow one path from f31 to f32 16:49:28 i.e. only A-B exists 16:49:40 and then you can update B->C or B->D etc.. 16:49:48 dustymabe: only one, that's the barrier 16:50:31 lucab: and if B isn't the latest release we still make them go through B? 16:50:58 lucab: hmm, interesting. is there a way to get the other behaviour if we wanted? 16:51:22 not sure I follow 16:51:38 lucab: what i'm talking about is a double barrier essentially 16:52:11 let's say A is the barrier release (the last release of stable f31) 16:52:14 lucab: does a barrier affect both inbound and outbound edges, or just inbound? 16:52:31 what paths are available to systems currently on A? 16:52:34 jlebon: only inbound 16:53:19 dustymabe: whatever are defined by further rollouts and barriers 16:53:24 lucab: right 16:53:33 that's what dustymabe is asking :) i'm not sure if we *need* to have that 16:53:39 so i think jlebon and I are asking if we can control it such that there is only one available path 16:53:52 so A->B is the only upgrade path that exists for A 16:54:05 it would limit our upgrade testing matrix 16:54:30 but not sure how practical it is 16:54:34 we can, with two barriers 16:54:42 lucab: right. ok that's what I was thinking 16:55:02 whether we want to do that or not can be another discussion probably 16:55:18 your other idea/approach makes a system who has to guess about future nodes 16:55:57 the end result is what I was looking for.. creating two barriers achieves the goal 16:56:07 ok any other things to discuss for moving to f32 ? 16:56:14 (I can think about it a bit more, but I am not thrilled) 16:56:35 lucab: yeah we may not want to do that, was an idea 16:57:00 dustymabe: in general we shouldn't have barriers, unless we know we have to force a chokepoint for a migration script 16:57:25 even this F31->F32 is not really needed 16:57:53 lucab: but it's useful in that it allows us to have users follow a more tested path? 16:57:55 it just makes the model easier to reason about for us humans 16:58:21 i.e. we probably aren't going to spin up FCOS from january to test upgrading to f32 16:58:45 * jlebon has to drop off for other meeting, but overall i think i lean more towards not having a double barrier 16:58:50 it might make sense for upgrade _testing_ to specifically test an F31 origin 16:59:00 dustymabe: right, it trims the space of unknown things 16:59:04 cool 16:59:05 and then avoid barriers on the user side 16:59:05 we need to trust our CI testing, and expand it if we feel nervous 16:59:16 i think we are all saying the same thing :) 16:59:23 I don't think so? 16:59:40 oh, k let me dig 16:59:53 I tend to agree with Luca that barriers should exist for specific technical reasons 17:00:09 bgilbert: "F31 origin" meaning the original gangsta ? 17:00:14 the first release? 17:00:17 bgilbert: the next CI run on `testing` will test that (but only from last F31 release) 17:00:28 "origin" in the sense of the starting point for the upgrade test 17:00:36 so probably one of the last F31 releases 17:00:59 right, which is what I thought we were proposing.. introduce an update barrier so that we force users to follow the path that our CI tests 17:01:21 I'm proposing that every future release should be upgrade-tested from F31 17:01:26 and then we don't need a barrier 17:01:50 bgilbert: I agree, but I think what you are saying is that we don't need a double barrier 17:01:59 I'm saying we don't need a single barrier either 17:01:59 we still need the single barrier 17:02:15 bgilbert: do you want every F31 release, even from Jan to be able to upgrade to every single F32 release? 17:02:21 yup! 17:02:23 it's the more aggressive option, to be sure :-) and if it turns out to be a disaster, we can use barriers for future releases 17:02:45 but if we set the precedent, it'll only get harder to do barrierless upgrades later 17:03:03 bgilbert: hmm. but what is the benefit of doing a barrierless upgrade in this case ? 17:03:14 that's fair 17:03:26 I guess the number of test runs would be "number of F31 FCOS releases in the wild" for each release 17:03:32 barriers add extra friction 17:03:38 cyberpear: I don't think we need to test everything -> everything 17:03:48 dustymabe: less reboots for a new node created on an old release 17:03:49 pick a representative F31 release and test upgrading from it 17:03:50 just everything -> proposed 17:03:54 I feel like it makes me feel better to know this upgrade path was tested for the f31 to f32 rebase 17:03:55 we're talking about unknown unknowns, but my instinct says that there aren't any bugs that would happen only when upgrading between an earlier version to latest 17:04:14 cyberpear: no, F31 -> proposed, forever 17:04:15 or "oldest F31" -> current and "newest F31" -> current 17:04:25 cyberpear: sure 17:04:48 bgilbert: I think the idea was more or less "representative == barrier" 17:04:56 in my experience, upgrade bugs are path-dependent anyway 17:04:59 the cases that exist are likely in things like moby/podman stored containers&images but I think they need to deal with even old images in general anyways 17:05:21 "this range of F31 releases wrote this file that we can't read anymore" type of thing 17:06:05 so I guess what I'm saying is, I'm +1 to more CI, but think the barrier is too cautious 17:06:20 the things that will break are not the things we think will break 17:07:03 yeah I'm not sure where I stand 17:07:39 seemed like a no-brainer to me. we've never had the ability to force an upgrade path before and match a known tested major upgrade path 17:07:55 for this specific F31->F32, I do not disagree 17:08:16 last time we used a barrier was because we needed a real migration script 17:08:17 but I think you're probably right that it's not absolutely necessary and we'd probably be fine 17:08:51 this time we are not aware of any, so there wouldn't be any strict need to 17:09:02 we will hopefully never have fewer users than right now who might be broken if we try being aggressive 17:09:15 so now's the time :-P 17:09:24 normal Fedora upgrades don't go thru any barrier 17:09:39 cyberpear: because they don't have that ability 17:09:41 we do 17:10:06 bgilbert: OTOH, I'm still scared by the "let's go back and fingerprint every grub since day 0" 17:10:36 it worked, didn't it? :-D 17:10:38 but yeah 17:10:45 ...or, put a different way, not using a barrier won't create a precedent, but using one will. 17:10:48 (grub is a bad example here) 17:11:11 bgilbert: i don't think we'll have to create a barrier in the future (say f32->f33) if we do use one now 17:11:45 dustymabe: there will be psychological pressure to. it'll be the safe, conservative approach. 17:11:57 we'll have more users then than we did during 31->32, etc. 17:12:14 we'll be considered more stable then than now 17:12:53 I've talked too much here, I'll stop 17:12:56 I vote for "no barrier unless known to be technically required" 17:13:24 ok maybe let's pick up this discussion again here in the next week (or maybe we start a separate tracker issue to capture the discussion) 17:13:40 bgilbert: I think I'm personally fine with establishing a "barrier between majors" rule 17:14:07 let's move to GH, we have till next week to come up with a decision 17:14:27 bgilbert: does "let's continue to have the discussion" sound good? so we can move on to other topics? 17:14:51 sure 17:14:53 one other thought I just had 17:15:01 (we can also retroactively put a barrier if we need to) 17:15:03 barriers, by their nature, force us to live with destination-side bugs forever 17:15:24 i.e., if the target side of the barrier has a kernel bug that causes boot problems on some boxes 17:15:30 we're stuck with it, or have to retarget the barrier 17:15:35 can we re-write that barrier later? 17:15:39 yes 17:16:13 but in principle the barrier becomes an artifact in its own right that might require maintenance 17:16:17 EOF 17:16:39 #action dustymabe to create a ticket where we discuss appropriate update barrier approach for major upgrades 17:17:01 #topic podman corner case bug: when do we backport fix problems ? 17:17:13 #link https://github.com/containers/libpod/issues/5950 17:17:18 ok so mini story here 17:17:24 i'm running `next` on my irc server 17:17:39 it starts a rootless podman container via systemd on boot 17:17:53 that spawns weechat inside of tmux (too much detail) 17:18:20 anywho - after the last upgrade to `next` my podman containers stopped working 17:18:35 https://github.com/containers/libpod/issues/5950#issuecomment-625450333 17:18:48 it only applies to running a rootless container via systemd 17:18:55 so it's a bit of a corner case 17:19:15 but i'm wondering if that is something we should consider backporting a fix for in the future 17:19:24 so far we haven't had any users other than me report the issue 17:19:36 does it only apply to `next`? 17:19:36 * dustymabe notes we should add a CI test for this specific problem 17:19:53 bgilbert: good question. I think it's a podman 1.9 bug - so I think it applies to all our current streams 17:20:12 is it a regression? 17:20:15 but I need to confirm that 17:20:26 (for us) 17:20:33 bgilbert: yes, my system was working.. went down for upgrade and then stopped working 17:20:44 whoops, sorry, so you said 17:21:09 has it actually landed in the other streams? 17:21:22 IIUC podman 1.9 is in all our streams right now 17:21:41 it took us a 3 weeks to do a new release of `next` 17:21:58 what do you mean by backport? -- can't we just push a regular update and include it in the next release? 17:21:59 so I didn't catch the bug when it was in `testing`, but not yet in `stable` 17:23:08 IMO container runtime regressions are the sort of fix we should prioritize 17:23:28 bgilbert: right, i agree 17:23:39 but it does slightly depend on the case 17:23:50 so in this case it's only a problem if you're starting a rootless container via systemd 17:23:50 with the current FCOS model (less priority on stability) I agree with cyberpear that we should tend to pick up new packages 17:24:03 maaaaaybe an actual backport to `stable` but meh 17:24:15 dustymabe: understood 17:24:17 if I was running `testing` I would have caught this before it hit stable 17:24:27 but I was trying to get some coverage on `next` so i missed it :( 17:24:38 in CL I think we we would have rolled out-of-cycle releases on all channels 17:25:03 so our options: 17:25:04 (streams) 17:25:18 my understanding is that podman is going to do a new release very soon with the fix in 17:25:50 should we try to respin current testing with that new podman so that we can get stable fixed in the next round of releases (next week)? 17:26:09 or do we just not do anything since no one has reported an issue ? 17:26:33 it's hard for me to gauge if it's a real problem for people without having anyone complain about it:0 17:26:55 dustymabe: if you give me a second I can solve your dilemma :-P 17:27:00 I count 1 person who complained :P 17:27:06 in the very least we *should* make sure the fix goes into the testing release that is cut next week 17:28:14 bgilbert: :) 17:28:25 all I can offer is my CL experience, which says: these things are judgment calls, and this feels like a case we care about 17:28:26 complaints or no 17:28:37 so I'd vote respin 17:28:47 bgilbert: ok so you'd be a fan of respinning testing to get the new release into it? 17:28:48 YMMV 17:28:49 +1 17:28:50 I'd also vote respin 17:28:56 dustymabe: yup 17:29:16 ok. i'll ping the podman guys on the release and try to respin testing 17:29:44 #action dustymabe to get new podman release into testing release so we can fix stable in next weeks releases 17:29:54 +1 17:29:58 bgilbert: as part of that I will also 100% confirm that the bug does affect stable and testing 17:30:04 cool 17:30:09 #topic open floor 17:30:09 also, better to exercise the machinery 17:30:37 I think for F33, we should rebase "next" in time for F33 Beta, then have the first F33-based "stable" based on exactly F33 final content 17:30:55 +1 to rebasing earlier, that was always the plan 17:31:10 what's the benefit to stable based on exactly F33? 17:31:13 +1 to that 17:31:18 it'd be two weeks after F33 lands, of course 17:31:44 two+ weeks 17:31:54 hmmm 17:32:10 understood, just think would be good to have a very-well defined "point-in-time" snapshot, and that one's been pre-defined 17:32:14 I think the delay we have could be shortened slightly but probably not by much (my opinion) 17:32:27 +1 to having a next stream earlier 17:32:29 "+1 to rebasing earlier" = the next stream, not testing/stable 17:32:33 yup 17:32:34 based on f33 17:32:38 we're lagging behind rawhide quite a bit, so the earlier we move the next stream to it, the better imo 17:32:58 cyberpear: we're effectively a rolling distro, though? 17:33:36 yes, but if we eventually needed a barrier or double-barrier, the GA content would make a good place for it, IMO 17:33:37 if the exact package set matters, I feel like we're doing something wrong 17:33:47 there are a lot of 0day fixes that land after f33 17:34:02 honestly the GA content is mostly about "does what's delivered on the media work right" 17:34:09 so let's define some FCOS release criteria and have them part of GA content? 17:34:55 also, part of our artifacts are ISOs... 17:35:11 cyberpear: we can certainly get some more hooks into the releng processses such that bugs that affect us are considered with higher priority 17:35:43 I don't think what happened this cycle is representative 17:35:49 we were running to catch up 17:35:53 yep 17:35:57 we'll get better at this 17:36:04 yep, just trying to plan for the future 17:36:10 +1 to hooks into releng where it makes sense. also better CI. 17:36:16 but: cyberpear, I'm not clear on what problem you're trying to solve 17:36:20 cyberpear: yep, and you can help us with that too 17:36:35 haphazard release process? something else? 17:37:46 I gotta drop. thanks all, and thanks for hosting dustymabe! 17:37:52 "in 3 years, I want to go back and reproduce my system as it was on F33 FCOS" -- I know it's not a priority for most here, but having a release based on GA content would give it a good chance of succeeding, even w/ overlaid content 17:38:05 lorbus++ 17:38:05 dustymabe: Karma for lorbus changed to 1 (for the current release cycle): https://badges.fedoraproject.org/tags/cookie/any 17:38:40 cyberpear: the 33.20201105.3.0 release artifacts will still exist 17:38:46 the GA content RPM set is kept forever; everything inbetween until EOL is discarded once there's an update made 17:38:47 presuming you've saved the URLs :-/ 17:38:56 cyberpear: and the git history in the fcos configs have the exact rpm NVRAs 17:39:29 cyberpear: and we protect those NVRAs from GC for some period of time 17:39:38 good to know 17:39:52 anyway, nothing actionable on this today, I think 17:39:54 honestly I wouldn't trust that the release can be rebuilt from parts in 3 years 17:39:59 (and the source is always saved) 17:40:22 even if you had the RPMs. you'd pin to an old cosa, which might have who-knows-what bugs with your 3-year-newer kernel etc. 17:40:31 * dustymabe notes time 17:40:56 bgilbert: that's why I'd like to see FCOS become part of the compose process so the GA artifices are also preserved along w/ the RPMs 17:41:02 * cyberpear also sees we're over time 17:41:05 cyberpear: thanks for bringing it up though. release processes _always_ need improvement :-) 17:41:05 will end meeting in two minutes 17:41:28 cyberpear: outside the FCOS bucket, you mean? 17:42:00 I mean, have the F33-GA-based FCOS be sent to the mirror network, as if it were part of the GA compose 17:42:04 the problem here is mostly that we don't have a specific GC contract for FCOS artifacts that users can rely on 17:42:10 dustymabe: true 17:42:27 cyberpear: I'm still really really hoping no one ever references our artifacts except from stream metadata 17:42:37 continued use of old releases = bad 17:42:43 meanwhile, in the real world... 17:42:47 (hence, sending it to the mirror network, so it can take advantage of the existing processes in place) 17:42:59 (which is why I wouldn't be happy about sending it out to mirrors) 17:43:12 that one might be a losing battle though 17:43:21 (yeah, real world I find myself occasionally needing a RHEL 5 VM or container, to test something out for someone who's stuck on it for some reason) 17:43:26 yeah :-( 17:43:41 #endmeeting