16:31:36 #startmeeting fedora_coreos_meeting 16:31:36 Meeting started Wed Jan 12 16:31:36 2022 UTC. 16:31:36 This meeting is logged and archived in a public location. 16:31:36 The chair is dustymabe. Information about MeetBot at https://fedoraproject.org/wiki/Zodbot#Meeting_Functions. 16:31:36 Useful Commands: #action #agreed #halp #info #idea #link #topic. 16:31:36 The meeting name has been set to 'fedora_coreos_meeting' 16:31:42 #topic roll call 16:31:43 .hi 16:31:44 bgilbert: bgilbert 'Benjamin Gilbert' 16:31:52 o/ 16:32:33 .hi 16:32:34 dustymabe: dustymabe 'Dusty Mabe' 16:32:34 .hello2 16:32:37 jlebon: jlebon 'None' 16:32:45 .hi 16:32:46 davdunc: davdunc 'David Duncan' 16:32:48 woah cool, new room 16:32:56 jlebon: fresh paint 16:32:58 test TEST Test 16:33:03 acoustics seem nice 16:33:09 :D 16:33:36 no fridge yet ? 16:34:23 no fridge, but there is a pony keg with beer in it 16:34:23 .hello siosm 16:34:26 travier: siosm 'Timothée Ravier' 16:34:50 .hello 16:34:50 jdoss: (hello ) -- Alias for "hellomynameis $1". 16:34:55 .hello 16:34:55 .hello sohank2602 16:34:58 saqali: (hello ) -- Alias for "hellomynameis $1". 16:34:58 .hello miabbott 16:35:01 skunkerk: sohank2602 'Sohan Kunkerkar' 16:35:04 miabbott: miabbott 'Micah Abbott' 16:35:14 .hello2 16:35:15 jdoss: jdoss 'Joe Doss' 16:35:17 .hello saqali 16:35:19 saqali: saqali 'Saqib Ali' 16:36:29 #chair bgilbert nemric jlebon davdunc travier jdoss saqali skunkerk miabbott 16:36:29 Current chairs: bgilbert davdunc dustymabe jdoss jlebon miabbott nemric saqali skunkerk travier 16:36:36 #chair lorbus 16:36:36 Current chairs: bgilbert davdunc dustymabe jdoss jlebon lorbus miabbott nemric saqali skunkerk travier 16:36:48 .hi 16:36:49 lorbus: lorbus 'Christian Glombek' 16:37:41 #topic Action items from last meeting 16:37:50 There were no action items from last meeting! 16:38:13 other than the usual `cat everything > /dev/jlebon` 16:38:22 \o/ 16:38:33 :) 16:38:40 oops 16:38:42 meant 16:38:54 `cat everything >> /dev/jlebon` - can't overwrite the backlog 16:39:02 :D 16:39:14 ha 16:39:23 -ENOSPC 16:39:32 Just redirect it all to stdtravier 16:39:37 :) 16:39:56 (Can we do https://github.com/coreos/fedora-coreos-tracker/issues/194 ? But only if nothing else is more pressing?) 16:40:22 travier: can try 16:40:29 let's start with something else first 16:40:36 =1 16:40:39 +1 16:40:39 #topic FYI: some xen instance types might fail to boot on latest testing and next streams 16:40:44 #link https://github.com/coreos/fedora-coreos-tracker/issues/1066 16:41:05 This one is mostly an FYI to raise awareness (we'll probably be sending out a communication about it as well) 16:41:31 some xen instance types that take advantage of enhanced networking via the ixgbevf driver are failing to boot 16:41:44 hi. 16:41:48 .hi 16:41:49 jmarrero: jmarrero 'Joseph Marrero' 16:41:55 the failure rate is ~95% but somehow our tests passed when `testing` and `next` were cut last week 16:42:20 (for that instance type we only launch one instance and run a `basic` test) 16:42:23 #chair jmarrero 16:42:23 Current chairs: bgilbert davdunc dustymabe jdoss jlebon jmarrero lorbus miabbott nemric saqali skunkerk travier 16:42:43 Should we report that to AWS folks? davdunc? 16:42:46 The current proposal is to revert the kernel back to a known good and pursue a fix upstream 16:42:50 travier: already done! :) 16:42:53 thanks dustymabe for that. I am investigating.. 16:42:54 great! 16:43:03 we have a kernel ticket in for it. 16:43:14 I'll add that for reference in the issue. 16:43:35 thanks davdunc 16:43:56 AFAIK there isn't anyway to workaround the issue other than reverting the kernel 16:44:15 and we'll have to come up with some steps for people to recover their instances if they've fallen into this trap :( 16:44:50 I have some ideas for improvements on how we can not hit this again, but I'll leave them for latert 16:45:05 wow how lucky were we that both runs were in those 5%. that's a ...0.25% probability 16:45:16 likely some other factor at play? 16:45:31 this is a good example though of how CI will never really catch everything 16:45:57 Maybe something else changed in AWS between that time and now? 16:46:00 jlebon: yeah there is definitely something else going on underneath the covers. Maybe some changes on AWS backend? 16:46:14 that make it more consistently failing 16:46:36 +1 16:46:45 there is a nitro wrapper for specifically older instance types, like the m2 and m3 instances. 16:46:53 there were other contributing factors to why we either didn't see this or ignored it for a period of time: https://github.com/coreos/fedora-coreos-tracker/issues/1066#issuecomment-1009978326 16:47:14 i did a deep dive in mantle last night and found some skeletons we need to address 16:47:26 a lot of isolation was required after spectre/meltdown 16:47:28 which would have given us a much clearer red X failure 16:47:38 for the testing-devel runs 16:47:46 It's always Jenkins fault! 16:48:26 anywho I've spent too much time on this already.. FYI bgilbert looks like you're up in the ad-hoc release rotation: https://hackmd.io/WCA8XqAoRvafnja01JG_YA 16:48:35 will collaborate with you 16:48:37 this is one of the cve's fixed in the kernel https://bugzilla.redhat.com/show_bug.cgi?id=2031199 16:48:39 yup 16:49:06 ok next topic 16:49:27 #topic networking: consider the effects of BOOTIF kernel argument on nm-initrd-generator 16:49:34 #link https://github.com/coreos/fedora-coreos-tracker/issues/1048 16:50:14 There is a downstream issue I'm working on that this has implications on so I prioritized it 16:50:53 Basically there is a gap we've had in the past where we didn't consider the change in behavior a BOOTIF kernel argument would have on nm-initrd-generator 16:51:03 or rather a gap we "have" 16:51:04 #chair jbrooks 16:51:04 Current chairs: bgilbert davdunc dustymabe jbrooks jdoss jlebon jmarrero lorbus miabbott nemric saqali skunkerk travier 16:52:17 the question is, should we start to tell nm-initrd-generator to ignore that argument or not. Ignoring it gets us back into our happy place that we thought we were in to begin with. 16:53:04 i think we should ignore it, but it'd be nice if there was a nicer way to do that than changing the kargs defaults 16:53:28 though, ignoring it could have implications (behavior change) for some (i.e. if there are a ton of NICs on a machine or something) 16:53:29 like some flag to nm-initrd-generator or something 16:54:42 i think it's worth highlighting that BOOTIF usually is not something users provide 16:55:02 true, it's usually provided by the PXE executable 16:55:08 it's meant as a way for a PXE boot to know from which interface it booted for informational purposes 16:55:26 Could someone be relying on that right now and we would break it by changing how this behaves? 16:55:57 in theory, yes 16:56:21 jlebon: another option for us could be to enhance the code that attempts to determine if the user supplied any networking configuration or not to consider `BOOTIF` 16:56:29 I would prefer we improve our detection logic but that might not be ideal (and would require more work) 16:57:16 I guess we need to game out all of the scenarios 16:57:18 dustymabe: i.e. and not propagate? 16:57:23 jlebon: correct 16:57:36 but still let it have an effect on initrd networking 16:57:56 jlebon: right it still would have an effect on initrd networking 16:58:12 * dustymabe needs to read the bug again to see if that would actually help or not 16:58:17 yeah, could make sense. it's a more conservative change 16:58:56 the problem with that scenario arises when someone has their ignition config on a different network/NIC than they PXE booted from 16:59:27 which I guess in that case we tell them to add the `rd.bootif=0` arg? 17:00:00 there's also a knob they could use to have it not inject BOOTIF= 17:00:03 i'll take this back to investigation and add more info to the ticket and circle back next meeting with alternative options/implications 17:00:14 right `rd.bootif=0` ? 17:00:28 it looks like the nm-initrd-generator glue respects the dracut cmdline glue 17:00:37 dustymabe: no on the pxe configuration side 17:00:43 so we could drop rd.bootif=0 in /etc/cmdline.d/foo.conf 17:00:52 oh, yeah, that could be a nice option jlebon 17:01:16 and /proc/cmdline can still override it 17:01:18 bgilbert: right, that's the same thing as adding it to our default kargs (the original proposal) 17:01:36 dustymabe: yes, but without having to change the user-visible kargs 17:02:02 right, sorry that wasn't written down (that was the implentation I was thinking of in my head) 17:02:03 UX wise though, apply the principle of least surprise, i still think it'd make more sense for us to ignore it 17:02:24 bgilbert: would it still be possible to override it from kargs then? 17:03:03 travier: I haven't traced nmi-cmdline-reader.c, but if nm-i-g respects last-arg-wins, yes 17:03:38 It's both nice but more convoluted thus harder to figure out 17:03:50 here is where I was originally going to update: ./overlay.d/05core/usr/lib/dracut/modules.d/35coreos-network/50-afterburn-network-kargs-default.conf 17:03:54 https://github.com/coreos/fedora-coreos-config/blob/testing-devel/overlay.d/05core/usr/lib/dracut/modules.d/35coreos-network/50-afterburn-network-kargs-default.conf#L7 17:04:22 travier: update: yes it would 17:04:52 either way let me try to come back to this next time with more information 17:05:08 there's a semantic difference between changing the afterburn fallback and shipping a cmdline.d dropin 17:05:10 so the decision is easier to make 17:05:59 +1 17:06:29 #topic Release notes 17:06:50 #link https://github.com/coreos/fedora-coreos-tracker/issues/194#issuecomment-992334650 17:06:50 #link https://github.com/coreos/fedora-coreos-tracker/issues/194 17:07:16 So the subject is broad. I'm suggesting we scope it to a smaller subset first 17:07:41 Improving the way we track and display what issues are fixed in which releases 17:08:23 This is partly inspired by the layout at https://www.flatcar.org/releases/ 17:09:17 travier: I say run with it. 17:09:26 i feel like we should be able to avoid any manual work to get this. e.g. just a label on tracker issues which marks it to be added to the notes 17:09:33 The idea is to list issues fixed in a release in a json file 17:09:36 Once we have that we could use the same logic to make a job that generate lists for CVE too 17:10:26 jlebon: agree, there could be a bot acting on it 17:11:07 i like the idea of using labels, but I wonder if we can get away with not creating a label for every released version ID 17:11:23 so e.g. the release job runs a script which collects all the issues with a certain label and auto-generates the notes to push to s3 and then drops the labels 17:11:27 We can do it manually until we make a bot. It's not that painful to update. 17:11:35 dustymabe: i think we'd just need one per stream 17:12:19 The idea behind a separated json stream and not just adding that to the main one is that we can update the list at any time 17:12:24 and correct things 17:12:32 jlebon: yeah, I was just thinking about automation and going back later to correct things (i.e. we thought we fixed an issue, but we didn't and it's still broken) 17:12:53 is there a way to link issues to the fedora-coreos-streams issue (checklist) 17:12:58 and then pull the information from there? 17:13:26 We could create per-release milestones on Github 17:13:35 at least in that case we have a single issue that represents a release, if we can find a way to associate other issues with it and pull that information then we'd be set 17:14:24 https://github.com/coreos/fedora-coreos-tracker/milestones 17:14:47 dustymabe: hmm yeah could work. we could have the job that pulls that info and converts to JSON be separate. it gets triggered by the release job, but could be rerun if we changed something 17:14:59 travier: feels heavy, but maybe it could work 17:15:17 either way I think we're very much in favor of the intermediate proposal (we have nothing right now) 17:15:31 +1 17:15:34 but we're just pining about how to achieve it with least effort (which we can talk about later) 17:15:41 I don't know if we can have a bug in multiple milestones 17:15:59 last time I checked, a bug can only have one milestone 17:16:15 linking from the streams issue seems pretty heavy, since we'd need to switch to another repo each time 17:16:19 https://github.com/isaacs/github/issues/797 17:16:22 we can not 17:16:26 so this not an option 17:17:01 ok let's agree to discuss implementation details further outside of the meeting 17:17:22 I don't think we need a #proposed #agreed for this 17:17:32 agreed 17:17:35 trying to pick over the remaining meeting items 17:17:41 https://github.com/coreos/fedora-coreos-tracker/issues?q=is%3Aissue+is%3Aopen+label%3Ameeting 17:17:44 anything time pressing? 17:18:04 if not we're going to discuss "Large and growing PXE RAM requirement kind/bug meeting " 17:19:11 SGTM 17:19:26 #topic Large and growing PXE RAM requirement 17:19:40 #link https://github.com/coreos/fedora-coreos-tracker/issues/1055 17:19:44 bgilbert: you have the stage 17:20:15 I discovered that the documented 3 GiB RAM requirement for PXE appended rootfs is no longer enough 17:20:27 we never really understood why it needed to be that large 17:21:07 I did some digging. part of the issue, for both coreos.live.rootfs_url and appended rootfs, is that initrd / is on tmpfs and tmpfs will only use 50% of RAM by default 17:21:34 but that doesn't completely explain the memory requirements 17:22:02 bgilbert: how much memory of the system is reserved for the kernel? could that explain some of it 17:22:22 dustymabe: not very much 17:22:26 we're missing hundreds of MB 17:22:51 I don't have a single concrete proposal here 17:22:54 I've a server that only run in live env .... is there a command line to get these info for you now ? 17:23:28 nemric: we have testing environments; the problem is figuring out what's going on 17:23:41 ok 17:23:56 one major piece of this is: how much do we care about the PXE appended rootfs case? 17:24:03 we went to some effort to support it 17:24:16 but from what I've seen, I suspect everyone just uses the rootfs_url karg 17:24:26 which is also faster and more RAM-efficient 17:25:14 if we deemphasize that case, we have more control, since e.g. the fetcher script could remount the tmpfs 17:25:19 hmm 17:25:33 Last option: bump the requirement for the concatenate option and mention in the doc the rootfs url option as faster and more efficient? 17:25:40 `rootfstype=` is a kernel argument that can be set on the pxe server side/ 17:25:43 side? 17:26:00 travier: I mean, yeah. it's also possible to improve the UX, so we actually tell the user they're OOM rather than failing on something random 17:26:01 do we know if the downstream consumers (i.e. assisted installer) are using the appended rootfs case? 17:26:44 dustymabe: yeah, but see the notes in the bug. we can't do exactly what we want, and the workaround has potential unknown consequences 17:26:49 miabbott: AFAIK they are 17:26:51 *not 17:27:18 bgilbert: that seems like it'd be a nice improvement (OOM) without too much work 17:27:48 personally ok with just requiring more RAM and keep supporting it 17:27:50 jlebon: if we completely drop appended mode, it's not worth doing, but otherwise yes 17:27:54 If we think there are better options, I think we should emphasize those and do the minimum to keep the one we have sane? 17:28:09 one trivial change we can make 17:28:10 I lean towards the second half of 1. (obviously we need to test some more to see if there are side effects) and then maybe we poke the upstream PR for tmpfs to see if that can gain any traction 17:28:29 is to allowlist TFTP in rootfs_url. that allows using rootfs_url without setting up HTTP 17:28:54 dustymabe: meh, the risk doesn't seem worth it IMO 17:29:05 what are the risks? 17:29:08 we omitted that to cut down on the support matrix, since appended initrds exist 17:29:34 I don't think it's worth optimising for low ram is we have another option more ram efficient already. If you want less use of RAM, use the other option 17:29:49 travier: fair 17:29:54 dustymabe: at runtime we'd be reading our rootfs out of a squashfs out of a minimal ramfs that no one uses 17:30:07 dustymabe: our initramfs is already special and extremely complex. this would further increase the gap between what we do and everyone else does 17:30:20 *ramfs implementation 17:31:13 ok so current proposal is to: 17:31:14 thoughts about allowlisting TFTP in rootfs_url? it would close a functionality gap in the preferred path 17:31:34 update docs to mention higher RAM reqs if you're going to concatenate and make OOM reporting better? 17:31:37 it's not 100% equivalent, since you have to repeat your TFTP server address in kargs instead of leaving it implicit 17:32:00 but it might help people migrate away from appending 17:32:04 bgilbert: sounds sane, but can you expand on the UX for that in the issue? 17:32:16 sure. it's just rootfs_url=tftp:// 17:32:24 :) 17:32:43 working on a #proposed 17:32:54 ahhh, well in that case SGTM :) 17:32:58 (FYI we're past time) 17:33:10 bgilbert: were the things I mentioned part of that? 17:33:13 yup 17:33:16 +1 17:34:08 #proposed We will update our docs for the apparent new RAM requirements for PXE appended rootfs, and we'll improve OOM reporting for the appended case. We'll also pursue supporting TFTP in the rootfs_url karg, and have the docs encourage people to use that karg when possible. 17:34:42 s/use/prefer/ 17:34:52 "use that karg when possible" as a possible alternative to appending if they have limited RAM? 17:35:07 regardless of RAM 17:35:14 it's also faster and more debuggable 17:35:36 bgilbert: hmm, i wonder if there's a way to auto-query the IP of the server served us somehow so the UX could be even simpler 17:35:48 the server that* served us 17:35:51 we can weaken that last part of the #proposed if desired 17:36:04 jlebon: the bootloader would need to pass that info on 17:36:39 I don't have strong opinions but it feels like we should just encourage people to use rootfs_url and then the rootfs_url docs can mention tftp or http 17:36:42 +1 17:36:58 either way I think: 17:37:00 we could get it from DHCP next-server, assuming the DHCP response doesn't change based on the client ID. which it very well might. 17:37:01 +1 17:37:05 bgilbert: yeah. wonder if it already does somehow. maybe some obscure ethtool knob against the interface 17:37:19 anyway, we can discuss this elsewhere :) 17:37:20 +1 17:38:05 #agreed We will update our docs for the apparent new RAM requirements for PXE appended rootfs, and we'll improve OOM reporting for the appended case. We'll also pursue supporting TFTP in the rootfs_url karg, and have the docs encourage people to prefer that karg when possible. 17:38:12 thanks all 17:38:24 thanks 17:38:27 #topic open floor 17:38:32 sorry for the long meeting 17:38:36 any topics for open floor? 17:38:59 #info dustymabe updated the f36 changes list: https://github.com/coreos/fedora-coreos-tracker/issues/918 17:39:15 i'm thinking maybe we should do a video meeting soon to go through the list 17:39:15 are we due for another video meeting soon? 17:39:20 ha 17:39:21 woah :) 17:39:32 same second 17:39:46 we could just ad-hoc schedule one for next week 17:40:04 we'll make jdoss run it 17:40:08 thoughts? 17:40:41 SGTM 17:40:43 Oh crap. 17:40:59 dustymabe: maybe not the whole meeting. leave some time for something more fun 17:41:01 jdoss: don't worry, 17:41:19 jlebon: IOW we should have something else on the agenda? 17:41:31 yeah. unless you meant it as a meeting separate from the community meeting 17:41:44 was going to use the same timeslot 17:41:58 ok let's discuss more offline 17:42:01 #endmeeting