16:32:37 #startmeeting fedora_coreos_meeting 16:32:37 Meeting started Wed May 11 16:32:37 2022 UTC. 16:32:37 This meeting is logged and archived in a public location. 16:32:37 The chair is travier. Information about MeetBot at https://fedoraproject.org/wiki/Zodbot#Meeting_Functions. 16:32:37 Useful Commands: #action #agreed #halp #info #idea #link #topic. 16:32:37 The meeting name has been set to 'fedora_coreos_meeting' 16:32:41 #topic roll call 16:32:44 .hello2 16:32:45 jlebon: jlebon 'None' 16:32:48 .hi 16:32:49 saqali: saqali 'Saqib Ali' 16:32:53 .hello miabbott 16:32:54 miabbott_: miabbott 'Micah Abbott' 16:33:02 .hello siosm 16:33:03 travier: siosm 'Timothée Ravier' 16:33:15 .hi 16:33:15 Sid__: Sorry, but user 'Sid__' does not exist 16:33:21 #chair jlebon saqali mikelo_ Sid__ 16:33:21 Current chairs: Sid__ jlebon mikelo_ saqali travier 16:33:55 .hi 16:33:56 lucab: lucab 'Luca BRUNO' 16:35:01 .hi 16:35:02 ravanell_: Sorry, but user 'ravanell_' does not exist 16:35:02 #chair lucab 16:35:02 Current chairs: Sid__ jlebon lucab mikelo_ saqali travier 16:35:14 .hi 16:35:15 dustymabe: dustymabe 'Dusty Mabe' 16:35:20 #chair lucab 16:35:20 Current chairs: Sid__ jlebon lucab mikelo_ saqali travier 16:35:27 #chair ravanell_ dustymabe 16:35:27 Current chairs: Sid__ dustymabe jlebon lucab mikelo_ ravanell_ saqali travier 16:36:17 Let's start 16:36:24 #topic Action items from last meeting 16:36:47 No actions 16:36:50 let's move on 16:37:11 hum 16:37:21 was there a meeting last week? 16:37:34 the notes are not there 16:37:55 travier: it was a video meeting 16:37:57 oh, it was a video one 16:37:59 ok 16:38:22 #topic coreos autoinstall creates huge number of xfs allocation groups #1183 16:38:29 #link https://github.com/coreos/fedora-coreos-tracker/issues/1183 16:38:36 who wants to introduce this one? 16:39:04 I can 16:39:22 .hi 16:39:23 aaradhak: aaradhak 'Aashish Radhakrishnan' 16:39:24 we started covering it last week in the video meeting 16:39:29 #chair aaradhak 16:39:29 Current chairs: Sid__ aaradhak dustymabe jlebon lucab mikelo_ ravanell_ saqali travier 16:39:47 * cmurf is lurking 16:39:50 * dustymabe hopes his summary was reflective of the meeting discussion: https://github.com/coreos/fedora-coreos-tracker/issues/1183#issuecomment-1119861324 16:40:07 .hi 16:40:08 bgilbert: bgilbert 'Benjamin Gilbert' 16:40:15 yes, I think it contains all the things we said 16:41:16 #chair bgilbert 16:41:16 Current chairs: Sid__ aaradhak bgilbert dustymabe jlebon lucab mikelo_ ravanell_ saqali travier 16:41:26 we stopped before deciding which path to take for a solution 16:42:10 bgilbert: we're currently contemplating the options in https://github.com/coreos/fedora-coreos-tracker/issues/1183#issuecomment-1119861324 16:42:20 dustymabe: thanks for writing that! 16:42:35 walters: np - i wrote it a day later so my memory could have missed some details 16:43:27 I think I brought up C but in retrospect I think it was a bad idea 16:43:39 A or F are the most appealing to me 16:43:41 yeah, -1 to C 16:44:36 lets do process of elimination 16:44:41 looks like C is out 16:44:45 any other we can disqualify? 16:44:52 I prefer 1 then F 16:44:56 A then F 16:45:03 F isn't really a solution though 16:45:22 bgilbert: it is if it's combined with docs 16:45:35 "sorry, your disk is too large, boot failed"? 16:45:38 D is too undeterministic. you don't want to be testing your Ignition config in e.g. QEMU and then suddenly it does something else on the real thing 16:45:39 bgilbert: yeah.. I think F is more "the tools should have not let us do this in the first place" kind of thing 16:45:56 i'd disqualify D 16:45:59 jlebon: fair re D 16:46:22 I don't like D because "magic" and Ignition is usually pretty explicit 16:46:29 +1 16:46:30 bgilbert: "see this link at $url" where we describe the different options (e.g. reprovision root or /var partition) 16:47:02 all in favor of throwing out D? 16:47:09 dustymabe: +1 16:47:10 most of the rest of these options, other than A, seem like us just punting on the issue 16:47:11 once we have detection for D, making A is maybe not that far off 16:47:13 on F I think there are actually two sub-options: F1) hard-fail when trying to growfs F2) soft-fail the growfs, keep booting with a small rootfs, show up a failed service unit 16:47:39 this is a technical detail that doesn't seem like it should be the user's problem 16:47:50 i think even if we do A, we should probably add an MOTD (for lack of a better mechanism) so they can choose to handle it differently 16:48:10 jlebon: what concrete recommendation would we want to give in the MOTD? 16:48:15 jlebon: how would you detect that? 16:48:35 bgilbert: to create a /var partition rather than pay the cost of reprovisioning 16:49:00 fair 16:49:02 A would be an obvious choice IMO if it weren't for its cost 16:49:24 jlebon: right.. do we have any idea how long it would take 16:49:38 considering the case where we want to do this is large disks essentially 16:49:54 if you're spinning up an instance with a 1 TB disk, I wonder if you care about a minute or two of copying data around 16:49:58 it might be worth investigating with XFS SMEs on whether there are cheaper ways to re-mkfs 16:50:11 If we make an MOTD for A, how do you detect that it happened vs something else in the Ignition config making it happen? 16:50:24 travier: we know why we decided to reprovision 16:50:37 bgilbert: you might if frequent reprovisioning is a key part of your setup 16:51:07 jlebon: if you're not already using a separate /var, then each reprovision is deleting all of your storage 16:51:15 ^^ 16:51:20 maybe you need multiple TB of scratch space but I don't think that's as likely 16:51:39 fair 16:51:54 honestly I'm cool with the MOTD - but maybe we could make it more generic than this problem 16:52:09 dustymabe: how? 16:52:15 i.e. "we see you have a large disk, have you considered a separate /var?" 16:52:25 jlebon: mkfs.xfs then xfs_copy perhaps 16:52:27 something along those lines ^^ 16:52:39 hi, this is fedora coreos clippy...it looks like you have a large disk... 16:52:46 :D 16:52:48 miabbott: i'm bad at marketing 16:52:51 it still feels a bit weird to make this costly choice for them, but it's likely the least bad option 16:52:57 I think we'd need to loop in the XFS folks soon anyway if E OR F are still on the table after today 16:53:41 honestly I think we should talk to the XFS folks anyway about F (regardless of what we decide to do here) 16:53:59 cmurf: ughh, TIL about xfs_copy. that's interesting 16:54:08 s/ughhh/huh/ 16:54:47 ok C and D are out 16:54:54 jlebon: it's the least surprising default. the user can still override it 16:54:58 I've seen some "A or F" 16:54:58 F should happen anyway, yeah 16:55:14 anyone want to advocate for or against E ? 16:55:16 bgilbert: yeah, agreed 16:55:21 we've previously collab'ed with the xfs folks, should just ping them on the ticket (see https://github.com/coreos/fedora-coreos-config/pull/1650#issuecomment-1090260286) 16:55:41 i guess the issue with F is that you've got one resize happening from an original size the user has no idea of 16:56:46 cmurf: i'm looking at F from a perspective outside of this specific issue.. just as a non-FCOS user (any user really), why should the tool allow me to get into a bad state 16:56:49 cmurf: the original size is more or less the only known variable in the equation as it comes from our build manifests 16:56:59 dustymabe: E is surprising. we only autogrow at all if the user has decided not to take control of their storage layout, and we should do what the user expects in that case 16:57:14 and thus they have no practical way of knowing the max size for the fs, unless xfs_growfs has a way of understanding this 1 order magnitude growfs "limit" it's not a hard limit, it will grow it, but a warning about performance degradation would be appropriate given upstream telling users their configuration is suboptimal anytime large fs resizes happen 16:57:22 bgilbert: so your vote is to disqualify E? 16:57:29 dustymabe: yeah 16:57:34 (yes it can be advertised better, but at least it's known upfront and versioned) 16:57:53 lucab: known by coreos developers, not the user 16:58:28 cmurf: yeah i think what I'm advocating for with F is that the tools disallow us to get into this state, but they could choose to give some sort of --but-i-really-want-this CLI flag or something 16:58:42 I'd be content with a warning 16:59:04 i think it's a valid ask 16:59:14 whether XFS developers will agree, no idea 16:59:32 bgilbert: warning wouldn't help us much here (in this situation) since it's all automated 16:59:41 bgilbert: E is similar to F, as in it tries to acknowledge that the resource as an upper bound that can't grow arbitrarily 16:59:45 dustymabe: right, and I don't think F is a solution for us anyway 16:59:46 hell there may even be a warning today that we haven't seen 16:59:55 *has an upper boudn 17:00:16 lucab: i expect it'll be documented at least (if it isn't already) before it'd get baked as a warning into xfs_growfs 17:00:16 well F could be a solution - if the tool fails then won't we surface that to the user in some way? 17:00:46 what you need an XFS developer to answer is, can the tool effectively estimate the original size of the fs in order to provide a useful warning? 17:00:55 lucab: ...and then make it the user's problem, which I don't like. growfs is an implementation detail and it's not great to punt its arbitrary limits to the user 17:01:14 because the current size is not the size it needs to make the assessment, but the original mkfs size 17:01:37 dustymabe: "disk too large" makes us look old and creaky. we're not MS-DOS. 17:01:46 and a possible with that is it depends on the xfsprogs version, since the mkfs time decisions differ by version 17:01:54 Agree that A is the best option. It's more work but we have time as we have an easy workaround for this issue (make a separated /var) 17:01:56 and version isn't including in the on-disk metadata AFAIK 17:02:01 bgilbert: hehe 17:02:32 ehh 17:02:35 I don't know 17:02:40 cmurf: I don't understand the concern about the original disk size 17:02:43 we already warn the user if their disk is too small 17:02:54 cmurf: if the problem is caused by too many allocation groups, then we warn if there are too many allocation groups, no? 17:02:58 and tell them they need to do something different 17:03:03 bgilbert: the original disk size at mkfs time is the limiting factor for growfs 17:03:04 I don't really think this is that much different 17:03:10 not the current size 17:03:42 cmurf: right, but we don't need to identify the root cause in numerical detail. "this filesystem has been grown too much, see " would be sufficient 17:04:07 cmurf: are you concerned with multiple/subsequent growfs? 17:04:12 #proposed Given that we have a valid and recommended workaround for this issue, we will investigate option A (adding auto-detection and auto re-provisioning). We will reach out to XFS folks to get a better understanding of our options and to see if F is also doable. 17:04:21 right but in order to say "it's been grown too much" since original size, means you need to know the original size or a way to infer it 17:04:47 cmurf: I'm saying, isn't it sufficient to detect that the new FS will have more than a constant N allocation groups? 17:04:48 and the way XFS developers do that is by asking what xfsprogs made the fs and then they can infer what the original size was at mkfs time from xfs_info 17:05:08 bgilbert: yeah that was my question 17:05:24 #chair +cmurf 17:05:24 Current chairs: +cmurf Sid__ aaradhak bgilbert dustymabe jlebon lucab mikelo_ ravanell_ saqali travier 17:05:32 i.e. i already ran growfs now look at state and see a million allocation groups and I know I've grown too much 17:05:42 (since you're more than lurking around :)) 17:05:45 a million? 17:05:49 hyperbole 17:06:28 dustymabe: too small makes intuitive sense; too large doesn't 17:06:40 bgilbert: i'm not sure, XFS dev question - I know at least the journal is not resized with xfs_growfs 17:06:40 cmurf: i'm trying to oversimplify something I don't fully understand 17:07:02 travier: ack +1 to #proposed 17:07:09 travier: +1 to #proposed 17:07:29 also +1 for me 17:07:39 +1 17:07:40 cmurf: fair. in general I'd think there could be heuristics for "the properties of this filesystem will lead to slowness" 17:07:40 from the bug report, an agcount ~20k seems to already fail the default timeout 17:07:41 +1 17:07:55 (and i already pinged sandeen on the ticket) 17:08:03 miabbott++ 17:08:22 #agreed Given that we have a valid and recommended workaround for this issue, we will investigate option A (adding auto-detection and auto re-provisioning). We will reach out to XFS folks to get a better understanding of our options and to see if F is also doable. 17:08:25 i have a followup subtopic of this 17:08:25 bgilbert: from a user standpoint I agree, because this is estoteric information and I'm going to push back hard on developers that say users are supposed to know this 17:08:34 dustymabe: go ahead 17:08:37 but that's the history of the file system is that it's a file system for experts who know things 17:08:56 which is.. I think we agree we'd like people to start using /var/ partitions more often.. How do we encourage that behavior (especially for large disks)? 17:09:11 hence why i'm strongly in the Btrfs as `/` camp and you can legit default to XFS for /var based on the actual size of that volume rather than an estimate or place holder size 17:09:58 cmurf: i think better to keep that convo in the tickets. we're straying a bit from the topic 17:10:25 dustymabe: docs maybe? 17:10:35 maybe :) 17:10:50 just didn't know if anyone else had ideas other than my "clippy" suggestion 17:10:50 Let's move one or we won't have enough time 17:10:54 move on ++ 17:11:12 #topic New Package Request: nmstate-libs and nmstate #1175 17:11:18 #link https://github.com/coreos/fedora-coreos-tracker/issues/1175 17:11:19 dustymabe: is it objectively better? i'm not sure 17:11:37 travier: sorry I think we should have removed the meeting label from this one 17:11:44 ok, next one 17:11:48 the NM guys were going to come back at some point 17:11:53 we'll need to schedule that with them 17:12:02 unless someone here does want to discuss this one today 17:12:02 #topic use internal qcow2 compression for nutanix image #1191 17:12:09 #link https://github.com/coreos/fedora-coreos-tracker/issues/1191 17:12:53 (I can undo depending on whether folks want to talk about it or not) 17:13:08 So, Nutanix doesn't support direct download/upload of a gzipped image 17:13:19 we want a qcow2 artifact that we can pass on to Nutanix Prism APIs for direct download of the image. The gz requires downloading, extracting to regular qcow2, and uploading which is more failure prone. 17:13:33 i.e. non-gz qcow2 17:13:51 We can also use internal qcow2 compression (-c flag) which will solve the problem of image size (which is why I guess we were using gzip for compression to begin with) while still having an image we can directly upload. 17:14:14 There's a PR I created to that effect 17:14:17 #link https://github.com/coreos/coreos-assembler/pull/2848 17:14:45 +1 to direct download/upload. we usually try to accommodate for platform requirements when the platform supports that, e.g. by changing compression algo 17:15:08 Sid__: what other formats does the API support? 17:15:42 iso afaik 17:16:08 Raw too. Internally, Nutanix unpacks the image to RAW before mounting 17:16:22 but it doesn't support any compressed raw, i suppose? 17:16:49 not sure on that one. 17:17:14 jlebon: I don't think compressed-qcow is too scary. the nutanix artifact is only meant to be used by nutanix ingest 17:17:29 it sounds like the qcow2 is more used as "transport". if there's a better matching artifact, we should use that. if not, the current proposal SGTM 17:17:51 I'm completely fine with making this change.. I do wish more platforms would support importing from a URL a compressed artifact, though. Most of the platforms don't so you have to decompress the image (large file) and upload that directly (eating bandwidth). 17:18:02 openstack has this problem, ibmcloud, azure 17:18:11 azurestack 17:18:19 I'm +1 to making the change, but there's potentially a compat concern 17:18:26 longer term, we do want a single internally-compressed qcow for Nutanix, right? 17:18:40 I had originally proposed adding a new artifact, rather than changing the existing one, to avoid breaking any user scripts 17:18:48 dustymabe: heh flashbacks of https://github.com/jlebon/osp-utils#upload 17:18:51 lucab: I think that is shorter term 17:19:03 but actually that's not trivial, because both artifacts would have the same name in the filesystem before cmd-compress gets to them 17:19:16 do we have stats on how many FCOS nutanix users there are? 17:19:20 Indeed, a single internally-compressed qcow for Nutanix is exactly what we want. 17:19:24 or image downloads 17:19:54 I assume not many? in which case, I think we should send a coreos-status announcement of a breaking change, and cut over ASAP 17:20:15 bgilbert: I assume you are correct 17:20:16 (which would imply doing the following FCOS `stable` release with a previous cosa snapshot) 17:20:20 and I agree with your assessment 17:20:41 if there are any, they'd probably be happy with this change anyway 17:20:59 bgilbert: lucky for you I tagged COSA in quay and the aarch64 builder on Saturday when I kicked off the testing build 17:21:01 jlebon: fair. we'd be breaking their scripts that they don't want to have 17:21:08 Given that this is an image upload that has to be done on the platform I don't think that's a big breaking change for existing users. It will also simplify their usage. 17:21:47 +1 17:21:57 can anyone do a quick countme or cincinnati stats query? 17:22:22 i don't think those expose platform information, do they? 17:22:23 we can not query the platform with countme (by design) 17:22:29 I think we don't count/track by platform on either 17:22:37 jlebon: I thought countme did, at least? 17:22:38 * jlebon sheds a tear for pinger 17:22:40 huh 17:23:42 jlebon: your tear has been recorded in the pinger backend in the sky 17:23:48 #proposed After a short deprecation period, we will switch the Nutanix artifacts to an internally compressed qcow2 image 17:23:56 bgilbert: haha 17:24:05 travier: let's define "short" 17:24:07 tear_count: 1 17:24:12 yeah AFAIU countme only tracks fedora major release and architecture 17:24:14 jlebon: hah 17:24:47 Making a backing for the pinger is a big and complex task especially regarding privacy & user data 17:25:09 travier: yup, I'm not complaining 17:25:12 I propose: send coreos-status post now, switch next/testing for next release cycle 17:25:19 and stable for the cycle after that 17:25:23 WFM 17:25:35 #proposed After a 2 weeks deprecation period, we will switch the Nutanix artifacts to an internally compressed qcow2 image 17:25:39 assuming our stack can handle `testing` and `stable` being different 17:25:49 i.e. websites and such 17:25:53 noob question: How long is a release cycle? 17:25:59 maybe make it 4 weeks? to get things sorted out? 17:26:01 travier: if we agree to a two-week minimum, that'll push us one more cycle out 17:26:04 Sid__: two weeks 17:26:16 Cool. Sounds good to me. 17:26:19 3 then :) 17:26:32 dustymabe: it should. everything just walks the format map 17:26:38 AFAIK 17:26:45 cool 17:26:48 * bgilbert crosses fingers 17:26:49 was worth asking 17:26:56 we may learn something 17:27:03 we always learn something! 17:27:13 it'll require some fun cosa tag handling 17:27:21 jlebon: already covered 17:27:31 #proposed After a 3 weeks deprecation period (announced ASAP), we will switch the Nutanix artifacts to an internally compressed qcow2 image 17:27:35 travier: I'm arguing for "effective the next cycle". I don't think there's much reason to delay further 17:28:08 dustymabe: sweet 17:28:10 #proposed We will switch the Nutanix artifacts to an internally compressed qcow2 image for the next cycle. This breaking change will be announced ASAP. 17:28:24 "effective next cycle" means to `testing` and `next` in the next release and to `stable` in 4 weeks essentially 17:28:30 IIUC ^^ 17:28:30 right 17:28:39 SGTM 17:28:41 could be convinced to wait longer if there's a need 17:29:06 travier: +1 to proposed 17:29:12 travier: +1 17:29:12 I'm not sure we should keep using different artifacts type for different streams. Let's change them all at once? 17:29:31 What do we benefit from switching only some? 17:29:42 travier: for the same reason we benefit from promotion for anything else 17:29:51 (time check, we're mostly done) 17:30:11 proposal SGTM too 17:30:14 get testing, shake out issues, avoid surprising users 17:30:18 ok 17:30:22 +1 17:30:52 #agreed We will switch the Nutanix artifacts to an internally compressed qcow2 image for the next cycle for testing & next and the following one for stable. This breaking change will be announced ASAP. 17:30:52 but it's a known breaking change anyway, at least users can change their script in a single way 17:31:28 #undo 17:31:28 Removing item from minutes: AGREED by travier at 17:30:52 : We will switch the Nutanix artifacts to an internally compressed qcow2 image for the next cycle for testing & next and the following one for stable. This breaking change will be announced ASAP. 17:31:36 went too fast here. 17:31:48 Let's get more +1 or another suggestion 17:32:28 any other votes? lucab miabbott aaradhak gursewak_ saqali ravanell_ 17:32:35 +1; in terms of breaking chagnes this seems like a small one 17:32:36 (timing out in 60 secs) 17:33:09 +1 17:33:18 +1 17:33:30 I'm happy to defer to whoever is going to handle the changes. I think we agree that we won't impact many users, so the details are probably not very pressing. 17:33:45 #agreed We will switch the Nutanix artifacts to an internally compressed qcow2 image for the next cycle for testing & next and the following one for stable. This breaking change will be announced ASAP. 17:33:49 lucab: +1 17:33:56 #topic Open Floor 17:34:12 (5 min and then I'll close :)) 17:35:10 #info Fedora 36 has been release to our `testing` stream! 17:35:43 woohoo! 17:36:49 I don't have anything else I don't think 17:36:56 I know there is a release party starting on Friday 17:37:04 which I encourage people to attend 17:37:23 🎉 17:37:28 https://hopin.com/events/fedora-linux-36-release-party/registration 17:37:38 actually while we are here 17:37:47 does anyone want to work on switching cosa to f36? 17:38:06 dustymabe: I can help with that 17:38:08 I vaguely remember some people testing it out already, but could be wrong 17:38:30 ravanell_++ 17:38:39 ravanell_++ 17:38:56 ravanell_: sweet, thanks! 17:38:57 that's all I had for today 17:39:27 upstream devs: be on the lookout for potential CI regressions. testing-devel and buildroot images will soon move to f36 17:39:53 xref: https://github.com/coreos/fedora-coreos-config/pull/1732 17:41:33 #endmeeting