2024-02-29 17:30:54 <@tflink:fedora.im> !startmeeting fedora-pytorch 2024-02-29 17:30:55 <@meetbot:fedora.im> Meeting started at 2024-02-29 17:30:54 UTC 2024-02-29 17:30:55 <@meetbot:fedora.im> The Meeting name is 'fedora-pytorch' 2024-02-29 17:31:01 <@tflink:fedora.im> !topic welcome and roll call 2024-02-29 17:31:05 <@tflink:fedora.im> !hello 2024-02-29 17:31:06 <@zodbot:fedora.im> Tim Flink (tflink) 2024-02-29 17:31:37 <@tflink:fedora.im> !link agenda document https://board.net/p/fedora-pytorch-meeting 2024-02-29 17:31:55 <@tflink:fedora.im> who all's here for some pytorch discussion fun times? 2024-02-29 17:32:34 <@kaitlynabdo:fedora.im> im here 2024-02-29 17:33:06 <@kaitlynabdo:fedora.im> i only have one thing today which is just to update everyone on what fesco said (: 2024-02-29 17:33:55 <@trix:fedora.im> meep meep 2024-02-29 17:35:17 <@tflink:fedora.im> !topic pytorch 2.3 2024-02-29 17:35:27 <@tflink:fedora.im> !chair Tom Rix 2024-02-29 17:35:43 <@trix:fedora.im> yes, i am sitting.. 2024-02-29 17:35:45 <@tflink:fedora.im> I still don't know if chair does anythihng with the new meetbot 2024-02-29 17:35:57 <@tflink:fedora.im> chair used to let you set links, change topics etc 2024-02-29 17:36:15 <@conan_kudo:matrix.org> chair does nothing 2024-02-29 17:36:19 <@conan_kudo:matrix.org> it's not implemented yet 2024-02-29 17:36:19 <@trix:fedora.im> well.. lets get started. 2024-02-29 17:36:30 <@tflink:fedora.im> good to know, thanks Conan Kudo 2024-02-29 17:37:05 <@tflink:fedora.im> Tom Rix this was your topic, I'll let you run with it since I'm only partially familiar with what you listed on the agenda 2024-02-29 17:37:05 <@trix:fedora.im> i am chasing 2.3 for rawhide and _maybe_ epel. 2024-02-29 17:37:26 <@trix:fedora.im> epel has some maybe deep python runtime needs. 2024-02-29 17:37:45 <@trix:fedora.im> what's others interest in epel at this time ? 2024-02-29 17:37:50 <@tflink:fedora.im> you're a braver man than I if you do epel, that's a long support commitment unless epel guidelines have changed since the last time I looked 2024-02-29 17:38:37 <@trix:fedora.im> i am asking if folks really care about epel, i would rather not own another 30 packages if no one cares 2024-02-29 17:38:52 <@tflink:fedora.im> I don't personally care, no 2024-02-29 17:39:19 <@tflink:fedora.im> and it doesn't seem like many other people are here so I doubt that you'll get much feedback in this forum 2024-02-29 17:39:42 <@trix:fedora.im> no worries. 2024-02-29 17:39:58 <@trix:fedora.im> that was the softball feature. 2024-02-29 17:40:07 <@tflink:fedora.im> !info trix is considering epel packages for pytorch and is looking for feedback on whether those are desired 2024-02-29 17:40:31 <@trix:fedora.im> the harball is rocm and splitting pytorch / having muliple installs 2024-02-29 17:40:51 <@trix:fedora.im> i am not sure how _not_ to do this and support all the rocm gpus 2024-02-29 17:40:59 <@tflink:fedora.im> shall we save rocm for the next topic with splitting or discuss it now with the other 2.3 features? 2024-02-29 17:41:17 <@trix:fedora.im> rocm is the big 2.3 feature 2024-02-29 17:41:46 <@tflink:fedora.im> sure but isn't splitting the bigger problem? 2024-02-29 17:41:59 <@davide:cavalca.name> !hi 2024-02-29 17:42:00 <@zodbot:fedora.im> Davide Cavalca (dcavalca) - he / him / his 2024-02-29 17:42:39 <@tflink:fedora.im> yeah, the previous time was a bit early for me as well :) 2024-02-29 17:42:49 <@trix:fedora.im> ok.. general rocm before splitting. its building 2024-02-29 17:43:03 <@trix:fedora.im> only tested on mi210 2024-02-29 17:43:10 <@trix:fedora.im> cuz that is what i have 2024-02-29 17:43:24 <@tflink:fedora.im> !info pytorch 2.3 with rocm support can be built but for the moment, support is limited to a few cards and has only been tested on mi210 2024-02-29 17:44:10 <@tflink:fedora.im> no idea if it'll work since I bet that upstream designed support for the MI cards, though 2024-02-29 17:44:27 <@trix:fedora.im> i think its too early to say 'a few cards' .. its not like i disabled the buidling of any of the cards 2024-02-29 17:45:11 <@tflink:fedora.im> when I attempted to build pytorch yesterday, it required specifiying gfx906 or gfx90a - is there a way around that? 2024-02-29 17:45:36 <@trix:fedora.im> eventually i will get around to checking it out on anther card, that is just a time thing 2024-02-29 17:46:06 <@trix:fedora.im> check today. the general building has been pushed to the rhel-test branch 2024-02-29 17:46:24 <@tflink:fedora.im> will do, thanks 2024-02-29 17:47:14 <@tflink:fedora.im> Tom Rix are the other features you listed in the agenda announcements or is there discussion to be had around them? 2024-02-29 17:47:35 <@tflink:fedora.im> IIRC, there are some potential issues with distibuted 2024-02-29 17:47:44 <@trix:fedora.im> another feature on the fence is distributed. 2024-02-29 17:48:06 <@trix:fedora.im> to do that rccl needs to be fedora and gloo needs to use it 2024-02-29 17:48:30 <@trix:fedora.im> and the hardest .. i do not really have a way to test distributed. i have 1 machine. 2024-02-29 17:48:45 <@trix:fedora.im> same for basic rccl. 2024-02-29 17:48:48 <@tflink:fedora.im> what do you need to test distributed? 2024-02-29 17:49:08 <@trix:fedora.im> i'm not sure. i assume at least 2 machines 2024-02-29 17:49:44 <@tflink:fedora.im> I have 2 cards in two machines right now (gfx906 and gfx1100) with a third on the way at some point. I can do testing if that'd be helpful 2024-02-29 17:49:59 <@tflink:fedora.im> but the cards are not identical 2024-02-29 17:50:17 <@trix:fedora.im> just seeing if rccl works would be help. 2024-02-29 17:50:29 <@tflink:fedora.im> ok, I'll sync up after the meeting to figure out how to do that 2024-02-29 17:50:40 <@trix:fedora.im> groovy 2024-02-29 17:50:57 <@tflink:fedora.im> !info more tesitng for rccl and pytorch distributed is needed, those features may be enabled for pytorch 2.3 2024-02-29 17:51:21 <@trix:fedora.im> some features that seem to _just build_ are caffe2 and openmp. 2024-02-29 17:51:36 <@trix:fedora.im> so adding them becaause i like easy 2024-02-29 17:52:03 <@tflink:fedora.im> sounds good to me :) 2024-02-29 17:52:23 <@trix:fedora.im> anyone have a feature they want/need ? 2024-02-29 17:52:24 <@tflink:fedora.im> !info caffe2 and openmp support are planned to be added for the pytorch 2.3 package 2024-02-29 17:52:39 <@tflink:fedora.im> other than getting rocm acceleration to work, none for me :) 2024-02-29 17:52:51 <@tflink:fedora.im> gotta ask for the most difficult feature 2024-02-29 17:53:08 <@tflink:fedora.im> anyone else have discussion to add for pytorch features? 2024-02-29 17:53:29 <@tflink:fedora.im> otherwise, we'll move on to the next topic 2024-02-29 17:53:59 <@tflink:fedora.im> we're almost at the scheduled end of the meeting - does anyone have hard stops at the half hour? 2024-02-29 17:54:39 <@kaitlynabdo:fedora.im> only the fesco update 2024-02-29 17:55:00 <@tflink:fedora.im> ok, lets go into that quick before getting into the rocm split discussion which I expect will take a bit 2024-02-29 17:55:04 <@tflink:fedora.im> !topic fesco update 2024-02-29 17:55:35 <@kaitlynabdo:fedora.im> just in case anyone isnt aware, tim and i reached out to fesco about pre trained weights when packaging AI models 2024-02-29 17:55:37 <@tflink:fedora.im> I assume this has to do with the question about packaging pre-trained weights 2024-02-29 17:55:41 <@tflink:fedora.im> !link https://pagure.io/fesco/issue/3175 2024-02-29 17:56:05 <@kaitlynabdo:fedora.im> and they got back to us and said that they dont handle stuff like that and to post it publicly on fedora legal 2024-02-29 17:56:10 <@kaitlynabdo:fedora.im> https://lists.fedoraproject.org/archives/list/legal@lists.fedoraproject.org/thread/PIPILJCMDEO67ORL4SAKB3NPHHVMFDJE/#PIPILJCMDEO67ORL4SAKB3NPHHVMFDJE 2024-02-29 17:56:21 <@kaitlynabdo:fedora.im> so this is the thread tim started 2024-02-29 17:56:35 <@tflink:fedora.im> !link https://lists.fedoraproject.org/archives/list/legal@lists.fedoraproject.org/thread/PIPILJCMDEO67ORL4SAKB3NPHHVMFDJE/#PIPILJCMDEO67ORL4SAKB3NPHHVMFDJE 2024-02-29 17:57:35 <@tflink:fedora.im> comments on the fesco ticket indicate some general acceptance with the idea of treating pre-trained weights as regular non-code content but we'll re-open the fesco ticket once the discussion with fedora-legal is concluded 2024-02-29 17:58:28 <@tflink:fedora.im> !info FESCo said that the issue on including pre-trained weights is an issue for legal so we started a public conversation with Fedora legal. once that conversation is concluded, we will re-open the FESCo issue 2024-02-29 17:58:32 <@tflink:fedora.im> anything else on this? 2024-02-29 17:59:03 <@kaitlynabdo:fedora.im> nope thats it from me on that 2024-02-29 17:59:36 <@tflink:fedora.im> we should have answers before much longer, the fedora legal conversation is currently waiting on me to respond which I will do later today 2024-02-29 17:59:50 <@tflink:fedora.im> if that's it, moving on to the next topic 2024-02-29 18:00:10 <@tflink:fedora.im> !topic rocm splitting and pytorch 2024-02-29 18:00:24 <@tflink:fedora.im> as I understand it, the problem can be summarize as thus: 2024-02-29 18:00:54 <@tflink:fedora.im> 1. many of the rocm components generate libraries which are too large to package if all of the card variants are enabled 2024-02-29 18:01:33 <@tflink:fedora.im> 2. to address this, those rocm components have been packaged as multiple separate libraries, separated into card families 2024-02-29 18:01:37 <@tflink:fedora.im> bah 2024-02-29 18:01:50 <@tflink:fedora.im> oh, something is re-numbering my text. huh 2024-02-29 18:02:25 <@trix:fedora.im> and this ripples to applications that use these libs 2024-02-29 18:02:35 <@tflink:fedora.im> but the problem is that pytorch needs a single library to link against which leads to the question of whether we need to have multiple pytorchs to enable rocm acceleration for all (or many) of the amd gpu families 2024-02-29 18:02:57 <@tflink:fedora.im> is my summary accurate? 2024-02-29 18:03:38 <@trix:fedora.im> yes. so do we care about all the gpus or just some ? 2024-02-29 18:04:36 <@tflink:fedora.im> !info some rocm components generate libraries which are too large to package if all supported gpus are enabled so they have been split up into multiple libraries by gpu family. this causes problems with downstream applications (like pytorch) which need to link against a single library 2024-02-29 18:04:57 <@tflink:fedora.im> I want to say all the gpus but then there's something about wishes and horses 2024-02-29 18:05:33 <@tflink:fedora.im> one idea that I've had kicking around that would be more of a temporary bandaid - what about using COPR repositories and have some out-of-band repos for each arch? 2024-02-29 18:06:26 <@tflink:fedora.im> we could have a gfx906 repo that contains rocblas906 etc. and pytorch-rocm906 or some similarly named packages which conflict with the main fedora packages 2024-02-29 18:06:38 <@tflink:fedora.im> I don't like the solution but I'm not sure I have much of a better idea for the moment 2024-02-29 18:07:14 <@tflink:fedora.im> obviously, we'd need more packaging automation to make all that even close to being sane 2024-02-29 18:07:28 <@trix:fedora.im> a reason to use fedora/rocm is it is all in one place, _the_ place. 2024-02-29 18:09:02 <@tflink:fedora.im> yeah, I don't really love the solution but it sounds like we're stuck with a few choices - chose a (or maybe a few) gpu families to support and get rid of the library split, figure out how to package the giant libraries like upstream rocm does or have multiple packages to support the gpu families separately 2024-02-29 18:09:18 <@trix:fedora.im> i have no interest in doing both fedora and copr for all the rocm stack i hold. 2024-02-29 18:09:39 <@tflink:fedora.im> so what is your proposed solution? 2024-02-29 18:10:23 <@trix:fedora.im> see rhel-test, it does that split for pytorch. building it for the families like the other rocm packages 2024-02-29 18:10:39 <@tflink:fedora.im> even if there was a way to automate most of the pain away? 2024-02-29 18:11:09 <@tflink:fedora.im> !link https://src.fedoraproject.org/rpms/python-torch/tree/rhel-test 2024-02-29 18:11:37 <@trix:fedora.im> the split is for the corner case of not the default gpus 2024-02-29 18:11:52 <@trix:fedora.im> default at the moment is gfx10 and gfx11 2024-02-29 18:12:33 <@tflink:fedora.im> I'm trying to read the specfile quickly so I might be wrong but doesn't that change all the import statements for using pytorch? 2024-02-29 18:13:04 <@tflink:fedora.im> would I have to 'import torch-gfx906' or something like that instead of 'import torch'? 2024-02-29 18:13:12 <@trix:fedora.im> it would mean adding a PYTHONPATH to the rocm module logic 2024-02-29 18:13:30 <@trix:fedora.im> the names are the same 2024-02-29 18:14:18 <@trix:fedora.im> so picking up torch from /usr/lib64/rocm/lib64/python<>/site-packages/torch 2024-02-29 18:14:59 <@tflink:fedora.im> that feels really ugly 2024-02-29 18:15:13 <@trix:fedora.im> this is why i bring it up 2024-02-29 18:15:20 <@tflink:fedora.im> but it's not like any of these solutions are great 2024-02-29 18:15:28 <@trix:fedora.im> right. 2024-02-29 18:16:01 <@tflink:fedora.im> the only other option would be to limit support to one of the gpu families? 2024-02-29 18:16:44 <@trix:fedora.im> if you have a gfx10 or gfx11 you don't need to worry about this (for now) 2024-02-29 18:17:34 <@tflink:fedora.im> if you're unwilling to consider my proposal of using COPRs for the short term, I'm not sure there is another solution that supports more than one gpu family 2024-02-29 18:17:56 <@tflink:fedora.im> other than hope that amd fixes the library size issue at some point 2024-02-29 18:18:59 <@tflink:fedora.im> out of curiosity, what is the limit on library size? 2024-02-29 18:19:38 <@trix:fedora.im> 2G i think 2024-02-29 18:20:11 <@tflink:fedora.im> is that set by Fedora? would it be possible to apply for a packaging exception somewhere to get rid of the split? 2024-02-29 18:20:37 <@trix:fedora.im> this is more a rocm question, than a pytorch question. 2024-02-29 18:21:00 <@trix:fedora.im> i believe the real problem is on the linking side, not the packaging 2024-02-29 18:21:15 <@tflink:fedora.im> yeah, I'm just trying to figure out any way to not require custom PYTHONPATH for not-current gpus 2024-02-29 18:21:46 <@tflink:fedora.im> why isn't it a problem for amd and the binaries they distribute? 2024-02-29 18:21:59 <@tflink:fedora.im> that's something I've never quite understood 2024-02-29 18:22:00 <@trix:fedora.im> they do not build all the gpus 2024-02-29 18:22:05 <@tflink:fedora.im> oh 2024-02-29 18:22:12 <@tflink:fedora.im> that would do it 2024-02-29 18:22:22 <@trix:fedora.im> that line 'we only support pro and mi'.. 2024-02-29 18:23:01 <@tflink:fedora.im> yeah, that does make up most of the rocm supported gpus 2024-02-29 18:23:19 <@tflink:fedora.im> the only non-pro cards on the rocm support list are gfx1100 2024-02-29 18:23:21 <@trix:fedora.im> pro and mi are gfx9 2024-02-29 18:23:50 <@tflink:fedora.im> oh, for pytorch. yeah, that's all vega stuff, I think - that doesn't even support newer pro cards 2024-02-29 18:23:58 <@trix:fedora.im> these are current but not likely to be what average person has 2024-02-29 18:24:00 <@tflink:fedora.im> unless I'm missing something 2024-02-29 18:24:28 <@tflink:fedora.im> the radeon pro vii was the last gfx9 pro card, I think 2024-02-29 18:24:42 <@trix:fedora.im> when jeremy comes back, lets poke him with this 2024-02-29 18:25:42 <@tflink:fedora.im> yeah, I don't see any other choice than to do what you're talking about or to just eliminate support for certain families like amd does 2024-02-29 18:25:48 <@tflink:fedora.im> for now, anyways 2024-02-29 18:26:01 <@trix:fedora.im> give it a look. when you build stuff. 2024-02-29 18:26:11 <@tflink:fedora.im> yeah, will do 2024-02-29 18:26:22 <@trix:fedora.im> any other topics ? 2024-02-29 18:28:22 <@tflink:fedora.im> !info the current strategy is to build pytorch multiple times within the package to support multiple gpu families - this does mean that for not-current-gen GPUs (outside of gfx10 and gfx11 at the moment), users would have to use a custom PYTHONPATH to get rocm accelerated pytorch to work 2024-02-29 18:29:21 <@tflink:fedora.im> !info this solution isn't ideal but until things that are outside of our control change, it's either this or to farther restrict support for gpu families like amd does for the binaries that they distribute 2024-02-29 18:29:28 <@tflink:fedora.im> does that seem accurate? 2024-02-29 18:29:33 <@trix:fedora.im> yup 2024-02-29 18:29:41 <@tflink:fedora.im> cool 2024-02-29 18:29:54 <@tflink:fedora.im> I'm sure that we'll have more discussion around this in the future :) 2024-02-29 18:29:59 <@tflink:fedora.im> but moving on to ... 2024-02-29 18:30:03 <@tflink:fedora.im> !topic open floor 2024-02-29 18:30:17 <@tflink:fedora.im> any other topics for today's meeting? 2024-02-29 18:30:50 <@trix:fedora.im> _crickets_ 2024-02-29 18:31:08 <@tflink:fedora.im> yeah, it has been mostly the two of us but you never know 2024-02-29 18:33:04 <@tflink:fedora.im> eh, I changed my mind. not waiting the 5 minutes 2024-02-29 18:33:13 <@tflink:fedora.im> thanks for coming everyone. I'll send out the minutes shortly 2024-02-29 18:33:17 <@tflink:fedora.im> !endmeeting