#meeting-2:fedoraproject.org log

<@tflink:fedora.im>

17:30:54

!startmeeting fedora-pytorch

<@meetbot:fedora.im>

17:30:55

Meeting started at 2024-02-29 17:30:54 UTC

<@meetbot:fedora.im>

17:30:55

The Meeting name is 'fedora-pytorch'

<@tflink:fedora.im>

17:31:01

!topic welcome and roll call

<@tflink:fedora.im>

17:31:05

!hello

<@zodbot:fedora.im>

17:31:06

Tim Flink (tflink)

<@tflink:fedora.im>

17:31:37

!link agenda document https://board.net/p/fedora-pytorch-meeting

<@tflink:fedora.im>

17:31:55

who all's here for some pytorch discussion fun times?

<@kaitlynabdo:fedora.im>

17:32:34

im here

<@kaitlynabdo:fedora.im>

17:33:06

i only have one thing today which is just to update everyone on what fesco said (:

<@trix:fedora.im>

17:33:55

meep meep

<@tflink:fedora.im>

17:35:17

!topic pytorch 2.3

<@tflink:fedora.im>

17:35:27

!chair Tom Rix

<@trix:fedora.im>

17:35:43

yes, i am sitting..

<@tflink:fedora.im>

17:35:45

I still don't know if chair does anythihng with the new meetbot

<@tflink:fedora.im>

17:35:57

chair used to let you set links, change topics etc

<@conan_kudo:matrix.org>

17:36:15

chair does nothing

<@conan_kudo:matrix.org>

17:36:19

it's not implemented yet

<@trix:fedora.im>

17:36:19

well.. lets get started.

<@tflink:fedora.im>

17:36:30

good to know, thanks Conan Kudo

<@tflink:fedora.im>

17:37:05

Tom Rix this was your topic, I'll let you run with it since I'm only partially familiar with what you listed on the agenda

<@trix:fedora.im>

17:37:05

i am chasing 2.3 for rawhide and _maybe_ epel.

<@trix:fedora.im>

17:37:26

epel has some maybe deep python runtime needs.

<@trix:fedora.im>

17:37:45

what's others interest in epel at this time ?

<@tflink:fedora.im>

17:37:50

you're a braver man than I if you do epel, that's a long support commitment unless epel guidelines have changed since the last time I looked

<@trix:fedora.im>

17:38:37

i am asking if folks really care about epel, i would rather not own another 30 packages if no one cares

<@tflink:fedora.im>

17:38:52

I don't personally care, no

<@tflink:fedora.im>

17:39:19

and it doesn't seem like many other people are here so I doubt that you'll get much feedback in this forum

<@trix:fedora.im>

17:39:42

no worries.

<@trix:fedora.im>

17:39:58

that was the softball feature.

<@tflink:fedora.im>

17:40:07

!info trix is considering epel packages for pytorch and is looking for feedback on whether those are desired

<@trix:fedora.im>

17:40:31

the harball is rocm and splitting pytorch / having muliple installs

<@trix:fedora.im>

17:40:51

i am not sure how _not_ to do this and support all the rocm gpus

<@tflink:fedora.im>

17:40:59

shall we save rocm for the next topic with splitting or discuss it now with the other 2.3 features?

<@trix:fedora.im>

17:41:17

rocm is the big 2.3 feature

<@tflink:fedora.im>

17:41:46

sure but isn't splitting the bigger problem?

<@davide:cavalca.name>

17:41:59

!hi

<@zodbot:fedora.im>

17:42:00

Davide Cavalca (dcavalca) - he / him / his

<@tflink:fedora.im>

17:42:39

yeah, the previous time was a bit early for me as well :)

<@trix:fedora.im>

17:42:49

ok.. general rocm before splitting. its building

<@trix:fedora.im>

17:43:03

only tested on mi210

<@trix:fedora.im>

17:43:10

cuz that is what i have

<@tflink:fedora.im>

17:43:24

!info pytorch 2.3 with rocm support can be built but for the moment, support is limited to a few cards and has only been tested on mi210

<@tflink:fedora.im>

17:44:10

no idea if it'll work since I bet that upstream designed support for the MI cards, though

<@trix:fedora.im>

17:44:27

i think its too early to say 'a few cards' .. its not like i disabled the buidling of any of the cards

<@tflink:fedora.im>

17:45:11

when I attempted to build pytorch yesterday, it required specifiying gfx906 or gfx90a - is there a way around that?

<@trix:fedora.im>

17:45:36

eventually i will get around to checking it out on anther card, that is just a time thing

<@trix:fedora.im>

17:46:06

check today. the general building has been pushed to the rhel-test branch

<@tflink:fedora.im>

17:46:24

will do, thanks

<@tflink:fedora.im>

17:47:14

Tom Rix are the other features you listed in the agenda announcements or is there discussion to be had around them?

<@tflink:fedora.im>

17:47:35

IIRC, there are some potential issues with distibuted

<@trix:fedora.im>

17:47:44

another feature on the fence is distributed.

<@trix:fedora.im>

17:48:06

to do that rccl needs to be fedora and gloo needs to use it

<@trix:fedora.im>

17:48:30

and the hardest .. i do not really have a way to test distributed. i have 1 machine.

<@trix:fedora.im>

17:48:45

same for basic rccl.

<@tflink:fedora.im>

17:48:48

what do you need to test distributed?

<@trix:fedora.im>

17:49:08

i'm not sure. i assume at least 2 machines

<@tflink:fedora.im>

17:49:44

I have 2 cards in two machines right now (gfx906 and gfx1100) with a third on the way at some point. I can do testing if that'd be helpful

<@tflink:fedora.im>

17:49:59

but the cards are not identical

<@trix:fedora.im>

17:50:17

just seeing if rccl works would be help.

<@tflink:fedora.im>

17:50:29

ok, I'll sync up after the meeting to figure out how to do that

<@trix:fedora.im>

17:50:40

groovy

<@tflink:fedora.im>

17:50:57

!info more tesitng for rccl and pytorch distributed is needed, those features may be enabled for pytorch 2.3

<@trix:fedora.im>

17:51:21

some features that seem to _just build_ are caffe2 and openmp.

<@trix:fedora.im>

17:51:36

so adding them becaause i like easy

<@tflink:fedora.im>

17:52:03

sounds good to me :)

<@trix:fedora.im>

17:52:23

anyone have a feature they want/need ?

<@tflink:fedora.im>

17:52:24

!info caffe2 and openmp support are planned to be added for the pytorch 2.3 package

<@tflink:fedora.im>

17:52:39

other than getting rocm acceleration to work, none for me :)

<@tflink:fedora.im>

17:52:51

gotta ask for the most difficult feature

<@tflink:fedora.im>

17:53:08

anyone else have discussion to add for pytorch features?

<@tflink:fedora.im>

17:53:29

otherwise, we'll move on to the next topic

<@tflink:fedora.im>

17:53:59

we're almost at the scheduled end of the meeting - does anyone have hard stops at the half hour?

<@kaitlynabdo:fedora.im>

17:54:39

only the fesco update

<@tflink:fedora.im>

17:55:00

ok, lets go into that quick before getting into the rocm split discussion which I expect will take a bit

<@tflink:fedora.im>

17:55:04

!topic fesco update

<@kaitlynabdo:fedora.im>

17:55:35

just in case anyone isnt aware, tim and i reached out to fesco about pre trained weights when packaging AI models

<@tflink:fedora.im>

17:55:37

I assume this has to do with the question about packaging pre-trained weights

<@tflink:fedora.im>

17:55:41

!link https://pagure.io/fesco/issue/3175

<@kaitlynabdo:fedora.im>

17:56:05

and they got back to us and said that they dont handle stuff like that and to post it publicly on fedora legal

<@kaitlynabdo:fedora.im>

17:56:10

https://lists.fedoraproject.org/archives/list/legal@lists.fedoraproject.org/thread/PIPILJCMDEO67ORL4SAKB3NPHHVMFDJE/#PIPILJCMDEO67ORL4SAKB3NPHHVMFDJE

<@kaitlynabdo:fedora.im>

17:56:21

so this is the thread tim started

<@tflink:fedora.im>

17:56:35

!link https://lists.fedoraproject.org/archives/list/legal@lists.fedoraproject.org/thread/PIPILJCMDEO67ORL4SAKB3NPHHVMFDJE/#PIPILJCMDEO67ORL4SAKB3NPHHVMFDJE

<@tflink:fedora.im>

17:57:35

comments on the fesco ticket indicate some general acceptance with the idea of treating pre-trained weights as regular non-code content but we'll re-open the fesco ticket once the discussion with fedora-legal is concluded

<@tflink:fedora.im>

17:58:28

!info FESCo said that the issue on including pre-trained weights is an issue for legal so we started a public conversation with Fedora legal. once that conversation is concluded, we will re-open the FESCo issue

<@tflink:fedora.im>

17:58:32

anything else on this?

<@kaitlynabdo:fedora.im>

17:59:03

nope thats it from me on that

<@tflink:fedora.im>

17:59:36

we should have answers before much longer, the fedora legal conversation is currently waiting on me to respond which I will do later today

<@tflink:fedora.im>

17:59:50

if that's it, moving on to the next topic

<@tflink:fedora.im>

18:00:10

!topic rocm splitting and pytorch

<@tflink:fedora.im>

18:00:24

as I understand it, the problem can be summarize as thus:

<@tflink:fedora.im>

18:00:54

1. many of the rocm components generate libraries which are too large to package if all of the card variants are enabled

<@tflink:fedora.im>

18:01:33

2. to address this, those rocm components have been packaged as multiple separate libraries, separated into card families

<@tflink:fedora.im>

18:01:37

bah

<@tflink:fedora.im>

18:01:50

oh, something is re-numbering my text. huh

<@trix:fedora.im>

18:02:25

and this ripples to applications that use these libs

<@tflink:fedora.im>

18:02:35

but the problem is that pytorch needs a single library to link against which leads to the question of whether we need to have multiple pytorchs to enable rocm acceleration for all (or many) of the amd gpu families

<@tflink:fedora.im>

18:02:57

is my summary accurate?

<@trix:fedora.im>

18:03:38

yes. so do we care about all the gpus or just some ?

<@tflink:fedora.im>

18:04:36

!info some rocm components generate libraries which are too large to package if all supported gpus are enabled so they have been split up into multiple libraries by gpu family. this causes problems with downstream applications (like pytorch) which need to link against a single library

<@tflink:fedora.im>

18:04:57

I want to say all the gpus but then there's something about wishes and horses

<@tflink:fedora.im>

18:05:33

one idea that I've had kicking around that would be more of a temporary bandaid - what about using COPR repositories and have some out-of-band repos for each arch?

<@tflink:fedora.im>

18:06:26

we could have a gfx906 repo that contains rocblas906 etc. and pytorch-rocm906 or some similarly named packages which conflict with the main fedora packages

<@tflink:fedora.im>

18:06:38

I don't like the solution but I'm not sure I have much of a better idea for the moment

<@tflink:fedora.im>

18:07:14

obviously, we'd need more packaging automation to make all that even close to being sane

<@trix:fedora.im>

18:07:28

a reason to use fedora/rocm is it is all in one place, _the_ place.

<@tflink:fedora.im>

18:09:02

yeah, I don't really love the solution but it sounds like we're stuck with a few choices - chose a (or maybe a few) gpu families to support and get rid of the library split, figure out how to package the giant libraries like upstream rocm does or have multiple packages to support the gpu families separately

<@trix:fedora.im>

18:09:18

i have no interest in doing both fedora and copr for all the rocm stack i hold.

<@tflink:fedora.im>

18:09:39

so what is your proposed solution?

<@trix:fedora.im>

18:10:23

see rhel-test, it does that split for pytorch. building it for the families like the other rocm packages

<@tflink:fedora.im>

18:10:39

even if there was a way to automate most of the pain away?

<@tflink:fedora.im>

18:11:09

!link https://src.fedoraproject.org/rpms/python-torch/tree/rhel-test

<@trix:fedora.im>

18:11:37

the split is for the corner case of not the default gpus

<@trix:fedora.im>

18:11:52

default at the moment is gfx10 and gfx11

<@tflink:fedora.im>

18:12:33

I'm trying to read the specfile quickly so I might be wrong but doesn't that change all the import statements for using pytorch?

<@tflink:fedora.im>

18:13:04

would I have to 'import torch-gfx906' or something like that instead of 'import torch'?

<@trix:fedora.im>

18:13:12

it would mean adding a PYTHONPATH to the rocm module logic

<@trix:fedora.im>

18:13:30

the names are the same

<@trix:fedora.im>

18:14:18

so picking up torch from /usr/lib64/rocm/lib64/python<>/site-packages/torch

<@tflink:fedora.im>

18:14:59

that feels really ugly

<@trix:fedora.im>

18:15:13

this is why i bring it up

<@tflink:fedora.im>

18:15:20

but it's not like any of these solutions are great

<@trix:fedora.im>

18:15:28

right.

<@tflink:fedora.im>

18:16:01

the only other option would be to limit support to one of the gpu families?

<@trix:fedora.im>

18:16:44

if you have a gfx10 or gfx11 you don't need to worry about this (for now)

<@tflink:fedora.im>

18:17:34

if you're unwilling to consider my proposal of using COPRs for the short term, I'm not sure there is another solution that supports more than one gpu family

<@tflink:fedora.im>

18:17:56

other than hope that amd fixes the library size issue at some point

<@tflink:fedora.im>

18:18:59

out of curiosity, what is the limit on library size?

<@trix:fedora.im>

18:19:38

2G i think

<@tflink:fedora.im>

18:20:11

is that set by Fedora? would it be possible to apply for a packaging exception somewhere to get rid of the split?

<@trix:fedora.im>

18:20:37

this is more a rocm question, than a pytorch question.

<@trix:fedora.im>

18:21:00

i believe the real problem is on the linking side, not the packaging

<@tflink:fedora.im>

18:21:15

yeah, I'm just trying to figure out any way to not require custom PYTHONPATH for not-current gpus

<@tflink:fedora.im>

18:21:46

why isn't it a problem for amd and the binaries they distribute?

<@tflink:fedora.im>

18:21:59

that's something I've never quite understood

<@trix:fedora.im>

18:22:00

they do not build all the gpus

<@tflink:fedora.im>

18:22:05

<@tflink:fedora.im>

18:22:12

that would do it

<@trix:fedora.im>

18:22:22

that line 'we only support pro and mi'..

<@tflink:fedora.im>

18:23:01

yeah, that does make up most of the rocm supported gpus

<@tflink:fedora.im>

18:23:19

the only non-pro cards on the rocm support list are gfx1100

<@trix:fedora.im>

18:23:21

pro and mi are gfx9

<@tflink:fedora.im>

18:23:50

oh, for pytorch. yeah, that's all vega stuff, I think - that doesn't even support newer pro cards

<@trix:fedora.im>

18:23:58

these are current but not likely to be what average person has

<@tflink:fedora.im>

18:24:00

unless I'm missing something

<@tflink:fedora.im>

18:24:28

the radeon pro vii was the last gfx9 pro card, I think

<@trix:fedora.im>

18:24:42

when jeremy comes back, lets poke him with this

<@tflink:fedora.im>

18:25:42

yeah, I don't see any other choice than to do what you're talking about or to just eliminate support for certain families like amd does

<@tflink:fedora.im>

18:25:48

for now, anyways

<@trix:fedora.im>

18:26:01

give it a look. when you build stuff.

<@tflink:fedora.im>

18:26:11

yeah, will do

<@trix:fedora.im>

18:26:22

any other topics ?

<@tflink:fedora.im>

18:28:22

!info the current strategy is to build pytorch multiple times within the package to support multiple gpu families - this does mean that for not-current-gen GPUs (outside of gfx10 and gfx11 at the moment), users would have to use a custom PYTHONPATH to get rocm accelerated pytorch to work

<@tflink:fedora.im>

18:29:21

!info this solution isn't ideal but until things that are outside of our control change, it's either this or to farther restrict support for gpu families like amd does for the binaries that they distribute

<@tflink:fedora.im>

18:29:28

does that seem accurate?

<@trix:fedora.im>

18:29:33

yup

<@tflink:fedora.im>

18:29:41

cool

<@tflink:fedora.im>

18:29:54

I'm sure that we'll have more discussion around this in the future :)

<@tflink:fedora.im>

18:29:59

but moving on to ...

<@tflink:fedora.im>

18:30:03

!topic open floor

<@tflink:fedora.im>

18:30:17

any other topics for today's meeting?

<@trix:fedora.im>

18:30:50

_crickets_

<@tflink:fedora.im>

18:31:08

yeah, it has been mostly the two of us but you never know

<@tflink:fedora.im>

18:33:04

eh, I changed my mind. not waiting the 5 minutes

<@tflink:fedora.im>

18:33:13

thanks for coming everyone. I'll send out the minutes shortly

<@tflink:fedora.im>

18:33:17

!endmeeting