#meeting-2:fedoraproject.org log

<@tflink:fedora.im>

16:31:25

!startmeeting fedora-pytorch

<@meetbot:fedora.im>

16:31:25

Meeting started at 2024-06-06 16:31:25 UTC

<@meetbot:fedora.im>

16:31:25

The Meeting name is 'fedora-pytorch'

<@tflink:fedora.im>

16:31:34

!topic welcome

<@tflink:fedora.im>

16:31:39

!hello

<@zodbot:fedora.im>

16:31:40

Tim Flink (tflink)

<@jsteffan:fedora.im>

16:32:00

!hi

<@tflink:fedora.im>

16:32:00

who all's here for some pytorch and ai-ml stuff?

<@zodbot:fedora.im>

16:32:01

Jonathan Steffan (jsteffan)

<@tflink:fedora.im>

16:32:13

!link https://board.net/p/fedora-pytorch-meeting

<@tflink:fedora.im>

16:32:27

meeting agenda and minutes are in that document

<@tflink:fedora.im>

16:32:52

we'll wait a couple of minutes for folks to filter in and then we can get started

<@trix:fedora.im>

16:35:10

filter filter.. am i a low or high pass filter ?!?

<@tflink:fedora.im>

16:35:55

hard for us to tell unless we get feedback - are things going over your head or are they beneath you ✌️

<@tflink:fedora.im>

16:36:18

where did that emoji come from? it was supposed to be :-D

<@tflink:fedora.im>

16:36:45

anyhow, we're 5 minutes after start so let's get this party started

<@tflink:fedora.im>

16:37:04

!topic HW need/wants for Fedora infra

<@man2dev:fedora.im>

16:37:51

So its just text based meating?

<@tflink:fedora.im>

16:38:08

there has been a question posed on discourse about whether there is HW that would be good to add for fedora infra for AI/ML purposes

<@tflink:fedora.im>

16:38:30

for today, yes. we've been alternating between google and matrix and AFAIK, the last meeting was on google

<@tflink:fedora.im>

16:38:55

!link https://discussion.fedoraproject.org/t/datacenter-hardware-needs-for-ai-in-fedora/119116

<@man2dev:fedora.im>

16:39:46

for what use case training? or running models?

<@tflink:fedora.im>

16:40:04

I've been working on this behind the scenes for a bit and am planning to ask for some hosts and GPUs to make progress on automated testing for AI/ML use cases

<@tflink:fedora.im>

16:40:46

Mohammadreza Hendiani: it's an open ended question so long as there is a good justification and realistic use case for the HW

<@trix:fedora.im>

16:41:03

my testing is completely manual and only one card at a time.. so i'm interested in what you want

<@tflink:fedora.im>

16:41:36

I have a research project that will require GPU time if it goes into production so that is one of the workloads/usecases that I have in the back of my mind but I know I'm not the only one with ideas :)

<@tflink:fedora.im>

16:43:13

My interest is primarily in HW coverage for AMD GPUs because the progress and focus as of late has been on ROCm. Since it'll be a year or so before we see any of this HW, I also want to keep intel in mind as they make progress on their stack

<@tflink:fedora.im>

16:43:57

I know that nvidia is popular and I'm not against looking at how we can have testing for that but it's a difficult problem to solve due to the distribution restrictions and proprietary nature of nvidia's platform

<@tflink:fedora.im>

16:44:32

access to nvidia accelerators is also much easier due to their current ubiquity in public clouds

<@man2dev:fedora.im>

16:44:54

most of the hardware optimization test i have done are with vulkan so in my opinion one of nvidias vulkan supported gpu would be use full for vulkan testing and optimization https://developer.nvidia.com/vulkan-driver

<@trix:fedora.im>

16:45:34

is there a way to distributively test ?

<@tflink:fedora.im>

16:46:05

WRT ROCm, I'd like to see automated coverage for gfx1100 and maybe gfx1103 in addition to planning for 2 or so more GPUs once rdna4 is a thing

<@tflink:fedora.im>

16:46:31

I need to write things up but I do have an idea for that since it's the most practical option for the near future

<@man2dev:fedora.im>

16:47:15

look we are very behind on vulkan support in Fedora in my own devise im setting my enviroment veriables manually becasue it hasent been packaged

<@tflink:fedora.im>

16:47:33

in an ideal world, I'd like to see coverage for gfx10 and gfx9 but I don't think those are available new anymore or for anything even close to a sane price for the newer end of gfx9

<@man2dev:fedora.im>

16:47:38

some of the most important parts of the sdk are packaged

<@man2dev:fedora.im>

16:47:48

but we don't have the full sdk

<@trix:fedora.im>

16:47:57

i need these samples for triaging problems but unless i have multiple servers and there is some widget that makes them sharable, it's not much help .. i can _ask_ for more servers, they will ask me to just swap cards.

<@jsteffan:fedora.im>

16:48:42

Mohammadreza Hendiani: i'd be interested in defining what support we do have and what are are missing and getting things packaged. we can discuss that outside of the meeting

<@man2dev:fedora.im>

16:48:53

i don't understand the question?

<@tflink:fedora.im>

16:49:04

yeah, testing is one thing, access to the machine for debug is another. to be honest, I'm not sure how to approach that but it'll be a point of discussion once plans start forming

<@jsteffan:fedora.im>

16:49:10

do we have access to responsive remote hands from the NOC?

<@tflink:fedora.im>

16:49:11

or once I get around to writing stuff up

<@man2dev:fedora.im>

16:49:14

i have list if you want but its not complete yet

<@jsteffan:fedora.im>

16:49:42

can we take advantage of eGPUs here? i know tflink mentioned that even finding a suitable chassis has been painful

<@trix:fedora.im>

16:49:45

ideally i plug in server+card and start up a testing service that tim or anyone can run fedora tests on

<@tflink:fedora.im>

16:49:57

eGPU? external GPUs?

<@tflink:fedora.im>

16:50:32

!info the immediate priority for AI/ML hardware is AMD for ROCm since that's what we've been making progress with

<@trix:fedora.im>

16:51:00

yup rocm all the wayyy babyyyy

<@jsteffan:fedora.im>

16:51:32

i would suspect having a few host rack mount servers that have plenty of cpu/ram, as many expansion cards we can fit) and then having a reserved quarter cabinet (or less) with some sort shelving/etc we could have stacks of eGPUs for testing. worst case, a NOC ticket to swap which eGPU is attached could be opened

<@tflink:fedora.im>

16:51:34

!info intel is another longer term priority since we're looking at a year minimum before any of this is online

<@jsteffan:fedora.im>

16:51:45

yeah

<@man2dev:fedora.im>

16:51:53

ok but this was about the future hardware needs lets face it NVIDIA is major player

<@mystro256:fedora.im>

16:52:18

Note: I looked into Nvidia for interest and there's also out of tree kernel module issues to worry about with Fedora

<@mystro256:fedora.im>

16:52:40

Nvidia's focus has never really been bleeding edge kernels

<@tflink:fedora.im>

16:52:45

!info nvidia is something we can also look at but it might be better (and more easily) handled with public cloud resources. it's also an issue due to its proprietary nature and distribution restrictions

<@man2dev:fedora.im>

16:53:40

I haven't tested the work thats been done with openapi but yeah they are inversted but the question is what are the performances like comperd to somthing like Rcom and vulkan

<@trix:fedora.im>

16:53:41

it comes down to who wants to do what, if someone really wants to do nvidia there is nothing stopping them, other than the oot kernel driver and third party cuda challanges.. everything has challanges.

<@tflink:fedora.im>

16:53:50

have I summarized things up reasonably well or am I missing things WRT what kind of datacenter-living HW we'd ask for?

<@mystro256:fedora.im>

16:54:12

My point is just Intel and AMD are the most viable for Fedora due to fast moving kernels

<@mystro256:fedora.im>

16:54:25

assuming intel isn't doing OOT kernel modules

<@man2dev:fedora.im>

16:54:33

to my understanding the all of these updates are coming only to the opensource diriver

<@trix:fedora.im>

16:54:52

datacenter is very expensive and more in the rhel side of things. i have access but can not really share outside of redhat.

<@man2dev:fedora.im>

16:54:53

no argument there

<@mystro256:fedora.im>

16:55:08

Out of curiousity, are they planning to merge that into upstream linux anytime soon?

<@tflink:fedora.im>

16:55:09

FWIW, I'm not trying to stop anyone from working on nvidia. I just recognize that we have very limited resources and believe that focusing on the more open solutions first is the best route forward

<@mystro256:fedora.im>

16:55:39

bit off topic, but I haven't heard much recently

<@man2dev:fedora.im>

16:55:40

other than the working being done in nova and nvk NO

<@man2dev:fedora.im>

16:56:54

but idk NVIDIA has been been weird for some time i mean who though redhat could convinse them to become open source

<@trix:fedora.im>

16:57:22

oh jeeze hand gernade.

<@mystro256:fedora.im>

16:57:27

is nova related to the official driver nv put on github?

<@trix:fedora.im>

16:57:33

back to content of meeting.

<@mystro256:fedora.im>

16:57:38

haha sorry

<@trix:fedora.im>

16:58:14

so how about we get hw for all the gfx11's and whatever new things are coming in the commercial side of amd

<@man2dev:fedora.im>

16:58:30

it is the easier and fast chouce due to alot of the work being done by upstream but im just saying to not put all are eags in one basket

<@tflink:fedora.im>

16:59:33

getting back to the topic at hand, I'm planning to ask for 8 pcie slots worth of servers for testing, assuming that a suitable server can be found. In addition, I want to plan for about that many GPUs - a mix of gfx1100 and whatever is coming after gfx1100 for amd, an intel card or two and maybe an nvidia card

<@tflink:fedora.im>

16:59:51

I don't expect to get all that but you never know if you don't ask :)

<@man2dev:fedora.im>

17:00:07

to my understanding its trying to implement the open source nvidia diver inside of it and replave nova

<@jsteffan:fedora.im>

17:00:26

4) define target AMD cards

<@jsteffan:fedora.im>

17:00:26

3) determine best way to acquire target consumer GPUs that will work on the selected compute platform (e.g. how are they attached to the compute platform)

<@jsteffan:fedora.im>

17:00:26

so if i were to summarize it would be:

<@jsteffan:fedora.im>

17:00:26

5) define target intel cards

<@jsteffan:fedora.im>

17:00:26

1) rely on public cloud resources to provision nvidia hardware

<@jsteffan:fedora.im>

17:00:26

<@jsteffan:fedora.im>

17:00:26

2) design a base compute platform that is in the common RHIT procurement pool

<@tflink:fedora.im>

17:00:26

am I missing anything in terms of HW that we would need/want?

<@trix:fedora.im>

17:00:31

tflink: you can reuse a crypto mining rig! franken-miner

<@man2dev:fedora.im>

17:00:56

what abut AMD CPUS?

<@tflink:fedora.im>

17:01:12

yeah, that's a good summary, I think

<@trix:fedora.im>

17:01:17

see above .. gfx1103 is cpu.

<@man2dev:fedora.im>

17:01:52

doesn't RCom work like cuda in the sense that its interprets gpu and cpu instructions? so wouldn't AMD CPU get the best results

<@tflink:fedora.im>

17:02:12

generally out of scope for what we want to do, I think. I'm struggling to find a suitable rackmount solution; if I can find one, I'm not going to be picky about the CPU for the host systems

<@tflink:fedora.im>

17:02:35

I thought that the CPU portion of ROCm was pretty vendor neutral

<@jsteffan:fedora.im>

17:02:37

yeah, item #3 is to determine what is the best strategy ... there is iGPU (integrated) dGPU (discrete) and eGPU (external) and there might need to be a mix

<@trix:fedora.im>

17:02:56

we are getting out of time.. so next topic ?

<@trix:fedora.im>

17:03:18

anyone want gmeet ?

<@tflink:fedora.im>

17:03:23

yeah, i could go on for a while on the restrictions for the HW

<@jsteffan:fedora.im>

17:03:25

the most common is dGPU, but we (the SIG) should also pay attention to enabling the iGPU stuff next, and then finally any eGPU stuff

<@tflink:fedora.im>

17:03:45

but we are kinda out of time. we can continue this convversation outside the meeting

<@tflink:fedora.im>

17:04:00

!topic future meeting location and name

<@man2dev:fedora.im>

17:04:12

Tom would know more about this

<@tflink:fedora.im>

17:04:37

up until now, we've had a pytorch meeting that happens every 2 weeks alternating between matrix and google meet

<@tflink:fedora.im>

17:05:16

I'd argue that it has become a more general ai/ml meeting rather than a pytorch specific meeting and it might make sense to change the name if we're making other changes

<@trix:fedora.im>

17:05:36

yup i agree

<@tflink:fedora.im>

17:05:41

but the bigger point is that alternating between two meeting methods is a bit of a pain and I'd like to choose one over the other

<@man2dev:fedora.im>

17:05:52

i agree

<@tflink:fedora.im>

17:06:24

my personal preference is for matrix but I would prefer consistency even if that means doing google meet

<@trix:fedora.im>

17:06:48

i'm fine with matrix

<@tflink:fedora.im>

17:07:03

I think this will need to go to discourse since there are likely interested folks who aren't here

<@trix:fedora.im>

17:07:13

that's fine too

<@man2dev:fedora.im>

17:07:45

how about a poll

<@trix:fedora.im>

17:08:17

do this on discourse and see.. so tflink you want to get that going ?

<@jsteffan:fedora.im>

17:08:25

yeah, a discourse poll sounds great

<@tflink:fedora.im>

17:08:29

I propose the following: we continue the current schedule for the next meeting in 2 weeks which will be on google meet. in the mean time, tflink will start a topic on discourse to get more input on which meeting method is preferred and if we make other changes

<@trix:fedora.im>

17:08:44

grovvy

<@man2dev:fedora.im>

17:09:00

i agree

<@tflink:fedora.im>

17:09:44

I would prefer to not use a poll unless discourse has features I'm not aware of - I generally care more about the opinions of the folks who are showing up regularly and a poll at least implies that all votes are equal

<@tflink:fedora.im>

17:10:56

it might not be a popular opinion but I think that using a meeting method that's popular for folks who aren't showing up if it's contrary to what the people showing up want is counter-productive

<@trix:fedora.im>

17:11:53

yes, i give more weight to people that show up.

<@tflink:fedora.im>

17:12:49

any other thoughts on the proposal?

<@man2dev:fedora.im>

17:13:17

its fine

<@tflink:fedora.im>

17:13:35

!info we will continue the current schedule for the next meeting in 2 weeks which will be on google meet. in the mean time, tflink will start a topic on discourse to get more input on which meeting method is preferred and if we make other changes

<@tflink:fedora.im>

17:14:21

also, be aware that the link for the google meeting in two weeks has likely changed. the ownership of the meeting in google moved to me and I think that will change the link

<@tflink:fedora.im>

17:14:42

I think that's it and we're 15 minutes over time so ...

<@tflink:fedora.im>

17:14:47

!topic open floow

<@tflink:fedora.im>

17:14:50

!undo

<@tflink:fedora.im>

17:15:04

!topic open floor ( now with fewer typos)

<@tflink:fedora.im>

17:15:51

it sounds like there may be more to discuss around the HW question, we can continue that in the #ai-ml:fedoraproject.org channel

<@tflink:fedora.im>

17:15:58

are there any other topics that folks want to bring up?

<@man2dev:fedora.im>

17:16:07

I think cpu situation should be researched a bit more as to see if there is any better results with AMD CPU & GPU

<@man2dev:fedora.im>

17:16:23

but we can talka about that in the sig

<@tflink:fedora.im>

17:16:45

if there are no more topics, I'll close out the meeting.

<@tflink:fedora.im>

17:17:01

thank you for coming, everyone. I'll post minutes to the discourse topic shortly

<@tflink:fedora.im>

17:17:03

!endmeeting