2024-06-06 16:31:25 <@tflink:fedora.im> !startmeeting fedora-pytorch
2024-06-06 16:31:25 <@meetbot:fedora.im> Meeting started at 2024-06-06 16:31:25 UTC
2024-06-06 16:31:25 <@meetbot:fedora.im> The Meeting name is 'fedora-pytorch'
2024-06-06 16:31:34 <@tflink:fedora.im> !topic welcome
2024-06-06 16:31:39 <@tflink:fedora.im> !hello
2024-06-06 16:31:40 <@zodbot:fedora.im> Tim Flink (tflink)
2024-06-06 16:32:00 <@jsteffan:fedora.im> !hi
2024-06-06 16:32:00 <@tflink:fedora.im> who all's here for some pytorch and ai-ml stuff?
2024-06-06 16:32:01 <@zodbot:fedora.im> Jonathan Steffan (jsteffan)
2024-06-06 16:32:13 <@tflink:fedora.im> !link https://board.net/p/fedora-pytorch-meeting
2024-06-06 16:32:27 <@tflink:fedora.im> meeting agenda and minutes are in that document
2024-06-06 16:32:52 <@tflink:fedora.im> we'll wait a couple of minutes for folks to filter in and then we can get started
2024-06-06 16:35:10 <@trix:fedora.im> filter filter.. am i a low or high pass filter ?!?
2024-06-06 16:35:55 <@tflink:fedora.im> hard for us to tell unless we get feedback - are things going over your head or are they beneath you ✌️
2024-06-06 16:36:18 <@tflink:fedora.im> where did that emoji come from? it was supposed to be :-D
2024-06-06 16:36:45 <@tflink:fedora.im> anyhow, we're 5 minutes after start so let's get this party started
2024-06-06 16:37:04 <@tflink:fedora.im> !topic HW need/wants for Fedora infra
2024-06-06 16:37:51 <@man2dev:fedora.im> So its just text based meating?
2024-06-06 16:38:08 <@tflink:fedora.im> there has been a question posed on discourse about whether there is HW that would be good to add for fedora infra for AI/ML purposes
2024-06-06 16:38:30 <@tflink:fedora.im> for today, yes. we've been alternating between google and matrix and AFAIK, the last meeting was on google
2024-06-06 16:38:55 <@tflink:fedora.im> !link https://discussion.fedoraproject.org/t/datacenter-hardware-needs-for-ai-in-fedora/119116
2024-06-06 16:39:46 <@man2dev:fedora.im> for what use case training? or running models?
2024-06-06 16:40:04 <@tflink:fedora.im> I've been working on this behind the scenes for a bit and am planning to ask for some hosts and GPUs to make progress on automated testing for AI/ML use cases
2024-06-06 16:40:46 <@tflink:fedora.im> Mohammadreza Hendiani: it's an open ended question so long as there is a good justification and realistic use case for the HW
2024-06-06 16:41:03 <@trix:fedora.im> my testing is completely manual and only one card at a time.. so i'm interested in what you want 
2024-06-06 16:41:36 <@tflink:fedora.im> I have a research project that will require GPU time if it goes into production so that is one of the workloads/usecases that I have in the back of my mind but I know I'm not the only one with ideas :)
2024-06-06 16:43:13 <@tflink:fedora.im> My interest is primarily in HW coverage for AMD GPUs because the progress and focus as of late has been on ROCm. Since it'll be a year or so before we see any of this HW, I also want to keep intel in mind as they make progress on their stack
2024-06-06 16:43:57 <@tflink:fedora.im> I know that nvidia is popular and I'm not against looking at how we can have testing for that but it's a difficult problem to solve due to the distribution restrictions and proprietary nature of nvidia's platform
2024-06-06 16:44:32 <@tflink:fedora.im> access to nvidia accelerators is also much easier due to their current ubiquity in public clouds
2024-06-06 16:44:54 <@man2dev:fedora.im> most of the hardware optimization test i have done are with vulkan so in my opinion one of nvidias vulkan supported gpu would be use full for vulkan testing and optimization https://developer.nvidia.com/vulkan-driver
2024-06-06 16:45:34 <@trix:fedora.im> is there a way to distributively test ?
2024-06-06 16:46:05 <@tflink:fedora.im> WRT ROCm, I'd like to see automated coverage for gfx1100 and maybe gfx1103 in addition to planning for 2 or so more GPUs once rdna4 is a thing
2024-06-06 16:46:31 <@tflink:fedora.im> I need to write things up but I do have an idea for that since it's the most practical option for the near future
2024-06-06 16:47:15 <@man2dev:fedora.im> look we are very behind on vulkan support in Fedora in my own devise im setting my enviroment veriables manually becasue it hasent been packaged
2024-06-06 16:47:33 <@tflink:fedora.im> in an ideal world, I'd like to see coverage for gfx10 and gfx9 but I don't think those are available new anymore or for anything even close to a sane  price for the newer end of gfx9
2024-06-06 16:47:38 <@man2dev:fedora.im> some of the most important parts of the sdk are packaged
2024-06-06 16:47:48 <@man2dev:fedora.im> but we don't have the full sdk
2024-06-06 16:47:57 <@trix:fedora.im> i need these samples for triaging problems but unless i have multiple servers and there is some widget that makes them sharable, it's not much help .. i can _ask_ for more servers, they will ask me to just swap  cards.
2024-06-06 16:48:42 <@jsteffan:fedora.im> Mohammadreza Hendiani: i'd be interested in defining what support we do have and what are are missing and getting things packaged. we can discuss that outside of the meeting
2024-06-06 16:48:53 <@man2dev:fedora.im> i don't understand the question?
2024-06-06 16:49:04 <@tflink:fedora.im> yeah, testing is one thing, access to the machine for debug is another. to be honest, I'm not sure how to approach that but it'll be a point of discussion once plans start forming
2024-06-06 16:49:10 <@jsteffan:fedora.im> do we have access to responsive remote hands from the NOC? 
2024-06-06 16:49:11 <@tflink:fedora.im> or once I get around to writing stuff up
2024-06-06 16:49:14 <@man2dev:fedora.im> i have list if you want but its not complete yet
2024-06-06 16:49:42 <@jsteffan:fedora.im> can we take advantage of eGPUs here? i know tflink mentioned that even finding a suitable chassis has been painful
2024-06-06 16:49:45 <@trix:fedora.im> ideally i plug in server+card and start up a testing service that tim or anyone can run fedora tests on
2024-06-06 16:49:57 <@tflink:fedora.im> eGPU? external GPUs?
2024-06-06 16:50:32 <@tflink:fedora.im> !info the immediate priority for AI/ML hardware is AMD for ROCm since that's what we've been making progress with
2024-06-06 16:51:00 <@trix:fedora.im> yup rocm all the wayyy babyyyy
2024-06-06 16:51:32 <@jsteffan:fedora.im> i would suspect having a few host rack mount servers that have plenty of cpu/ram, as many expansion cards we can fit) and then having a reserved quarter cabinet (or less) with some sort shelving/etc we could have stacks of eGPUs for testing. worst case, a NOC ticket to swap which eGPU is attached could be opened
2024-06-06 16:51:34 <@tflink:fedora.im> !info intel is another longer term priority since we're looking at a year minimum before any of this is online
2024-06-06 16:51:45 <@jsteffan:fedora.im> yeah
2024-06-06 16:51:53 <@man2dev:fedora.im> ok but this was about the future hardware needs lets face it NVIDIA is major player
2024-06-06 16:52:18 <@mystro256:fedora.im> Note: I looked into Nvidia for interest and there's also out of tree kernel module issues to worry about with Fedora
2024-06-06 16:52:40 <@mystro256:fedora.im> Nvidia's focus has never really been bleeding edge kernels
2024-06-06 16:52:45 <@tflink:fedora.im> !info nvidia is something we can also look at but it might be better (and more easily) handled with public cloud resources. it's also an issue due to its proprietary nature and distribution restrictions
2024-06-06 16:53:40 <@man2dev:fedora.im> I haven't tested the work thats been done with openapi but yeah they are inversted but the question is what are the performances like comperd to somthing like Rcom and vulkan
2024-06-06 16:53:41 <@trix:fedora.im> it comes down to who wants to do what, if someone really wants to do nvidia there is nothing stopping them, other than the oot kernel driver and third party cuda challanges.. everything has challanges.
2024-06-06 16:53:50 <@tflink:fedora.im> have I summarized things up reasonably well or am I missing things WRT what kind of datacenter-living HW we'd ask for?
2024-06-06 16:54:12 <@mystro256:fedora.im> My point is just Intel and AMD are the most viable for Fedora due to fast moving kernels
2024-06-06 16:54:25 <@mystro256:fedora.im> assuming intel isn't doing OOT kernel modules
2024-06-06 16:54:33 <@man2dev:fedora.im> to my understanding the all of these updates are coming only to the opensource diriver
2024-06-06 16:54:52 <@trix:fedora.im> datacenter is very expensive and more in the rhel side of things.  i have access but can not really share outside of redhat.
2024-06-06 16:54:53 <@man2dev:fedora.im> no argument there
2024-06-06 16:55:08 <@mystro256:fedora.im> Out of curiousity, are they planning to merge that into upstream linux anytime soon?
2024-06-06 16:55:09 <@tflink:fedora.im> FWIW, I'm not trying to stop anyone from working on nvidia. I just recognize that we have very limited resources and believe that focusing on the more open solutions first is the best route forward
2024-06-06 16:55:39 <@mystro256:fedora.im> bit off topic, but I haven't heard much recently
2024-06-06 16:55:40 <@man2dev:fedora.im> other than the working being done in nova and nvk NO
2024-06-06 16:56:54 <@man2dev:fedora.im> but idk NVIDIA has been been weird for some time i mean who though redhat could convinse them to become open source
2024-06-06 16:57:22 <@trix:fedora.im> oh jeeze hand gernade.
2024-06-06 16:57:27 <@mystro256:fedora.im> is nova related to the official driver nv put on github?
2024-06-06 16:57:33 <@trix:fedora.im> back to content of meeting.
2024-06-06 16:57:38 <@mystro256:fedora.im> haha sorry
2024-06-06 16:58:14 <@trix:fedora.im> so how about we get hw for all the gfx11's and whatever new things are coming in the commercial side of amd 
2024-06-06 16:58:30 <@man2dev:fedora.im> it  is the easier and fast chouce due to alot of the work being done by upstream but im just saying to not put all are eags in one basket
2024-06-06 16:59:33 <@tflink:fedora.im> getting back to the topic at hand, I'm planning to ask for 8 pcie slots worth of servers for testing, assuming that a suitable server can be found. In addition, I want to plan for about that many GPUs - a mix of gfx1100 and whatever is coming after gfx1100 for amd, an intel card or two and maybe an nvidia card
2024-06-06 16:59:51 <@tflink:fedora.im> I don't expect to get all that but you never know if you don't ask :)
2024-06-06 17:00:07 <@man2dev:fedora.im> to my understanding its trying to implement the open source nvidia diver inside of it and replave nova
2024-06-06 17:00:26 <@jsteffan:fedora.im> 4) define target AMD cards
2024-06-06 17:00:26 <@jsteffan:fedora.im> 3) determine best way to acquire target consumer GPUs that will work on the selected compute platform (e.g. how are they attached to the compute platform)
2024-06-06 17:00:26 <@jsteffan:fedora.im> so if i were to summarize it would be:
2024-06-06 17:00:26 <@jsteffan:fedora.im> 5) define target intel cards
2024-06-06 17:00:26 <@jsteffan:fedora.im> 1) rely on public cloud resources to provision nvidia hardware
2024-06-06 17:00:26 <@jsteffan:fedora.im> 
2024-06-06 17:00:26 <@jsteffan:fedora.im> 2) design a base compute platform that is in the common RHIT procurement pool
2024-06-06 17:00:26 <@tflink:fedora.im> am I missing anything in terms of HW that we would need/want?
2024-06-06 17:00:31 <@trix:fedora.im> tflink: you can reuse a crypto mining rig! franken-miner
2024-06-06 17:00:56 <@man2dev:fedora.im> what abut AMD CPUS?
2024-06-06 17:01:12 <@tflink:fedora.im> yeah, that's a good summary, I think
2024-06-06 17:01:17 <@trix:fedora.im> see above .. gfx1103 is cpu.
2024-06-06 17:01:52 <@man2dev:fedora.im> doesn't RCom work like cuda in the sense that its interprets gpu and cpu instructions? so wouldn't AMD CPU get the best results
2024-06-06 17:02:12 <@tflink:fedora.im> generally out of scope for what we want to do, I think. I'm struggling to find a suitable rackmount solution; if I can find one, I'm not going to be picky about the CPU for the host systems
2024-06-06 17:02:35 <@tflink:fedora.im> I thought that the CPU portion of ROCm was pretty vendor neutral
2024-06-06 17:02:37 <@jsteffan:fedora.im> yeah, item #3 is to determine what is the best strategy ... there is iGPU (integrated) dGPU (discrete) and eGPU (external) and there might need to be a mix 
2024-06-06 17:02:56 <@trix:fedora.im> we are getting out of time.. so next topic ?
2024-06-06 17:03:18 <@trix:fedora.im> anyone want gmeet ?
2024-06-06 17:03:23 <@tflink:fedora.im> yeah, i could go on for a while on the restrictions for the HW
2024-06-06 17:03:25 <@jsteffan:fedora.im> the most common is dGPU, but we (the SIG) should also pay attention to enabling the iGPU stuff next, and then finally any eGPU stuff
2024-06-06 17:03:45 <@tflink:fedora.im> but we are kinda out of time. we can continue this convversation outside the meeting
2024-06-06 17:04:00 <@tflink:fedora.im> !topic future meeting location and name
2024-06-06 17:04:12 <@man2dev:fedora.im> Tom would know more about this
2024-06-06 17:04:37 <@tflink:fedora.im> up until now, we've had a pytorch meeting that happens every 2 weeks alternating between matrix and google meet
2024-06-06 17:05:16 <@tflink:fedora.im> I'd argue that it has become a more general ai/ml meeting rather than a pytorch specific meeting and it might make sense to change the name if we're making other changes
2024-06-06 17:05:36 <@trix:fedora.im> yup i agree
2024-06-06 17:05:41 <@tflink:fedora.im> but the bigger point is that alternating between two meeting methods is a bit of a pain and I'd like to choose one over the other
2024-06-06 17:05:52 <@man2dev:fedora.im> i agree
2024-06-06 17:06:24 <@tflink:fedora.im> my personal preference is for matrix but I would prefer consistency even if that means doing google meet
2024-06-06 17:06:48 <@trix:fedora.im> i'm fine with matrix
2024-06-06 17:07:03 <@tflink:fedora.im> I think this will need to go to discourse since there are likely interested folks who aren't here
2024-06-06 17:07:13 <@trix:fedora.im> that's fine too
2024-06-06 17:07:45 <@man2dev:fedora.im> how about a poll
2024-06-06 17:08:17 <@trix:fedora.im> do this on discourse and see.. so tflink you want to get that going ?
2024-06-06 17:08:25 <@jsteffan:fedora.im> yeah, a discourse poll sounds great
2024-06-06 17:08:29 <@tflink:fedora.im> I propose the following: we continue the current schedule for the next meeting in 2 weeks which will be on google meet. in the mean time, tflink will start a topic on discourse to get more input on which meeting method is preferred and if we make other changes
2024-06-06 17:08:44 <@trix:fedora.im> grovvy
2024-06-06 17:09:00 <@man2dev:fedora.im> i agree
2024-06-06 17:09:44 <@tflink:fedora.im> I would prefer to not use a poll unless discourse has features I'm not aware of - I generally care more about the opinions of the folks who are showing up regularly and a poll at least implies that all votes are equal
2024-06-06 17:10:56 <@tflink:fedora.im> it might not be a popular opinion but I think that using a meeting method that's popular for folks who aren't showing up if it's contrary to what the people showing up want is counter-productive
2024-06-06 17:11:53 <@trix:fedora.im> yes, i give more weight to people that show up.
2024-06-06 17:12:49 <@tflink:fedora.im> any other thoughts on the proposal?
2024-06-06 17:13:17 <@man2dev:fedora.im> its fine
2024-06-06 17:13:35 <@tflink:fedora.im> !info we will continue the current schedule for the next meeting in 2 weeks which will be on google meet. in the mean time, tflink will start a topic on discourse to get more input on which meeting method is preferred and if we make other changes
2024-06-06 17:14:21 <@tflink:fedora.im> also, be aware that the link for the google meeting in two weeks has likely changed. the ownership of the meeting in google moved to me and I think that will change the link
2024-06-06 17:14:42 <@tflink:fedora.im> I think that's it and we're 15 minutes over time so ...
2024-06-06 17:14:47 <@tflink:fedora.im> !topic open floow
2024-06-06 17:14:50 <@tflink:fedora.im> !undo
2024-06-06 17:15:04 <@tflink:fedora.im> !topic open floor ( now with fewer typos)
2024-06-06 17:15:51 <@tflink:fedora.im> it sounds like there may be more to discuss around the HW question, we can continue that in the #ai-ml:fedoraproject.org channel
2024-06-06 17:15:58 <@tflink:fedora.im> are there any other topics that folks want to bring up?
2024-06-06 17:16:07 <@man2dev:fedora.im> I think cpu situation should be researched a bit more as to see if there is any better results with AMD CPU & GPU
2024-06-06 17:16:23 <@man2dev:fedora.im> but we can talka about that in the sig
2024-06-06 17:16:45 <@tflink:fedora.im> if there are no more topics, I'll close out the meeting.
2024-06-06 17:17:01 <@tflink:fedora.im> thank you for coming, everyone. I'll post minutes to the discourse topic shortly
2024-06-06 17:17:03 <@tflink:fedora.im> !endmeeting