2024-06-06 16:31:25 <@tflink:fedora.im> !startmeeting fedora-pytorch 2024-06-06 16:31:25 <@meetbot:fedora.im> Meeting started at 2024-06-06 16:31:25 UTC 2024-06-06 16:31:25 <@meetbot:fedora.im> The Meeting name is 'fedora-pytorch' 2024-06-06 16:31:34 <@tflink:fedora.im> !topic welcome 2024-06-06 16:31:39 <@tflink:fedora.im> !hello 2024-06-06 16:31:40 <@zodbot:fedora.im> Tim Flink (tflink) 2024-06-06 16:32:00 <@jsteffan:fedora.im> !hi 2024-06-06 16:32:00 <@tflink:fedora.im> who all's here for some pytorch and ai-ml stuff? 2024-06-06 16:32:01 <@zodbot:fedora.im> Jonathan Steffan (jsteffan) 2024-06-06 16:32:13 <@tflink:fedora.im> !link https://board.net/p/fedora-pytorch-meeting 2024-06-06 16:32:27 <@tflink:fedora.im> meeting agenda and minutes are in that document 2024-06-06 16:32:52 <@tflink:fedora.im> we'll wait a couple of minutes for folks to filter in and then we can get started 2024-06-06 16:35:10 <@trix:fedora.im> filter filter.. am i a low or high pass filter ?!? 2024-06-06 16:35:55 <@tflink:fedora.im> hard for us to tell unless we get feedback - are things going over your head or are they beneath you ✌️ 2024-06-06 16:36:18 <@tflink:fedora.im> where did that emoji come from? it was supposed to be :-D 2024-06-06 16:36:45 <@tflink:fedora.im> anyhow, we're 5 minutes after start so let's get this party started 2024-06-06 16:37:04 <@tflink:fedora.im> !topic HW need/wants for Fedora infra 2024-06-06 16:37:51 <@man2dev:fedora.im> So its just text based meating? 2024-06-06 16:38:08 <@tflink:fedora.im> there has been a question posed on discourse about whether there is HW that would be good to add for fedora infra for AI/ML purposes 2024-06-06 16:38:30 <@tflink:fedora.im> for today, yes. we've been alternating between google and matrix and AFAIK, the last meeting was on google 2024-06-06 16:38:55 <@tflink:fedora.im> !link https://discussion.fedoraproject.org/t/datacenter-hardware-needs-for-ai-in-fedora/119116 2024-06-06 16:39:46 <@man2dev:fedora.im> for what use case training? or running models? 2024-06-06 16:40:04 <@tflink:fedora.im> I've been working on this behind the scenes for a bit and am planning to ask for some hosts and GPUs to make progress on automated testing for AI/ML use cases 2024-06-06 16:40:46 <@tflink:fedora.im> Mohammadreza Hendiani: it's an open ended question so long as there is a good justification and realistic use case for the HW 2024-06-06 16:41:03 <@trix:fedora.im> my testing is completely manual and only one card at a time.. so i'm interested in what you want 2024-06-06 16:41:36 <@tflink:fedora.im> I have a research project that will require GPU time if it goes into production so that is one of the workloads/usecases that I have in the back of my mind but I know I'm not the only one with ideas :) 2024-06-06 16:43:13 <@tflink:fedora.im> My interest is primarily in HW coverage for AMD GPUs because the progress and focus as of late has been on ROCm. Since it'll be a year or so before we see any of this HW, I also want to keep intel in mind as they make progress on their stack 2024-06-06 16:43:57 <@tflink:fedora.im> I know that nvidia is popular and I'm not against looking at how we can have testing for that but it's a difficult problem to solve due to the distribution restrictions and proprietary nature of nvidia's platform 2024-06-06 16:44:32 <@tflink:fedora.im> access to nvidia accelerators is also much easier due to their current ubiquity in public clouds 2024-06-06 16:44:54 <@man2dev:fedora.im> most of the hardware optimization test i have done are with vulkan so in my opinion one of nvidias vulkan supported gpu would be use full for vulkan testing and optimization https://developer.nvidia.com/vulkan-driver 2024-06-06 16:45:34 <@trix:fedora.im> is there a way to distributively test ? 2024-06-06 16:46:05 <@tflink:fedora.im> WRT ROCm, I'd like to see automated coverage for gfx1100 and maybe gfx1103 in addition to planning for 2 or so more GPUs once rdna4 is a thing 2024-06-06 16:46:31 <@tflink:fedora.im> I need to write things up but I do have an idea for that since it's the most practical option for the near future 2024-06-06 16:47:15 <@man2dev:fedora.im> look we are very behind on vulkan support in Fedora in my own devise im setting my enviroment veriables manually becasue it hasent been packaged 2024-06-06 16:47:33 <@tflink:fedora.im> in an ideal world, I'd like to see coverage for gfx10 and gfx9 but I don't think those are available new anymore or for anything even close to a sane price for the newer end of gfx9 2024-06-06 16:47:38 <@man2dev:fedora.im> some of the most important parts of the sdk are packaged 2024-06-06 16:47:48 <@man2dev:fedora.im> but we don't have the full sdk 2024-06-06 16:47:57 <@trix:fedora.im> i need these samples for triaging problems but unless i have multiple servers and there is some widget that makes them sharable, it's not much help .. i can _ask_ for more servers, they will ask me to just swap cards. 2024-06-06 16:48:42 <@jsteffan:fedora.im> Mohammadreza Hendiani: i'd be interested in defining what support we do have and what are are missing and getting things packaged. we can discuss that outside of the meeting 2024-06-06 16:48:53 <@man2dev:fedora.im> i don't understand the question? 2024-06-06 16:49:04 <@tflink:fedora.im> yeah, testing is one thing, access to the machine for debug is another. to be honest, I'm not sure how to approach that but it'll be a point of discussion once plans start forming 2024-06-06 16:49:10 <@jsteffan:fedora.im> do we have access to responsive remote hands from the NOC? 2024-06-06 16:49:11 <@tflink:fedora.im> or once I get around to writing stuff up 2024-06-06 16:49:14 <@man2dev:fedora.im> i have list if you want but its not complete yet 2024-06-06 16:49:42 <@jsteffan:fedora.im> can we take advantage of eGPUs here? i know tflink mentioned that even finding a suitable chassis has been painful 2024-06-06 16:49:45 <@trix:fedora.im> ideally i plug in server+card and start up a testing service that tim or anyone can run fedora tests on 2024-06-06 16:49:57 <@tflink:fedora.im> eGPU? external GPUs? 2024-06-06 16:50:32 <@tflink:fedora.im> !info the immediate priority for AI/ML hardware is AMD for ROCm since that's what we've been making progress with 2024-06-06 16:51:00 <@trix:fedora.im> yup rocm all the wayyy babyyyy 2024-06-06 16:51:32 <@jsteffan:fedora.im> i would suspect having a few host rack mount servers that have plenty of cpu/ram, as many expansion cards we can fit) and then having a reserved quarter cabinet (or less) with some sort shelving/etc we could have stacks of eGPUs for testing. worst case, a NOC ticket to swap which eGPU is attached could be opened 2024-06-06 16:51:34 <@tflink:fedora.im> !info intel is another longer term priority since we're looking at a year minimum before any of this is online 2024-06-06 16:51:45 <@jsteffan:fedora.im> yeah 2024-06-06 16:51:53 <@man2dev:fedora.im> ok but this was about the future hardware needs lets face it NVIDIA is major player 2024-06-06 16:52:18 <@mystro256:fedora.im> Note: I looked into Nvidia for interest and there's also out of tree kernel module issues to worry about with Fedora 2024-06-06 16:52:40 <@mystro256:fedora.im> Nvidia's focus has never really been bleeding edge kernels 2024-06-06 16:52:45 <@tflink:fedora.im> !info nvidia is something we can also look at but it might be better (and more easily) handled with public cloud resources. it's also an issue due to its proprietary nature and distribution restrictions 2024-06-06 16:53:40 <@man2dev:fedora.im> I haven't tested the work thats been done with openapi but yeah they are inversted but the question is what are the performances like comperd to somthing like Rcom and vulkan 2024-06-06 16:53:41 <@trix:fedora.im> it comes down to who wants to do what, if someone really wants to do nvidia there is nothing stopping them, other than the oot kernel driver and third party cuda challanges.. everything has challanges. 2024-06-06 16:53:50 <@tflink:fedora.im> have I summarized things up reasonably well or am I missing things WRT what kind of datacenter-living HW we'd ask for? 2024-06-06 16:54:12 <@mystro256:fedora.im> My point is just Intel and AMD are the most viable for Fedora due to fast moving kernels 2024-06-06 16:54:25 <@mystro256:fedora.im> assuming intel isn't doing OOT kernel modules 2024-06-06 16:54:33 <@man2dev:fedora.im> to my understanding the all of these updates are coming only to the opensource diriver 2024-06-06 16:54:52 <@trix:fedora.im> datacenter is very expensive and more in the rhel side of things. i have access but can not really share outside of redhat. 2024-06-06 16:54:53 <@man2dev:fedora.im> no argument there 2024-06-06 16:55:08 <@mystro256:fedora.im> Out of curiousity, are they planning to merge that into upstream linux anytime soon? 2024-06-06 16:55:09 <@tflink:fedora.im> FWIW, I'm not trying to stop anyone from working on nvidia. I just recognize that we have very limited resources and believe that focusing on the more open solutions first is the best route forward 2024-06-06 16:55:39 <@mystro256:fedora.im> bit off topic, but I haven't heard much recently 2024-06-06 16:55:40 <@man2dev:fedora.im> other than the working being done in nova and nvk NO 2024-06-06 16:56:54 <@man2dev:fedora.im> but idk NVIDIA has been been weird for some time i mean who though redhat could convinse them to become open source 2024-06-06 16:57:22 <@trix:fedora.im> oh jeeze hand gernade. 2024-06-06 16:57:27 <@mystro256:fedora.im> is nova related to the official driver nv put on github? 2024-06-06 16:57:33 <@trix:fedora.im> back to content of meeting. 2024-06-06 16:57:38 <@mystro256:fedora.im> haha sorry 2024-06-06 16:58:14 <@trix:fedora.im> so how about we get hw for all the gfx11's and whatever new things are coming in the commercial side of amd 2024-06-06 16:58:30 <@man2dev:fedora.im> it is the easier and fast chouce due to alot of the work being done by upstream but im just saying to not put all are eags in one basket 2024-06-06 16:59:33 <@tflink:fedora.im> getting back to the topic at hand, I'm planning to ask for 8 pcie slots worth of servers for testing, assuming that a suitable server can be found. In addition, I want to plan for about that many GPUs - a mix of gfx1100 and whatever is coming after gfx1100 for amd, an intel card or two and maybe an nvidia card 2024-06-06 16:59:51 <@tflink:fedora.im> I don't expect to get all that but you never know if you don't ask :) 2024-06-06 17:00:07 <@man2dev:fedora.im> to my understanding its trying to implement the open source nvidia diver inside of it and replave nova 2024-06-06 17:00:26 <@jsteffan:fedora.im> 4) define target AMD cards 2024-06-06 17:00:26 <@jsteffan:fedora.im> 3) determine best way to acquire target consumer GPUs that will work on the selected compute platform (e.g. how are they attached to the compute platform) 2024-06-06 17:00:26 <@jsteffan:fedora.im> so if i were to summarize it would be: 2024-06-06 17:00:26 <@jsteffan:fedora.im> 5) define target intel cards 2024-06-06 17:00:26 <@jsteffan:fedora.im> 1) rely on public cloud resources to provision nvidia hardware 2024-06-06 17:00:26 <@jsteffan:fedora.im> 2024-06-06 17:00:26 <@jsteffan:fedora.im> 2) design a base compute platform that is in the common RHIT procurement pool 2024-06-06 17:00:26 <@tflink:fedora.im> am I missing anything in terms of HW that we would need/want? 2024-06-06 17:00:31 <@trix:fedora.im> tflink: you can reuse a crypto mining rig! franken-miner 2024-06-06 17:00:56 <@man2dev:fedora.im> what abut AMD CPUS? 2024-06-06 17:01:12 <@tflink:fedora.im> yeah, that's a good summary, I think 2024-06-06 17:01:17 <@trix:fedora.im> see above .. gfx1103 is cpu. 2024-06-06 17:01:52 <@man2dev:fedora.im> doesn't RCom work like cuda in the sense that its interprets gpu and cpu instructions? so wouldn't AMD CPU get the best results 2024-06-06 17:02:12 <@tflink:fedora.im> generally out of scope for what we want to do, I think. I'm struggling to find a suitable rackmount solution; if I can find one, I'm not going to be picky about the CPU for the host systems 2024-06-06 17:02:35 <@tflink:fedora.im> I thought that the CPU portion of ROCm was pretty vendor neutral 2024-06-06 17:02:37 <@jsteffan:fedora.im> yeah, item #3 is to determine what is the best strategy ... there is iGPU (integrated) dGPU (discrete) and eGPU (external) and there might need to be a mix 2024-06-06 17:02:56 <@trix:fedora.im> we are getting out of time.. so next topic ? 2024-06-06 17:03:18 <@trix:fedora.im> anyone want gmeet ? 2024-06-06 17:03:23 <@tflink:fedora.im> yeah, i could go on for a while on the restrictions for the HW 2024-06-06 17:03:25 <@jsteffan:fedora.im> the most common is dGPU, but we (the SIG) should also pay attention to enabling the iGPU stuff next, and then finally any eGPU stuff 2024-06-06 17:03:45 <@tflink:fedora.im> but we are kinda out of time. we can continue this convversation outside the meeting 2024-06-06 17:04:00 <@tflink:fedora.im> !topic future meeting location and name 2024-06-06 17:04:12 <@man2dev:fedora.im> Tom would know more about this 2024-06-06 17:04:37 <@tflink:fedora.im> up until now, we've had a pytorch meeting that happens every 2 weeks alternating between matrix and google meet 2024-06-06 17:05:16 <@tflink:fedora.im> I'd argue that it has become a more general ai/ml meeting rather than a pytorch specific meeting and it might make sense to change the name if we're making other changes 2024-06-06 17:05:36 <@trix:fedora.im> yup i agree 2024-06-06 17:05:41 <@tflink:fedora.im> but the bigger point is that alternating between two meeting methods is a bit of a pain and I'd like to choose one over the other 2024-06-06 17:05:52 <@man2dev:fedora.im> i agree 2024-06-06 17:06:24 <@tflink:fedora.im> my personal preference is for matrix but I would prefer consistency even if that means doing google meet 2024-06-06 17:06:48 <@trix:fedora.im> i'm fine with matrix 2024-06-06 17:07:03 <@tflink:fedora.im> I think this will need to go to discourse since there are likely interested folks who aren't here 2024-06-06 17:07:13 <@trix:fedora.im> that's fine too 2024-06-06 17:07:45 <@man2dev:fedora.im> how about a poll 2024-06-06 17:08:17 <@trix:fedora.im> do this on discourse and see.. so tflink you want to get that going ? 2024-06-06 17:08:25 <@jsteffan:fedora.im> yeah, a discourse poll sounds great 2024-06-06 17:08:29 <@tflink:fedora.im> I propose the following: we continue the current schedule for the next meeting in 2 weeks which will be on google meet. in the mean time, tflink will start a topic on discourse to get more input on which meeting method is preferred and if we make other changes 2024-06-06 17:08:44 <@trix:fedora.im> grovvy 2024-06-06 17:09:00 <@man2dev:fedora.im> i agree 2024-06-06 17:09:44 <@tflink:fedora.im> I would prefer to not use a poll unless discourse has features I'm not aware of - I generally care more about the opinions of the folks who are showing up regularly and a poll at least implies that all votes are equal 2024-06-06 17:10:56 <@tflink:fedora.im> it might not be a popular opinion but I think that using a meeting method that's popular for folks who aren't showing up if it's contrary to what the people showing up want is counter-productive 2024-06-06 17:11:53 <@trix:fedora.im> yes, i give more weight to people that show up. 2024-06-06 17:12:49 <@tflink:fedora.im> any other thoughts on the proposal? 2024-06-06 17:13:17 <@man2dev:fedora.im> its fine 2024-06-06 17:13:35 <@tflink:fedora.im> !info we will continue the current schedule for the next meeting in 2 weeks which will be on google meet. in the mean time, tflink will start a topic on discourse to get more input on which meeting method is preferred and if we make other changes 2024-06-06 17:14:21 <@tflink:fedora.im> also, be aware that the link for the google meeting in two weeks has likely changed. the ownership of the meeting in google moved to me and I think that will change the link 2024-06-06 17:14:42 <@tflink:fedora.im> I think that's it and we're 15 minutes over time so ... 2024-06-06 17:14:47 <@tflink:fedora.im> !topic open floow 2024-06-06 17:14:50 <@tflink:fedora.im> !undo 2024-06-06 17:15:04 <@tflink:fedora.im> !topic open floor ( now with fewer typos) 2024-06-06 17:15:51 <@tflink:fedora.im> it sounds like there may be more to discuss around the HW question, we can continue that in the #ai-ml:fedoraproject.org channel 2024-06-06 17:15:58 <@tflink:fedora.im> are there any other topics that folks want to bring up? 2024-06-06 17:16:07 <@man2dev:fedora.im> I think cpu situation should be researched a bit more as to see if there is any better results with AMD CPU & GPU 2024-06-06 17:16:23 <@man2dev:fedora.im> but we can talka about that in the sig 2024-06-06 17:16:45 <@tflink:fedora.im> if there are no more topics, I'll close out the meeting. 2024-06-06 17:17:01 <@tflink:fedora.im> thank you for coming, everyone. I'll post minutes to the discourse topic shortly 2024-06-06 17:17:03 <@tflink:fedora.im> !endmeeting