#meeting-2:fedoraproject.org log

<@tflink:fedora.im>

17:30:22

!startmeeting fedora-ai-ml-sig

<@meetbot:fedora.im>

17:30:22

Meeting started at 2024-12-05 17:30:22 UTC

<@meetbot:fedora.im>

17:30:23

The Meeting name is 'fedora-ai-ml-sig'

<@tflink:fedora.im>

17:30:30

!hello

<@zodbot:fedora.im>

17:30:31

Tim Flink (tflink)

<@man2dev:fedora.im>

17:30:34

!hi

<@zodbot:fedora.im>

17:30:36

Mohammadreza Hendiani (man2dev)

<@trix:fedora.im>

17:30:38

!hi

<@zodbot:fedora.im>

17:30:39

Tom Rix (trix)

<@tflink:fedora.im>

17:30:45

Who all is here for the AI-ML SIG meeting?

<@trix:fedora.im>

17:30:59

just us chickens

<@man2dev:fedora.im>

17:31:26

🤣

<@tflink:fedora.im>

17:32:25

ok, let's get started

<@tflink:fedora.im>

17:32:46

the only topic on the agenda is going over the test ideas/proposals/requirements that came out of the last meeting

<@trix:fedora.im>

17:32:57

old buisness.

<@tflink:fedora.im>

17:33:06

I assume that this is likely to take up plenty of time, is there anything else quick that folks want to bring up before we dive in?

<@wbclark:fedora.im>

17:33:17

/wave

<@trix:fedora.im>

17:33:23

Jeremy Newton: opened this ticket https://pagure.io/fesco/issue/3291

<@tflink:fedora.im>

17:33:30

oh yeah, that's a good one

<@tflink:fedora.im>

17:33:50

!topic FESCo ticket about using AMD's llvm fork for hipcc

<@tflink:fedora.im>

17:33:57

!link https://pagure.io/fesco/issue/3291

<@trix:fedora.im>

17:34:44

so i think we are good to go on the bundling

<@wbclark:fedora.im>

17:35:10

another quick announcement - All Things Open conference has a new spin off conference for open source AI. March 17-18 in Raleigh-Durham. call for papers closes tomorrow (Dec 6th)

<@tflink:fedora.im>

17:35:15

yeah, it sounds like there are still some reservations but I don't see any hard "no"s

<@zodbot:fedora.im>

17:35:59

man2dev gave a cookie to wbclark. They now have 1 cookie, 1 of which was obtained in the Fedora 41 release cycle

<@tflink:fedora.im>

17:36:02

William Clark: do you have a link?

<@wbclark:fedora.im>

17:36:09

yes, https://allthingsopen.ai/

<@tflink:fedora.im>

17:37:06

William Clark: it sounds like an in-person thing without provisions for remote presentations?

<@tflink:fedora.im>

17:37:34

sorry, I let this get side tracked

<@wbclark:fedora.im>

17:38:01

not sure, didn't check as I'm local to that area.. sorry!

<@tflink:fedora.im>

17:38:31

!info the FESCo ticket wasn't so much for permission as for guidance and so far, the guidance seems to be a somewhat OK with requests for more justification and submission as a feature for the bundling change

<@tflink:fedora.im>

17:38:41

does that summary seem right for the FESCo ticket?

<@trix:fedora.im>

17:40:58

i would have to defer to Jeremy on this.

<@tflink:fedora.im>

17:41:09

the amount of typing start/stop makes me think that I missed something :)

<@trix:fedora.im>

17:41:30

I believe we have some cleanup in the spec file and we are taking care of that.

<@tflink:fedora.im>

17:42:00

but it sounds like the plan, for now at least, is to start bundling AMD's llvm fork for hipcc for F42?

<@trix:fedora.im>

17:42:22

with the 6.3 change coming up.

<@trix:fedora.im>

17:42:43

in 6.2 it is already bundled for rhel and suse.

<@tflink:fedora.im>

17:42:53

!info at this moment, the plan is to start bundling AMD's llvm fork with hipcc for ROCm 6.3

<@tflink:fedora.im>

17:43:05

cool, anything else on this topic?

<@trix:fedora.im>

17:43:10

nope.

<@tflink:fedora.im>

17:44:08

!topic HW testing ideas/proposals/requirements

<@tflink:fedora.im>

17:44:45

!info at the last meeting, there was some conversation around automated testing and after some chaos and confusion, the decision was to start gathering ideas, requirements etc. for later discussion

<@tflink:fedora.im>

17:44:48

!link https://board.net/p/fedora-ai-ml-sig-testing-brainstorm

<@tflink:fedora.im>

17:45:27

from what I see, there are roughly 3 ideas that were submitted

<@tflink:fedora.im>

17:45:50

1. kernel testing

<@tflink:fedora.im>

17:45:50

2. ROCm component testing

<@tflink:fedora.im>

17:45:50

3. End to end AI/ML testing

<@tflink:fedora.im>

17:46:07

in the hopes of making this time productive, let's go over them 1 by 1

<@tflink:fedora.im>

17:46:22

!info the following topics are the proposals that were part of that document

<@trix:fedora.im>

17:46:27

ok me me me

<@tflink:fedora.im>

17:46:28

!topic Basic Kernel Testing

<@tflink:fedora.im>

17:46:39

Tom Rix: this was yours, go for it

<@trix:fedora.im>

17:47:07

i am looking at how to hook up kernelci , maybe package up 'lava' and get something going locally

<@trix:fedora.im>

17:47:27

but really want to get it going in a lab, maybe one at AMD.

<@tflink:fedora.im>

17:47:32

lava?

<@trix:fedora.im>

17:47:50

its a linaro thing

<@tflink:fedora.im>

17:48:04

ah, is that part of the automated upstream DRM setup?

<@trix:fedora.im>

17:48:05

i know little, i am looking into it.

<@man2dev:fedora.im>

17:48:25

llava the image processing model?

<@trix:fedora.im>

17:48:37

not sure, if drm uses that, but that would be the subsystem we need to test.

<@tflink:fedora.im>

17:48:45

I think that the biggest hurdle here is that kernel testing like this requires bare metal and automating it requires automating bare metal

<@trix:fedora.im>

17:48:54

build a kernel, run drm tests on

<@trix:fedora.im>

17:49:12

yup, that why i need to see it running locally.

<@tflink:fedora.im>

17:49:13

why not use the built kernels?

<@trix:fedora.im>

17:49:27

yes, can do that too.

<@trix:fedora.im>

17:49:48

hard part hw setup to use it

<@tflink:fedora.im>

17:49:49

automating bare metal is quite a task, in my experience. tons and tons of corner cases

<@trix:fedora.im>

17:50:16

yes, i am hoping i only have to do the 80% that is easy.

<@tflink:fedora.im>

17:50:43

not saying that it can't or shouldn't be done, just that it tends to be a much more difficult problem than some folks give it credit for

<@trix:fedora.im>

17:50:44

i'll see if i get traction this month.

<@tflink:fedora.im>

17:50:50

cool

<@trix:fedora.im>

17:51:05

next!

<@trix:fedora.im>

17:51:08

you are up

<@tflink:fedora.im>

17:51:24

!info the proposal is have a setup for automating kernel testing on bare metal, focusing on the driver bits that affect the ai-ml bits we're working on

<@tflink:fedora.im>

17:51:54

!topic ROCm Component Testing

<@tflink:fedora.im>

17:52:14

this kind of testing can be done without automating bare metal

<@trix:fedora.im>

17:52:37

containers on some host os

<@tflink:fedora.im>

17:52:41

the basic idea is to run tests against the ROCm components whenever they or their dependencies change

<@trix:fedora.im>

17:53:05

👍️

<@tflink:fedora.im>

17:53:27

yeah, close attention needs to be paid to the host kernel and driver versions but it's the kind of thing that can be changed less frequently and have the container handle the per-task setup

<@tflink:fedora.im>

17:53:59

so it doesn't avoid all the HW management problems, just makes them less frequent :)

<@trix:fedora.im>

17:54:25

does sanity checking of our copr's

<@tflink:fedora.im>

17:54:26

!info the proposal is pretty simple, start with containers and run ROCm component tests whenever the components or their dependencies change

<@tflink:fedora.im>

17:55:16

yeah, COPR would be part of this. most of the ROCm self-tests don't make it though koji so things will have to be rebuilt in COPR with the tests enabled

<@tflink:fedora.im>

17:55:54

rpath problems are the most common but at the end of the day, it is test code and AFAIK, isn't really meant to be distributed like the main deliverables

<@tflink:fedora.im>

17:56:06

any other questions or comments on this?

<@trix:fedora.im>

17:56:39

i am working on proposal to get hw at amd for this.

<@trix:fedora.im>

17:56:59

so we solve the place to put hw

<@trix:fedora.im>

17:57:07

and the $$

<@trix:fedora.im>

17:57:30

i'll follow up with folks about this later, we don't need to rat hole on this.

<@tflink:fedora.im>

17:57:45

yeah, it's a consistent theme :)

<@tflink:fedora.im>

17:57:54

!topic End to End ML Testing

<@tflink:fedora.im>

17:58:33

One issue is that ML stacks tend to have the resiliency of wet tissue paper - they collapse if you look at them wrong

<@tflink:fedora.im>

17:59:00

the idea here is to have some e2e workflows to have some idea when the end environments break

<@tflink:fedora.im>

17:59:17

i.e have pytorch run through some training, have ollama serve a model

<@trix:fedora.im>

17:59:21

so build llama-cpp and set it go ?

<@trix:fedora.im>

17:59:49

yes, just running the pytorch example is 90% of my ai testing.

<@trix:fedora.im>

18:00:09

building all the bits to do that is really a pain.

<@tflink:fedora.im>

18:00:25

the secondary benefit here is that we would have containers that could be shared to start managing the pace of change better for the end enviornments

<@man2dev:fedora.im>

18:00:29

true llama-cpp does download and test running by default

<@tflink:fedora.im>

18:01:06

at least that's how I envisioned things but I'm open to other ideas

<@tflink:fedora.im>

18:01:38

basically, have containers for the environments that we are interested in and run pseudo-production workflows through those containers on a regular basis

<@trix:fedora.im>

18:01:41

sounds fine. if we could at least do cpu x86 and aarch that would be win

<@trix:fedora.im>

18:02:10

i do 0 testing on aarch64

<@man2dev:fedora.im>

18:02:22

I want to propose building some models as well which does kind of relate to this if you meed building of models as well

<@trix:fedora.im>

18:02:23

and 0 on x86

<@tflink:fedora.im>

18:02:31

yeah, I don't have much access to aarch64 hardware either

<@tflink:fedora.im>

18:03:06

Mohammadreza Hendiani: do you mean train them from scratch or fine-tune existing models?

<@man2dev:fedora.im>

18:03:30

from scratch

<@tflink:fedora.im>

18:04:04

!info the proposal is to help hedge against the fragility of ML stacks by creating Fedora-based containers for those environments and run pseudo-production workloads through those containers on a regular basis

<@tflink:fedora.im>

18:04:29

I don't have any philosophical issues with training models from scratch but that's a lot of HW we don't have

<@man2dev:fedora.im>

18:05:09

yeah i know my main idea on making this work is if aws access is granted

<@tflink:fedora.im>

18:05:14

a lot of fine-tuning methods that I know of for LLMs is beyond the HW and budget we have now

<@tflink:fedora.im>

18:05:44

much less training from scratch

<@man2dev:fedora.im>

18:06:00

i wanted to avoid bigger models as to try and make it scale

<@tflink:fedora.im>

18:06:24

I'm not saying that it can't be done or not to do it, just trying to express the cost and size of a project like that

<@man2dev:fedora.im>

18:06:57

yeah i get it.

<@trix:fedora.im>

18:07:53

i think any sort of smoke test would be better than what we have now, 0

<@tflink:fedora.im>

18:08:05

also true. perfect is the enemy of good (enough)

<@tflink:fedora.im>

18:09:06

!topic LLM Training

<@tflink:fedora.im>

18:09:28

!info the idea here is to train some LLM model from scratch using the tools we have available to us

<@trix:fedora.im>

18:10:05

would that LLM become part of the release or not ?

<@tflink:fedora.im>

18:10:07

!info this poses many challenges, including access to sufficient hardware and potential licensing issues

<@tflink:fedora.im>

18:10:20

I assume not or at least not at first

<@tflink:fedora.im>

18:10:45

there have been plenty of proposals about how Fedora could or shouldn't use LLMs :-D

<@tflink:fedora.im>

18:11:05

Mohammadreza Hendiani: did I summarize things reasonably?

<@man2dev:fedora.im>

18:11:40

im not going need rraw data as to avid ` potential licensing issues`

<@wbclark:fedora.im>

18:11:47

fine tuning doesn't have to be expensive. for a smoke test, you could undertrain. since the purpose is to validate that the artifact produced by training can be served

<@man2dev:fedora.im>

18:11:55

im talking about already processed and licensed data

<@tflink:fedora.im>

18:12:16

yeah, there are plenty of ways to go about doing it. Some of them are cheaper than others

<@tflink:fedora.im>

18:13:00

yeah, part of what I meant by licensing issues is making sure that all inputs are OK. I didn't think you were talking about gathering all the data from scratch

<@tflink:fedora.im>

18:13:09

anything else on this?

<@man2dev:fedora.im>

18:13:54

<@man2dev:fedora.im>

18:14:05

<@tflink:fedora.im>

18:14:15

ok, moving on

<@tflink:fedora.im>

18:14:26

!topic next steps for testing projects/conversations

<@tflink:fedora.im>

18:15:03

actually, it might be worth going over some of the general things I proposed

<@tflink:fedora.im>

18:15:04

!undo

<@tflink:fedora.im>

18:15:14

!topic Proposed General Requirements

<@tflink:fedora.im>

18:15:34

!info the idea here was to start describing some of the general things that we want to encourage or avoid

<@tflink:fedora.im>

18:15:52

I didn't mean this to be a unilateral imposition

<@tflink:fedora.im>

18:16:49

most of the items I listed come from mistakes that I've made or seen over the years

<@tflink:fedora.im>

18:17:04

the rest of it is things that I'd like to see

<@tflink:fedora.im>

18:18:03

For example, we're not going to get a ton of HW in a datacenter to start off. Having a setup that is flexible enough to support the idea of having HW distributed around when possible will help get the testing matrix started, at least

<@tflink:fedora.im>

18:18:31

any questions, concerns, additions to the list I started?

<@man2dev:fedora.im>

18:19:02

<@tflink:fedora.im>

18:19:42

several of the items I listed come from mistakes that I've made or seen over the years

<@tflink:fedora.im>

18:20:13

ok, this can be revisited later. as I said, I didn't mean the list to be "tflink has dictated that solutions must do X"

<@tflink:fedora.im>

18:20:18

moving on

<@tflink:fedora.im>

18:20:21

!topic next steps

<@tflink:fedora.im>

18:20:41

we have about 10 minutes left and I don't think that's enough time to make meaningful progress on what to do from here

<@trix:fedora.im>

18:20:46

tflink: has basement and hw.. please continue 😊

<@tflink:fedora.im>

18:21:18

I haven't been talking about it much but I have been working on a PoC for the ROCm component testing stuff

<@tflink:fedora.im>

18:21:49

I know that Mohammadreza Hendiani and Tom Rix have ideas about how to implement things as well :)

<@trix:fedora.im>

18:22:16

i am thinking out distributed too for home tested stuff to results posted some where that fills in the matrix.

<@tflink:fedora.im>

18:22:34

I don't sense much, if any, disagreement on the target testing proposals

<@tflink:fedora.im>

18:22:44

we didn't touch on where to start, though

<@trix:fedora.im>

18:22:58

i do not have a lot of experience here, so happy to lean on folks

<@tflink:fedora.im>

18:23:20

I'm just not quite sure what the next step should be or if there should be a step between here and "here is a solution"

<@tflink:fedora.im>

18:23:49

any thoughts on what the next conversation should be?

<@man2dev:fedora.im>

18:24:03

no its a great idea to have targeting testing since we are rebuilding packages

<@tflink:fedora.im>

18:24:05

if there's a step between here and presentation of potential solutions?

<@tflink:fedora.im>

18:24:22

not for distribution, unless I've misunderstood things

<@tflink:fedora.im>

18:24:57

maybe for the model but none of the things I proposed will have bits that get distributed other than reports

<@man2dev:fedora.im>

18:25:26

my idea basically boils down to firstly up streaming .fmf testing for very big packages like llvm for rocm

<@tflink:fedora.im>

18:25:41

let me be clear, we don't have time to get into details today

<@man2dev:fedora.im>

18:25:47

or just setting them in packages

<@man2dev:fedora.im>

18:25:59

so it can itgrate with test farm

<@tflink:fedora.im>

18:26:15

at least I don't. I'm not going to stop conversations from happening outside of the meeting :)

<@trix:fedora.im>

18:26:21

next meeting is close to holidays, should we skip ?

<@tflink:fedora.im>

18:26:32

that's another good point

<@tflink:fedora.im>

18:26:57

but before we get there - lets talk about what the next step is for figuring out test setups

<@tflink:fedora.im>

18:27:09

it sounds like the next step is to get into specific proposals?

<@trix:fedora.im>

18:27:16

yes.

<@trix:fedora.im>

18:27:24

i am writing up proposals now.

<@tflink:fedora.im>

18:27:45

i.e me, Tom Rix and Mohammadreza Hendiani write up at least summaries of what we're talking about for broader discussion

<@trix:fedora.im>

18:28:03

<@tflink:fedora.im>

18:28:15

Tom Rix and Mohammadreza Hendiani , any objection to this?

<@trix:fedora.im>

18:28:21

nope

<@man2dev:fedora.im>

18:28:28

nope

<@tflink:fedora.im>

18:28:38

shall we post the proposals to discourse?

<@tflink:fedora.im>

18:28:57

I don't think that matrix is the best place for the raw proposals

<@tflink:fedora.im>

18:29:01

or wiki?

<@tflink:fedora.im>

18:29:05

how about this:

<@trix:fedora.im>

18:29:12

wiki .

<@man2dev:fedora.im>

18:29:36

we can just post the wiki to dicourse

<@man2dev:fedora.im>

18:29:39

we can just post the wiki to discourse

<@tflink:fedora.im>

18:29:59

!info tflink, trix and man2dev will write up brief proposals detailing what they want to see happen before the next meeting where we can discuss the proposals more as a group

<@tflink:fedora.im>

18:30:23

!info proposals need to be in a non-matrix format; wiki or discourse is encouraged

<@tflink:fedora.im>

18:30:39

does that make sense and mesh with what y'all were understanding?

<@trix:fedora.im>

18:31:04

sure, and if it doesn't we know each other

<@tflink:fedora.im>

18:31:49

yeah, it doesn't have to be formal - I just think it's easier to read more detailed proposals in a less fluid medium :)

<@tflink:fedora.im>

18:32:02

ok. one last quick topic. I know we're over time

<@tflink:fedora.im>

18:32:06

!topic next meeting date

<@trix:fedora.im>

18:32:25

skip to new year

<@tflink:fedora.im>

18:32:35

as Tom Rix mentioned, the next two meeting times are close to holidays for some of us

<@tflink:fedora.im>

18:32:49

the next meetings would be on 2024-12-19 and 2025-01-02

<@trix:fedora.im>

18:32:58

skip

<@tflink:fedora.im>

18:33:12

do we want to skip one of them? both? it sounds like both

<@trix:fedora.im>

18:33:14

both, is my vote

<@tflink:fedora.im>

18:33:41

do we want to schedule something that isn't on the regular "every 2 weeks on thursday" schedule or just wait for 2025-01-16

<@man2dev:fedora.im>

18:33:48

both seams fine if we really need to talk something out the group is there

<@tflink:fedora.im>

18:33:58

yeah, that makes sense

<@tflink:fedora.im>

18:34:44

!info the next two ai-ml-sig meetings (2024-12-19 and 2025-01-02) are canceled. The next ai-ml-sig meeting will be on 2025-01-16

<@tflink:fedora.im>

18:35:05

!info if there are topics to discuss, folks are around on discourse and matrix

<@tflink:fedora.im>

18:35:10

!topic open floor

<@tflink:fedora.im>

18:35:18

anything else before we close out the last meeting of the year?

<@trix:fedora.im>

18:35:46

been a great year, thanks guys!

<@man2dev:fedora.im>

18:36:23

👌 have great new year

<@man2dev:fedora.im>

18:36:28

👌 have a great new year

<@tflink:fedora.im>

18:36:43

alrighty, then

<@tflink:fedora.im>

18:36:53

have a great new year, everyone

<@tflink:fedora.im>

18:37:14

thanks for attending and participating

<@tflink:fedora.im>

18:37:16

!endmeeting