18:03:28 #startmeeting 18:03:28 Meeting started Fri May 3 18:03:28 2013 UTC. The chair is davej. Information about MeetBot at http://wiki.debian.org/MeetBot. 18:03:28 Useful Commands: #action #agreed #halp #info #idea #link #topic. 18:03:28 #meetingname Fedora Kernel meeting 18:03:28 #meetingtopic Fedora Kernel meeting 18:03:28 #addchair jforbes jwb 18:03:28 The meeting name has been set to 'fedora_kernel_meeting' 18:03:39 woo woo. 18:03:48 is there anybody out there ? 18:03:51 no. 18:04:02 yup 18:04:16 * brunowolff is here 18:04:35 alrighty. let's start as usual with the state of the trees 18:04:53 let's do f17/f18 as one, because they're pretty much the same thing still 18:05:01 start with them or rawhide? 18:05:09 the former 18:05:14 ok 18:05:21 * nirik is lurking around. 18:05:30 Not much to speak of in F17/F18 at the moment, we need to move to 3.9 soonish 18:05:34 currently on 3.8.11. 3.9 got released last weekend. we'll start on a rebase next week. 18:05:43 probably coinciding with 3.9.1 18:05:57 I suppose it might be worth noting that by some miracle F17 got enough karma to push a kernel without manual intervention 18:06:14 So thanks to those of you testing and giving karma there 18:07:12 lots of old 17/18 bugs starting to get closed out through inactivity. I'm not thrilled at closing some of them without resolution, but if the reporters have gone away, we don't have a lot of choice 18:08:11 anything else on 17/18 ? 18:08:18 nothing else here 18:08:31 ok. next up.. 19 18:08:49 * nirik does have a f17 vm for testing, but I'm usually too busy to remember 18:08:54 so f19 is on 3.9.0 at the moment. it will continue on the 3.9.y series until F19 GAs 18:08:56 if you need karma feel free to ping me 18:09:06 fairly stable at this point, so not a ton of worries 18:09:29 Beta should include some 3.9.y stable kernel 18:09:42 and debugging will remain off for good now 18:10:06 i think that's about it on f19 18:10:35 any questions? 18:10:39 19 could use a bz sweep to clear out some of the older bugs by the looks of things. 129 open right now 18:10:52 I'll have a look through some of those this afternoon 18:10:56 yeah, i haven't looked at bugzilla yet. on tap for next week 18:11:20 though a lot of those 129 were actually bugs moved to f19 from rawhide. that... wasn't helpful 18:11:59 onto rawhide? 18:12:04 probably should tag more of the rawhide bugs with the whiteboard tag I can never remember that prevents that 18:12:11 FutureFeatre 18:12:22 which is a lie, but makes the scripts stop messing with things 18:12:28 that's probably why I can't rememebr it, it's badly named 18:12:32 yeah 18:12:36 maybe we can request a better one ? 18:12:40 and it's a keyword i think 18:13:05 probably. we can bug jreznik about it iirc 18:13:21 is that who runs the scripts to do the migration ? 18:13:39 he did this past time 18:13:48 in the past it was bugzappers, which no longer exists afaik 18:13:52 brb, doorbell. 18:14:19 any other comments/questions on f19? 18:15:04 ok, rawhide 18:15:41 rawhide is in the middle of the 3.10 merge window. i've been building quite a few kernels per day for this, mostly so that if things break we have granular snapshots of the merge window as it was progressing 18:15:52 that will hopefully make it easier to narrow down where something broke 18:16:21 the latest build is the one right before the DRM tree was merged. i have that done locally, but it breaks at least one of my machines. bisecting at the moment 18:16:25 rawhide nodebug tends to build about 1 kernel per day, I was starting builds with every rawhide build and they werent finishing before the next was ready 18:16:45 yeah. the dedicated build machines are speedy now 18:18:07 until the DRM merge, things have been looking OK. i know davej has seen some boot issues in the clocksource code, but i haven't on any of teh machines i have 18:20:00 * jwb idles until davej gets back 18:20:58 oh, in case someone was wondering, the secure boot patchsets are not in the upstream 3.10 kernel. 18:21:01 sorry about that, had to resign my lease paperwork.. 18:21:15 your landlord has great timing 18:21:23 indeed :) 18:21:36 * nirik has had no issues with the latest rawhide kernels here so far. 18:22:00 any other comments/questions on rawhide? 18:22:02 I'm pretty amazed I'm the only one seeing those clock bugs (apart from Yinghai) 18:22:12 every machine I try hits it (or a variant of it) 18:22:30 * nirik has a few items for open floor or whatever. 18:22:50 davej, are they old? weird bios? all one kind of CPU? 18:23:08 couple years old. intel and amd. 18:23:18 the amd is maybe 3 years old 18:23:20 strange 18:23:25 the oldest intel is from 2007 18:23:47 I just ooze gamma rays or something 18:24:03 is your .config different from fedoras? 18:24:14 it's a cut down version 18:24:26 so fedora (debug) without the drivers 18:24:42 shouldn't be vastly different then, unless you and i answered a question differently 18:24:50 actually maybe I tweaked a few other things too. I should double check 18:25:20 DEBUG_PAGEALLOC maybe 18:25:46 anyway.. that's pretty it for release overview I guess ? 18:26:06 think so 18:26:31 ok. let's talk a little about the writeup you did at http://fedoraproject.org/wiki/KernelBugTriage 18:27:00 for those who haven't seen this yet, Josh put some work into codifying the sort of triage activities that would be useful to us. 18:27:21 right. and spot has the aliases there to CC now 18:27:51 basically, all of the bugs need to be triaged. it will take quite a bit of effort 18:27:53 yeah, that's probably the biggest change. as you can see there are a whole bunch of new aliases, and we'll add more as necessary 18:28:34 if you need any of them changed, anyone in sysadmin-* can change them... or I can get some or all of you added to do that too if you prefer. 18:28:49 i think i'm in sysadmin-* 18:28:58 * nirik nods. probibly are. 18:29:12 do you get tons of nagios emails you filter to /dev/null ? 18:29:12 yes, i am 18:29:14 :) 18:29:17 i do! 18:30:12 if anyone has questions about anything on that triage doc, follow up on the fedora-kernel-list, and we'll try to expand on anything that's unclear 18:30:40 anything more to say about that for now ? 18:30:46 yes. because i'm sure there are things that are unclear. 18:31:50 i think that's it on triage then 18:31:55 ok. let's talk a little about the automated testing. 18:32:11 Okay 18:32:31 We have hardware, hopefully being racked soon. 18:32:44 that was one of the things I wanted to mention. ;) 18:32:46 The plan is to get things setup the week of May 13th 18:32:54 nirik: I heard there was some confusion about the drac cards or something ? 18:32:56 we have the hw, are working on getting console access so we can install them. ;) 18:33:04 Oh? 18:33:12 yeah, it seems they didn't ship with mgmt... 18:33:18 or it was unclear if they did. 18:33:45 ok, so who is chasing that up, you or spot ? 18:33:46 we will get it sorted out. We have a kvm thing that is supposed to let us have console on things with no mgmt, but it's java is busted. 18:34:02 Smooge was working with the datacenter folks. 18:34:09 That's unpleasant. Console access is kind of important in this setup 18:34:12 * nirik will ask him for a update. 18:34:43 nirik: let us know if they wont be ready by week after next so we can plan accordingly 18:35:05 ok. can do. 18:35:18 feel free to ping me anytime if you want me to scare up status too. 18:35:38 you folks will need access to the mgmt if it exists? 18:35:39 jforbes: they aren't essential to the setup though right ? 18:35:48 they're only really needed once we're in productionm 18:36:17 davej: well, they need some sort of console access to install them 18:36:48 * nirik nods. 18:36:52 oh, duh 18:37:18 once installed tho, will they often need reinstall from mgmt? 18:37:35 nirik: no, but we need a way to get console on crashes 18:37:53 because we expect things to crash. :) 18:37:55 ok. we will have serial too. 18:38:25 nirik: yeah, serial is really what we need. Also, is there a way we can script a remote power toggle to force a reboot? 18:38:28 and power of course. 18:38:57 possibly... 18:39:18 power is 2 apc units you have to ssh into... and serial is a ssh or web login. 18:39:31 nirik: it would be good to have each machine monitor the other and reset it if it dies without intervention. Then it can send us what was on the console 18:39:39 Oh, ssh could work 18:39:55 as long as it's ok with passwordless keys 18:40:05 davej: worst case, expect 18:40:13 gross, but yeah 18:40:15 it's not sadly. 18:40:25 hmm. ok. that's something else for the TODO 18:40:27 like I said, worst case. 18:40:37 and also other machines will be on those power, so probibly we could get you guys access, but I wouldn't want a script to contain the passwords. 18:40:48 there's also always watchdog. ;) 18:40:56 jforbes, wait, expect like 'expect(1)' ? 18:41:35 jwb: expect like tcl 18:41:42 worst case, the 2nd machine can notice the 1st isn't up and send a mail so we manually intervene 18:41:44 anyhow, once we get console we will ping you on how they are really configured and such 18:41:56 jforbes, yeah, that's what i meant. giving me LTP flashbacks 18:42:00 nirik: excellent 18:42:38 jwb: yeah, I last wrote an expect script in 1998, but I think I still have it around, and it was specifically to ssh into something and run commands, so it is reusable if I can find it 18:43:37 So once all of that is in place, we should have automated testing working within the week of May 13th for every kernel build 18:44:09 so thats installing and rebooting to new kernel ? 18:44:14 and running some tests? 18:44:27 https://fedoraproject.org/wiki/KernelTestingInitiative has the details, though I will be updating that page with current status shortly 18:44:46 nirik: Yes 18:44:54 cool. :) 18:44:55 jforbes: we can probably just share the work-in-progress plan that we've been doing thelast couple days once we've got that filled out a bit more 18:45:17 that should give nirik the background of what we're needing 18:45:23 nirik: also using several guests, but the hosts installs and tests too 18:45:23 we can also monitor the machines with nagios, but if they reboot a lot not we will need to make sure the timeouts are high 18:45:32 davej: correct 18:45:50 nirik: yeah, they will reboot almost daily, sometimes more than once a day. high timeout is fine 18:46:29 yeah, so that might be another option... just have a ping check, if they don't respond for an hour or something high, power cycle or alert you 18:47:58 nirik: sure, it would just be nice to have something catch the console and alert us. Even better if it can do that, then reboot it. So we get the relevant info, but it can go back to testing 18:48:12 * nirik nods. 18:48:58 Overall I am rather excited to get this working, I hope it will catch bugs before users ever see them 18:49:51 yeah, sounds good. 18:51:27 jforbes: how different is the cloud testing stuff going to be to this ? 18:51:46 is that just teh same thing but single-host ? 18:52:28 davej: none at all, the regression test harness box will be dynamically starting an EC2 instance to run the regression suite just like any other virtual machine 18:53:10 ok, nice 18:54:05 anything else on this for now ? 18:54:14 The EC2 instance runs the test then shuts down, if there are problems with the boot, we should get a message from the harness, if there are problems with the actual tests, we should get an email from the EC2 instance 18:54:18 nothing else here 18:54:46 if you like... we could possibly get you access to fire off a openstack instance too in our private cloud. 18:55:02 that's using KVM, right? 18:55:12 nirik: wouldn't hurt 18:55:26 yep 18:55:29 kvm... 18:55:40 it wouldn't no, but we have KVM covered with the existing setup. i suggested AWS/EC2 because it's xen guests 18:55:45 nirik: we can look at that after the rest is set up, we are testing kvm locally 18:55:57 sure. just something to drop on todo list... 18:56:30 something I'd like to get to eventually would be specialised guest images too 18:56:50 davej, meaning? 18:57:02 davej: certain;y possible, we have the storage, just not the memory for running too many concurrent guests 18:57:07 for eg, we have one bug open right now with a user of a virt machine running hadoop or something. I know nothing about how that osrt of thing is set up, but if we have prepackaged reproducer cases like that, adding them to the mix would be useful I think 18:57:24 ah, yeah. good idea 18:57:52 I do my best to ignore "misc java bonghits", but we seem to get quite a few bugs of that sort fo thing 18:58:14 That kind of stuff is a logical extension, easy to work out 18:58:16 might just be that those things are just pushing the vm a lot more when they suck up lots of ram 18:58:44 jforbes: yeah, I've deliberatly not added to the plan for now, just to limit scope 18:58:58 Anything else on the testing? 18:59:38 think we're done 18:59:52 filled the whole hour, good meeting everyone. 18:59:59 #endmeeting