18:00:51 #startmeeting kernel 18:00:51 Meeting started Fri Mar 2 18:00:51 2012 UTC. The chair is jforbes. Information about MeetBot at http://wiki.debian.org/MeetBot. 18:00:51 Useful Commands: #action #agreed #halp #info #idea #link #topic. 18:01:07 #chair djones 18:01:07 Current chairs: djones jforbes 18:01:10 alright, lets get started, because I think this is going to be a full hour. 18:01:44 #topic common bugs 18:01:58 Shall we start with hibernate? 18:02:09 yeah, sure 18:02:33 so couple things here. 18:03:01 the number of ways this fails is impressive. We've got a bunch of different problems, of varying severity 18:03:16 I'm actually surprised it works for anyone at all judging by the bugs we've been getting 18:03:35 * nirik doesn't think too many people use it anymore. 18:03:54 of biggest concern I think is the apparent memory corruption bugs that seem to appear in the cases where it "works". 18:04:10 works on hibernate, doom on resume 18:04:15 nirik: Even the reports of people who do use it, seem to report occasional success 18:05:02 Eric Sandeen has been staring at a bunch of those bugs for a while (because it usually manifests as ext* failures) 18:05:26 but I think that's just because most of memory is dcache, so a scribble is just more likely to land there. 18:05:57 Stanislav found one case where it's solved if you disable modesetting on i915 18:06:06 too bad we have reports of it happening on !915 hardware 18:06:24 could be similar problems across varying hardware though. basically dma 18:06:53 so I have three hibernate specific changes in mind, which I don't think will be too controversial. 18:07:40 1. we disable the recent change to make compression threaded. Unrelated to the corruption, but might make some of the other "it stopped working" bugs go away. 18:08:17 davej, disable compression entirely, or just the threaded aspect? 18:08:28 if that does fix make it work again for anyone, we'll take that upstream and let them figure it out. (Though the hazard here is that they'll hit other hibernate bugs instead) 18:08:34 jwb: threaded 18:08:39 hm, k 18:09:10 jwb: Might be good to start one at a time, threaded is the higher risk of fail, so disable it. If that doesnt fix it for anyone, we can try to disable compression all together mabye? 18:09:16 disabling compression entirely will likely expose other fun bugs if everything doesn't fit in swap 18:09:31 davej, there's checks for that and it'll bail on hibernate 18:09:36 theoretically 18:09:36 ok 18:09:41 jforbes, makes sense, yes 18:09:46 well, lets do what jforbes suggested 18:09:50 sure 18:10:10 #action disable threaded compression in hibernate code 18:10:47 second suggestion, is that we add a taint flag for when we hibernate, so we can tell from weird bugs without having to do a round-trip to the reporter 18:10:55 I think this is a no-brainer. 18:11:14 yes, just need to make sure abrt doesn't auto-block it 18:11:23 good point 18:11:41 It also has the advantage of letting us see if a class of bugs that looks completely unrelated only appears when tainted with hibernate 18:11:57 #action check if abrt will still file bugs if we taint on hibernate 18:12:24 final suggestion was to disable hibernate if tainted with proprietary modules. 18:12:36 given we have enough problems on untainted kernels, I think this makes sense. 18:13:07 i agree, but with the abrt stuff we shouldn't be getting reports on those machines anyway 18:13:32 we don't get oopses, but we get "hibernate doesn't work" 18:13:45 then five comments in we find out they're using vbox and nvidia 18:13:51 I don't see a problem with this, though honestly we will probably just get as many hibernate doesnt work bugs 18:14:02 true. i guess we'll find out if they change it to "hibernate is disabled" ;) 18:14:30 davej, similarly disable it if taint W or D (if it doesn't already block taht) ? 18:14:58 maybe a boot-option to re-enable it. so they're aware there may be problems after reading dmesg.. 18:15:07 i'm guessing the only plausible taint we'd allow is G 18:15:30 I think so 18:15:36 hibernate=ignore-taint 18:16:43 ok, any further thoughts on this ? 18:16:57 yes 18:17:02 so for tracking purposes 18:17:09 Who's going to implement? 18:17:24 do we want to use the "disable hibernate" bug as a tracker for any bugs that turn out to be hibernate related? 18:17:24 jforbes: I'll do the tainting stuff 18:17:38 figured. davej takes the fun stuff 18:17:41 ;) 18:17:48 heh 18:17:53 jforbes, you want to do the disable threaded change? 18:18:04 jwb: happy to 18:18:05 k 18:18:08 the threaded thing is just a define change afaik 18:18:13 yeah 18:18:19 jwb: re tracker, yeah, I think that makes sense 18:18:38 ok. i can go through bugzilla and make them block that 18:18:43 just need to find it again 18:20:10 longer term, we need to figure out what exactly is causing this corruption, which I think is a nice lead into the next topic. 18:21:36 #action davej to taint hibernate 18:21:50 #action jforbes to disable threading in hibernate compression 18:22:20 So moving on to page table corruption bugs... 18:22:33 #action jwb to troll through bugzilla and add bugs to hibernate blocker 18:22:34 yeah, those are really 'fun'. 18:22:55 I've been staring at those most this week, and not really coming to any conclusion other than something is horribly wrong. (duh) 18:23:16 the only suggestion I have right now is that we enable CONFIG_DEBUG_VM in the production kernels. 18:23:26 it's pretty low impact (mostly just a ton more BUG_ONs) 18:23:39 davej: to clarify, the production debug kernels only right? 18:23:45 no, the non-debug ones 18:24:13 it's already on in the debug builds 18:25:08 I posted about these problems upstream, and got no response at all. 18:25:36 I'm going to dig a little deeper, and see if I can get my head around some of the locking rules in mm/, and maybe I'll bring it up directly with Linus/Andrew. 18:25:37 davej: I saw that. I don't think the performance hit is too bad on it 18:26:01 yeah, it's nowhere as bad as slab debug or lockdep 18:26:24 it might show up on benchmarks or something, but at this point, I think reliability is a bigger concern 18:26:54 in everyday use, I doubt people will even notice it's on (unless they trip the BUG_ON's) 18:27:11 jwb: any thoughts ? 18:27:17 People who really care about performance can continue to run older kernels if everything was working for them. And we can turn it back off once some of this is figured out? 18:27:49 yeah, sounds ok 18:27:50 i'm certainly OK with it 18:28:13 if we get a flood of reports from those BUG_ONs though, we'll leave it on until it's sorted out 18:28:48 this is to hopefully catch the 'weird root name' bugs, or the bad state bugs, or both? 18:28:48 Sure, as long as we are getting reports from it, we need to keep it on 18:29:22 both, and maybe other unexplained things. 18:29:28 we have a lot of linked list corruptor bugs too 18:29:48 many of those are also (surprise) dcache lists 18:30:00 ok, cool. we kinda merged the 2 topics in the agenda then 18:30:23 yeah, a little. mostly because we don't really have a handle on what's going on. 18:30:31 it could all be the same issue for all we know right now 18:31:10 I'm going to follow up with viro too on his mmap locking spree, and see what he's turned up. who knows, might be something related. 18:31:49 the thing that really bothers me though, is that none of us are able to reproduce these bugs. 18:32:08 yes :\ 18:32:28 i tried looking for really odd ball modules or something, and nothing seems all that wacky 18:32:31 I put some work into my syscall fuzzing tool to make it focus on vm related operations, but it's not turned up anything yet (though it still needs improvement) 18:33:10 after noticing so many reports had 'chrome' as their process, I even tried running that for a while instead of firefox. 18:33:53 there's still a part of me wondering if any of these bugs are crappy hardware. but there's so many of them.. 18:34:37 do we autoload the edac stuff? 18:34:50 davej: actually I run chromium and haven't seen that here either 18:35:26 jwb: should do, though doesn't that need ECC memory ? 18:35:52 well, for the memory controllers, yes. i thought there were other modules for various pieces of hardware though 18:36:12 cpu, pci bridges, etc 18:36:23 anyway, probably only viable on higher end hardware anyway 18:37:13 something else that I thought of this week.. a lot of times, we can scribble over something in memory, and it could be a while before something walks that list or whatever, and hits it. I wonder if it'd be worthwhile adding some kind of thread that does periodic integrity checks to the debug build. 18:37:48 ie, every N minutes, wake up, and walk various lists end to end 18:38:50 davej: not a bad idea; sortof akin to forcing ecc refresh, but you don't need to randomly walk all of ram for it. 18:39:05 pjones: right 18:39:10 davej: hardest bit is that you're effectively talking about walking lists lockless 18:39:19 which means you're going to take some faults sometimes. 18:39:51 you'd have to know all of the lists too, right? 18:40:16 jwb: you could make them register when they're initialized, but yeah. 18:40:17 just some of the important ones. 18:40:32 Or some of the largest 18:40:37 right 18:41:01 I suppose if you're registering them you can register locks as well 18:42:25 so what are you going to find with this thread other than the same corruption earlier? 18:42:44 jwb: earlier is the idea; hopefully that'll help narrow it down as to what causes it 18:42:48 I think it makes sense in that it might find things closer to the cause and 18:42:55 how? 18:43:07 if nothing else it gives a better answer to "what were you doing just before the crash?" 18:43:19 if you're walking it in a kthread and dump the stack trace, it's not going to point at all to what the list was being used for... 18:43:59 jwb: the idea here is that you're seeing corruption infrequently, which implies that it's caused by some infrequent, and thus probably manually triggered, action. 18:43:59 jwb: the assumption is the user of the list is not the one trashing it 18:44:22 if we wanted to be really fancy, you could theoretically log the last operation to the list, and dump that. 18:44:43 yes, i know that. but again, if you're just walking the lists you aren't going to have any idea who corrupted it 18:44:50 davej, yeah, that's what i'm getting at 18:44:54 davej: unlikely to be the problem. more likely the last transaction to whatever the /previous/ thing to get that address from the allocator. 18:45:22 man, that didn't really parse as english. 18:45:34 I get your point 18:46:52 might be worth trying to categorize the /kind/ of corruption better. 18:47:18 It might be more interesting to get a full process list if we see corruption, perhaps there is something that all reporters are doing which we are not, that trips it 18:47:37 jforbes: or even a crash dump 18:47:38 that'd be easy to add 18:47:43 davej, that'd be hard :\ 18:47:49 kdump is busted on f16 18:47:54 jwb: yeah, for shame. 18:47:59 f17 is the new target, so there's hope 18:48:18 these sorts of bugs I think are exactly the sort of case that we need that stuff working 18:48:18 jforbes: yeah, I'm thinking along the lines of the ECC... analogy? What if all the reporters buy even crappier ram than what's in an SDV, etc. 18:49:04 pjones: possible, but it seems unlikely that hardware is the cause 18:49:32 anyway, i'm not saying walking the lists isn't a good idea. i'm just not sure we're going to glean a bunch of insight from it. walk it, dump stack, abrt/user reports bug, we ask "what were you doing when this hit" and most of them time the comment will still be "i don't know" 18:49:38 sure. more likely some bad driver that's rarely used or some code that uses some obscure kernel feature that davej will turn off just as soon as we ask him to ;) 18:50:00 turn them all off. enable only when they've been proven correct mathematically 18:50:06 heh 18:51:48 so I think we're more or less done with the agenda 18:51:58 #topic open floor 18:52:14 Anyone? 18:52:17 we briefly talked about this rcu/tracing lockdep/panic earlier 18:52:18 the only other thing I have was a mail I got an hour or so ago asking if we have any ideas for GSoC stuff for the kernel 18:52:31 I might toss some suggestions on https://fedoraproject.org/wiki/Summer_coding_ideas_for_2012 later 18:52:53 jwb: ah yes, thanks for the reminder 18:53:00 what to do about that.. 18:53:16 so backporting 50 patches to 15/16 seems kinda crazy 18:53:33 the warning only hits in 3.3 18:53:34 davej: these are patches queued for 3.4 right? 18:53:45 jforbes: yeah 18:53:59 duh, yes. 17. 18:54:23 the problem is in 15/16, but it won't spew. it won't spew in 17 after debugging is disabled either 18:54:24 I'm leaning towards saying we just ignore it for now, and then when we get to beta the warning goes away anyway (except for people running -debug) 18:55:02 maybe by the time we rebase 15/16 to 3.3 something will have made its way back to -stable 18:55:06 That seems a more acceptable solution, we aren't dealing with a regression here, just more information 18:55:09 pick one of the bugs and close it as UPSTREAM, dupe the rest to that? 18:55:19 jwb: yeah 18:55:23 k. i'll do taht too 18:55:31 jforbes: right, it's just bugzilla spam if we do nothing 18:55:47 #action jwb to clean up the "powertop rcu spew bugs" 18:56:02 ok, anything else ? 18:56:17 oh, yeah 18:56:26 jforbes, welcome to the team officially and junk ;) 18:56:36 heh, thanks 18:56:40 oh yeah :) 18:57:04 Okay, leave it open for 60 seconds to see if anyone has anything else.... 18:57:19 * gholms raises hand 18:57:25 go ahead gholms 18:57:53 Any suggestions on how an average user like me can help diagnose issues with wireless connections dropping? 18:58:23 * gholms appears to be seeing https://bugzilla.redhat.com/show_bug.cgi?id=767855 18:59:20 only suggestion I have is to try mailing the driver maintainers. 18:59:46 Ok 18:59:50 going forward, it should be less confusing to tell them which version you're using 19:00:01 the compat-wireless stuff is going to be dropped 19:00:15 Oh, that's good to know. 19:00:34 * rbergeron yawns 19:00:40 What is the time frame for that? F17? 19:00:43 Okay, so that's about all the time we have before rbergeron starts her cloud burst 19:00:50 hey look, jforbes is here for the cloud meeting 19:00:56 rbergeron: lulz 19:00:56 OH GOD SO PUNNY 19:01:02 gholms, it's off in f17 already. f16 will drop it when it rebases to 3.3 19:01:05 * rbergeron hands over her Chief-Pun-Master badge to jforbes 19:01:12 Thanks, everyone! 19:01:19 Thanks for showing up everyone 19:01:22 #endmeeting