#fedora-meeting log

18:00:51 <jforbes> #startmeeting kernel
18:00:51 <zodbot> Meeting started Fri Mar  2 18:00:51 2012 UTC.  The chair is jforbes. Information about MeetBot at http://wiki.debian.org/MeetBot.
18:00:51 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic.
18:01:07 <jforbes> #chair djones
18:01:07 <zodbot> Current chairs: djones jforbes
18:01:10 <davej> alright, lets get started, because I think this is going to be a full hour.
18:01:44 <jforbes> #topic common bugs
18:01:58 <jforbes> Shall we start with hibernate?
18:02:09 <davej> yeah, sure
18:02:33 <davej> so couple things here.
18:03:01 <davej> the number of ways this fails is impressive. We've got a bunch of different problems, of varying severity
18:03:16 <davej> I'm actually surprised it works for anyone at all judging by the bugs we've been getting
18:03:35 * nirik doesn't think too many people use it anymore.
18:03:54 <davej> of biggest concern I think is the apparent memory corruption bugs that seem to appear in the cases where it "works".
18:04:10 <jwb> works on hibernate, doom on resume
18:04:15 <jforbes> nirik: Even the reports of people who do use it, seem to report occasional success
18:05:02 <davej> Eric Sandeen has been staring at a bunch of those bugs for a while (because it usually manifests as ext* failures)
18:05:26 <davej> but I think that's just because most of memory is dcache, so a scribble is just more likely to land there.
18:05:57 <davej> Stanislav found one case where it's solved if you disable modesetting on i915
18:06:06 <davej> too bad we have reports of it happening on !915 hardware
18:06:24 <jwb> could be similar problems across varying hardware though.  basically dma
18:06:53 <davej> so I have three hibernate specific changes in mind, which I don't think will be too controversial.
18:07:40 <davej> 1. we disable the recent change to make compression threaded.  Unrelated to the corruption, but might make some of the other "it stopped working" bugs go away.
18:08:17 <jwb> davej, disable compression entirely, or just the threaded aspect?
18:08:28 <davej> if that does fix make it work again for anyone, we'll take that upstream and let them figure it out.  (Though the hazard here is that they'll hit other hibernate bugs instead)
18:08:34 <davej> jwb: threaded
18:08:39 <jwb> hm, k
18:09:10 <jforbes> jwb: Might be good to start one at a time, threaded is the higher risk of fail, so disable it.  If that doesnt fix it for anyone, we can try to disable compression all together mabye?
18:09:16 <davej> disabling compression entirely will likely expose other fun bugs if everything doesn't fit in swap
18:09:31 <jwb> davej, there's checks for that and it'll bail on hibernate
18:09:36 <jwb> theoretically
18:09:36 <davej> ok
18:09:41 <jwb> jforbes, makes sense, yes
18:09:46 <davej> well, lets do what jforbes suggested
18:09:50 <jwb> sure
18:10:10 <davej> #action disable threaded compression in hibernate code
18:10:47 <davej> second suggestion, is that we add a taint flag for when we hibernate, so we can tell from weird bugs without having to do a round-trip to the reporter
18:10:55 <davej> I think this is a no-brainer.
18:11:14 <jwb> yes, just need to make sure abrt doesn't auto-block it
18:11:23 <davej> good point
18:11:41 <jforbes> It also has the advantage of letting us see if a class of bugs that looks completely unrelated only appears when tainted with hibernate
18:11:57 <davej> #action check if abrt will still file bugs if we taint on hibernate
18:12:24 <davej> final suggestion was to disable hibernate if tainted with proprietary modules.
18:12:36 <davej> given we have enough problems on untainted kernels, I think this makes sense.
18:13:07 <jwb> i agree, but with the abrt stuff we shouldn't be getting reports on those machines anyway
18:13:32 <davej> we don't get oopses, but we get "hibernate doesn't work"
18:13:45 <davej> then five comments in we find out they're using vbox and nvidia
18:13:51 <jforbes> I don't see a problem with this, though honestly we will probably just get as many hibernate doesnt work bugs
18:14:02 <jwb> true.  i guess we'll find out if they change it to "hibernate is disabled" ;)
18:14:30 <jwb> davej, similarly disable it if taint W or D (if it doesn't already block taht) ?
18:14:58 <davej> maybe a boot-option to re-enable it. so they're aware there may be problems after reading dmesg..
18:15:07 <jwb> i'm guessing the only plausible taint we'd allow is G
18:15:30 <davej> I think so
18:15:36 <jwb> hibernate=ignore-taint
18:16:43 <davej> ok, any further thoughts on this ?
18:16:57 <jwb> yes
18:17:02 <jwb> so for tracking purposes
18:17:09 <jforbes> Who's going to implement?
18:17:24 <jwb> do we want to use the "disable hibernate" bug as a tracker for any bugs that turn out to be hibernate related?
18:17:24 <davej> jforbes: I'll do the tainting stuff
18:17:38 <jwb> figured.  davej takes the fun stuff
18:17:41 <jwb> ;)
18:17:48 <davej> heh
18:17:53 <jwb> jforbes, you want to do the disable threaded change?
18:18:04 <jforbes> jwb: happy to
18:18:05 <jwb> k
18:18:08 <davej> the threaded thing is just a define change afaik
18:18:13 <jwb> yeah
18:18:19 <davej> jwb: re tracker, yeah, I think that makes sense
18:18:38 <jwb> ok.  i can go through bugzilla and make them block that
18:18:43 <jwb> just need to find it again
18:20:10 <davej> longer term, we need to figure out what exactly is causing this corruption, which I think is a nice lead into the next topic.
18:21:36 <jforbes> #action davej to taint hibernate
18:21:50 <jforbes> #action jforbes to disable threading in hibernate compression
18:22:20 <jforbes> So moving on to page table corruption bugs...
18:22:33 <jwb> #action jwb to troll through bugzilla and add bugs to hibernate blocker
18:22:34 <davej> yeah, those are really 'fun'.
18:22:55 <davej> I've been staring at those most this week, and not really coming to any conclusion other than something is horribly wrong. (duh)
18:23:16 <davej> the only suggestion I have right now is that we enable CONFIG_DEBUG_VM in the production kernels.
18:23:26 <davej> it's pretty low impact (mostly just a ton more BUG_ONs)
18:23:39 <jforbes> davej: to clarify, the production debug kernels only right?
18:23:45 <davej> no, the non-debug ones
18:24:13 <davej> it's already on in the debug builds
18:25:08 <davej> I posted about these problems upstream, and got no response at all.
18:25:36 <davej> I'm going to dig a little deeper, and see if I can get my head around some of the locking rules in mm/, and maybe I'll bring it up directly with Linus/Andrew.
18:25:37 <jforbes> davej: I saw that.  I don't think the performance hit is too bad on it
18:26:01 <davej> yeah, it's nowhere as bad as slab debug or lockdep
18:26:24 <davej> it might show up on benchmarks or something, but at this point, I think reliability is a bigger concern
18:26:54 <davej> in everyday use, I doubt people will even notice it's on (unless they trip the BUG_ON's)
18:27:11 <davej> jwb: any thoughts ?
18:27:17 <jforbes> People who really care about performance can continue to run older kernels if everything was working for them. And we can turn it back off once some of this is figured out?
18:27:49 <davej> yeah, sounds ok
18:27:50 <jwb> i'm certainly OK with  it
18:28:13 <davej> if we get a flood of reports from those BUG_ONs though, we'll leave it on until it's sorted out
18:28:48 <jwb> this is to hopefully catch the 'weird root name' bugs, or the bad state bugs, or both?
18:28:48 <jforbes> Sure, as long as we are getting reports from it, we need to keep it on
18:29:22 <davej> both, and maybe other unexplained things.
18:29:28 <davej> we have a lot of linked list corruptor bugs too
18:29:48 <davej> many of those are also (surprise) dcache lists
18:30:00 <jwb> ok, cool.  we kinda merged the 2 topics in the agenda then
18:30:23 <davej> yeah, a little. mostly because we don't really have a handle on what's going on.
18:30:31 <davej> it could all be the same issue for all we know right now
18:31:10 <davej> I'm going to follow up with viro too on his mmap locking spree, and see what he's turned up. who knows, might be something related.
18:31:49 <davej> the thing that really bothers me though, is that none of us are able to reproduce these bugs.
18:32:08 <jwb> yes :\
18:32:28 <jwb> i tried looking for really odd ball modules or something, and nothing seems all that wacky
18:32:31 <davej> I put some work into my syscall fuzzing tool to make it focus on vm related operations, but it's not turned up anything yet (though it still needs improvement)
18:33:10 <davej> after noticing so many reports had 'chrome' as their process, I even tried running that for a while instead of firefox.
18:33:53 <davej> there's still a part of me wondering if any of these bugs are crappy hardware. but there's so many of them..
18:34:37 <jwb> do we autoload the edac stuff?
18:34:50 <jforbes> davej: actually I run chromium and haven't seen that here either
18:35:26 <davej> jwb: should do, though doesn't that need ECC memory ?
18:35:52 <jwb> well, for the memory controllers, yes.  i thought there were other modules for various pieces of hardware though
18:36:12 <jwb> cpu, pci bridges, etc
18:36:23 <jwb> anyway, probably only viable on higher end hardware anyway
18:37:13 <davej> something else that I thought of this week.. a lot of times, we can scribble over something in memory, and it could be a while before something walks that list or whatever, and hits it. I wonder if it'd be worthwhile adding some kind of thread that does periodic integrity checks to the debug build.
18:37:48 <davej> ie, every N minutes, wake up, and walk various lists end to end
18:38:50 <pjones> davej: not a bad idea; sortof akin to forcing ecc refresh, but you don't need to randomly walk all of ram for it.
18:39:05 <davej> pjones: right
18:39:10 <pjones> davej: hardest bit is that you're effectively talking about walking lists lockless
18:39:19 <pjones> which means you're going to take some faults sometimes.
18:39:51 <jwb> you'd have to know all of the lists too, right?
18:40:16 <pjones> jwb: you could make them register when they're initialized, but yeah.
18:40:17 <davej> just some of the important ones.
18:40:32 <jforbes> Or some of the largest
18:40:37 <davej> right
18:41:01 <pjones> I suppose if you're registering them you can register locks as well
18:42:25 <jwb> so what are you going to find with this thread other than the same corruption earlier?
18:42:44 <pjones> jwb: earlier is the idea; hopefully that'll help narrow it down as to what causes it
18:42:48 <jforbes> I think it makes sense in that it might find things closer to the cause and
18:42:55 <jwb> how?
18:43:07 <davej> if nothing else it gives a better answer to "what were you doing just before the crash?"
18:43:19 <jwb> if you're walking it in a kthread and dump the stack trace, it's not going to point at all to what the list was being used for...
18:43:59 <pjones> jwb: the idea here is that you're seeing corruption infrequently, which implies that it's caused by some infrequent, and thus probably manually triggered, action.
18:43:59 <jforbes> jwb: the assumption is the user of the list is not the one trashing it
18:44:22 <davej> if we wanted to be really fancy, you could theoretically log the last operation to the list, and dump that.
18:44:43 <jwb> yes, i know that.  but again, if you're just walking the lists you aren't going to have any idea who corrupted it
18:44:50 <jwb> davej, yeah, that's what i'm getting at
18:44:54 <pjones> davej: unlikely to be the problem. more likely the last transaction to whatever the /previous/ thing to get that address from the allocator.
18:45:22 <pjones> man, that didn't really parse as english.
18:45:34 <davej> I get your point
18:46:52 <pjones> might be worth trying to categorize the /kind/ of corruption better.
18:47:18 <jforbes> It might be more interesting to get a full process list if we see corruption, perhaps there is something that all reporters are doing which we are not, that trips it
18:47:37 <davej> jforbes: or even a crash dump
18:47:38 <jwb> that'd be easy to add
18:47:43 <jwb> davej, that'd be hard :\
18:47:49 <jwb> kdump is busted on f16
18:47:54 <davej> jwb: yeah, for shame.
18:47:59 <jwb> f17 is the new target, so there's hope
18:48:18 <davej> these sorts of bugs I think are exactly the sort of case that we need that stuff working
18:48:18 <pjones> jforbes: yeah, I'm thinking along the lines of the ECC... analogy?  What if all the reporters buy even crappier ram than what's in an SDV, etc.
18:49:04 <jforbes> pjones: possible, but it seems unlikely that hardware is the cause
18:49:32 <jwb> anyway, i'm not saying walking the lists isn't a good idea.  i'm just not sure we're going to glean a bunch of insight from it.  walk it, dump stack, abrt/user reports bug, we ask "what were you doing when this hit" and most of them time the comment will still be "i don't know"
18:49:38 <pjones> sure.  more likely some bad driver that's rarely used or some code that uses some obscure kernel feature that davej will turn off just as soon as we ask him to ;)
18:50:00 <jwb> turn them all off.  enable only when they've been proven correct mathematically
18:50:06 <davej> heh
18:51:48 <davej> so I think we're more or less done with the agenda
18:51:58 <jforbes> #topic open floor
18:52:14 <jforbes> Anyone?
18:52:17 <jwb> we briefly talked about this rcu/tracing lockdep/panic earlier
18:52:18 <davej> the only other thing I have was a mail I got an hour or so ago asking if we have any ideas for GSoC stuff for the kernel
18:52:31 <davej> I might toss some suggestions on https://fedoraproject.org/wiki/Summer_coding_ideas_for_2012 later
18:52:53 <davej> jwb: ah yes, thanks for the reminder
18:53:00 <davej> what to do about that..
18:53:16 <davej> so backporting 50 patches to 15/16 seems kinda crazy
18:53:33 <jwb> the warning only hits in 3.3
18:53:34 <jforbes> davej: these are patches queued for 3.4 right?
18:53:45 <davej> jforbes: yeah
18:53:59 <davej> duh, yes. 17.
18:54:23 <jwb> the problem is in 15/16, but it won't spew.  it won't spew in 17 after debugging is disabled either
18:54:24 <davej> I'm leaning towards saying we just ignore it for now, and then when we get to beta the warning goes away anyway (except for people running -debug)
18:55:02 <davej> maybe by the time we rebase 15/16 to 3.3 something will have made its way back to -stable
18:55:06 <jforbes> That seems a more acceptable solution, we aren't dealing with a regression here, just more information
18:55:09 <jwb> pick one of the bugs and close it as UPSTREAM, dupe the rest to that?
18:55:19 <davej> jwb: yeah
18:55:23 <jwb> k.  i'll do taht too
18:55:31 <davej> jforbes: right, it's just bugzilla spam if we do nothing
18:55:47 <jwb> #action jwb to clean up the "powertop rcu spew bugs"
18:56:02 <davej> ok, anything else ?
18:56:17 <jwb> oh, yeah
18:56:26 <jwb> jforbes, welcome to the team officially and junk ;)
18:56:36 <jforbes> heh, thanks
18:56:40 <davej> oh yeah :)
18:57:04 <jforbes> Okay, leave it open for 60 seconds to see if anyone has anything else....
18:57:19 * gholms raises hand
18:57:25 <jforbes> go ahead gholms
18:57:53 <gholms> Any suggestions on how an average user like me can help diagnose issues with wireless connections dropping?
18:58:23 * gholms appears to be seeing https://bugzilla.redhat.com/show_bug.cgi?id=767855
18:59:20 <davej> only suggestion I have is to try mailing the driver maintainers.
18:59:46 <gholms> Ok
18:59:50 <jwb> going forward, it should be less confusing to tell them which version you're using
19:00:01 <jwb> the compat-wireless stuff is going to be dropped
19:00:15 <gholms> Oh, that's good to know.
19:00:34 * rbergeron yawns
19:00:40 <gholms> What is the time frame for that? F17?
19:00:43 <jforbes> Okay, so that's about all the time we have before rbergeron starts her cloud burst
19:00:50 <rbergeron> hey look, jforbes is here for the cloud meeting
19:00:56 <gholms> rbergeron: lulz
19:00:56 <rbergeron> OH GOD SO PUNNY
19:01:02 <jwb> gholms, it's off in f17 already.  f16 will drop it when it rebases to 3.3
19:01:05 * rbergeron hands over her Chief-Pun-Master badge to jforbes
19:01:12 <gholms> Thanks, everyone!
19:01:19 <jforbes> Thanks for showing up everyone
19:01:22 <jforbes> #endmeeting