20:01:31 <mmcgrath> #startmeeting Infrastructure 20:01:31 <zodbot> Meeting started Thu Apr 29 20:01:31 2010 UTC. The chair is mmcgrath. Information about MeetBot at http://wiki.debian.org/MeetBot. 20:01:33 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic. 20:01:34 <mmcgrath> #topic Who's here? 20:01:49 * nirik is lurking in the back. 20:02:04 <jokajak> i'm here 20:02:07 <jokajak> but i'm a nobody 20:02:26 <mmcgrath> well, I know toshio's out 20:02:29 <mmcgrath> seth's recovering 20:02:31 <mmcgrath> smooge: you around? 20:02:36 * skvidal is here 20:02:40 <skvidal> oh meetingwhat's up 20:02:44 <skvidal> mmcgrath: where's toshio? 20:02:49 <mmcgrath> skvidal: PTO 20:02:52 <skvidal> mmcgrath: ah, good 20:02:54 <skvidal> that's right 20:02:57 <skvidal> family visiting 20:03:10 <mmcgrath> Well, lets get started. 20:03:27 <mmcgrath> #topic Final Freeze 20:03:33 <mmcgrath> Just a reminder the final freeze starts on the 4th. 20:03:44 <mmcgrath> Does anyone have any major changes they're pushing out or planning on pushing out? 20:04:15 * mmcgrath doesn't think any major things are coming out 20:04:17 <mmcgrath> I know MM has one 20:04:22 <mmcgrath> oh that's another one, mdomsch is on a plane :) 20:04:34 <smooge> here 20:04:47 <mmcgrath> smooge: do you know of any major changes before the freeze? 20:05:12 <smooge> rsyslog is partially implemented. It will finish after the freeze 20:05:20 <mmcgrath> k 20:05:24 <smooge> when does the freeze end? 20:05:30 <mmcgrath> smooge: the day after the release 20:05:39 <smooge> speculated to be June? 20:06:29 <mmcgrath> May 18th 20:06:30 <mmcgrath> http://fedoraproject.org/wiki/Schedule 20:06:38 <mmcgrath> so the 19th would be the unfreeze date. 20:07:48 <mmcgrath> Ok, so that's really all there is on that for now 20:07:52 <mmcgrath> #topic func yum 20:07:59 <mmcgrath> skvidal: where did we leave the security updates thing? 20:08:12 <skvidal> mmcgrath: yesterday morning I said 20:08:19 <skvidal> "I think I can get to it today" 20:08:28 <skvidal> then my day took a turn for the not-gonna-happen 20:08:33 <mmcgrath> heheheh 20:08:40 <skvidal> so - here's the deal - we can exec yum update --security 20:08:46 <skvidal> on all the machines 20:08:49 <skvidal> no problem 20:08:50 <skvidal> using func 20:08:54 <mmcgrath> <nod> 20:08:56 * mmcgrath is fine with that. 20:09:20 <skvidal> then I guess we should do that 20:09:31 <mmcgrath> K, I'll look at doing that this afternoon or soon. 20:09:34 <skvidal> wait 20:09:40 <skvidal> I can make it faster/better I think 20:09:51 <skvidal> lemme finish closing some more yum bugs and I'll see if I can make it suck less for you 20:10:02 <skvidal> give me an hour or so at least to hack something up 20:10:05 <skvidal> ok? 20:10:11 <mmcgrath> sounds good, thanks 20:10:37 <smooge> ok will help after you are ready 20:10:45 * sijis is here.. late 20:10:47 <mmcgrath> Ok, anyone have anything else on that? 20:11:48 <mmcgrath> alllrighty :) 20:11:53 <mmcgrath> #topic PHX2 outage 20:12:02 <mmcgrath> so yeah, the sky fell last night. 20:12:17 <mmcgrath> I'm pretty happy with the results. For the most part everything came back up on its own. 20:12:30 <mmcgrath> iscsi being the biggest bump, some hosts booted before the netapp was available. 20:13:12 <mmcgrath> I still haven't heard the root cause but the rumor going around at the moment is during some elecrtical work, one of th eelectricians flipped off the wrong circuit. 20:13:24 <smooge> ooops 20:13:44 <mmcgrath> I just hope that same electrician didn't then go work on circuits that were live and he didn't know it. 20:13:59 <mmcgrath> but yeah, there's bound to be hell to pay somewhere. 20:14:23 <mmcgrath> The biggest ongoign concerns we have are why redundancy didn't work. If someone flipped a switch there's probably nothing we can do about that. 20:14:39 <mmcgrath> The next one is, when everything was powered back up. Why didn't the network come back on its own. 20:14:41 * skvidal expects the root cause is 'squirrel in transformer' 20:14:52 <skvidal> this appears to be electrician-speak for 'umm, I have no earthly idea' 20:14:56 <mmcgrath> now that one's not on us, but it's a concern I'm going to bring up with RHIT during our next meeting. 20:14:58 <smooge> suicide squirrels' taking over the world 20:14:59 <mmcgrath> skvidal: :-D 20:15:37 <mmcgrath> Our major outage time was about 3 and a half hours. 20:15:43 <mmcgrath> Most services were back online after 2.5 hours 20:15:47 <smooge> yeah it was interesting that cnode and sourceware seemed to stay up or available.. and we were out for a while 20:15:47 <mmcgrath> pkgdb being the big outstander. 20:16:06 <mmcgrath> I've already started talking with abadger1999 about hwo to make pkgdb more redundant. 20:16:18 <mmcgrath> it actually seems like if haproxy hadn't flagged it down, pkgdb would have been partially available. 20:16:34 <smooge> you know what networking is probably kicking themselves over.... they could have put that hairpin code inplace 20:16:45 <mmcgrath> so we need to figure out if a partially working pkgdb is better (less risky) then no pkgdb at all. 20:16:50 <mmcgrath> smooge: hehehe that's true. 20:16:52 <mmcgrath> I forgot about that. 20:17:22 <smooge> what can we put in other locations? 20:17:35 <mmcgrath> well, I want to put another openvpn server somewhere. 20:17:42 <mmcgrath> that way some things would have gracefully recovered. 20:17:51 <mmcgrath> like the connection between proxy servers and the backup app servers. 20:17:54 <mmcgrath> and done so securely. 20:18:05 <mmcgrath> but the rest of it that relies on a data layer we're kind of stuck with. 20:18:17 <mmcgrath> I'm not convinced that remote database replication will be a net win for us. 20:19:15 <mmcgrath> and even with remote replciation db replication we'd only keep... 20:19:18 <mmcgrath> fas up 20:19:20 <mmcgrath> maybe pkgdb. 20:19:21 <mmcgrath> smolt 20:19:26 <mmcgrath> the wiki requires the nfs moutn. 20:19:45 <mmcgrath> so unless we fork up some serious cash to do replication at the data layer to another remote site. I just don't think it's feasible for us. 20:19:56 <mmcgrath> at least not at this point in time in our growth. 20:19:59 <mmcgrath> everyone following me on that. 20:20:02 <mmcgrath> ? 20:20:53 <sijis> somewhat :) 20:21:09 <smooge> yes 20:21:10 <skvidal> mmcgrath: so 20:21:11 <skvidal> lemme ask 20:21:24 <skvidal> is it even remotely worth thinking about drbd sor such things? 20:21:32 <skvidal> as a poor-man's data-layer replication 20:21:37 <skvidal> for things like the wiki nfs? 20:21:39 <mmcgrath> skvidal: it is actually, for some things. 20:21:47 <mmcgrath> but for us it won't be a drop in replacement. 20:21:49 <skvidal> nod 20:21:51 <skvidal> of course 20:21:55 <mmcgrath> but that is something we could architect for. 20:22:16 <mmcgrath> and drbd (or similar) is something I'd like to look at for live replication of some of our critical hosts like fedorahosted 20:22:31 * nirik notes drbd is not available in rhel directly. ;) 20:22:42 <mmcgrath> in our case though, nfs is on netapp 20:22:51 <mmcgrath> you get the idea. But long term that is something I think we should investigate. 20:23:05 <mmcgrath> I'm not totally sure how it would work or how well. 20:23:10 <mmcgrath> but certainly worth a few experiments. 20:23:14 <mmcgrath> ok, that's really all I have on that. 20:23:20 <mmcgrath> anyone have anything else they'd like to discuss on that? 20:23:43 <mmcgrath> alllllll righty :) 20:23:52 <mmcgrath> #topic New security stuff 20:24:00 <mmcgrath> I've been working on some new security policies and procedures. 20:24:05 <mmcgrath> everyone see my note to the list? 20:24:08 * dgilmore thinks they looked fine 20:24:39 <mmcgrath> Basically the idea is to assign every host a security category that will help us plan for that category group. 20:24:44 <mmcgrath> as well as better document what each host does. 20:25:17 <dgilmore> and how hosts depend on other hosts 20:25:29 <mmcgrath> dgilmore: yeah, that was something I saw and thought was a good idea. 20:25:30 <sijis> i haven't read it.. but its categorized per host, not per app or role? 20:25:45 <mmcgrath> sijis: well, it'll mostly be by role 20:25:59 <mmcgrath> so it's not like app1 would be given a different security category then app2 20:26:11 <mmcgrath> but you do it by host because for the most part you consider a 'host' getting compromised. 20:26:31 <sijis> true. we've categorized here by role not a specific host. 20:26:33 <mmcgrath> apps get that too but usually you assume the host has been compromised if the app had. 20:26:38 <sijis> but i get what you mean 20:26:39 <mmcgrath> correct. 20:26:48 <mmcgrath> although I've only done one specific host at the moment 20:27:01 <mmcgrath> you see some new info in motd on fedorapeople.org if you ssh there. 20:27:32 <mmcgrath> Ok, so that's all I have on that 20:27:35 <mmcgrath> .any a-k 20:27:35 <zodbot> mmcgrath: a-k was last seen in #fedora-meeting 11 weeks, 6 days, 23 hours, 23 minutes, and 20 seconds ago: *** a-k has parted #fedora-meeting ("Bye") 20:27:48 <mmcgrath> no ak, and I forgot zodbot fails to write to disk very often. 20:27:50 <mmcgrath> So with that 20:27:52 <mmcgrath> #topic Open Floor 20:27:57 <mmcgrath> anyone have anything they'd like to discuss? 20:27:59 <mmcgrath> anything at all? 20:28:26 <mmcgrath> If not we'll close in 30 20:28:58 <gholms|work> Cloud SIG meeting in 30, if anyone's interested. 20:29:05 <gholms|work> (minutes, that is) 20:29:14 <mmcgrath> gholms|work: thanks :) 20:29:17 <mmcgrath> #endmeeting