20:01:31 <mmcgrath> #startmeeting Infrastructure
20:01:31 <zodbot> Meeting started Thu Apr 29 20:01:31 2010 UTC.  The chair is mmcgrath. Information about MeetBot at http://wiki.debian.org/MeetBot.
20:01:33 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic.
20:01:34 <mmcgrath> #topic Who's here?
20:01:49 * nirik is lurking in the back.
20:02:04 <jokajak> i'm here
20:02:07 <jokajak> but i'm a nobody
20:02:26 <mmcgrath> well, I know toshio's out
20:02:29 <mmcgrath> seth's recovering
20:02:31 <mmcgrath> smooge: you around?
20:02:36 * skvidal is here
20:02:40 <skvidal> oh meetingwhat's up
20:02:44 <skvidal> mmcgrath: where's toshio?
20:02:49 <mmcgrath> skvidal: PTO
20:02:52 <skvidal> mmcgrath: ah, good
20:02:54 <skvidal> that's right
20:02:57 <skvidal> family visiting
20:03:10 <mmcgrath> Well, lets get started.
20:03:27 <mmcgrath> #topic Final Freeze
20:03:33 <mmcgrath> Just a reminder the final freeze starts on the 4th.
20:03:44 <mmcgrath> Does anyone have any major changes they're pushing out or planning on pushing out?
20:04:15 * mmcgrath doesn't think any major things are coming out
20:04:17 <mmcgrath> I know MM has one
20:04:22 <mmcgrath> oh that's another one, mdomsch is on a plane :)
20:04:34 <smooge> here
20:04:47 <mmcgrath> smooge: do you know of any major changes before the freeze?
20:05:12 <smooge> rsyslog is partially implemented. It will finish after the freeze
20:05:20 <mmcgrath> k
20:05:24 <smooge> when does the freeze end?
20:05:30 <mmcgrath> smooge: the day after the release
20:05:39 <smooge> speculated to be June?
20:06:29 <mmcgrath> May 18th
20:06:30 <mmcgrath> http://fedoraproject.org/wiki/Schedule
20:06:38 <mmcgrath> so the 19th would be the unfreeze date.
20:07:48 <mmcgrath> Ok, so that's really all there is on that for now
20:07:52 <mmcgrath> #topic func yum
20:07:59 <mmcgrath> skvidal: where did we leave the security updates thing?
20:08:12 <skvidal> mmcgrath: yesterday morning I said
20:08:19 <skvidal> "I think I can get to it today"
20:08:28 <skvidal> then my day took a turn for the not-gonna-happen
20:08:33 <mmcgrath> heheheh
20:08:40 <skvidal> so - here's the deal - we can exec yum update --security
20:08:46 <skvidal> on all the machines
20:08:49 <skvidal> no problem
20:08:50 <skvidal> using func
20:08:54 <mmcgrath> <nod>
20:08:56 * mmcgrath is fine with that.
20:09:20 <skvidal> then I guess we should do that
20:09:31 <mmcgrath> K, I'll look at doing that this afternoon or soon.
20:09:34 <skvidal> wait
20:09:40 <skvidal> I can make it faster/better I think
20:09:51 <skvidal> lemme finish closing some more yum bugs and I'll see if I can make it suck less for you
20:10:02 <skvidal> give me an hour or so at least to hack something up
20:10:05 <skvidal> ok?
20:10:11 <mmcgrath> sounds good, thanks
20:10:37 <smooge> ok will help after you are ready
20:10:45 * sijis is here.. late
20:10:47 <mmcgrath> Ok, anyone have anything else on that?
20:11:48 <mmcgrath> alllrighty :)
20:11:53 <mmcgrath> #topic PHX2 outage
20:12:02 <mmcgrath> so yeah, the sky fell last night.
20:12:17 <mmcgrath> I'm pretty happy with the results.  For the most part everything came back up on its own.
20:12:30 <mmcgrath> iscsi being the biggest bump, some hosts booted before the netapp was available.
20:13:12 <mmcgrath> I still haven't heard the root cause but the rumor going around at the moment is during some elecrtical work, one of th eelectricians flipped off the wrong circuit.
20:13:24 <smooge> ooops
20:13:44 <mmcgrath> I just hope that same electrician didn't then go work on circuits that were live and he didn't know it.
20:13:59 <mmcgrath> but yeah, there's bound to be hell to pay somewhere.
20:14:23 <mmcgrath> The biggest ongoign concerns we have are why redundancy didn't work.  If someone flipped a switch there's probably nothing we can do about that.
20:14:39 <mmcgrath> The next one is, when everything was powered back up.  Why didn't the network come back on its own.
20:14:41 * skvidal expects the root cause is 'squirrel in transformer'
20:14:52 <skvidal> this appears to be electrician-speak for 'umm, I have no earthly idea'
20:14:56 <mmcgrath> now that one's not on us, but it's a concern I'm going to bring up with RHIT during our next meeting.
20:14:58 <smooge> suicide squirrels' taking over the world
20:14:59 <mmcgrath> skvidal: :-D
20:15:37 <mmcgrath> Our major outage time was about 3 and a half hours.
20:15:43 <mmcgrath> Most services were back online after 2.5 hours
20:15:47 <smooge> yeah it was interesting that cnode and sourceware seemed to stay up or available.. and we were out for a while
20:15:47 <mmcgrath> pkgdb being the big outstander.
20:16:06 <mmcgrath> I've already started talking with abadger1999 about hwo to make pkgdb more redundant.
20:16:18 <mmcgrath> it actually seems like if haproxy hadn't flagged it down, pkgdb would have been partially available.
20:16:34 <smooge> you know what networking is probably kicking themselves over.... they could have put that hairpin code inplace
20:16:45 <mmcgrath> so we need to figure out if a partially working pkgdb is better (less risky) then no pkgdb at all.
20:16:50 <mmcgrath> smooge: hehehe that's true.
20:16:52 <mmcgrath> I forgot about that.
20:17:22 <smooge> what can we put in other locations?
20:17:35 <mmcgrath> well, I want to put another openvpn server somewhere.
20:17:42 <mmcgrath> that way some things would have gracefully recovered.
20:17:51 <mmcgrath> like the connection between proxy servers and the backup app servers.
20:17:54 <mmcgrath> and done so securely.
20:18:05 <mmcgrath> but the rest of it that relies on a data layer we're kind of stuck with.
20:18:17 <mmcgrath> I'm not convinced that remote database replication will be a net win for us.
20:19:15 <mmcgrath> and even with remote replciation db replication we'd only keep...
20:19:18 <mmcgrath> fas up
20:19:20 <mmcgrath> maybe pkgdb.
20:19:21 <mmcgrath> smolt
20:19:26 <mmcgrath> the wiki requires the nfs moutn.
20:19:45 <mmcgrath> so unless we fork up some serious cash to do replication at the data layer to another remote site.  I just don't think it's feasible for us.
20:19:56 <mmcgrath> at least not at this point in time in our growth.
20:19:59 <mmcgrath> everyone following me on that.
20:20:02 <mmcgrath> ?
20:20:53 <sijis> somewhat :)
20:21:09 <smooge> yes
20:21:10 <skvidal> mmcgrath: so
20:21:11 <skvidal> lemme ask
20:21:24 <skvidal> is it even remotely worth thinking about drbd sor such things?
20:21:32 <skvidal> as a poor-man's data-layer replication
20:21:37 <skvidal> for things like the wiki nfs?
20:21:39 <mmcgrath> skvidal: it is actually, for some things.
20:21:47 <mmcgrath> but for us it won't be a drop in replacement.
20:21:49 <skvidal> nod
20:21:51 <skvidal> of course
20:21:55 <mmcgrath> but that is something we could architect for.
20:22:16 <mmcgrath> and drbd (or similar) is something I'd like to look at for live replication of some of our critical hosts like fedorahosted
20:22:31 * nirik notes drbd is not available in rhel directly. ;)
20:22:42 <mmcgrath> in our case though, nfs is on netapp
20:22:51 <mmcgrath> you get the idea.  But long term that is something I think we should investigate.
20:23:05 <mmcgrath> I'm not totally sure how it would work or how well.
20:23:10 <mmcgrath> but certainly worth a few experiments.
20:23:14 <mmcgrath> ok, that's really all I have on that.
20:23:20 <mmcgrath> anyone have anything else they'd like to discuss on that?
20:23:43 <mmcgrath> alllllll righty :)
20:23:52 <mmcgrath> #topic New security stuff
20:24:00 <mmcgrath> I've been working on some new security policies and procedures.
20:24:05 <mmcgrath> everyone see my note to the list?
20:24:08 * dgilmore thinks they looked fine
20:24:39 <mmcgrath> Basically the idea is to assign every host a security category that will help us plan for that category group.
20:24:44 <mmcgrath> as well as better document what each host does.
20:25:17 <dgilmore> and how hosts depend on other hosts
20:25:29 <mmcgrath> dgilmore: yeah, that was something I saw and thought was a good idea.
20:25:30 <sijis> i haven't read it.. but its categorized per host, not per app or role?
20:25:45 <mmcgrath> sijis: well, it'll mostly be by role
20:25:59 <mmcgrath> so it's not like app1 would be given a different security category then app2
20:26:11 <mmcgrath> but you do it by host because for the most part you consider a 'host' getting compromised.
20:26:31 <sijis> true. we've categorized here by role not a specific host.
20:26:33 <mmcgrath> apps get that too but usually you assume the host has been compromised if the app had.
20:26:38 <sijis> but i get what you mean
20:26:39 <mmcgrath> correct.
20:26:48 <mmcgrath> although I've only done one specific host at the moment
20:27:01 <mmcgrath> you see some new info in motd on fedorapeople.org if you ssh there.
20:27:32 <mmcgrath> Ok, so that's all I have on that
20:27:35 <mmcgrath> .any a-k
20:27:35 <zodbot> mmcgrath: a-k was last seen in #fedora-meeting 11 weeks, 6 days, 23 hours, 23 minutes, and 20 seconds ago: *** a-k has parted #fedora-meeting ("Bye")
20:27:48 <mmcgrath> no ak, and I forgot zodbot fails to write to disk very often.
20:27:50 <mmcgrath> So with that
20:27:52 <mmcgrath> #topic Open Floor
20:27:57 <mmcgrath> anyone have anything they'd like to discuss?
20:27:59 <mmcgrath> anything at all?
20:28:26 <mmcgrath> If not we'll close in 30
20:28:58 <gholms|work> Cloud SIG meeting in 30, if anyone's interested.
20:29:05 <gholms|work> (minutes, that is)
20:29:14 <mmcgrath> gholms|work: thanks :)
20:29:17 <mmcgrath> #endmeeting