20:01:31 #startmeeting Infrastructure 20:01:31 Meeting started Thu Apr 29 20:01:31 2010 UTC. The chair is mmcgrath. Information about MeetBot at http://wiki.debian.org/MeetBot. 20:01:33 Useful Commands: #action #agreed #halp #info #idea #link #topic. 20:01:34 #topic Who's here? 20:01:49 * nirik is lurking in the back. 20:02:04 i'm here 20:02:07 but i'm a nobody 20:02:26 well, I know toshio's out 20:02:29 seth's recovering 20:02:31 smooge: you around? 20:02:36 * skvidal is here 20:02:40 oh meetingwhat's up 20:02:44 mmcgrath: where's toshio? 20:02:49 skvidal: PTO 20:02:52 mmcgrath: ah, good 20:02:54 that's right 20:02:57 family visiting 20:03:10 Well, lets get started. 20:03:27 #topic Final Freeze 20:03:33 Just a reminder the final freeze starts on the 4th. 20:03:44 Does anyone have any major changes they're pushing out or planning on pushing out? 20:04:15 * mmcgrath doesn't think any major things are coming out 20:04:17 I know MM has one 20:04:22 oh that's another one, mdomsch is on a plane :) 20:04:34 here 20:04:47 smooge: do you know of any major changes before the freeze? 20:05:12 rsyslog is partially implemented. It will finish after the freeze 20:05:20 k 20:05:24 when does the freeze end? 20:05:30 smooge: the day after the release 20:05:39 speculated to be June? 20:06:29 May 18th 20:06:30 http://fedoraproject.org/wiki/Schedule 20:06:38 so the 19th would be the unfreeze date. 20:07:48 Ok, so that's really all there is on that for now 20:07:52 #topic func yum 20:07:59 skvidal: where did we leave the security updates thing? 20:08:12 mmcgrath: yesterday morning I said 20:08:19 "I think I can get to it today" 20:08:28 then my day took a turn for the not-gonna-happen 20:08:33 heheheh 20:08:40 so - here's the deal - we can exec yum update --security 20:08:46 on all the machines 20:08:49 no problem 20:08:50 using func 20:08:54 20:08:56 * mmcgrath is fine with that. 20:09:20 then I guess we should do that 20:09:31 K, I'll look at doing that this afternoon or soon. 20:09:34 wait 20:09:40 I can make it faster/better I think 20:09:51 lemme finish closing some more yum bugs and I'll see if I can make it suck less for you 20:10:02 give me an hour or so at least to hack something up 20:10:05 ok? 20:10:11 sounds good, thanks 20:10:37 ok will help after you are ready 20:10:45 * sijis is here.. late 20:10:47 Ok, anyone have anything else on that? 20:11:48 alllrighty :) 20:11:53 #topic PHX2 outage 20:12:02 so yeah, the sky fell last night. 20:12:17 I'm pretty happy with the results. For the most part everything came back up on its own. 20:12:30 iscsi being the biggest bump, some hosts booted before the netapp was available. 20:13:12 I still haven't heard the root cause but the rumor going around at the moment is during some elecrtical work, one of th eelectricians flipped off the wrong circuit. 20:13:24 ooops 20:13:44 I just hope that same electrician didn't then go work on circuits that were live and he didn't know it. 20:13:59 but yeah, there's bound to be hell to pay somewhere. 20:14:23 The biggest ongoign concerns we have are why redundancy didn't work. If someone flipped a switch there's probably nothing we can do about that. 20:14:39 The next one is, when everything was powered back up. Why didn't the network come back on its own. 20:14:41 * skvidal expects the root cause is 'squirrel in transformer' 20:14:52 this appears to be electrician-speak for 'umm, I have no earthly idea' 20:14:56 now that one's not on us, but it's a concern I'm going to bring up with RHIT during our next meeting. 20:14:58 suicide squirrels' taking over the world 20:14:59 skvidal: :-D 20:15:37 Our major outage time was about 3 and a half hours. 20:15:43 Most services were back online after 2.5 hours 20:15:47 yeah it was interesting that cnode and sourceware seemed to stay up or available.. and we were out for a while 20:15:47 pkgdb being the big outstander. 20:16:06 I've already started talking with abadger1999 about hwo to make pkgdb more redundant. 20:16:18 it actually seems like if haproxy hadn't flagged it down, pkgdb would have been partially available. 20:16:34 you know what networking is probably kicking themselves over.... they could have put that hairpin code inplace 20:16:45 so we need to figure out if a partially working pkgdb is better (less risky) then no pkgdb at all. 20:16:50 smooge: hehehe that's true. 20:16:52 I forgot about that. 20:17:22 what can we put in other locations? 20:17:35 well, I want to put another openvpn server somewhere. 20:17:42 that way some things would have gracefully recovered. 20:17:51 like the connection between proxy servers and the backup app servers. 20:17:54 and done so securely. 20:18:05 but the rest of it that relies on a data layer we're kind of stuck with. 20:18:17 I'm not convinced that remote database replication will be a net win for us. 20:19:15 and even with remote replciation db replication we'd only keep... 20:19:18 fas up 20:19:20 maybe pkgdb. 20:19:21 smolt 20:19:26 the wiki requires the nfs moutn. 20:19:45 so unless we fork up some serious cash to do replication at the data layer to another remote site. I just don't think it's feasible for us. 20:19:56 at least not at this point in time in our growth. 20:19:59 everyone following me on that. 20:20:02 ? 20:20:53 somewhat :) 20:21:09 yes 20:21:10 mmcgrath: so 20:21:11 lemme ask 20:21:24 is it even remotely worth thinking about drbd sor such things? 20:21:32 as a poor-man's data-layer replication 20:21:37 for things like the wiki nfs? 20:21:39 skvidal: it is actually, for some things. 20:21:47 but for us it won't be a drop in replacement. 20:21:49 nod 20:21:51 of course 20:21:55 but that is something we could architect for. 20:22:16 and drbd (or similar) is something I'd like to look at for live replication of some of our critical hosts like fedorahosted 20:22:31 * nirik notes drbd is not available in rhel directly. ;) 20:22:42 in our case though, nfs is on netapp 20:22:51 you get the idea. But long term that is something I think we should investigate. 20:23:05 I'm not totally sure how it would work or how well. 20:23:10 but certainly worth a few experiments. 20:23:14 ok, that's really all I have on that. 20:23:20 anyone have anything else they'd like to discuss on that? 20:23:43 alllllll righty :) 20:23:52 #topic New security stuff 20:24:00 I've been working on some new security policies and procedures. 20:24:05 everyone see my note to the list? 20:24:08 * dgilmore thinks they looked fine 20:24:39 Basically the idea is to assign every host a security category that will help us plan for that category group. 20:24:44 as well as better document what each host does. 20:25:17 and how hosts depend on other hosts 20:25:29 dgilmore: yeah, that was something I saw and thought was a good idea. 20:25:30 i haven't read it.. but its categorized per host, not per app or role? 20:25:45 sijis: well, it'll mostly be by role 20:25:59 so it's not like app1 would be given a different security category then app2 20:26:11 but you do it by host because for the most part you consider a 'host' getting compromised. 20:26:31 true. we've categorized here by role not a specific host. 20:26:33 apps get that too but usually you assume the host has been compromised if the app had. 20:26:38 but i get what you mean 20:26:39 correct. 20:26:48 although I've only done one specific host at the moment 20:27:01 you see some new info in motd on fedorapeople.org if you ssh there. 20:27:32 Ok, so that's all I have on that 20:27:35 .any a-k 20:27:35 mmcgrath: a-k was last seen in #fedora-meeting 11 weeks, 6 days, 23 hours, 23 minutes, and 20 seconds ago: *** a-k has parted #fedora-meeting ("Bye") 20:27:48 no ak, and I forgot zodbot fails to write to disk very often. 20:27:50 So with that 20:27:52 #topic Open Floor 20:27:57 anyone have anything they'd like to discuss? 20:27:59 anything at all? 20:28:26 If not we'll close in 30 20:28:58 Cloud SIG meeting in 30, if anyone's interested. 20:29:05 (minutes, that is) 20:29:14 gholms|work: thanks :) 20:29:17 #endmeeting