20:00:39 #startmeeting infrastructure 20:00:39 Meeting started Thu Dec 16 20:00:39 2010 UTC. The chair is smooge. Information about MeetBot at http://wiki.debian.org/MeetBot. 20:00:39 Useful Commands: #action #agreed #halp #info #idea #link #topic. 20:00:51 #meetingname Infrastructure 20:00:51 The meeting name has been set to 'infrastructure' 20:01:04 #chair skvidal dgilmore 20:01:04 Current chairs: dgilmore skvidal smooge 20:01:11 yah 20:01:12 * skvidal is here 20:01:13 #topic roll call 20:01:13 * ricky 20:01:17 * sgallagh lurks 20:01:18 * CodeBlock here 20:01:19 * ricky 20:01:21 * ke4qqq 20:01:22 * waltJ skulks around 20:01:23 * goozbach here 20:01:28 * dgilmore is preset and accounted for 20:01:29 * rfelsburg is here 20:01:34 * jsmith lurks 20:01:41 * rbergeron is impressed! 20:01:51 * sijis is around 20:01:51 * deathwing01 is here! 20:02:11 * nirik is around 20:02:15 rbergeron: how so? 20:02:22 * skvidal is here 20:02:42 Lots of people here today! 20:02:44 that was a very rapid bullet-pointy list of people all being present and accounted for :) 20:03:00 rbergeron: Sysadmins are efficient! 20:03:01 rbergeron: i see 20:03:05 rbergeron: except dgilmore, who is preset, not present :) 20:03:13 #topic introductions 20:03:18 rbergeron: we all bots and scripts 20:03:21 * gholms resets dgilmore 20:04:06 * Elwell_ is lurking as normal 20:04:37 real quick I would like to introduce our new volunteers who have been helping goozbach has helped schedules, deathwing01 is working on trac testing for EL6 and hvivani has been helping on smokeping 20:05:20 rfelsburg, and some others have been helpful also. 20:05:27 but I have not had much time to mentor 20:06:09 if there are people who are waiting for sponsorship shoot me an email and a ticket you want to look at and I will try to get you into the appropriate groups by after break 20:06:20 anything anyone wants to say real quick? 20:06:45 Thanks for volunteering, folks! 20:06:55 gday all 20:06:55 Welome! 20:06:59 our pleassure :) 20:07:07 That's one word for it. :) 20:07:24 ok next topic 20:07:24 * deathwing01 kills the extra s 20:07:32 #topic slushy freeze 20:07:37 deathwing01: So it's not just a clever name 20:08:19 * skvidal keeps thinking of darkwing duck whenever he sees deathwing01 20:08:33 skvidal: That makes two of us. 20:08:35 we are going to start a slushy freeze starting this friday afternoon. Basically any changes to puppet or servers needs to get a review on irc/mail and a +1 20:08:47 * goozbach bows belatedly 20:09:52 smooge: this friday as in tomorrow, or next friday? 20:10:03 this friday as tomorrow 20:10:28 the less changes that creep in over the break without someone knowing about them the better. 20:10:44 worksforme 20:11:01 that way when people are drinking eggnog/tofunog with rum/everclear they aren't doing other things. 20:11:37 I will be away from the 26th->2nd. skvidal is similar gone. nirik says he will be around and I think some others will be around every now and then 20:11:53 I will be around pretty much that entire time 20:11:55 * dgilmore will be around 20:11:56 * nirik nods. Should be around. 20:11:57 * skvidal will be pageable/callable 20:12:00 but likely distracted 20:12:13 * waltJ will be around too 20:12:13 * ricky will be around 20:12:14 * deathwing01 will probably be around a lot after Dec 24th 20:12:16 * CodeBlock has nothing to do, so likely won't even be distracted :) 20:12:20 * jsmith will not be around 20:12:37 * mdomsch will be offline most of 12/17-1/4 20:12:47 jsmith: slacker 20:13:25 dgilmore: It's not like I'm going to get a break -- trust me, it would be less stressful to stick around here :-) 20:13:31 * goozbach will be offline from 12/24 to 12/29 20:13:34 ah family 20:13:57 ok next topic? 20:14:07 yup 20:14:08 #topic Current Outage 20:14:33 * goozbach taps his wristwatch to keep meeting rolling 20:14:39 :) 20:14:43 Ok we are currently going through a 'degredation of services' with some items more degraded than others. 20:15:12 There may be serveral causes going on and no one factor. 20:15:42 1) our netapp filer is shared with other community projects and is being used more by all. 20:16:10 2) we ran into an issue with EL6 NFS (nfs-utils) that caused background mounting to fail. 20:16:43 3) DNS/host name ichanges caused the filer to not like most of fedora as various ACL caches aged out. 20:17:07 4) its right before I finally go to disneyland for the first time in my life. 20:18:09 5) and someone(me) said "hey we had a quiet weekend on the pager..." 20:18:43 so what was impacted: mirror manager, and parts of release engineering 20:18:52 puppet and new servers being brought up. 20:18:59 Surprisingly wiki images :-) 20:19:16 wiki attachments, too 20:19:16 some app servers trying to mount scratch space 20:19:32 Er, surprisingly not 20:19:43 I never saw the wiki images fail, but they may have at some point 20:19:55 ricky: proxies could been holding them 20:20:03 True 20:20:03 so it's an overloaded netapp 20:20:13 I think the work people put into haproxy and varnish stopped some things. 20:20:17 goozbach: yes and some dns pain 20:20:20 that isn't owned exclusively by infra? 20:20:35 correct 20:21:59 ok I don't have much else to say on this other than I hate SATA drive arrays. 20:22:19 like I hates the hobbitses 20:22:40 any other issues? I missed skvidal or dgilmore or ricky? 20:23:14 nothing leaps to mind for me 20:23:25 we have a fair bit of clean up to do once the dust settles 20:23:38 Probably good to mention the future plans/new netapp next year 20:23:54 yeah I thought I was doing well just cleaning up old lvms last week. 20:24:00 skvidal: dont think so 20:24:22 i guess we could mention that im moving the lookaside cache to the equalogics 20:24:57 ok so according to plan, we will be moving our sata arrays to a new netapp cluster that should be just us. 20:25:28 that will happen in Feb/Mar this year depending on how the schedules break 20:25:48 then we will see how things shape up. 20:26:35 so based on earlier comments you indicated that in addition to dns changes, that there was an io capacity issue - what changed there, and who changed, and can they stop until we get stuff moved off or? 20:26:57 do we need to add more to a caching layer above? 20:27:45 ke4qqq: jboss merged with xo 20:27:48 err exo 20:27:54 in terms of their repos 20:27:57 andadded a lot of use 20:28:01 that's on the same netapp 20:28:15 they also grew their testing and such. 20:28:16 the plan is to split them off - that's what the new netapp stuff is about 20:29:08 there are some otehr parts.. and I can after meeting because I have to turn off my Brian Blessed mode in doing so 20:29:31 I have no ides what that means and can't even guess 20:29:42 so what other questions about this clusterfuck do y'all have? 20:29:57 s/clusterfuck/series of unfortunate events/ 20:30:00 * dgilmore has none 20:30:21 Any approximate ETA until we're 100% again? 20:30:25 skivdal: were there any idications that we were going to have a problem before it happened? usage stats etc. can we add monitoring to look for this stuff in the future 20:30:29 wow - ok - so I read that as no short term fix - continue degraded til feb/mar? 20:30:32 rfelsburg: yes 20:30:32 Hopefully within the hour 20:30:54 100% = things mount and can ls 20:31:11 ricky, I was going for that to be 75% 20:31:20 s/the hour/nowish/, actually :-) 20:31:26 the dns issue is fixed 20:31:30 so we have the hosts back 20:31:36 but the performance issue may not be fixed 20:31:45 100% will be that ^^^ 20:31:46 Can we quantify the performance issue at all? 20:31:53 How much slower is it? 20:32:16 ricky: the timeouts on app## are one of the issues we're talking about in performance 20:32:17 ricky, a good guess will be that app07 does not see drops on /vol/fedora every 3 minutes 20:32:21 app03 and app07 20:32:27 y/win 20 20:32:30 Ah, didn't know about those. 20:32:43 Jeff_S: nice password there 20:33:00 * CodeBlock assumes he was just switching irc windows 20:33:11 skvidal: yeah, that's for root@baseurl.org 20:33:18 Jeff_S: nice 20:33:21 sorry for the noise :) 20:34:04 ok I think we can go to meeting tickets 20:34:09 let's do that 20:34:12 ok 20:34:51 #topic Tickets 20:35:03 .ticket 2502 20:35:03 smooge: #2502 (Retrace Server) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2502 20:35:30 Ok this is a project for development/qa on analyzing coredumps from willing participants. 20:35:57 I think dennis and I know about the same amount on it... which is not much 20:36:21 I spent a good portion of last week trying to find disk space for them and put a temp/test/oh-god server on telia1 20:36:40 smooge: my understanding is that they plan to make it so debuginfo is available without needing to install the debuginfo rpms 20:36:55 at this point I consider it to be not much different from a publictest instance. 20:37:10 so that abrt/coredump reports etc will all be useful 20:37:25 dgilmore, oh I thought it was that you uploaded your core files and they did the analysis there. 20:37:38 smooge: not my understanding 20:37:42 but i could be wrong 20:37:56 it does stuff with debugging - is it important that we know? 20:38:18 its not 20:38:19 long term it has security implications and throughput implications. 20:38:28 lets move on to the next ticket 20:38:30 yay 20:38:34 and uses a but load of diskspace 20:38:44 .ticket 2501 20:38:45 ie all debuginfo rpms 20:38:46 smooge: #2501 (What will it take to upgrade fedorahosted to RHEL6, new trac, new git?) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2501 20:39:10 ok ke4qqq and deathwing01 (there is no darkwing here)( 20:39:14 ke4qqq and deathwing01 have been working on that on publictest03 a bit 20:39:48 yeah - so right now we are focused on test plan 20:40:01 https://fedoraproject.org/wiki/User:Ke4qqq/Trac_test_plan 20:40:08 #link https://fedoraproject.org/wiki/User:Ke4qqq/Trac_test_plan 20:41:18 yup 20:41:37 ke4qqq: how much testing has been done, and how much still needs to be done? 20:41:51 I want to thank you guys on that.. and hope we can extend those plans onto other apps/systems. 20:42:01 hows it looking so far? 20:42:15 * deathwing01 thinks there's still a lot to to 20:42:15 so far it's not bad - still lots of testing to go 20:42:20 that way when we are doing an update to a server class we can test a checklist versus my current "well the links worked and I could log in" 20:42:41 right - and hopefully have it adopted by QA 20:42:47 for trac updates in fedora 20:43:44 ok thanks on that any more questions? 20:44:10 .ticket 2275 CodeBlock et al 20:44:10 smooge: Error: '2275 CodeBlock et al' is not a valid integer. 20:44:14 .ticket 2275 20:44:15 smooge: #2275 (Upgrade Nagios) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2275 20:44:21 CodeBlock, and others? 20:44:35 Nagios is running on noc01.stg (which is EL6)... 20:44:47 I still need to move zodbot and others over to noc01.stg before we can kill noc01 20:45:23 i accessed the stg nagios 3 system 20:45:25 jds2001: told me last night that he moved supybot-fedora to EPEL6, so ... I should be able to do that now 20:45:36 Why not just rebuild noc01 entirely instead of moving stuff over 20:45:58 ricky: +1 20:46:07 ndeed 20:46:11 I would like us to rebuild when we have tested 20:46:21 not move. sorry if I miscommunicated that 20:46:41 smooge: oh? We're not just going to rename noc01.stg to noc01 later? 20:46:57 no .. noc01.stg will be around for testing changes in the future and such 20:47:09 oh.. heh, I didn't know that 20:47:10 no - if the install of nagios doesn't work from a reinstall then we can't really use it 20:47:38 CodeBlock: This might also be a good opportunity to make nagios into a puppet module if you're interested :-) 20:47:47 +1 on sustainability 20:47:49 ricky: It was a thought ;) 20:47:59 oooooooh 20:48:00 +100 on puppet module 20:48:18 I have a skeleton of one in my own puppet setup if that'd help 20:48:20 actually it would be useful for the many people who wanted to help out to see about doing that 20:48:49 document it as an SOP as well :) 20:49:04 12mins left 20:49:09 smooge: I could maybe work with phuzion on it 20:49:26 well first lets work on getting it into a proper module in staging and then we can move to the next stage of a rebuild of noc01.stg to make sure it all works and then a rebuild of noc01 20:49:42 exactly what I was thinking :-) 20:49:44 alright 20:49:45 ok next ticket 20:49:58 .ticket 2481 20:49:59 smooge: #2481 (Fedora switching from the CLA to FPCA) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2481 20:50:04 Ok this is a MAJOR ONE 20:50:30 I think abadger1999 was working on this? 20:51:20 anyway.. the point of htis is that the current CLA will be replaced with new 'paperwork' and everyone will have to reagree 20:51:36 this requires fas changes, and some flag days 20:51:51 So all of our auth plugins hardcoding cla_done need to be fixed - this is what we get for hardcoding configuration :-) 20:52:01 we are needing to get this done by F15 release 20:52:18 That soon? 20:52:28 I don't think files are a huge deal - we don't delete on inactivation for fedorapeople, which is the only system that should really be affected 20:52:32 so I would like ot have all the hard stuff done at/byend of FudCon 20:53:20 i think just did the check against cla_*.. i *may* be OK 20:54:14 then after that people (FPL and such) can announce the more political flag days 20:54:22 does that sound good? 20:54:33 ricky: content would get moved away 20:54:44 On fedorapeople, it just gets chmodded 20:55:00 we move it to /home/fedora.bak 20:55:07 Not anymore 20:55:08 at least we did 20:55:14 ok 20:55:19 news to me 20:55:25 guess i did not pay attention 20:55:53 we need to document better :). I thought we did an rm --real --fast 20:56:17 ok anything else on this? skvidal ricky ? 20:56:34 .ticket 2277 20:56:35 smooge: #2277 (Figure out how to upgrade transifex on a regular schedule) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2277 20:56:40 not from me 20:56:55 four mins till cloud guys kick us in the shins under the table 20:56:57 this is something that comes up every release.. and would be nice for people to try and figure out 20:57:06 * gholms grins evilly 20:57:07 I brought a 2x4 this time 20:57:15 There has been talk about getting a representative of l10n on the sysadmin team 20:57:19 (a long time ago) 20:57:27 Someone more familiar with the internals of transifex 20:57:39 ok we still need that. I will put it on my list to find out and talk with them after break 20:57:46 then one last ticket 20:57:49 .ticket 2500 20:57:50 smooge: #2500 (Discuss possibility of FreeIPA as FAS replacement) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2500 20:57:58 sorry for just 3 minutes guys 20:57:58 This next version will be the hardest because of a big architecture change 20:58:11 oh you mean transifex 20:58:12 Heh, this was the one I was most interested in :-) 20:58:22 sorry I will make it higher next week 20:58:57 what is the timeline on FAS->FreeIPA migration 20:58:58 anyone? 20:58:59 ? 20:59:00 So we've been talking about kerberos to avoid typing passwords forsuand stuff. 20:59:06 what do we need to change? 20:59:15 goozbach: We're not that far yet, no timeline yet. 20:59:32 My main question is - what does freeipa give us over openldap + kerberos? 20:59:48 ricky: from what I can tell, ease of administration 20:59:55 freeipa seems pretty heavy to me, at least - I know it has a great python API which we could use, but I think there's slightly less flexibility with custom schema 21:00:17 * rbergeron eeks in for a cloud meeting 21:00:25 2 minutes please sorry 21:00:28 np 21:00:55 freeipa mainly gets us a local upstream to help on issues. 21:00:59 ricky: i think the benefit of using freeipa over bare ldap kerberos is that we could interact with a python api 21:01:15 so we need a feature list of FAS and a feature list of FreeIPA written up 21:01:16 mmcgrath looked into this before. (and sgallagh wants to get us to freeipa now). 21:01:18 rather than having to develop tools to interact with each service seperatly 21:01:25 and a cost/benifit analisis 21:01:39 I think that we still have issues with using kerberos for two domains so I'm not sure if we can implement kerb yet. 21:01:39 So openldap would give us LDAP, which people could query directly against 21:01:56 abadger1999: I'm on multiple realms fine, I just have a script which switches my credentials cache 21:02:07 I don't thik that's a huge issue anymore. 21:02:12 freeIPA also does host managment 21:02:15 So, who wants to discuss in #fedora-admin? :-) 21:02:19 ease of admin on that side 21:02:21 ricky: k. Does that work with firefox too? 21:02:28 ok will move to #fedora-admin 21:02:31 +1 for moving to fedora-admin 21:02:39 Yeah, we should move and let the cloud sig get on with their meeting 21:02:41 thankyou for the extra 3 minutes 21:02:43 #endmeeting