19:00:02 #startmeeting Infrastructure (2011-08-04) 19:00:02 Meeting started Thu Aug 4 19:00:02 2011 UTC. The chair is nirik. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:02 Useful Commands: #action #agreed #halp #info #idea #link #topic. 19:00:02 #meetingname infrastructure 19:00:02 The meeting name has been set to 'infrastructure' 19:00:02 #topic Robot Roll Call 19:00:03 #chair smooge skvidal codeblock ricky nirik abadger1999 19:00:03 Current chairs: abadger1999 codeblock nirik ricky skvidal smooge 19:00:11 here 19:00:17 giggity 19:00:18 * abadger1999 here 19:00:18 morning smooge 19:00:39 * nirik waves to all 19:01:26 * nirik will start the meeting at :03 19:01:45 * CodeBlock waves 19:02:44 oh I thought I was late again 19:03:02 #topic New folks introductions and apprentice tasks/feedback 19:03:06 smooge: not at all. ;) 19:03:28 so, any new folks like to introduce themselevs? any apprentice folks like to talk about specific items or questions? 19:04:16 I added another apprentice / easyfix ticket yesterday... 19:04:24 move/convert SOP's over from wiki to git. 19:04:53 I've also gotten several replies to my aug fi-apprentice ping email. A number of people had busy summers but hope to dig back in soon. 19:05:18 I'll be doing the group cleanup next week. 19:05:46 #topic F16 Alpha Freeze reminder and tickets 19:05:57 Reminder that we are in a pre-release freeze right now. 19:06:22 https://fedorahosted.org/fedora-infrastructure/browser/architecture/Environments.png 19:06:28 lists whats included and whats not. 19:06:40 Anything thats included, you MUST post to the list and get 2 +1's on. 19:06:56 It looks like we will slip 1 or more weeks if I read the email correct 19:07:06 yeah, seeming likely. ;( 19:07:08 It's entirely possible 19:07:18 we also have f16 alpha tickets all filed: 19:07:20 Not for sure yet, but somewhat likely, given the late TC 19:07:27 .ticket 2894 19:07:28 nirik: #2894 (F16Alpha: websites) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2894 19:07:33 .ticket 2895 19:07:36 nirik: #2895 (F16Alpha: Verify mirror space) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2895 19:07:40 .ticket 2896 19:07:41 nirik: #2896 (F16Alpha: Release day ticket) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2896 19:07:42 .ticket 2897 19:07:45 nirik: #2897 (F16Alpha: Verify mirror permissions) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2897 19:07:47 .ticket 2898 19:07:50 nirik: #2898 (F16Alpha: Verify mirrormanager redirects) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2898 19:08:08 so, we should make sure we have these under control before Alpha. 19:08:43 sorry, sorry 19:08:45 * skvidal is here 19:08:50 hey skvidal. No worries. 19:09:12 so, does anyone wish to take on any of those alpha tickets for their very own? ;) 19:09:47 in any case we will make sure they get done before alpha. 19:10:19 Anything more on alpha tickets? 19:10:37 #topic Upcoming Tasks/Items 19:10:50 Anyone have upcoming items they wish to plan/schedule or discuss? 19:11:09 We can't affect any of the machines in the freeze, but we can work on other machines and also plan/document things. ;) 19:11:52 I'm planning on sending out a straw man plan for upgrading hosted for people to poke holes in. 19:12:37 I'm working on a little web app for ambassadors to be able to run a raffle. 19:12:52 Plan to deploy it to production after alpha freeze. 19:13:04 Not sure if it'll become a permanent fixture or will be a one-shot. 19:13:07 abadger1999: cool. ;) 19:13:30 is that likely to need to follow the dev-> stg-> prod chain? or so simple it can just test in stg? 19:14:07 nirik: I can test in stg since it's not deployed yet. 19:14:17 nirik: But I can start in dev/w a dev instance if you'd rather. 19:14:22 up to you :-) 19:14:23 I will take mirror space and permissions 19:14:47 I am planning one two things that I will need +1 for 19:14:54 abadger1999: don't care too much on a simple app I don't think. If it can be safely tested in stg thats fine. Especially if it doesn't use a different framework, etc. 19:15:04 19:15:05 smooge: thanks on the tickets. 19:15:34 It'll be TG2. I'll plan on testing stg; I'll holler if I need something else b/c it's not safe to test there. 19:15:50 ok 19:16:21 #topic List items / random info 19:16:33 So, I thought I would bring up a few things I posted on list for discussion... 19:16:43 but of course replies to the list are fine too. 19:17:12 First one was: sysadmin group requirement for sysadmin-qa. I was thinking we might drop that requirement for them since they don't care about sysadmin emails. 19:17:24 I don't know if there's some other reason sysadmin-foo groups require sysadmin. 19:18:01 Second one was access to log02 for apprentices. ;) 19:18:09 nirik: one thing about that was that they needed to go through bastion to get to their boxes I think... we could add sysadmin-qa to the list of groups that can shell into bastion, though. 19:18:33 abadger1999: I made them a bastion-comm01... so they should be able to use that for access. 19:18:40 Okay 19:18:43 That works too :-) 19:18:48 does sysadmin get shell on bastion? 19:19:06 I think that's the way we set it up. 19:19:08 * abadger1999 checks 19:19:31 doesn't seem to. 19:19:38 you have to be in sysadmin-noc or above 19:19:40 * nirik thinks thats just the emails 19:19:41 to get into bastion 19:20:15 ah, looks like we explicitly list all the sysadmin-* groups. Misrecollection on my part. 19:20:40 for not sysadmin-qa I think it makes sense... if you are sysadmin-resource you should still be in the loop on commits and outages so you can know changes that affect your resource. 19:21:48 anyhow, can see if there's a historical reason and just change it if there's not. 19:23:14 so, do chime in on list. ;) 19:23:33 some info items: 19:23:48 #info there was a short unplanned outage yesterday. Sent details to list. 19:24:04 #info infra-docs is live and ready for SOP's to be converted to it. 19:25:18 #info DNS glue records are now fixed. 19:25:23 1) I need to update our wildcard certificate. 2) I am going to remove ns1/ns2 from the dns for fedoraproject.org and other zones that have been fixed 19:25:34 #info backup03 sees it's take drive, so we can set it up now. 19:25:40 actually the files aren't fixed. I realized that I needed +1 to do so 19:25:52 #info new wildcard cert is ready to go (28 days to spare) 19:26:48 * nirik thinks of other things pending. 19:27:04 #info new ibiblio02 machine should be ready soon. 19:27:50 #topic Meeting tagged tickets: 19:27:50 https://fedorahosted.org/fedora-infrastructure/query?status=new&status=assigned&status=reopened&group=milestone&keywords=~Meeting&order=priority 19:28:00 any meeting tickets folks would like to note or talk about? 19:28:06 or any other tickets for that matter? 19:28:18 not me at the moment 19:28:28 nothing leaps to mind 19:28:32 nope 19:28:45 cool. 19:28:51 #topic Open Floor 19:28:55 anything for open floor? 19:29:17 just waiting for the hardware to be finished racking 19:29:25 smooge: any news on that? 19:29:42 nothing beyond that it was what caused our outage yesterday :) 19:30:03 yeah, I figured. ;( 19:30:28 Once those are in place, I'd like to build up the bvirthostwhatever and put a new releng03 on it. 19:30:53 ok 19:30:54 smooge: oh, can you talk about that IMM/RSA reset thing a bit? 19:31:15 ok so for some reason a bunch of our IMM boxes went "dead" to the world after we left pHX2 19:31:17 * skvidal stabs imm/rsa in the face 19:31:26 oh, sorry, bitter 19:31:51 the only fix I have found is to install an IBM tool which talks to the hidden controller between the IMM and the box 19:32:00 there are 4 machines where the management interface is not working currently. 19:32:08 and tell it to give an ip address and reset 19:32:21 s/not working/not working at all. no ping, no ssh, no nothing/ 19:33:06 unfortunately, those machines are contain 'important' guests. 19:33:09 the issue is.. all the systems which are down are critical 19:33:19 so it can't happen until after the freeze 19:33:46 * nirik nods. 19:34:02 Also, many of our machines have older versions of the IMM firmware. Updating that might be a good thing too. 19:34:13 not that the new one is too much better. ;) 19:34:34 So, the boxes are up and the guests are running but the management interface is down? 19:34:48 #info need to update IMM/RSA on machines, as well as reset it on 4 of them. 19:34:48 correct. if something happens to the box.. we are sol 19:34:50 abadger1999: yep 19:34:53 Okay. 19:35:44 * nirik tries to think of anything else to discuss... 19:35:53 any other topics? Or shall we call it a short meeting? 19:36:20 one minor thing 19:36:23 the infra-hosts git repo 19:36:30 if anyone wants to start adding notes to servers 19:36:32 please do so 19:36:44 * nirik nods. Good plan. 19:36:46 hell, anytime you remember something 'odd' that's is quasi-specific to that server, do it 19:36:51 it can be anything 19:37:07 look at log02 for an example 19:37:41 * nirik has an idea. Not sure it will be useful or work tho. 19:38:04 nirik: ? 19:38:05 could we put something in that repo to mark what hosts are in which update group? A B C ? 19:38:12 absolutely 19:38:29 then, somehow generate func lists or whatever from that... 19:38:31 put it in the 'notes' file 19:38:35 hmmm... 19:38:44 sure 19:38:45 or perhaps thats best as seperate groups in func 19:38:46 we could do that 19:38:47 no 19:38:55 I think we could do that 19:39:01 I can write a script to mine that data out 19:39:05 don't put it in 'notes' then 19:39:14 maybe make a 'servertype' item or something like that 19:39:28 I'd like a 'func-yum --hosts-from-list=group-a check update' or whatever. 19:39:34 yeah, or 'updategroup' or something. 19:39:39 it would probably be 19:39:46 func-yum --hosts=@group-a update 19:39:54 * nirik nods, thats fine. 19:39:56 since func-yum should handle thar group syntax now 19:40:14 anyhow, can figure that out out of band... 19:40:25 yep 19:40:55 app => rhel6; lmacken thinks that fedoracommunity should be pretty easy to fix once he gets the last packages built for EPEL6. 19:41:08 So that just leaves mediawiki slowness. 19:41:24 cool. I keep meaning to look at that, but never get to it. ;) 19:41:29 Do we want to put out a cattle call to find a new fi-apprentice to look at that? 19:41:31 yeah, I'm working on the moksha EL6 thing... dealing with odd issues with the TG2 stack atm. 19:41:33 might see if ricky or ianweller can look at some point. 19:41:41 abadger1999: that would be cool too. 19:42:35 nirik: Do we have a ticket about the slowness issue? 19:42:40 abadger1999: once we have a rhel6 app server working, would bapp01 be hard to do? or it's mostly distro independent? 19:42:54 nope. I can file one tho... 19:43:36 nirik: I'll write a call for volunteers; if you get a ticket open with some numbers/testing it'll be a good place for me to send people to get started. 19:43:47 nirik: I'd say do bapp01 last. 19:44:08 nirik: bapp01 has a bunch of stuff running that's not on the other app servers. 19:44:14 yeah. 19:44:15 cron jobs and such. 19:44:42 ok. 19:44:45 things that interface with rh bugzilla, koji... not everything on there is easy to test in stg for those reasons :-( 19:45:00 ok. 19:45:32 I can file a ticket on the mediawiki thing. 19:45:35 Probably we need to update the other app servers, then look through puppet for what's running on bapp01. 19:45:44 (and not on the other app servers) 19:45:55 does bapp01 need to be in phx2? (for bugzilla access, etc?) 19:46:00 and the people responsible for those (mdomsch, I, maybe lmacken) 19:46:01 yes 19:46:14 it needs bugzilla, mounting of the netapps 19:46:16 site down and make sue all of those work... 19:46:23 and various other things 19:46:27 maybe in production since they might be hard to test. 19:46:42 (without having side effects on bugzilla/koji/etc) 19:46:49 nirik, it is probably the most critical box that needs to be in phx :/ 19:47:20 ok. 19:48:20 If we think that multiple small, targetted servers are more scalable than one beefier server, bapp01 might be a good candidate. 19:48:46 It doesn't truly need to be an app server and it doesn't need to be load balanced. 19:49:08 well, the reason I asked if it needs to be in phx2, was thinking that it would be nice if it could be 'floating'... ie, have app server setup in puppet and a bapp thing and we could move bapp to whatever app server we wanted to run those things. 19:49:21 but it sounds like thats not possible. 19:49:37 nirik: the mount points make it tricky, I suspect 19:49:43 abadger1999: https://fedorahosted.org/fedora-infrastructure/ticket/2908 19:49:54 though I've often wondered about that... is it actually MOVIING or accessing files on those mount points? 19:49:59 or is it mostly acquiring directory indexes? 19:50:42 Some of everything I believe 19:51:35 not sure. 19:52:10 I can't think of anything off hand that would be writing to the mount points at least, but bapp01 is very... eclectic so I don't know everything that's running on it. 19:52:40 I guess I was wondering 19:52:44 nirik: thanks. I'll send a message aboout that. 19:52:49 could we dump the nfs mounts 19:53:00 and use file-indexes of the rpms generated on the boxes 19:53:11 skvidal: yeah, I think that might be for bodhi to complete package names... 19:53:11 or even repometadata 19:53:20 on the other apps at least 19:53:34 nirik: that 's what I was thinking - I'm sure bodhi can read a list from a file faste than a dir glob.glob() 19:53:56 I'll see about looking at the code for bodhi to see if I can make that work 19:53:57 https://fedorahosted.org/fedora-infrastructure/ticket/2836 19:54:09 lmacken: ^ is that for package name completion? 19:54:12 skvidal: cool. 19:54:23 It would be nice to not have to have mounts on the app servers. 19:55:00 indeed 19:55:10 and it would make those boxes less 'special' 19:55:41 also, currently we have app05 and app06 that are not in phx2, but they are not in the base load (only backups) I think due to this reason. 19:56:20 (well, and possibly db latency) 19:56:40 I think the writing is from mirrormanager 19:56:59 nirik, a lot of db latency 19:57:32 smooge: mirrormanager is writing to nfs? or do you mean writing to the db? 19:58:01 skvidal, I thought there was something in mirrormanager that writes to the disks.. but I could be wrong 19:58:25 smooge: I know it writes out its mirror metalinks files and what-not 19:58:29 but that's not big 19:58:55 oh I was thinking you were wondering about ro access versus rw. I misread something 19:59:02 nirik: db latency was why they were backups originally. 19:59:03 np 19:59:05 nirik, skvidal: it used to be for the build auto-completion, but I think we may not need /mnt/koji on the app servers anymore. I'll look into it and follow up in the ticket. 19:59:13 lmacken: thank you 19:59:17 abadger1999: yeah. ;( 19:59:21 lmacken: cool. Thanks. 19:59:56 in any case I think we all agree on bapp01: a) identify and document the 'specialness' it has and b) try and reduce that so it's less complex/SPOF. ;) 20:00:42 +1 20:00:49 ok, any last items from anyone? if not will close out soon here... 20:01:24 lmacken: just did some searches through the code 20:01:43 lmacken: looks like it is _fetch_candidate_builds() which does the autocompletion and that looks like direct koji calls to get those lists 20:01:56 lmacken: so - I suspect you are correct about /mnt/koji being a legacy mount 20:02:56 cool. 20:03:55 ok, thanks for coming everyone! 20:03:57 #endmeeting