21:03:44 #startmeeting infrastructure2 21:03:44 Meeting started Thu Jan 6 21:03:44 2011 UTC. The chair is smooge. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:03:44 Useful Commands: #action #agreed #halp #info #idea #link #topic. 21:03:52 #meetingname infrastructure 21:03:52 The meeting name has been set to 'infrastructure' 21:04:01 #chair skvidal ricky 21:04:01 Current chairs: ricky skvidal smooge 21:04:15 I think we are done with fas01 21:04:34 next ticket? 21:04:34 .ticket 2543 21:04:35 smooge: #2543 (upgrade internetx01 to rhel6) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2543 21:05:00 Ok this will basically be a reinstall with possibly remote hands. 21:05:16 where can I move proxy02? 21:05:19 mmcgrath did this one himself 21:05:22 I can look at that today if you're interested 21:05:26 ricky: +1 21:05:30 I think we have console there 21:05:33 I would just turn it off and take out of dns 21:05:34 And proxy02 can just be down for a while, just take out of DNS 21:05:41 we would just reinstall afterwords 21:05:50 ricky: nod 21:05:54 okay 21:05:55 it is mainly for IPv6 21:06:07 the main issue is that there are 2 different routes 21:06:19 the main hardware is on one and the guests have a different one 21:06:29 but that seems similar to the boxes at other colos 21:07:16 sounds reasonable, though 21:07:20 * nirik ventures to this cold and desolate corner of freenode. 21:07:24 ricky: if you want to nuke proxy02 and do it 0 it's cool 21:08:00 OK, will start noting down the configs for those machine after meeting 21:08:04 just make sure its documented.. I remember mmcgrath practically swearing on this or bodhost at one point but I can't remember why 21:08:40 thanks ricky . let me know how I can help 21:08:49 .ticket 2543 21:08:50 smooge: #2543 (upgrade internetx01 to rhel6) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2543 21:08:58 Will probably bother you with a bunch of ipv6 questions :-) 21:09:09 .ticket 2531 21:09:10 smooge: #2531 (DB03 update) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2531 21:09:47 ok this one is that we have db03 on a local compiled version of postgres83 21:09:56 and EL5 is with 84 now. 21:09:56 * skvidal twitches 21:10:17 so every 30 minutes puppet says "Hey I tried to update these rpms for you but you ated them." 21:10:41 Is there an el6 upgrade planned for db03 as well? Should we bother doing one dump/load for 84 and another one for the el6 upgrade? 21:10:42 so we need to figure out how to dump and reload like we did with db0[12?] 21:10:49 we need to move it to el6 21:11:11 are we confident that the db in el6 is stable/performant as it is currently in el5? 21:11:16 dgilmore, ok cool. I didn't want to add more makework to it so was going for lowest change 21:11:17 smooge: we need to take a koji outage 21:11:20 dump the db 21:11:26 build a el6 box 21:11:32 load the backup 21:11:37 and away we go 21:11:52 smooge: its the road of greatest pain 21:12:01 well we are going to have all kinds of outages coming up :). 21:12:18 * dgilmore notes that db03 is not a virtual machine 21:12:21 I have the feeling I won't be doing much at Fudcon but will be at the colo shooting things 21:12:37 dgilmore, yeah.. I was wondering if you wanted to try it as a virtual machine again? 21:13:20 The hardware for db03 is to be renewed this coming year 21:14:18 smooge: we can. the reason that it got its own box is gone 21:14:54 well if you have time to help me rebuild bvirthost01 to your needs we could put it there.. it has vast tracts of disk 21:15:21 or you could go with what is there now if it meets them 21:16:10 how about this for a plan of action: 21:16:49 1) build a db03-06 on bvirthost01 with EL6. Do a dump on db03 and do an import in db03-06 to see if it shits bricks or not. 21:17:17 2) rebuild db03-06 (if needed) and do the koji outage with a dump. 21:17:37 3) rename db03-06 to db03 and put into production... see what poops bricks then. 21:17:47 4) go back or continue on. 21:18:02 skvidal, ricky? overly complicated or missing something? 21:18:13 doesn't seem overly complicated to me 21:18:31 seems like it would let us test out the basics 21:18:35 and shorten the outage time 21:18:36 Sounds good 21:18:49 the hw that db03 is on 21:18:54 is it out of warranty, too? 21:19:22 skvidal, it will be in June 21:19:30 okay 21:19:33 it will be replaced in our first order list 21:20:19 okay 21:20:33 dgilmore: does the above sound okay to you? 21:21:34 skvidal: seems fine 21:21:40 okay 21:21:48 I'll update the ticket 21:21:50 next! 21:22:03 .ticket 2501 21:22:04 smooge: #2501 (What will it take to upgrade fedorahosted to RHEL6, new trac, new git?) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2501 21:22:24 do we have a testing rhel6 instance? 21:22:28 i thought we did 21:22:42 for trac, I thought jkeating did 21:22:42 several. Oxf13 put new git and such on one 21:22:44 and tested it 21:22:51 pt3, I think 21:22:56 or pt7 21:23:18 * dgilmore has been trying to find time to look at glusterfs for use on the backend storage in a architecture redesign 21:23:55 hmm I wonder what it would take to build/test that at fudcon so we can "see" and break things in the same room 21:24:10 * nirik notes that sheepdog looks interesting. http://www.osrg.net/sheepdog/ (but would require machines be on the same net to share backend storage) 21:24:32 I think currently hosted01/02 are on the same network 21:24:58 they are 21:25:53 smooge: right now - my fudcon schedule is completely chock-a-block 21:26:05 i believe all our serverbeach stuff is in one datacentre 21:26:08 or at least it was 21:26:09 mine is looking to be in the colo :/ 21:26:14 There are two serverbeach datacenters 21:26:20 how about this 21:26:20 hosted* are in texas, I think 21:26:23 The rest in virginia 21:26:30 do we have a deadline for the hosted update? 21:26:35 can we put this one off just a bit? 21:26:49 ricky: hrrm ok. i thought they were all in the same one. even though sb has multiple 21:26:56 if it will involve so many infrastructural changes - do we want to wait until the new boss shows up? 21:27:00 skvidal: its a nice to have thing 21:27:00 I think there was some breakage that this was to "fix" by introducing new breakage 21:27:09 but I think we could wait til February 21:27:12 dgilmore: right - but not critical 21:27:34 skvidal: we can do the rhel6 migration without architectural changes also 21:28:00 Would we have to do extra work to avoid getting the new trac pacakges? 21:28:17 ricky: new trac is only in el6 21:28:22 stay on el5 21:28:25 if it becomes critical: 1) rebuild hosted03 as EL6 and do the same items as db03 (shutdown, dump, load, lather rince, repeat) 21:28:26 old trac 21:28:32 OK. 21:28:41 2) rebuild hosted02 as EL6 and put in synce with renamed 03 21:28:45 smooge: I think that is always going to be the preferred path 21:29:02 then deal with gluster and multiple sites 21:29:04 the only boxes we should take down to update to rhel6 are those with a multiple boxes supporting the service 21:29:23 now here si the thing 21:29:40 hosted02 is as far as I can tell a oh shit backup versus any sort of failover 21:29:47 yes 21:29:56 it is, at best, a warm copy 21:30:46 ok so soemthing for february then 21:31:20 So next? :-) 21:32:03 .ticket 2517 21:32:04 smooge: #2517 (Need mod_evasive for EL6) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2517 21:33:07 ok have we seen any issues with git that requires us to have mod_evasive on it? 21:33:09 So do has pkg01 been falling over without it? :-) 21:33:18 I don't think it has 21:33:24 Won't deny that gitweb is pretty heavy :-( 21:33:35 the only fall overs I have seen have been weirder stuff 21:33:37 But -caching seems to be doing the job 21:35:17 when mod_evasive was installed 21:35:25 were we dealing with a problem? 21:35:33 It was viewvc 21:35:33 or was it entirely "this might be bad"? 21:35:39 on cvs01? 21:35:43 but nothing else? 21:35:47 (it's always the evil VCS web frontend, isn't it?) 21:36:16 Pretty sure it was just viewvc shelling out to CVS/RCS 21:36:28 And getting hit by robots 21:36:38 we do have the snapshot stuff turned off in gitweb still. 21:36:38 okay 21:36:43 there's a request to open that up again. 21:37:04 so.. - maybe do this 21:37:14 move the mod_evasive issue over to the snapshot enablement ticket 21:37:24 .ticket 2123 21:37:25 nirik: #2123 (Please enable snapshot link in gitweb) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2123 21:38:29 but let's move on 21:38:46 but I think tying mod_evasive to snapshots - only if needed seems like a good plan 21:38:55 .ticket 2539 21:38:56 smooge: #2539 (decom xb-01 and reallocate bxen01) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2539 21:39:13 ok I will work on getting mod_evasive into epel 21:39:14 so bxen01 reallocating is not gonna happen if it is out of waranty :( 21:39:37 well for a bubble sort box to be used just to move crap onto and off of? 21:39:46 fine 21:40:50 next? 21:40:55 but we could do it with xen07 also. it is the next one to go out of warranty. 21:41:12 .ticket 2544 21:41:13 smooge: #2544 (migrate autoqa01 elsewhere) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2544 21:41:24 autoqa01 is living on cnode01 21:41:29 cnode01 belongs to the cloud group now 21:41:45 ok this one is a redesign of networks and such that I was putting together for RHIT before break 21:42:20 basically I would like to build a 4th network in PHX2 where the secondary architectures (s390/ppc/arm) can go live and also QA boxes 21:43:05 this network would have limited access to the product/devel networks to cut down "oh shit" moments. 21:43:39 I will ping mgalgoci/ebrown after the meeting to figure out where this is and if its not we go to plan b 21:43:45 which we need to figure out. 21:44:03 this feels like a ways off, then 21:44:05 next? 21:44:13 as it turns out that its more than just autoqa01 that moves from cnode. there are 4-8 qa boxes that need to move too 21:44:33 .ticket 2545 21:44:34 smooge: #2545 (SOP and best practices for publictest## boxes) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2545 21:44:45 abadger1999 and I were talking about this 21:44:57 specifically we have a number of publictest boxes which seem to be idle but still running 21:45:07 should we make the rule that if they're not in use we shut them down 21:45:36 and we need to make a way for people to request them to be turned back on 21:45:58 otherwise we end up with "oh I just saw pt03 running so I installed there.. sorry it fucked up your project" 21:46:18 yah 21:46:25 but I have no problem with dropping boxes that aren't in use. 21:46:30 abadger1999: ? 21:46:32 you around? 21:46:44 * ricky feels like we can just solve these conflicts when they come up 21:47:08 yeah. That would seem like better practice than what we do now. 21:47:09 Can't remember one happening once yet. 21:47:21 ricky, it happened twice last summer 21:47:39 (Removing boxes that aren't in use; creating them as they're used again) 21:47:41 Oh, ignore me then 21:48:15 but it was something where "put in an RFR and get someone in sysadmin-main to spin you up a fresh box" would have covered it. 21:48:19 Easiest way is: any unlabelled machines get xm shutdown 21:48:37 ricky: virsh destroy - the new xm shutdown :) 21:48:43 And then they get erased as soon as soon as we need to build a new one and overwrite it 21:49:12 so if you have notes/thoughts 21:49:15 add them to the ticket 21:49:25 we'll compile that into an SOP and maybe into a script 21:49:30 ok will do so. 21:50:12 next? 21:51:22 .ticket 2546 21:51:23 smooge: #2546 (bnfs01) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2546 21:52:04 I think a side point on this is "Should hyperthreading be turned on our systems?" 21:52:40 * skvidal has no opinion - I always turned it off before 21:52:56 but even turning it off this is an 8-core box with 16GB of ram 21:53:00 that's been 99% idle 21:53:08 * nirik always leaves it on, but not sure why. 21:53:15 Many of them have them off, but a couple have them on (mostly new ones). 21:53:34 nirik: right - I'm w/you on 'not sure why' 21:54:06 Well depending on the system hyperthreading can be faster (databases like oracle) or slower (VM's) 21:54:27 so I usually turn it off because I don't do oracle 21:54:33 or similar tools. 21:54:38 okay 21:54:40 but back to the main question. 21:54:51 this is the belt and suspender box for nfs0`1 21:55:11 if nfs01 goes kablooey this is meant to be its replacement. 21:55:17 do we have a doc on what the 'cold failoiver' procedure looks like? 21:55:49 Just a guess - probably something like: Check mount, change IP 21:56:06 Oh, it's cold. Never mind. 21:56:33 not that I know of. dgilmore I think purchased/set it up 21:57:22 the history I got from dgilmore and mmcgrath was this 21:57:32 1. it was intended to be snapshotted regularly 21:57:36 2. it was setup that way 21:57:39 smooge: mmcgrath did the puchase setup 21:57:51 3. bad things happened in the db when that happened - where you had to manually intervene 21:58:04 ouch 21:58:07 ok 21:58:08 4. so it was shelved until someone got back to it? - that last bit is bit fuzzy 21:58:27 dgilmore: does the above sound right? or am I misremembering? 21:58:41 skvidal: thats pretty spot on 21:59:04 we went that route because backing up /mnt/koji to tape took days 21:59:14 gotcha 21:59:18 and blocked all other jobs 21:59:39 this way we would have something that we could backup to 21:59:49 but also do a cold failover if it came to it 22:00:23 but like with hosted ive been thinking of ways to redo it 22:00:50 and i think that we could use gluster to keep the data realtime replicated to it 22:01:17 so that could be our gluster test case? 22:01:21 we could honestly make bnfs01 be a vm on the host 22:01:35 smooge: well i want to test it at home first 22:02:54 so rebuild the box to be a "virthost" and then create a vm on it. ok 22:03:05 right 22:03:26 make the disk available to it via some method 22:04:29 sounds like a plan - dgilmore, smooge: would one of y'all be willing to update the ticket with this? 22:04:54 doing so 22:05:12 .ticket 2540 22:05:13 smooge: #2540 (find all no longer running xen/kvm instances with disk space still allocated) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2540 22:05:48 smooge: sounds like you did this one - wanna close it? 22:06:13 I am still working on this one 22:06:15 okay 22:06:17 great 22:06:36 we have http://fpaste.org/oUkw/ 22:06:41 a lot of dirty partitions 22:07:05 not counting what is unused on the iscsi box 22:07:21 which I haven't finished with yet. 22:07:48 ok 22:07:59 how I determined was did a pvs and looked for parttiitons that were: -wi-a- 22:08:13 lvs sorry 22:08:32 I think all but the bxen boxes are pretty safe to remove 22:09:12 * skvidal didn't know about looking for 'o' in lvs 22:09:13 good move 22:09:14 bxen02 had a lot of stuff on it that seemed special 22:09:47 yeah I figured it out when playing with the kpartx to look at the age of old images 22:10:40 dgilmore, the partitions on bxen02 like mpmtest koji2 22:11:49 anyway I think I can leave bxenXX til later and clean up the rest 22:12:13 last ticket 22:12:17 .ticket 2530 22:12:18 smooge: #2530 (Selinux issues on PPC servers) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2530 22:12:20 smooge: mpmtest is mmcgrath 22:12:31 ok will ask him 22:13:00 the other ones we can go over when you have more bandwidth and rest from this everlong meeting :) 22:13:05 dgilmore, thanks 22:13:58 the last issue looks to have been fixed over break I will close that one. 22:14:33 skvidal, ricky we are done. 22:14:38 kewl 22:14:40 thank you 22:14:53 now to move another 30 of our open tickets to meeting :) 22:14:54 smooge: the koji2 on xenGuests i think was put there to migrate it from one host to another 22:14:56 Yay 22:15:07 #endmeeting