21:03:44 <smooge> #startmeeting infrastructure2
21:03:44 <zodbot> Meeting started Thu Jan  6 21:03:44 2011 UTC.  The chair is smooge. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:03:44 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic.
21:03:52 <smooge> #meetingname infrastructure
21:03:52 <zodbot> The meeting name has been set to 'infrastructure'
21:04:01 <smooge> #chair skvidal ricky
21:04:01 <zodbot> Current chairs: ricky skvidal smooge
21:04:15 <smooge> I think we are done with fas01
21:04:34 <skvidal> next ticket?
21:04:34 <smooge> .ticket 2543
21:04:35 <zodbot> smooge: #2543 (upgrade internetx01 to rhel6) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2543
21:05:00 <smooge> Ok this will basically be a reinstall with possibly remote hands.
21:05:16 <skvidal> where can I move proxy02?
21:05:19 <smooge> mmcgrath did this one himself
21:05:22 <ricky> I can look at that today if you're interested
21:05:26 <skvidal> ricky: +1
21:05:30 <ricky> I think we have console there
21:05:33 <smooge> I would just turn it off and take out of dns
21:05:34 <ricky> And proxy02 can just be down for a while, just take out of DNS
21:05:41 <smooge> we would just reinstall afterwords
21:05:50 <skvidal> ricky: nod
21:05:54 <skvidal> okay
21:05:55 <smooge> it is mainly for IPv6
21:06:07 <smooge> the main issue is that there are 2 different routes
21:06:19 <smooge> the main hardware is on one and the guests have a different one
21:06:29 <smooge> but that seems similar to the boxes at other colos
21:07:16 <skvidal> sounds reasonable, though
21:07:20 * nirik ventures to this cold and desolate corner of freenode.
21:07:24 <skvidal> ricky: if you want to nuke proxy02 and do it 0 it's cool
21:08:00 <ricky> OK, will start noting down the configs for those machine after meeting
21:08:04 <smooge> just make sure its documented.. I remember mmcgrath practically swearing on this or bodhost at one point but I can't remember why
21:08:40 <smooge> thanks ricky . let me know how I can help
21:08:49 <smooge> .ticket 2543
21:08:50 <zodbot> smooge: #2543 (upgrade internetx01 to rhel6) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2543
21:08:58 <ricky> Will probably bother you with a bunch of ipv6 questions :-)
21:09:09 <smooge> .ticket 2531
21:09:10 <zodbot> smooge: #2531 (DB03 update) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2531
21:09:47 <smooge> ok this one is that we have db03 on a local compiled version of postgres83
21:09:56 <smooge> and EL5 is with 84 now.
21:09:56 * skvidal twitches
21:10:17 <smooge> so every 30 minutes puppet says "Hey I tried to update these rpms for you but you ated them."
21:10:41 <ricky> Is there an el6 upgrade planned for db03 as well?  Should we bother doing one dump/load for 84 and another one for the el6 upgrade?
21:10:42 <smooge> so we need to figure out how to dump and reload like we did with db0[12?]
21:10:49 <dgilmore> we need to move it to el6
21:11:11 <skvidal> are we confident that the db in el6 is stable/performant as it is currently in el5?
21:11:16 <smooge> dgilmore, ok cool. I didn't want to add more makework to it so was going for lowest change
21:11:17 <dgilmore> smooge: we need to take a koji outage
21:11:20 <dgilmore> dump the db
21:11:26 <dgilmore> build a el6 box
21:11:32 <dgilmore> load the backup
21:11:37 <dgilmore> and away we go
21:11:52 <dgilmore> smooge: its the road of greatest pain
21:12:01 <smooge> well we are going to have all kinds of outages coming up :).
21:12:18 * dgilmore notes that db03 is not a virtual machine
21:12:21 <smooge> I have the feeling I won't be doing much at Fudcon but will be at the colo shooting things
21:12:37 <smooge> dgilmore, yeah.. I was wondering if you wanted to try it as a virtual machine again?
21:13:20 <smooge> The hardware for db03 is to be renewed this coming year
21:14:18 <dgilmore> smooge: we can. the reason that it got its own box is gone
21:14:54 <smooge> well if you have time to help me rebuild bvirthost01 to your needs we could put it there.. it has vast tracts of disk
21:15:21 <smooge> or you could go with what is there now if it meets them
21:16:10 <smooge> how about this for a plan of action:
21:16:49 <smooge> 1) build a db03-06 on bvirthost01 with EL6. Do a dump on db03 and do an import in db03-06 to see if it shits bricks or not.
21:17:17 <smooge> 2) rebuild db03-06 (if needed) and do the koji outage with a dump.
21:17:37 <smooge> 3) rename db03-06 to db03 and put into production... see what poops bricks then.
21:17:47 <smooge> 4) go back or continue on.
21:18:02 <smooge> skvidal, ricky? overly complicated or missing something?
21:18:13 <skvidal> doesn't seem overly complicated to me
21:18:31 <skvidal> seems like it would let us test out the basics
21:18:35 <skvidal> and shorten the outage time
21:18:36 <ricky> Sounds good
21:18:49 <skvidal> the hw that db03 is on
21:18:54 <skvidal> is it out of warranty, too?
21:19:22 <smooge> skvidal, it will be in June
21:19:30 <skvidal> okay
21:19:33 <smooge> it will be replaced in our first order list
21:20:19 <skvidal> okay
21:20:33 <skvidal> dgilmore: does the above sound okay to you?
21:21:34 <dgilmore> skvidal: seems fine
21:21:40 <skvidal> okay
21:21:48 <skvidal> I'll update the ticket
21:21:50 <skvidal> next!
21:22:03 <smooge> .ticket 2501
21:22:04 <zodbot> smooge: #2501 (What will it take to upgrade fedorahosted to RHEL6, new trac, new git?) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2501
21:22:24 <dgilmore> do we have a testing rhel6 instance?
21:22:28 <dgilmore> i thought we did
21:22:42 <skvidal> for trac, I thought jkeating did
21:22:42 <smooge> several. Oxf13 put new git and such on one
21:22:44 <smooge> and tested it
21:22:51 <ricky> pt3, I think
21:22:56 <smooge> or pt7
21:23:18 * dgilmore has been trying to find time to look at glusterfs for use on the backend storage in a architecture redesign
21:23:55 <smooge> hmm I wonder what it would take to build/test that at fudcon so we can "see" and break things in the same room
21:24:10 * nirik notes that sheepdog looks interesting. http://www.osrg.net/sheepdog/ (but would require machines be on the same net to share backend storage)
21:24:32 <smooge> I think currently hosted01/02 are on the same network
21:24:58 <dgilmore> they are
21:25:53 <skvidal> smooge: right now - my fudcon schedule is completely chock-a-block
21:26:05 <dgilmore> i believe all our serverbeach stuff is in one datacentre
21:26:08 <dgilmore> or at least it was
21:26:09 <smooge> mine is looking to be in the colo :/
21:26:14 <ricky> There are two serverbeach datacenters
21:26:20 <skvidal> how about this
21:26:20 <ricky> hosted* are in texas, I think
21:26:23 <ricky> The rest in virginia
21:26:30 <skvidal> do we have a deadline for the hosted update?
21:26:35 <skvidal> can we put this one off just a bit?
21:26:49 <dgilmore> ricky: hrrm ok. i thought they were all in the same one. even though sb has multiple
21:26:56 <skvidal> if it will involve so many infrastructural changes - do we want to wait until the new boss shows up?
21:27:00 <dgilmore> skvidal: its a nice to have thing
21:27:00 <smooge> I think there was some breakage that this was to "fix" by introducing new breakage
21:27:09 <smooge> but I think we could wait til February
21:27:12 <skvidal> dgilmore: right - but not critical
21:27:34 <dgilmore> skvidal: we can do the rhel6 migration without architectural changes also
21:28:00 <ricky> Would we have to do extra work to avoid getting the new trac pacakges?
21:28:17 <dgilmore> ricky: new trac is only in el6
21:28:22 <dgilmore> stay on el5
21:28:25 <smooge> if it becomes critical: 1) rebuild hosted03 as EL6 and do the same items as db03 (shutdown, dump, load, lather rince, repeat)
21:28:26 <dgilmore> old trac
21:28:32 <ricky> OK.
21:28:41 <smooge> 2) rebuild hosted02 as EL6 and put in synce with renamed 03
21:28:45 <skvidal> smooge: I think that is always going to be the preferred path
21:29:02 <smooge> then deal with gluster and multiple sites
21:29:04 <skvidal> the only boxes we should take down to update to rhel6 are those with a multiple boxes supporting the service
21:29:23 <smooge> now here si the thing
21:29:40 <smooge> hosted02 is as far as I can tell a oh shit backup versus any sort of failover
21:29:47 <skvidal> yes
21:29:56 <skvidal> it is, at best, a warm copy
21:30:46 <smooge> ok so soemthing for february then
21:31:20 <ricky> So next?  :-)
21:32:03 <smooge> .ticket 2517
21:32:04 <zodbot> smooge: #2517 (Need mod_evasive for EL6) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2517
21:33:07 <smooge> ok have we seen any issues with git that requires us to have mod_evasive on it?
21:33:09 <ricky> So do has pkg01 been falling over without it?  :-)
21:33:18 <smooge> I don't think it has
21:33:24 <ricky> Won't deny that gitweb is pretty heavy :-(
21:33:35 <smooge> the only fall overs I have seen have been weirder stuff
21:33:37 <ricky> But -caching seems to be doing the job
21:35:17 <skvidal> when mod_evasive was installed
21:35:25 <skvidal> were we dealing with a problem?
21:35:33 <ricky> It was viewvc
21:35:33 <skvidal> or was it entirely "this might be bad"?
21:35:39 <skvidal> on cvs01?
21:35:43 <skvidal> but nothing else?
21:35:47 <ricky> (it's always the evil VCS web frontend, isn't it?)
21:36:16 <ricky> Pretty sure it was just viewvc shelling out to CVS/RCS
21:36:28 <ricky> And getting hit by robots
21:36:38 <nirik> we do have the snapshot stuff turned off in gitweb still.
21:36:38 <skvidal> okay
21:36:43 <nirik> there's a request to open that up again.
21:37:04 <skvidal> so.. - maybe do this
21:37:14 <skvidal> move the mod_evasive issue over to the snapshot enablement ticket
21:37:24 <nirik> .ticket 2123
21:37:25 <zodbot> nirik: #2123 (Please enable snapshot link in gitweb) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2123
21:38:29 <skvidal> but let's move on
21:38:46 <skvidal> but I think tying mod_evasive to snapshots - only if needed seems like a good plan
21:38:55 <smooge> .ticket 2539
21:38:56 <zodbot> smooge: #2539 (decom xb-01 and reallocate bxen01) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2539
21:39:13 <smooge> ok I will work on getting mod_evasive into epel
21:39:14 <skvidal> so bxen01 reallocating is not gonna happen if it is out of waranty :(
21:39:37 <smooge> well for a bubble sort box to be used just to move crap onto and off of?
21:39:46 <skvidal> <shjrug> fine
21:40:50 <skvidal> next?
21:40:55 <smooge> but we could do it with xen07 also. it is the next one to go out of warranty.
21:41:12 <smooge> .ticket 2544
21:41:13 <zodbot> smooge: #2544 (migrate autoqa01 elsewhere) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2544
21:41:24 <skvidal> autoqa01 is living on cnode01
21:41:29 <skvidal> cnode01 belongs to the cloud group now
21:41:45 <smooge> ok this one is a redesign of networks and such that I was putting together for RHIT before break
21:42:20 <smooge> basically I would like to build a 4th network in PHX2 where the secondary architectures (s390/ppc/arm) can go live and also QA boxes
21:43:05 <smooge> this network would have limited access to the product/devel networks to cut down "oh shit" moments.
21:43:39 <smooge> I will ping mgalgoci/ebrown after the meeting to figure out where this is and if its not we go to plan b
21:43:45 <smooge> which we need to figure out.
21:44:03 <skvidal> this feels like a ways off, then
21:44:05 <skvidal> next?
21:44:13 <smooge> as it turns out that its more than just autoqa01 that moves from cnode. there are 4-8 qa boxes that need to move too
21:44:33 <smooge> .ticket 2545
21:44:34 <zodbot> smooge: #2545 (SOP and best practices for publictest## boxes) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2545
21:44:45 <skvidal> abadger1999 and I were talking about this
21:44:57 <skvidal> specifically we have a number of publictest boxes which seem to be idle but still running
21:45:07 <skvidal> should we make the rule that if they're not in use we shut them down
21:45:36 <smooge> and we need to make a way for people to request them to be turned back on
21:45:58 <smooge> otherwise we end up with "oh I just saw pt03 running so I installed there.. sorry it fucked up your project"
21:46:18 <skvidal> yah
21:46:25 <smooge> but I have no problem with dropping boxes that aren't in use.
21:46:30 <skvidal> abadger1999: ?
21:46:32 <skvidal> you around?
21:46:44 * ricky feels like we can just solve these conflicts when they come up
21:47:08 <abadger1999> yeah.  That would seem like better practice than what we do now.
21:47:09 <ricky> Can't remember one happening once yet.
21:47:21 <smooge> ricky, it happened twice last summer
21:47:39 <abadger1999> (Removing boxes that aren't in use; creating them as they're used again)
21:47:41 <ricky> Oh, ignore me then
21:48:15 <smooge> but it was something where "put in an RFR and get someone in sysadmin-main to spin you up a fresh box" would have covered it.
21:48:19 <ricky> Easiest way is: any unlabelled machines get xm shutdown
21:48:37 <skvidal> ricky: virsh destroy - the new xm shutdown :)
21:48:43 <ricky> And then they get erased as soon as soon as we need to build a new one and overwrite it
21:49:12 <skvidal> so if you have notes/thoughts
21:49:15 <skvidal> add them to the ticket
21:49:25 <skvidal> we'll compile that into an SOP and maybe into a script
21:49:30 <smooge> ok will do so.
21:50:12 <skvidal> next?
21:51:22 <smooge> .ticket 2546
21:51:23 <zodbot> smooge: #2546 (bnfs01) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2546
21:52:04 <smooge> I think a side point on this is "Should hyperthreading be turned on our systems?"
21:52:40 * skvidal has no opinion - I always turned it off before
21:52:56 <skvidal> but even turning it off this is an 8-core box with 16GB of ram
21:53:00 <skvidal> that's been 99% idle
21:53:08 * nirik always leaves it on, but not sure why.
21:53:15 <smooge> Many of them have them off, but a couple have them on (mostly new ones).
21:53:34 <skvidal> nirik: right - I'm w/you on 'not sure why'
21:54:06 <smooge> Well depending on the system hyperthreading can be faster (databases like oracle) or slower (VM's)
21:54:27 <smooge> so I usually turn it off because I don't do oracle
21:54:33 <smooge> or similar tools.
21:54:38 <skvidal> okay
21:54:40 <smooge> but back to the main question.
21:54:51 <smooge> this is the belt and suspender box for nfs0`1
21:55:11 <smooge> if nfs01 goes kablooey this is meant to be its replacement.
21:55:17 <skvidal> do we have a doc on what the 'cold failoiver' procedure looks like?
21:55:49 <ricky> Just a guess -  probably something like: Check mount, change IP
21:56:06 <ricky> Oh, it's cold.  Never mind.
21:56:33 <smooge> not that I know of. dgilmore I think purchased/set it up
21:57:22 <skvidal> the history I got from dgilmore and mmcgrath was this
21:57:32 <skvidal> 1. it was intended to be snapshotted regularly
21:57:36 <skvidal> 2. it was setup that way
21:57:39 <dgilmore> smooge: mmcgrath did the puchase setup
21:57:51 <skvidal> 3. bad things happened in the db when that happened - where you had to manually intervene
21:58:04 <smooge> ouch
21:58:07 <smooge> ok
21:58:08 <skvidal> 4. so it was shelved until someone got back to it? - that last bit is bit fuzzy
21:58:27 <skvidal> dgilmore: does the above sound right? or am I misremembering?
21:58:41 <dgilmore> skvidal: thats pretty spot on
21:59:04 <dgilmore> we went that route because backing up /mnt/koji to tape took days
21:59:14 <skvidal> gotcha
21:59:18 <dgilmore> and blocked all other jobs
21:59:39 <dgilmore> this way we would have something that we could backup to
21:59:49 <dgilmore> but also do a cold failover if it came to it
22:00:23 <dgilmore> but like with hosted ive been thinking of ways to redo it
22:00:50 <dgilmore> and i think that we could use gluster to keep the data realtime replicated to it
22:01:17 <smooge> so that could be our gluster test case?
22:01:21 <dgilmore> we could honestly make bnfs01 be a vm on the host
22:01:35 <dgilmore> smooge: well i want to test it at home first
22:02:54 <smooge> so rebuild the box to be a "virthost" and then create a vm on it. ok
22:03:05 <dgilmore> right
22:03:26 <dgilmore> make the disk available to it via some method
22:04:29 <skvidal> sounds like a plan - dgilmore, smooge: would one of y'all be willing to update the ticket with this?
22:04:54 <smooge> doing so
22:05:12 <smooge> .ticket 2540
22:05:13 <zodbot> smooge: #2540 (find all no longer running xen/kvm instances with disk space still allocated) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2540
22:05:48 <skvidal> smooge: sounds like you did this one - wanna close it?
22:06:13 <smooge> I am still working on this one
22:06:15 <skvidal> okay
22:06:17 <skvidal> great
22:06:36 <smooge> we have http://fpaste.org/oUkw/
22:06:41 <smooge> a lot of dirty partitions
22:07:05 <smooge> not counting what is unused on the iscsi box
22:07:21 <smooge> which I haven't finished with yet.
22:07:48 <skvidal> ok
22:07:59 <smooge> how I determined was did a pvs and looked for parttiitons that were: -wi-a-
22:08:13 <smooge> lvs sorry
22:08:32 <smooge> I think all but the bxen boxes are pretty safe to remove
22:09:12 * skvidal didn't know about looking for 'o' in lvs
22:09:13 <skvidal> good move
22:09:14 <smooge> bxen02 had a lot of stuff on it that seemed special
22:09:47 <smooge> yeah I figured it out when playing with the kpartx to look at the age of old images
22:10:40 <smooge> dgilmore, the partitions on bxen02 like mpmtest koji2
22:11:49 <smooge> anyway I think I can leave bxenXX til later and clean up the rest
22:12:13 <smooge> last ticket
22:12:17 <smooge> .ticket 2530
22:12:18 <zodbot> smooge: #2530 (Selinux issues on PPC servers) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2530
22:12:20 <dgilmore> smooge: mpmtest is mmcgrath
22:12:31 <smooge> ok will ask him
22:13:00 <smooge> the other ones we can go over when you have more bandwidth and rest from this everlong meeting :)
22:13:05 <smooge> dgilmore, thanks
22:13:58 <smooge> the last issue looks to have been fixed over break I will close that one.
22:14:33 <smooge> skvidal, ricky we are done.
22:14:38 <skvidal> kewl
22:14:40 <skvidal> thank you
22:14:53 <smooge> now to move another 30 of our open tickets to meeting :)
22:14:54 <dgilmore> smooge: the koji2 on xenGuests i think was put there to migrate it from one host to another
22:14:56 <abadger1999> Yay
22:15:07 <ricky> #endmeeting