19:00:02 #startmeeting Infrastructure (2011-03-24) 19:00:02 Meeting started Thu Mar 24 19:00:02 2011 UTC. The chair is nirik. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:02 Useful Commands: #action #agreed #halp #info #idea #link #topic. 19:00:03 #meetingname infrastructure 19:00:03 The meeting name has been set to 'infrastructure' 19:00:03 #topic Robot Roll Call 19:00:03 #chair goozbach smooge skvidal codeblock ricky nirik 19:00:03 Current chairs: codeblock goozbach nirik ricky skvidal smooge 19:00:21 * CodeBlock is short-time/will not be here for long. 19:01:02 my head feels like puffy sausages so will be a bit slow 19:01:39 * lmacken 19:01:43 * jsmith lurks 19:01:43 I didn't have much for agenda today... so hopefully a shortish meeting. ;) 19:02:13 yeah 19:02:17 * waltJ is here. 19:02:27 ok... lets dive in... 19:02:50 #topic new folks introduction 19:03:01 Any new folks around who would like to say hi and get more involved? 19:04:04 ok, we have had a few chime in on list. I would encourage them to show up to meetings and say hi in #fedora-admin 19:04:25 #topic pkgs01 iscsi issues 19:04:44 so pkgs.fedoraproject.org has had periodic issues with it's lookaside storage of late. 19:05:00 dgilmore suspects it might be related to it being a rhel6 guest on a rhel5 xen host. 19:05:07 * skvidal is here 19:05:19 so, we might try and move it to a rhel6 kvm host if we can find a place. 19:05:48 or open to other ideas. 19:06:17 we might want to do this move at the same time as the possible db03 move in april, since that might be downtime for packagers anyhow. 19:06:17 do we have any new kernels 19:06:23 we can try on rhel6 under rhel5 xen? 19:06:32 skvidal: it's running the newest I think. 19:06:41 nirik: 6.1b got announced the other day 19:06:41 I rebooted it last time this happened to get it up to the latest. 19:06:49 dir drjones give us anything more recently? 19:06:52 true... we could look for fixes in that kernel. 19:06:53 s/dir/dir/ 19:07:12 the problem is still outstanding 19:07:23 it is only seen here on our stuff 19:07:29 what are the reqs for pkgs01? 19:07:38 also, a quick task for someone: nagios check for read-only happening there, so we notice it fast instead of having user reports. ;) 19:07:40 it looks like it is using a lot of disk space in /? 19:07:42 but bvirthost01 might be able to take it 19:08:29 ah - I see /srv/git is mounted locally 19:08:36 for some reason I figured that would be elsewhere 19:08:40 skvidal: yeah... 19:08:55 but even so it's only 60GB 19:09:01 (for all of /) 19:09:16 smooge: is bvirthost01 hw we can rely on? 19:09:23 it is new ahrdware 19:09:25 yah - it's virthost13 I'm thinking of which is 'odd' 19:09:28 skvidal: i believe its new hardware 19:09:29 smooge: did you have a bug on the rhel6 on rhel5 xen thing? 19:09:31 it has only run 1 server 19:09:47 nirik, yes. I have several bugs on that 19:10:11 ok, if you get a chance to shoot me the #'s, I can cc myself... 19:10:30 nirik, I am at the point of when we can decommision something in PHX2 we are shipping the IBM hardware to a kernel admin with a blood sacrifice 19:10:48 :) 19:11:10 which db's are on db03? 19:11:51 koji I think 19:11:57 that's all iirc 19:11:59 lemme find my notes 19:12:33 .bug 632802 19:12:34 smooge: Access Denied - https://bugzilla.redhat.com/show_bug.cgi?id=632802 19:12:41 nirik: yes, just koji 19:12:43 only koji is on db02 19:12:50 dgilmore: db02? 19:12:52 03 19:12:55 ah, whew 19:12:55 typo 19:12:56 yeah, so, if db03 is koji, packagers would have an outage there anyhow for moving db03... so we could do pkgs01 in the same window. 19:13:02 nod 19:13:10 its only there because we had rpmfiles and one other table 19:13:16 nirik, it is 632802. Not sure if you can join it 19:13:19 they were massive 19:13:29 millions of rows 19:13:36 and frequently hit 19:13:38 smooge: ok, not sure either, but will note it. 19:13:51 but kojis db needs have changed since 19:14:02 it doesnt need gobs of ram and cpu like it did 19:14:19 nirik, once I get an idea of what we would like to replace db03 with even if its a virtual box to run it and other stuff.. I will get a quote from CDW next week 19:14:40 that could become bvirthost02 and that would allow for all kinds of moves/cleans etc 19:14:48 ok. 19:15:16 it could be a virt host 19:15:28 we may want to look at our current db's and needs. I know we wanted to think about how to replicate/move SPOF for them if possible. 19:15:40 so, for now I just wanted to get the pkgs01 thing on everyones radar. 19:16:35 I think moving it would be good when we have a good window and place. 19:17:15 nod 19:17:17 #idea possibly move pkgs01 when there's a db03 move outage. 19:17:38 #action need to add a check for detecting the r/o condition. 19:17:40 * skvidal is curious 19:18:02 if we had one piece of hw could db03 and pkgs01 run on the same virt host? 19:18:05 seems like they could 19:18:05 I can add that check if no one beats me to it. 19:18:26 skvidal: probably 19:19:17 ok, anything more on this? or shall we move on? 19:20:26 #topic puppet reorg 19:20:51 skvidal has been redoing puppet stuff. ;) Care to give an overview of the changes for folks? 19:20:59 sure 19:21:11 first: puppet is no longer running as a daemon for us on our systems 19:21:25 it runs out of cron every half hour - at 23 minutes after the hour and 53 minutes after the hour 19:21:32 it randomly waits up to 10m 19:21:33 before running 19:21:45 so we don't get the stampeding herd problem 19:22:06 so 19:22:16 we have gotten rid of the 'is puppet running' nagios check 19:22:26 actually the pkgs item is pretty 'simple' it is on iscsi and bvirthost01 should have a mount for it (if not I will put it in the queue). Just undefine on one box and define on another and we are golden 19:22:29 I've written a script to check to see if puppet has checked in to the puppetmaster 19:22:31 oh sorry 19:23:01 and it will emit a notice to the nagios server using func + nsca to let us know a box has not run puppet in N amount of time 19:23:08 finally we've been working on the puppet error reports 19:23:24 and we've trimmed them down - but it's still an issue 19:23:30 b/c some of them won't go away for quite a while 19:23:52 I'm actively working on a new script to parse the reports, per host, and generate a checksum of the errors/warnings 19:24:00 if the checksum changes, then it will send us a notice about that host 19:24:07 if it doesn't then it won't so we should see errors only once 19:24:14 to make sure we don't forget about them 19:24:22 I'm going to have the checksums nulled out once a month 19:24:22 skvidal, thank you very much for doing all that work. 19:24:24 so we get notices 19:24:36 it's interesting - parsing the puppet report yaml 19:24:39 skvidal: yeah, thanks a lot. It's great stuff to get fixed up. 19:24:53 it'll be nice to trim out a lot of crap 19:24:58 something that someone could take on 19:25:01 if they wanted to 19:25:21 smooge: its iscsi is on the equalogic not netapp 19:25:22 smooge: we still would need a outage to sync over the / data, but yeah... I guess we can do it sooner if people think. It's not been super often it does tho 19:25:24 1. we need someone to add aliases to our nagios hosts to correspond with their real host name, their vpn hostname and others 19:25:55 2. it would be worthwhile, I think, to go through our puppet config and remove projects which are never coming back - Istr seeing zabbix in there 19:26:13 3. I would love to have some volunteers for audits on systems and the pkgs/files they have installed 19:26:20 dgilmore, on bxen03 it is running from the netapp. It mounts other stuff from the equalogic via iscsi 19:26:20 sorry that took so long to get out 19:26:33 smooge: huh 19:26:48 skvidal: no worries. 19:26:48 i volunteer to do audits, if appropriate 19:26:49 smooge: there is 2 volumes on the equalogics units 19:26:55 /mnt/koji and lookaside cache 19:26:58 marchant: excellent! 19:27:21 i do not have appropriate access priv's though, more than likely 19:27:27 marchant: good - a lot of the auditing is going through sets of rpms on machines and assuming nothing belongs on there 19:27:29 dgilmore, all I am saying is on bxen03 doing an lvs shows that the virtual machine pkgs01 is on xenGuests and open which is fromt he netapp 19:27:45 marchant: for what I've just suggested for pkg auditing - you don't need any other privs than you have 19:28:05 skvidal, I really appreicate the work. it was needed 19:28:07 smooge: pkgs01 and nfs01 are the only things accessing the equalogics 19:28:08 that sounds like a good fit then 19:28:22 marchant: I can generate a list of every pkg on every system for you 19:28:31 smooge: or are we confusing things 19:28:33 marchant: and we can weed out the crazy 19:28:52 OK, tell me what you need and I will be happy to help 19:28:53 marchant: another important thing (to me) is going through machines to find places where machinetype01 doesn't match machinetype02 19:28:55 smooge: are you saying that pkgs raw disk is a netapp iscsi volume? 19:29:07 dgilmore, yes. sorry for not being clear 19:29:23 hey folks, I have to step away for a bit. would someone be able to take over running things? we have meeting tickets if anyone has any they want to address and open floor... 19:29:26 skvidal: do you mean as far as the installed packages 19:29:30 marchant: yes 19:29:31 dgilmore, so it should be very simple to get it to work on something else. 19:29:36 marchant: I'll email you with more 19:29:38 smooge: ah, that makes it easier. I see what you are saying. 19:29:44 skvidal; totally makes sense 19:29:54 skvidal: thanks 19:29:59 marchant: nod 19:30:16 smooge: so, outage would be minutes, not long... 19:30:25 yes. 19:30:36 just an idea - let's assume the outage involves rebuilding the box 19:30:36 :) 19:30:38 and could go back quickly if it didn't wrok at all 19:30:41 and plan for 2 hours 19:30:44 just b/c 19:31:00 skvidal, I always plan via the Scotty method. 19:31:06 skvidal: always good to plan for disaster, and deliver only minutes of downtime. ;) 19:31:14 smooge: I'm givin' her all I can capt. 19:31:21 * nirik has to go visit sick down now. 19:31:27 dog? 19:31:29 by 19:31:37 unless your dog is named 'down' which would be funny 19:31:42 'come, down' 19:31:46 'down, heel' 19:31:49 up, down 19:31:50 ha. dog, yes. 19:31:54 * nirik can't type 19:31:58 down down 19:32:12 if someone could take over that would be great. 19:32:15 * nirik leaves for a bit. 19:32:16 ok skvidal what is your view of what to happen on puppet 19:32:24 'happen'? 19:32:26 skvidal: i think that once we declare beta gold 19:32:26 for the next week? And what can we help with 19:32:36 ill take a buildsys outage 19:32:39 do db03 19:32:44 and pkgs01 19:33:12 puppet is working as normal right now 19:33:25 I should be able to finish the script to handle the error reports today 19:33:26 if I do 19:33:30 I'll disable tagmail for everyone 19:33:38 and start sending out this mail instead 19:33:56 one thing to keep in mind 19:34:02 these error reports will come out half-hourly 19:34:17 I have not found a way to make them driven by when puppet runs on the host 19:34:38 only by the change of state in the reports 19:35:20 smooge: oh! 19:35:25 oh? 19:35:26 one thing I completely blanked on is bxen01 19:35:31 the box is reinstalled 19:35:38 and needs to be moved to the community network/rack 19:35:48 then I'll install the autoqa02/bastion## box over there 19:35:50 yes.. the box for the hardware move that we will someday do when I GET A FING NETWORK 19:35:56 ah 19:35:59 so I'm not holding anything up 19:36:01 fabulous 19:36:28 there are some other tasks to clean up our hosts which I wouldn't mind some other eyeballs on 19:36:32 if anyone is interested in them 19:36:35 some of them are DULL 19:36:44 but shouldn't take very long 19:36:57 I added them to our FI clean up 2011 list 19:37:01 https://fedoraproject.org/wiki/Infrastructure_Cleanup_Tasks_2011#Fix_all_the_things_that_we_have 19:37:03 at the bottom 19:37:35 no no.. just keeping me from reaching for the battle axe and woad 19:37:47 the cron entries in /var/spool/cron and /var/spool/mail 19:37:57 I know its not interesting but imo we shouldn't have ANYTHING in either of those directories 19:38:03 and if we do it damn well better be known about in puppet 19:38:09 if it is not then SOMETHING IS WRONG 19:38:13 I am going to start on those on Monday. I hope I get a weekend off without systems having drama issues 19:38:40 no kidding 19:38:54 speaking of which... log02 19:39:15 if we can move to the next item? 19:39:31 yah 19:39:37 #topic logging 19:40:04 we had a bit of a scare on Saturday with log02's xen box dropping one of its disks and everything going into degraded mode. 19:40:09 s/log02/log01/ 19:40:27 so I started working on a log02 and CodeBlock picked up hat I broke and started cleaning things 19:40:56 I am hoping that after this meeting I will start an rsync of the data from one to the other 19:41:06 and we should be able to 'move' over to log02 next week 19:41:42 do we need to run over to log02? 19:41:56 is the box that log01 is on in that bad of shape? 19:42:06 skvidal, well xen10 goes out of service in early June. 19:42:21 so I want to get ahead of the curve 19:42:26 okay - I'm just wondering - can we take some time to make log02 'better' 19:42:30 and get it configured 19:42:39 before we start inundating it with data 19:42:41 sure.. 19:43:00 I am not sure what better is so I guess we can talk about that and then do 19:43:12 1. I'd like to pursue epylog reports 19:43:33 2. I'd like to break our hosts out into report groups - so we can send log reports from the proxy servers to one group of people 19:43:36 (for example) 19:44:06 3. it might be nice to consider a real log-rotation/expiration policy that isn't 'keep everything forever' 19:44:29 both to protect ourselves from issues and to just limit the sheer size of things. 19:44:35 does 3 seem ridiculous to anyone? 19:44:51 skvidal: Not to me 19:44:54 hmmm pretty much every place I have ever worked has always found that in the end it comes down to "keep everything forever". So I just figured to assume that first :) 19:45:05 smooge: everywhere I have worked has had specific policies 19:45:09 with expiration timeouts 19:45:15 for CYA against lawsuits/subpoenas 19:45:32 yeah. we had policies and every time they came up to be deleted some law came in and we ahd to revise and keep things longer. 19:45:34 if you have a policy of 'we keep logs for 6 months and kill them' then a subpoena gets nothing 19:45:35 compress and archive to off-site storage only. that way, legal can get the logs if required. 19:45:50 fenrus02: and if you have an official policy you can hand that back to legal 19:45:56 PROVIDED YOU FOLLOW THE POLICY 19:46:06 point 19:46:18 "keep everything forever" is simply not sustainable. 19:46:42 smooge: we don't work for the gov't of any kind 19:46:48 we have no obligation to keep everything forever 19:47:06 skvidal, heck I was thinking of startups and .edu 19:47:28 so was I - and all the lawyers and infosec people said "have a policy and keep to it" 19:47:32 duke was militant about it 19:47:35 but we don't service anything medical 19:47:55 you know what 19:48:00 I don't want an argument about it 19:48:02 I'll work on epylog 19:48:07 if y'all decide if you want a policy 19:48:08 I am not arguing geez 19:48:08 great 19:48:09 the "have a policy" bit is the trick 19:48:39 if not, then provided we can continue to poop infinite disk space, fantastic 19:49:28 but before we deploy log02 it might be a good idea to know what reports we want and what reports we actually are GETTING and READING 19:49:38 do we know this? is it written up anywhere? 19:49:44 poop disk space? 19:50:06 * marchant that's amazing 19:50:37 marchant: we have a goose which instead of laying golden eggs 19:50:40 it poops diskspace 19:50:54 it's fabulous, actually, but sheesh cleaning the disk space before you use it is usually advised 19:51:14 smooge: what's next? 19:52:02 https://fedorahosted.org/fedora-infrastructure/query?status=new&status=assigned&status=reopened&group=milestone&keywords=~Meeting&order=priority 19:53:22 so we've covered a number of these in the earlier conversations 19:53:37 which is good 19:53:47 2591 is curious - what's our status on blogs? 19:55:31 i put a list together of what i found on the usage of blogs 19:55:42 i attached it to the ticket. 19:56:11 nod - so it's a larger number than I expected 19:56:20 a lot larger number 19:56:21 do we have any concern/reticence with moving them? 19:56:31 nod 19:56:35 it is a lot more 19:57:40 ugh I am not sure. 19:58:31 a lot of them don't seem to be updated.. which is normal for blogs.. but its a lot to archive/move OR deal with otherwise 19:58:56 Has the blog infrastructure been difficult to manage in general? 19:59:09 jsmith: yes 19:59:17 wordpress requires pretty much constant attention 19:59:28 it is about the same case as our transifex issue 19:59:43 * jsmith doesn't know of a web application that doesn't require a lot of attention, unfortunately 20:00:00 it is a definate behind the curve. 20:00:26 Is it just a matter of updating to the latest wordpress package, or are there other items that keep you up at night? 20:01:06 well updating it is part of the problem 20:01:18 we are i believe running on stuff that is dead software now 20:01:36 jsmith: the problem is that wordpress is frequently vulnerable 20:01:41 and updating it is going to be an act of faith and a large focus 20:01:45 jsmith: and the -mu we're using is a dead bracnh 20:02:02 Right... the -mu functionality got rolled into the main branch 20:02:30 jsmith: the issue in my mind is that the time we spend baby sitting this is expensive 20:02:30 I'm pretty good friends with some of the automattic folks -- is it worth asking if they'd be interested in helping with a migration? 20:02:37 Or is it just not worth the effort? 20:02:43 I dunno 20:02:46 the questions I'd ask are 20:02:59 For me, I see it as a valuable resource for the community, but don't have any way to gauge the sysadmin drain it causes 20:02:59 1. why do we have 100 blogs hosted? 20:03:10 2. what do we get out of having it run locally? 20:03:37 3. how many hours have been spent on it since it was installed 20:03:44 #3 might be a question for ricky 20:03:49 since I know he's done a lot of work on it 20:04:18 4. who do we go to fix it when its not working. 20:04:35 which usually ends up being ricky or nb (I think) 20:05:43 nod 20:06:48 i'd love to see us switch to a flat-file git based blog system, like blog-o-file or pyblosxom :) 20:07:17 lmacken: pyblosxom --- haven't heard that in years 20:07:29 * nirik returns, reads up 20:07:31 lmacken: is it still being maintained? 20:07:33 skvidal: I had lunch with the creator of it at PyCon :) then worked with him to upgrade lewk.org to the latest code 20:07:37 skvidal: yup 20:07:42 lmacken: wow, that's cool 20:07:48 lmacken: I used it a looooooong time ago 20:07:52 yeah, he's based out of Boston and hacks on Miro 20:07:57 lmacken: I liked editing in my editor and rsyncing the files up :) 20:08:07 lmacken: (this was all pre-git) 20:08:25 yeah, I do like the concept of blog-o-file too... git push hooks that compile your entries down with Mako templates, so your blog is purely static files 20:08:44 lmacken: makes comments systems harder 20:08:50 sorry, this is off-subject a bit 20:08:52 yeah, most people just throw disqus in there 20:09:07 sorry for de-railing :) I hate wordpress. 20:09:26 ok, we are over an hour too. ;) should we open floor/close out? 20:09:43 * sijis sorry. had to step away 20:10:54 #topic Open Floor 20:11:03 anything for open floor? 20:11:28 uhmmm... not in official space, just wanted to say hi to everyone and apologize for my tardiness :) 20:11:34 hi maxamillion 20:11:36 $dayjob got in the way of the meeting 20:12:00 no problem. welcome maxamillion 20:12:19 I'm currently trying to learn my way around the infrastructure layout a bit and then I'll be taking on a couple of the line items from https://fedoraproject.org/wiki/Infrastructure_Cleanup_Tasks_2011 ... time and perms permitting anyways 20:12:31 excellent. 20:12:56 I also have no idea what all I have permissions to ssh into (if anything at all) 20:13:08 we can sort that out... 20:13:10 I assume that's handled by FAS groups 20:13:13 nirik: ok, sounds good 20:13:15 yep. 20:13:39 you have permission to log into any host you are allowed to. all systems you aren't will initiate a dd if=/dev/zero of=/dev/sda 20:13:46 of your system 20:13:54 smooge: lol 20:14:14 sounds like fun 20:14:17 ok, thanks for coming everyone... 20:14:52 #endmeeting