19:00:00 #startmeeting Infrastructure (2011-08-11) 19:00:00 Meeting started Thu Aug 11 19:00:00 2011 UTC. The chair is nirik. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:00 Useful Commands: #action #agreed #halp #info #idea #link #topic. 19:00:01 #meetingname infrastructure 19:00:01 The meeting name has been set to 'infrastructure' 19:00:01 #topic Robot Roll Call 19:00:01 #chair smooge skvidal codeblock ricky nirik abadger1999 19:00:01 Current chairs: abadger1999 codeblock nirik ricky skvidal smooge 19:00:16 here 19:00:22 * dgilmore is here 19:00:23 Crowbot here 19:00:32 * athmane is here 19:00:42 hola 19:00:52 welcome everyone. 19:01:06 #topic New folks introductions and apprentice tasks/feedback 19:01:28 Any new folks like to introduce themselves? or any apprentices like to talk about tickets or other issues? 19:01:58 * pingou around 19:02:13 I did clean out the apprentice group the other day. A few more folks let me know they should be around more at some point... at which time we can re-add them. 19:02:43 Does everyone think the apprentice program is helpfull? 19:02:49 * skvidal is here 19:03:00 The feedback I have gotten is that it is, but I'm interested in other thoughts too. 19:03:03 morning skvidal 19:03:16 nirik: I was an apprentice :) 19:03:19 feed back on what? (sorry for being late) 19:03:24 Does everyone think the apprentice program is helpfull? 19:03:34 athmane: did you find the apprentice group useful? 19:04:10 nirik: sure, having access to puppet is very useful 19:04:50 yeah. I think it might help some, but we still have the issue of getting people over the 'where do I contribute' thing. ;) Oh well, we can keep trying and adjust as we come up with more ideas. 19:05:18 * nirik will move on in a sec then. 19:05:47 #topic F16 Alpha Freeze reminder and tickets 19:06:01 reminder that we are still pre-release frozen for f16 alpha. 19:06:04 I'm here 19:06:12 and since it's slipped a week, we are frozen for an extra week. 19:06:16 morning LoKoMurdoK. 19:06:27 hi nirik 19:07:06 looks like we have the tickets all assigned (thanks smooge!) 19:07:33 #topic Upcoming Tasks/Items 19:07:42 I will be working on them this week so we are ready for Alpha 19:07:45 Any upcoming tasks or items folks would like to talk about? 19:07:54 thanks smooge 19:08:21 I'll shout out a few tasks that anyone can work on (especially during a freeze): 19:08:27 smooge: let me know if I can help at all, finally starting to have a bit more time again, seems like a good place to get back into the game 19:08:43 .infra 2906 19:09:22 .ticket 2906 19:09:26 nirik: #2906 (Migrate SOP documents to infra-docs git repo) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2906 19:10:26 and any of the easyfix ones: 19:10:31 https://fedorahosted.org/fedora-infrastructure/report/14 19:10:36 is my ticket 19:10:59 LoKoMurdoK: yeah, you mentioned you wanted to work on that. ;) If we can assist or help you get started, please let us know. 19:11:32 we can discuss in #fedora-admin after meeting too. 19:11:54 I'll also note that our new machines have been racked in phx2... so we will be installing new machines hopefully before too long. 19:11:58 the 2 weeks had problems in the office, very busy 19:12:10 Then we will look at migrating some old instances to new instances on those new machines. 19:12:15 LoKoMurdoK: no problems. 19:12:17 but I'm free 19:12:49 I can now migrate the infra-docs 19:13:03 ok 19:13:05 abadger1999: is everything looking ok for rhel6 app servers? (this would be a post freeze item) 19:13:09 LoKoMurdoK: great. 19:13:29 nirik: I think that fedoracommunity is still a problem 19:13:40 nirik: but everything else looks okay. 19:13:54 lmacken: Is that accurate? Or do you have fixes for that? 19:14:01 (fedoracommunity on rhel6) 19:14:12 * nirik recalls someone working on it the other day, not sure if for rhel6 specifically tho . 19:14:38 yeah, comphappy was working on fedoracommunity... I think it was general bugfixing, though. 19:14:44 ok. 19:14:54 at least, the questions he asked me were general bugfixing :-) 19:15:22 nirik: I'd love for us to migrate to rhel6... means half the work for apps when testing. 19:15:30 (vs mixed rhel5 and rhel6) 19:15:37 I posted proposed post freeze changes to log02 as well to the list. If everyone could review and tell me if they have any problems/reservations with those changes that would be good. 19:15:45 abadger1999: yeah, agreed. 19:16:33 * nirik looks over his list for any other upcoming things to discuss. 19:17:14 Oh, we have 3 items (that I know of) in Request for resource process: ask, paste and nitrate. All are moving forward as they can... 19:17:37 * skvidal has something for openfloor when we have time 19:17:43 s/when/if/ 19:18:05 ok, should get there soonish. ;) 19:18:21 abadger1999: I'm working on getting a new moksha & fcomm release prepped (including comphappy's patches), then I'll be able to test it on RHEL6. 19:18:36 lmacken: cool. ;) 19:18:55 lmacken: k any time estimate? Think we'll be able to deploy rhel6 app servers after alpha freeze? 19:19:17 * nirik notes the current freeze ends the 23rd unless we slip again. 19:19:35 then we have 3 weeks until beta freeze. 19:20:32 abadger1999: I'll try and shoot for after the alpha freeze 19:20:57 I can try and have a rhel6 app server instance prepped... 19:21:03 lmacken: Excellent :-) 19:21:35 well, we have app02.stg for testing, but I mean for adding into prod mix. ;) 19:21:50 19:22:02 For converting all app servers... are we waiting on more hardware? 19:22:29 waiting on re-installing hosts as rhel6 and freeze to be over. 19:22:45 we have new hardware racked now, we will be setting that up in the next few weeks. 19:23:08 Cool. So we control our destiny ;-) 19:23:24 yeah, sadly we don't control 48 hour days. :) 19:23:36 yum install tardis 19:23:39 there we go 19:23:52 "just yum it"© 19:24:01 I also posted a plan for hosted... more feedback on that would be great. It sounds like no one minded my short term plans at least. 19:24:43 are we still looking at a complete stg purge and build? 19:24:54 I will look at the hosted after the meeting. 19:25:14 yeah, staging is still needing dealing with. 19:25:29 by the way which fedoracommunity were we talking about above? 19:25:35 I'm not sure the best way forward on it... perhaps I will try and send out a plan on that too for people to poke holes in. 19:25:59 smooge: this one: https://admin.fedoraproject.org/community/ 19:26:07 ok just wanted to confirm 19:26:08 thanks 19:26:49 ok, anything else upcoming? or shall we move to open floor? 19:27:13 retrace build 19:27:21 oh yeah, thats another new machine... 19:27:24 that was my only focus after I dug out of email hell 19:27:44 that would be a nice one to get going. I assume we just install it, and then hand it off to retrace folks? 19:28:23 yeah.. I am just going to have it set up minimally so it can be updated/rebuilt quickly and then have them tell us what sections to back up. 19:28:32 once they are ready to cut over we change dns and recommision our retrace01... 19:28:43 after that its getting it a public ip address/port and poof they can actually run Fedora 15/16 dumps 19:29:01 smooge: you might ask them if they want config management on it. I still need to setup a qa bcfg2 instance, but it could go in that if they wanted. 19:30:20 ok, on to open floor... 19:30:26 #topic Open Floor 19:30:33 I've been working on the epylog reports some more 19:30:49 for those people receiving them - I'd like to know what you find useful and not useful 19:31:00 and what things you'd like to see change 19:31:40 * skvidal hears the echoes 19:31:53 skvidal: so, one question... who should we allow to be in sysadmin-logs? apprentice folks? or sysadmin group people only? or just based on our judgement... 19:31:59 19:32:09 I don't think it is any different than who can login to log02, is it? 19:32:15 right. 19:32:19 so 19:32:45 so, if I change still so apprentice can look at logs, they should probibly be allowed in sysadmin-logs. Or for that matter we could just change log02 to allow sysadmin-logs 19:33:03 indeed 19:33:13 so logs are only readable by root, so you need sudo there 19:33:14 anyhow, side topic. 19:33:21 s/so/some/ 19:33:26 athmane: no 19:33:29 athmane: nope. 19:33:34 athmane: our central log server allows log reading 19:33:37 on its merged logs 19:33:50 and the non merged ones are currently open to 'sysadmin' 19:33:52 skvidal, I find them useful. I need a page so I can see old logs 19:34:02 but only main and noc can login there. 19:34:07 smooge: so you want an index of some kind? 19:34:37 yeah. because I see a report.. then I have to go searching through email for the link to the previous one etc 19:34:41 * nirik finds the logs very useful. We have fixed a number of issues and noted problems or things we should investigate from them. It's very nice to be proactive sometimes. 19:34:51 what are folks' thoughts on auth for log access via the webserver... I'll admit to not LOVING the idea of adding the mod_auth_pg to those boxes 19:35:43 is there a way for us to require the auth from admin.fp.o BEFORE it redirects to log02 for the actual content? 19:35:44 well how do we use it to get to other parts like nagios, collectd and such 19:35:58 skvidal: i dont like it 19:36:02 but thats just me 19:36:06 dgilmore: which part? 19:36:12 dgilmore: the log report in general? 19:36:37 dgilmore: or the mod_auth_pg thing? 19:36:38 collectd is all open I think... 19:36:45 but nagios uses the auth_pg thing. 19:36:55 indeed 19:36:57 nirik: and it is installed on the noc server isn't it? 19:37:04 skvidal: yep 19:37:05 skvidal: mod_auth_pg 19:37:08 nirik: not at thr admin.fp. level 19:37:14 apache and haproxy status are open too 19:37:15 skvidal: and accessing logs via browser 19:37:18 skvidal: right. 19:37:32 dgilmore: well we're not alking about accessing the logs by browser - just the log report 19:37:40 we could do ssh tunnels, but thats a bit ugly. ;) 19:38:06 skvidal: can we do it using lynx on log02? 19:38:27 dgilmore: of course you can - but I think lynx isn't installed but elinks is 19:38:36 skvidal: that works 19:38:40 lets just do that 19:38:44 the log report is just prettier in a real web browser 19:38:55 and it is much more convenient to access it from your mail that way 19:39:50 brb 19:40:37 so what are the issuess with mod_auth_pg? just more deps? 19:40:59 nirik: more deps - and more tightly coupled to the db server 19:41:01 so... 19:41:07 if we need to look at the logs/log reports 19:41:07 yeah, if db is down... right. 19:41:11 and the db server is down 19:41:12 right 19:41:24 there is that OAUTH plugin... but it's not packaged... 19:41:26 if there was a sane way to do mod_auth_pam against nss_db 19:41:30 I'd be all over it 19:41:47 this is one of those cases where one-off service passwords would be wonderful 19:42:41 skvidal: yubikey auth? 19:42:50 huh... as an alternative... how about htaccess with a shared password. You get that when you join sysadmin-log? another password and could be grabbed around... 19:43:00 dgilmore: still requires the db I think... 19:43:22 nirik: it requires a db 19:43:39 in that you need to yubikey validation server up and running 19:43:43 yeah. 19:44:52 anyhow, I guess lets ponder on it more and try and come up with the least anoying alternative? ;) 19:46:35 I figure an .htaccess pushed from puppet is not ok? 19:47:35 smooge: well, we could put the pass in private... the issues would be: password could get leaked by someone and no easy way to account who was reading them (except possibly by IP). 19:47:45 smooge: and we'd have to keep updatinging it 19:47:52 nirik: oh - nm 19:47:53 sorry 19:47:55 I just caught up 19:47:56 shared pw 19:48:00 not individual ones 19:48:12 hahah, I know what we could do 19:48:21 I could have the epylog mail send out the password ;) 19:48:23 yeah, I was talking shared. 19:48:32 * skvidal kids - that's no better than how we are now 19:48:35 oh I figured it would just be like the password files. puppet runs on lockbox, gets the passwords out of the db, builds an htaccess and pushes that out 19:48:52 I suppose it could be a per user pass in there too. 19:49:10 actually lockbox wouldn't do it.. log02 would 19:49:11 smooge: oh - so you mean puppet generates an htpasswd file with all sysadmin-log members in it? 19:49:12 duh 19:49:22 skvidal, yeah. 19:49:23 hmm 19:49:26 or a python script 19:49:34 can aapache deal with the passwd crypt 19:49:35 that is cron run or something 19:49:38 we have in the nssdb files? 19:49:54 smooge: I like it, in principal... 19:50:03 skvidal: possibly. 19:50:25 reality will probably intrude and say no 19:50:29 nirik: we run this risk of having these files more available so people could compromise them 19:50:31 dump crypted pass from nssdb -> htaccess file -> bobs your uncle. 19:50:36 ? 19:50:44 nirik:well the files have to be readable by apache 19:50:52 which makes them slightly more vulnerable than shadow nss_db 19:50:56 true. 19:51:08 * skvidal hmms 19:51:23 but anyone can get crypted pass from fas 19:51:26 or wait. 19:51:29 no they can't. 19:52:03 no, they can't 19:52:15 getent won't do what you want 19:52:23 it's not like nis 19:52:38 and you'd need the fas host password to grab it from fas 19:52:44 I was thinking of someone remotely running fasClient, but it doesn't allow non priv accounts that info 19:53:19 hey how about we add a field into fas that keeps that password? 19:53:31 it should be all of what 5 minutes of abadger1999's time :) 19:53:47 smooge: umm...... 19:53:57 anyway.. I think I have derailed this 19:54:02 smooge: if we were going to do it - I'd say we add an arbitrary service password field 19:54:16 and then make it so you can pull down that service password 19:54:22 well yes. that was I meant 19:54:23 and that will not be a 5m project 19:54:27 * nirik wonders if there's a clever way to have apache auth with ssh keys. ;) 19:54:30 I figured that also 19:54:36 and to be fair to abadger1999 I doubt he has the 5m to spare 19:54:53 hehe 19:55:10 abadger1999: if I'm wrong about that then GO GET TO WORK! 19:55:10 :) 19:55:59 okay 19:56:04 anyhow, I think we keep thinking... 19:56:08 now 19:56:12 about the log reports 19:56:16 is there any CONTENT in the log report 19:56:20 that folks like or don't like 19:56:23 or would like to see 19:56:44 I have been thinking about how we could make apache logs happen and it is just not frelling obvious to me 19:56:53 it will involve, I suspect, a lot of work 19:57:20 yeah, on that, could we switch them to syslog? or is that too much work/load/network problem? 19:57:38 * nirik also wondered about getting audit setup to log to a central place. 19:57:41 skvidal: Okay, there's currently a config table in fas that you can access via json. 19:57:42 nirik: given our web log volume I'm worried how it would beat up log02 19:57:49 That allows you to keep arbitrary data in fas. 19:57:57 abadger1999: orly? 19:58:01 asterisk for instance was kept there. 19:58:10 abadger1999: huh 19:58:24 skvidal: yeah, would need some testing. 19:58:28 Let me see if I can find that. 19:58:38 nirik, too much load when I tried apache logs a year or so ago 19:58:48 I've also wondered about having a log03 somewhere. backs up log02 read-only in case something happens to it. 19:58:56 smooge: ok. 19:59:03 skvidal: https://fedorahosted.org/releases/p/y/python-fedora/doc/existing.html#fedora.client.AccountSystem.get_config 19:59:08 and get_config_like() 19:59:13 skvidal, I would go with a seperate epylog report for apache logs.. 19:59:15 get_configs_like() 19:59:25 smooge: oh of course 19:59:32 smooge: I wouldn't want it mixed with the other logs atall 19:59:40 smooge: but I mean just handling the data at all 20:00:13 * smooge is confused then. I thought epylog dealt with it per domain 20:00:20 or did I misremember 20:00:26 epylog works on a merged log of all hosts. 20:00:44 we could merge the apache logs too and run on them, but I bet it's going to be frigging gigantic. 20:00:49 smooge: epylog doesn't deal with apache logs at all right now 20:00:53 smooge: it has no module to handle them 20:01:00 ah I thought for apache it did something different. I must be thinking of splunk or something 20:01:02 nirik: no betting about it 20:01:04 yeah, so something else operating on them might be good. 20:01:17 the other issue is this 20:01:26 epylog is more about reporting and pointing up issues 20:01:26 we do have awstats, but it's just hits and such, not errors. 20:01:35 apache logs often are about stat reporting 20:01:39 less about error reporting 20:01:44 nirik: 404 errors are in iirc 20:01:45 and I _think_ we want to know more about errors 20:01:45 * nirik nods. 20:01:52 however, there are errors. 20:02:04 which currently we ignore. ;) 20:02:09 apache errorlogs are an abomination 20:02:10 that's where I would like to be 20:02:13 to be fair 20:02:15 I would love it 20:02:17 pingou: yeah, but not tracebacks and such. 20:02:23 if we could get just apache error logsand app traces 20:02:25 via syslog 20:02:32 which should, in theory, be much less 20:02:39 oooooh 20:02:42 hmmm 20:03:10 now we could modify our apache configs 20:03:21 sadly, I think it's the bulk of apache logs. 20:03:23 to have the error logs be spat out via syslog to a local# facility 20:03:31 nirik: you think most of our apache logs are errors? 20:03:37 nirik: that feels..... bad 20:03:45 -rw-r--r-- 1 root root 2320563738 Aug 11 20:03 error_log 20:03:57 how OLD is that/ 20:04:02 head -1 error_log 20:04:24 [Thu Aug 11 04:02:22 2011] [error] /usr/sbin/pkgdb.wsgi:19: DeprecationWarning: fedora.tg.util is deprecated. Switch to one of these instead: TG1 apps: fedora.tg.tg1utils TG2 apps: fedora.tg.tg2utils. This file will disappear in 0.4 20:04:41 1 day 20:04:42 wait 20:04:44 noway 20:04:49 what the hell? 20:04:59 way. ;) 20:05:00 it's 2.2GB in ONE DAY?! 20:05:03 app01 20:05:17 I think all our monitoring is showing up in there. 20:05:28 [Thu Aug 11 20:05:17 2011] [error] - - "GET /pkgdb/collections/ HTTP/1.0" 200 22395 "" "" 20:05:43 wow 20:05:47 or perhaps pkgdb is just doing everything to error. ;) 20:05:49 like 5 logs a second 20:05:53 christ on a crutch 20:05:54 wow that has gotten bad 20:06:05 I don't think it was that bad last fall. 20:06:08 right. So, we need to clean this up some before we can even look at syslog. 20:06:15 nirik: no kidding 20:06:20 [Thu Aug 11 20:05:51 2011] [error] - - "GET /voting/ HTTP/1.0" 200 14378 "" "" 20:06:23 that's our monitoring 20:06:30 also 20:06:38 why is a 200 from that an 'error'? 20:06:45 no idea. 20:07:01 well it could be buried in a line below that 20:07:15 okay 20:07:17 you know what 20:07:24 I'll take a good long look at that in the coming week 20:07:30 skvidal: thanks. 20:07:35 #action skvidal to curse at apache error logs 20:07:41 any other items? or shall we close out the meeting? 20:07:55 not me I think I broke things enough 20:08:01 wow 20:08:02 I'm wrong 20:08:07 101 error log lines per second 20:08:23 on app01 20:08:24 That's something we need lmacken to look into -- it's the curse of the TG1 logging system that we've never figured out completely. 20:08:24 ONLY 20:08:29 yeah, app01 alerted the other night because / was more than 85% full. ;( 20:08:51 I'm going to see if I can do the user outlier log report today and tomorrow 20:08:55 and then get to the apache logs 20:08:57 and be angry 20:09:21 sed -i -e "s|200|d" error_log ? :d 20:09:28 We either seem to not get tracebacks in the logs or way too much information or both. 20:10:20 ok, thanks for coming everyone! 20:10:24 abadger1999: yeah, our TG1 logging setup is inconsistent, and awful. 20:10:37 abadger1999: todo: check the tg2 version 20:10:52 * nirik waits a min more since we have more info. ;) 20:11:20 pingou: is voting changed to tg2? or not in prod yet? 20:11:34 nirik: afaik not yet 20:11:42 ok. 20:11:54 ok, will close out then.... 20:11:58 #endmeeting