18:59:59 #startmeeting Infrastructure (2013-03-07) 18:59:59 Meeting started Thu Mar 7 18:59:59 2013 UTC. The chair is nirik. Information about MeetBot at http://wiki.debian.org/MeetBot. 18:59:59 Useful Commands: #action #agreed #halp #info #idea #link #topic. 18:59:59 #meetingname infrastructure 18:59:59 The meeting name has been set to 'infrastructure' 18:59:59 #topic welcome everyone 18:59:59 #chair smooge skvidal CodeBlock ricky nirik abadger1999 lmacken dgilmore mdomsch threebean 19:00:00 Current chairs: CodeBlock abadger1999 dgilmore lmacken mdomsch nirik ricky skvidal smooge threebean 19:00:21 who all is around for a infrastructure meeting? 19:00:23 * lmacken 19:00:35 * cyberworm54 here 19:00:37 * abadger1999 here 19:00:40 here 19:00:41 * maayke here 19:00:48 KasumiNinja here 19:01:11 * threebean is here 19:01:24 * nirik will wait another min for folks to wander in. 19:02:03 * pingou 19:02:22 ok, I guess we can go ahead and dive in. ;) 19:02:28 #topic New folks introductions and Apprentice tasks 19:02:42 any new folks want to introduce themselves? or apprentices with questions? 19:02:50 me 19:02:52 I'm new 19:03:06 here 19:03:12 * SmootherFrOgZ is here 19:03:21 hola 19:03:27 I work as a sysadmin and like to help fedora with sysadmin tasks 19:03:34 blackdeerranger / KasumiNinja: welcome. ;) 19:03:39 :) 19:03:47 :-) 19:04:03 blackdeerranger: are you also interested in sysadmin side of things? or more application development? 19:04:07 I have no programming experience 19:04:21 I typed my introduction here: http://paste.fedoraproject.org/4574/80565136/ 19:04:45 I´m interested in sysadmin tasks 19:04:59 one thing I'd like to mention here... I sent out my regular fi-apprentice feedback email and got some email back that highlighted how self driven we require people to be... 19:05:02 I´m not a software developer 19:05:37 cool. 19:06:13 http://paste.fedoraproject.org/4574/80565136/ 19:06:19 oops sorry :) 19:06:22 I'm wondering if we want to try and change that focus any, or if there are ways to better communicate it to new folks so they don't get confused when no one is assigning them specific tasks and asking for updates all the time. 19:07:16 perhaps I will start a discussion on the list around that, just wanted to mention it here. 19:07:18 As I unterstand it is a good staring point to join "fi-apprentice " 19:07:22 nirik: I will send you email tomorrow :) 19:07:28 blackdeerranger: yeah. 19:07:33 good 19:08:09 the thing is that we are not setup to do heavy mentoring... so joining usually means the new person has to be very focused on just going and doing things and bugging people. 19:08:22 works fine for me 19:08:25 * samkottler is just getting here 19:09:00 something worth mentioning 19:09:06 anyhow, will post a thread to the list about it. ;) just brainstorming. 19:09:08 given where fedora infra is going with config mgmt and c&c 19:09:15 anyone interested in the sysadmin side of things 19:09:27 would be best served by going to ansible.cc 19:09:35 and trying it out for themselves to be familiar 19:09:44 absolutely. ;) 19:09:54 we have lots of examples at: http://infrastructure.fedoraproject.org/cgit/ansible.git/tree/ 19:10:08 but our examples are going to increasingly get more complicated, I suspect :) 19:10:28 they always do. 19:10:29 so KasumiNinja, blackdeerranger: probably worth familiarizing yourself with this tool 19:10:55 sounds interesting - thanks 19:11:09 #info new folks would be advised to look into ansible. 19:11:16 #topic Applications status / discussion 19:11:29 great I'll look into it 19:11:29 any exciting application side news this week or upcoming? 19:11:38 * pingou has mainly be bothering threebean 19:11:42 #info fas-openid is now in production. Thanks puiterwijk. :) 19:11:43 been* 19:11:51 KasumiNinja, blackdeerranger: great 19:11:53 \ó/ well done puiterwijk 19:12:06 we *just* got our first fedmsg message out of the secondary arch compose process 19:12:10 Three cheers for puiterwijk :-) 19:12:16 puiterwijk++ 19:12:30 #info fedmsg starting to be enabled for secondary arch compose processes. 19:12:48 puiterwijk and I also discussed oauth to some length. I'm organizing that into a message to send to the list. 19:12:57 I announced the fedmsg-notify today, so we'll hopefully see a lot more consumers 19:13:08 An oauth server is what I think his next project is going to be. 19:13:19 I also fixed some long-standing push ordering bugs in the bodhi masher last week 19:13:36 lmacken: yeah. Great post. ;) 19:13:41 thanks ☺ 19:13:48 * Adran is here (late) 19:14:03 #info fedmsg-notify blog post out there, hopefully more consumers 19:14:07 nice lmacken 19:14:12 lmacken: do we have any way to tell how many consumers there are? 19:14:18 https://admin.fedoraproject.org/haproxy/proxy01 19:14:19 and 02 19:14:19 #link http://lewk.org/blog/fedmsg-notify.html 19:14:28 fedmsg-raw-zmq-outbound 19:14:29 cool. 19:14:48 lmacken: where is your blog plugin? 19:14:56 pingou: hrm? 19:15:02 abadger1999: related. Have we given any thought about further expanding our 2 factor stuff to web apps or other places? Or haven't really explored that yet. 19:15:07 mdomsch: Question -- is python-GeoIP still needed for mirrormanager? I noticed that it was orphaned in Fedora/EPEL. 19:15:11 lmacken: I saw tour twitter feed on your blog and I went... :) 19:15:45 nirik: yes -- SmootherFrOgZ has been working up a patch to do 2-factor login to fas and support in python-fedora to take advantage of that. 19:15:53 abadger1999: yep - largely used. I'll have to go adopt it 19:15:58 nirik: I think that'll give us the first step. 19:16:28 abadger1999: ah ha. I was confused as to what that work was about. Good. 19:16:38 mdomsch: Cool. If you need help, feel free to add me as a comaintainer and I'll put it on my once a cycle, look for updates list. 19:16:43 note that also fas-openid has that PAPE plugin we could use for openid consuming applications. 19:16:49 nirik: His work will be a first step. 19:17:03 There's lots of other things that need to happen before it's "real" 19:17:21 ie: at first we'll have both the single factor and 2-factor login. 19:17:28 sure, yeah 19:17:36 so you could circumvent by simply going to the other login page. 19:17:58 But first we get it working, then we make it mandatory for some people. 19:18:08 * nirik nods. 19:18:16 So -- continuing to make progress :-) 19:18:20 not sure if it was talked about last week, but we hit yet an issue with our tg1 apps where they would just silently block all requests. It happened after some dns change, but hopefully this change will fix it in the future. https://github.com/fedora-infra/python-fedora/pull/18 19:18:20 also, that would only be yubikey? 19:18:49 nirik: yeah -- since fas only knows about yubikey at this point. 19:18:54 lmacken: ah yes, thanks for finding that. ;) We should try and get that fix rolled out to production... 19:18:58 It's another thing we'll have to do to get this good to go. 19:19:13 abadger1999: I might poke you later, I'd be interested in seeing if Google Authenticator could work, maybe I can work on something. 19:19:44 Adran: cool. It should. We just have to integrate it more tightly with fas (we have basically a separate google authenticator setup right now) 19:19:54 Ah. 19:20:01 Adran: could work where? we have support for google auth in totpcgi - it's available - just not integrated to fas 19:20:05 Adran: yes - what abadger1999 said :) 19:20:22 so, next week is fudcon... how many of you will be out there? 19:20:24 skvidal: Right. Maybe it can be integrated? :) 19:20:51 nirik: s/fudcon/pycon/ 19:20:58 I'll be there :) 19:21:00 yeah, sorry, brain failure. ;) 19:21:01 me too 19:21:50 so, suggestion: we may want to be cautious about application changes tomorrow/the weekend... just so things are not possibly unstable with all you folks gone. ;) 19:22:37 and hopefully you all have safe travels. :) 19:22:55 any other application news? 19:23:10 pingou started a really good conversation with hughsie about integrating tagger with AppStream 19:23:30 and he's already written some code for it today. 19:23:38 Pretty exciting. :) 19:23:40 cool. I saw some of that. 19:23:46 I haven't looked at appstream yet much... 19:24:17 #topic Sysadmin status / discussion 19:24:25 I'll be there 19:25:03 so, lets see... not too much on the sysadmin side that I can recall in detail. ;) 19:25:14 nirik: a few things 19:25:18 #info ssmoogen reinstalled proxy01 as x86_64... 19:25:27 (which is good, since it now matches all the other proxies) 19:25:32 and I think it worked 19:25:32 1. we've been moving ahead on the ansible conversion 19:25:46 * mdomsch is going to need mmbapp01 to be x86_64 too due to s3cmd memory usage 19:25:59 2. I've added a path lookup plugin to ansible that will allow us to have lookups for staging then production like we do now in puppet 19:26:19 nice! 19:26:31 very nice 19:26:36 skvidal: yep. :) we need to test out some workflow there/make some simple playbooks for simple hosts to test things out. 19:26:48 mdomsch: :( oh well... we can do that. 19:26:51 3. more things moving into openstack and we're already running into limits of our available resources 19:27:11 oh :( 19:27:14 I have 4 more instances to finish transitioning and then we can start moving systems over to increase the available resources 19:27:30 which is great b/c that should double our available resources 19:28:01 nirik: for later - we might consider tinkering with multiple availability zones and osuosl02 19:28:04 and add cinder volumes from each of the compute nodes. 19:28:15 yep. That would be great. 19:29:02 so, that bug that lmacken mentioned earlier... I was wondering, should we make app servers depend on their local proxy ? 19:29:19 right now they hit admin.fedoraproject.org, which is dns round robin for all of them. 19:29:28 but if they talked to their local proxy it might be faster. 19:29:36 and also if that proxy is down, then likely they are too. 19:29:59 so appX->proxyX? 19:30:34 yeah. 19:30:38 #info smooge is waiting to find out if physical memory arrived in PHX2 so we can give it to systesm 19:30:42 app01/02/03/04 would hit proxy01 19:31:32 anyhow, it's a thought. I don't think it's urgent. 19:31:46 nirik: query on this 19:32:04 would it make more sense for us to pursue the above? or for us to pursue breaking all the apps out? 19:32:36 well, the above is pretty trivial if we want to do it. ;) Breaking apps out is still a longer term thing we really must do, IMHO. 19:32:54 nirik: fair enough.. 19:33:02 if the proxies were running next to the app servers 19:33:07 like on the same hw 19:33:10 I';m inclined to say yes 19:33:22 and I think breaking apps out kinda wants ansible to be ready to handle those apps. 19:33:25 but we don't really want an outage on proxy01 to kill the app servers 19:33:35 yeah, true. 19:33:50 well, lets leave it for now... 19:34:17 #topic Private Cloud status update / discussion 19:34:20 About the private cloud -- does this mean we're going to just use openstack going forward? 19:34:24 we already hit on some of this above... 19:34:33 abadger1999: so right now here is what we have discussed 19:34:43 abadger1999: 1. moving the instances we have over to openstack 19:34:54 2. moving 2 of the compute nodes over to openstack for additional resources 19:35:04 3. taking the remaining 2 machines for other prototyping/testing 19:35:22 whether 3 is of openstack or of $something_else is really up to later discussion/decision 19:35:32 19:36:03 would adding another proxy to phx2 help? 19:36:07 Okay so it seems like our "production" services are moving openstack but we're still testing out the alternatives ? 19:36:13 #info moving things into openstack cloudlet 19:36:53 #info will move some compute nodes from the other cloudlet over to it, and then have 2 nodes to continue testing other things with. 19:37:08 abadger1999: yeah, or possibly we will use the other 2 to test the next openstack version... 19:37:14 19:38:08 I think we can knock a bunch of things off our 'need to figure out before production' list now too. 19:38:24 I'll look at cleaning up the wiki page on that, since I think we solved or decided many of them 19:38:50 anything else on cloudlets? 19:39:04 no - but I had something for openfloor when that happens 19:39:28 ok 19:39:33 #topic Upcoming Tasks/Items 19:39:38 #info 2013-03-07 remove inactive apprentices. 19:39:39 #info 2013-03-12 to 2013-03-21 pycon 19:39:39 #info 2013-03-19 to 2013-03-26 - koji update 19:39:39 #info 2013-03-29 - spring holiday. 19:39:39 #info 2013-04-02 to 2013-04-16 ALPHA infrastructure freeze 19:39:40 #info 2013-04-16 F19 alpha release 19:39:41 #info 2013-05-07 to 2013-05-21 BETA infrastructure freeze 19:39:43 #info 2013-05-21 F19 beta release 19:39:45 #info 2013-05-31 end of 1st quarter 19:39:47 #info 2013-06-11 to 2013-06-25 FINAL infrastructure freeze. 19:39:49 #info 2013-06-25 F19 FINAL release 19:39:51 anything anyone would like to schedule or note? 19:40:11 we have a little less than a month until alpha freeze. 19:40:38 #topic Open Floor 19:40:42 skvidal: you had something? 19:40:45 yah 19:40:53 something I was thinking about 19:40:53 I sent somethings about jenkins, feedbacks welcome :) 19:41:01 in any given week we're all working on a billion things 19:41:06 pingou: yeah, I have it marked to reply to. ;) Thanks. 19:41:15 and I was wondering if there was any thought to doing something like a virtual fad week 19:41:26 where we intend to focus on a few tasks and get them done 19:41:31 skvidal: focusing on one area? 19:41:33 sounds nice 19:41:34 yeah. 19:41:35 right 19:41:39 we all are on irc 19:41:42 and most of the time all day long 19:41:55 I think that kind of thing is very effective if we plan exactly what we want to try and do. 19:41:55 it seems like it would be very possible to schedulea vfad 19:41:57 and most of us in a close timezone :) 19:42:00 and make sure the needed people are available 19:42:09 hell, tie us toigether using a google hangout if need be 19:42:18 * threebean nods 19:42:20 but announce that we will be focused on one thing 19:42:28 and unavailable for random-ass pings 19:42:41 #info idea: do some vfads and focus on specific areas to get them done. 19:42:48 I liked that we were able to knock out a specific problem last year for the 2fa fad 19:42:57 and I think we should try it w/o the relocation 19:43:10 so what subjects would be things we could knock out? 19:43:20 application logging? 19:43:27 * lmacken was just about to say that :P 19:43:30 2fa app wise 19:43:30 perfect 19:43:53 breaking apps out 19:43:53 what else? 19:44:13 ansible migration 19:44:43 CLI logins for our web-app 19:44:47 fedorahosted-ng 19:44:58 but that comes back to the discussion abadger1999 wants to start :) 19:45:15 IDS 19:45:26 IDS? 19:45:31 intrusion detection system 19:45:46 ah, cool 19:45:48 * skvidal is making a list 19:45:51 identify and plan how to get rid of our SPOF. 19:45:56 the db? 19:46:04 db-replication 19:46:11 db's are a big one. 19:46:18 spof? 19:46:19 there might be other things tho. 19:46:24 single point of failure 19:46:26 I'd love to have some FADs on 2-fa and oauth. 19:46:27 sorry, single point of failure. 19:46:34 oauth needs some discussion first. 19:47:08 2-fa is closer to having a solid plan where a FAD type setting would really help 19:47:42 okay that's a good list start 19:47:43 lmacken, for app logging -- do we know how to fix the problems we have? 19:48:04 * abadger1999 still hasn't found any reason we aren't getting all tracebacks in our logs. 19:48:35 abadger1999: I didn't know tracebacks were not appearing. Could be a few things. 19:48:49 I'd love to experiment with https://github.com/ryanpetrello/canary 19:49:12 lmacken: that's like my number 1 problem with app logging since 2007 or so :-) 19:49:18 spof? 19:49:20 sorry 19:49:25 hmm, probably a simple ini or wsgi config tweak honestly 19:49:36 getting all the data is a good first step, then to fix/reduce so it only tells us about real errors... 19:49:36 single point of failure 19:49:41 pingou: places where if that server/service dies then everything dies 19:49:52 nirik: I'll add another item to our list 19:49:53 nirik: nagios 19:50:01 I know some apps really send a ton of stuff... I think fas sends to error_log on every successfull login. 19:50:18 yeah. I have done some work on nagios, but there is more to plan out and do. 19:50:25 skvidal: I hit arrow up/enter in the wrong window, sorry for the noise 19:50:37 pingou: oh - I thought you were still wondering what that means 19:50:57 lmacken: I would think so... but I know you've looked at it a ton of times nad we've only succeeded in over logging things that we don't care about :-/ 19:51:47 I'd love to get to the point where someone says "hey, I just hit $app and got a 500" and we can easily see a traceback to attach to a ticket about it. ;) 19:52:02 so 19:52:05 abadger1999: potential pycon hackfest item :) 19:52:05 wrt app logging 19:52:10 it seems, to me, somewhat obvious 19:52:14 that if we break out app servers 19:52:18 logging becomes A LOT simpler 19:52:21 yep. 19:52:22 doesn't it? 19:52:34 if the log is from foobarapp01 it's likely caused by foobarapp 19:52:34 yup 19:52:35 since isolating the logs for tagger won't involve sifting through a bunch of other logs 19:52:41 skvidal: for some definition of a lot. 19:52:44 yeah 19:52:45 (tagger was just an example) 19:52:51 that's about 1/3 of the problem i think. 19:53:09 abadger1999: what are the other 2/3rds? 19:53:12 centralizing 19:53:13 analyzing 19:53:30 another 1/3 is getting logs consolidated per service rather than per host. 19:53:32 realtime notifications of Bad Errors 19:53:46 trending 19:53:53 ie: fas is its on boxes but it's still hard to search for the traceback because it could be on fas1,2,3 19:54:08 *on its own boxes 19:54:20 okay - a couple of things to note - with our existing logging configuration on log02 19:54:26 right now we have 2 major logging groups 19:54:27 load balancing makes it harder for sure 19:54:27 per-host 19:54:28 and merged 19:54:38 there is nothing stopping us from grouping those logs, too 19:54:43 ie: fas 19:54:46 * nirik nods 19:54:47 apps 19:55:02 hookup up the SyslogHandler for each app is still on the TODO 19:55:04 so you'd end up with consolidated syslogs/app logs int /var/log/groups/fas/ 19:55:49 skvidal: It's syslog based for the apache logs? So we can have a single log file for a service? Because that would be really nice. 19:55:59 abadger1999: well we have the app log now 19:56:34 which is only for apps which are using it 19:56:42 which isn't many 19:56:42 bodhi in stg atm, iirc 19:56:43 istr it is local4 19:56:54 (that's the log facility) 19:56:57 * lmacken hasn't had cycles to wrap that up 19:57:09 yes local4 19:57:16 abadger1999: so we have 2 options there 19:57:25 (documented here: https://fedoraproject.org/wiki/Infrastructure/AppBestPractices#Centralized_logging) 19:57:41 1. setup apache on the app servers to emit all error logs to logger on local4 19:57:52 2. figure out a nicer way to setup apache/our apps 19:58:00 (or something in between) 19:58:08 I'd like to suggest one more avenue 19:58:11 that will require testing 19:58:13 and discussion 19:58:24 but is this - if there is a non-syslog mechanism for getting apache logs off of systems 19:58:33 let's hear about it 19:59:05 we could also work on reducing our syslogs.... 19:59:21 for example if there is a way to use 0mq to emit logs sanely 19:59:27 and to integrate it at the apache layer 19:59:37 I'd be inclined to pay attention to it, personally. 19:59:50 but 19:59:52 1. it needs to work 19:59:57 where sanely == reliably w/o risk of losing messages 19:59:58 * puiterwijk is finally home and online 19:59:58 2. be fairly reliable under load 20:00:05 lmacken: :) 20:00:09 yeah. 20:00:23 lmacken: it can lose a few - but ideally the ring buffer that rsyslog uses would be the most desireable 20:01:00 sounds like a logging vfad would be a good idea ☺ 20:01:01 skvidal: so, can you post your vfad list and thoughts around those to the list and we can look at picking one and scheduling it? 20:01:13 * nirik nods. logging seems popular 20:01:17 * lmacken going to Monitorama after PyCon, so may have more ideas later this month 20:01:22 nirik: yes 20:01:34 thanks. 20:01:34 nirik: careful though -- logging is only popular because it's such a pain :-) 20:01:39 indeed. 20:01:55 it's only a pain for you crazy kids and your new-fangled webapps ;) 20:02:00 :) 20:02:00 ok, any other open floor items before we close out? 20:02:23 hm, I'd put a vote in for splitting appservers first. It might make fixing logging easier. 20:02:26 I think for logging -- we should have a plan (like: reconfigure all apps to log to syslog local4). Then the vfad concentrates on doing that plan. 20:02:57 yeah. 20:02:58 + 20:03:38 ok, thanks for coming everyone. Do continue on #fedora-admin, #fedora-apps, and #fedora-noc. 20:03:41 #endmeeting