18:01:02 #startmeeting Infrastructure (2012-05-24) 18:01:02 #meetingname infrastructure 18:01:02 #topic Robot Roll Call 18:01:02 Meeting started Thu May 24 18:01:02 2012 UTC. The chair is nirik. Information about MeetBot at http://wiki.debian.org/MeetBot. 18:01:02 Useful Commands: #action #agreed #halp #info #idea #link #topic. 18:01:02 The meeting name has been set to 'infrastructure' 18:01:02 #chair smooge skvidal CodeBlock ricky nirik abadger1999 lmacken dgilmore mdomsch threebean 18:01:02 Current chairs: CodeBlock abadger1999 dgilmore lmacken mdomsch nirik ricky skvidal smooge threebean 18:01:30 * abadger1999 here 18:01:36 * lmacken 18:01:38 * threebean is here 18:01:42 * skvidal is here 18:01:47 hola 18:01:51 * jds2001 around 18:02:13 here 18:02:32 hello evrybody 18:02:57 welcome everyone. 18:03:33 ok, lets go ahead and dive in then 18:03:37 #topic New folks introductions and Apprentice tasks. 18:03:37 If any new folks want to give a quick one line bio or any apprentices 18:03:37 would like to ask general questions, they can do so now. Anyone? 18:03:56 i will 18:04:09 * marcdeop is sorry to be late 18:04:20 no worries. 18:04:32 My name is Ivan garcia I've been a linux kvm admin for 4 years now 18:04:50 living in the USA 18:05:18 cyberworm54: welcome. :) Are you most interested in the sysadmin side of things? or application devel/programming? 18:06:10 thank you :) I am more in the infrastructure side/sysadmin, I have proggramming skills and scripting too such as python and bash 18:06:27 * codemaniac is here 18:06:35 great. Come see me in #fedora-admin later this afternoon and we can see about getting you setup in the apprentice group... 18:06:42 welcome codemaniac 18:07:00 \o nirik .Good to see you . 18:07:03 so, any other general questions from apprentices or new folks? Or shall we move on? 18:07:08 that would be excellent I am really excited about this opportunity :) thank you 18:07:37 * marcdeop welcomes the new fellow! 18:08:09 :) 18:08:17 thanks marcdeop 18:08:17 ok, I'd like to diverge from the agenda I sent out for a bit and talk about issues we have been hitting this week... 18:08:33 #topic bugzilla and database issues this week 18:08:53 So, the first issue we ran into was that bugzilla.redhat.com was upgraded last saturday. 18:09:14 We (in retrospect foolishly) updated python-bugzilla on our servers. 18:09:47 The new python-bugzilla has some behavior and cases where it causes the same query as before to generate a LOT of additional queries with the new interface 18:10:11 We have now downgraded back to the old stable python-bugzilla and (almost) everything should be back to normal. 18:10:28 What can we do to better avoid this kind of thing moving forward? 18:11:07 One thing I can think of is to better test in stg with partner-bugzilla. 18:11:21 nirik: to be fair 18:11:25 Red Hat runs a bugzilla instance at partner-bugzilla.redhat.com that has email turned off. 18:11:30 nirik: the limited scale of tests in staging 18:11:30 easnt here. tezting bugzilla for some weeks? 18:11:39 might not have tipped off the issues 18:11:47 and the load issues may not have been visible 18:12:03 skvidal: true. Also, we didn't know until RHIT contacted us and blocked us out for causing too much load. ;) 18:12:08 nirik: indeed 18:12:37 marcdeop: they updated the testing one a while back, but only updated the production one (bugzilla.redhat.com) last saturday 18:13:12 Anyhow, I will probibly send a email on this to the list too with more info about what happened, etc. 18:13:31 Right on the heels of fixing the bugzilla issues, we ran into some database problems. 18:13:39 * pingou (late) 18:13:55 This was caused we think by a fedora-tagger update that hit the database hard every few hours. 18:14:25 * fcami__ is here (sorta) 18:14:26 nirik: were those performance problems? 18:14:50 marcdeop: ultimately, that's what we _think_ it was 18:14:51 marcdeop: yeah, worse, it would fill up all our database connections so our account system couldn't authenticate anyone. 18:15:10 nirik: we could check our database connections 18:15:15 with nagios 18:15:21 and warn us in a threshold 18:15:37 marcdeop: yeah, but sadly when the account system goes down, that means ALL our web apps that need it to authenticate go down. 18:15:50 So, nagios was sending us about 150-200 emails anytime this happened. 18:16:01 another nagios check isn't going to help, unless it reduces that. :) 18:16:17 well,I was thinking more about *identifying* the problem 18:16:24 as you remarked the work _think_ 18:16:32 s/work/word 18:16:37 * nirik is still not 100% sure that was the problem, but reverting that seems to have fixed it. 18:16:56 marcdeop: the issue looks like a tagger/packages update beating up db connections 18:17:10 marcdeop: we're taking multiple steps right now to help solve this for the future 18:17:16 1. the rollback is a temp solution 18:17:29 2. moving fas' db to its own instance - away from everything else 18:17:54 when I get MM 1.4 solidified, it'll be much nicer to its database. Still needs a boatload of connections for the crawler though 18:17:57 yeah, currently out account system db is on the same db host with other things that may be less important 18:18:14 marcdeop: I agree that figuring out what was triggering it is a good idea - but moving fas db is important independent of any failures from today 18:18:24 today just makes it acute, rather than a chronic concern 18:18:30 3. increasing what is in the freeze, so we don't make changes like this that might affect the larger picture. 18:18:32 skvidal: that's right 18:19:03 skvidal: +1 18:19:13 nirik: we can't control RH IT with our freezes 18:19:15 skvidal: indeed 18:19:28 mdomsch: very true... 18:19:39 but we could have not been updating packages/tagger now. 18:20:04 I just posted a freeze break request to infra list - +1's for the general process and I'll post more for the puppet changes 18:20:26 #action nirik to file outage tickets for these two things so we have record of them. 18:20:40 #info plan to move fas db to it's own db server. 18:20:44 nirik: have we actually had outage ? 18:20:51 #info adjust freeze document 18:20:57 or were we quick enough to reboot the pg server every two hours ? 18:20:59 pingou: well, depends on how you define it. 18:21:18 there's a short window where things were not authenticating. (less than a minute I'd guess) 18:21:20 nirik: let's say outage longer than few min 18:21:24 ok 18:21:34 also, for the bugzilla thing, maintainers couldn't add bugs to their updates... 18:21:42 and bugs.fedoraproject.org didn't work 18:22:34 so, it was sorta kinda an outage of not so important things. 18:22:53 anyhow, just wanted to talk about those items real quick... will move along now. 18:22:56 * marcdeop wonders if bugs works as of right now 18:23:04 marcdeop: it should, yeah 18:23:09 #topic two factor auth status 18:23:18 any news on the two factor front this week? 18:23:40 nope 18:23:42 next 18:23:47 actually.. 18:23:51 I haven't seen wolfkit in a bit 18:24:21 yeah, everyone is busy. :) No worries... 18:24:25 #topic Fedora 17 Release tickets 18:24:36 we have release tickets lined up for the release on tuesday. 18:24:45 #info release 2012-05-29 18:25:07 .ticket 3285 18:25:08 nirik: #3285 (Fedora17 Final - new website) – Fedora Infrastructure - https://fedorahosted.org/fedora-infrastructure/ticket/3285 18:25:11 .ticket 3286 18:25:16 nirik: #3286 (Fedora17 Final: verify mirror space) – Fedora Infrastructure - https://fedorahosted.org/fedora-infrastructure/ticket/3286 18:25:21 .ticket 3287 18:25:23 nirik: #3287 (Fedora17 Final - release day ticket) – Fedora Infrastructure - https://fedorahosted.org/fedora-infrastructure/ticket/3287 18:25:29 .ticket 3288 18:25:30 nirik: #3288 (Fedora17 Final - verify permissions on content) – Fedora Infrastructure - https://fedorahosted.org/fedora-infrastructure/ticket/3288 18:25:34 .ticket 3289 18:25:35 nirik: #3289 (Fedora17 Final - modify Template:FedoraVersion on wiki) – Fedora Infrastructure - https://fedorahosted.org/fedora-infrastructure/ticket/3289 18:25:43 .ticket 3290 18:25:45 nirik: #3290 (Fedora17 Final - update stats gathering scripts for new release) – Fedora Infrastructure - https://fedorahosted.org/fedora-infrastructure/ticket/3290 18:25:56 I think we are pretty much on track with them all. 18:26:08 as usual, we can't check bits until rel-eng stages them 18:26:28 nirik: any guess when that is? 18:26:35 today or tomorrow? 18:26:40 dgilmore: ^ ? :) 18:26:53 nirik: ill be staging today 18:26:55 * mdomsch needs to be sure the move-to-release script doesn't blow up again like it did last time 18:26:59 and disabling buildbranched 18:27:08 great. 18:27:50 likely ill flip the bits for the mirrors today, but it might be in the morning 18:27:57 smooge: you set to take the verify ones you usually do? or would you like someone else to? 18:28:08 going to tripple check that hardlinking etc is right before opening up 18:28:29 I am ready 18:28:34 dgilmore: are we nuking drpms? I think there was a request to do that last cycle? 18:28:48 I will start checking after he has staged. We have plenty fo disk space still 18:29:13 nirik: i dont know 18:29:22 nirik: it feels like we shouldnt nuke it 18:29:47 they don't do much good 18:30:08 https://fedorahosted.org/rel-eng/ticket/4963 18:30:11 no they dont 18:30:14 anyhow, just thought I would mention it. 18:30:20 anything else for release? 18:30:22 and skvidal swears that yum wont break if they dont exist 18:30:30 what did I do? 18:30:33 drpms? 18:30:41 oh 18:30:42 yah 18:30:47 that's save a few GB 18:30:47 skvidal: if we drop drpms from Everything tree 18:30:57 pretty confident in the opportunistic nature of yum on that 18:31:05 well 18:31:07 mostly confident 18:31:08 :) 18:31:18 skvidal: its the mostly that scares me 18:31:21 :) 18:31:25 LIVE IN FEAR 18:31:33 die in peace! 18:31:38 :P 18:31:47 anyhow, shall we move along... 18:31:53 yes 18:31:57 #topic Applications status / discussion 18:32:09 ooh me me pick me :) 18:32:13 So the fedorahosted app I've started on is coming along well, threebean has helped a lot too :) It's moved to Fedora Hosted now (mirrored to github still): https://fedorahosted.org/fedorahosted/. I did my first test of things on a RHEL 6 vm last night, the CLI side of the app works fine, but python-flask isn't packaged for RHEL. 18:32:15 lmacken / abadger1999 / threebean / pingou / relrod: any apps news? 18:32:18 *so* my question is this. The web app has no tie to anything Fedora Hosted right now. It's just a web form and a db that stores stuff, that may end up having FAS auth so people can't create requests for $random_fas_username. Do we want to explore putting and leaving the web frontend on openshift? Or do we want to get Flask +deps in EPEL? I'm not sure what the general FI stance is, re: "production-ish" stuff on openshift. 18:32:26 * relrod may have had that typed out beforehand ;) 18:32:41 * lmacken did lots of fedora-packages/moksha work this week... nothing visible though. 18:32:59 I'd perfer things like that we do host ourselves... just to avoid an external dependency. 18:33:20 apps.fp.o/{packages,tagger} are pretty much done though. 18:33:28 * threebean prepares for the python-flask-* packaging party 18:33:35 nirik: ok. It was just a thought 18:33:40 * dgilmore has tasks for a webapp developer if they want something to do 18:33:51 relrod: it looks like it has a epel branch, but just was never built for some reason. 18:34:27 * skvidal has been working on a coprs-submission front end using flask and openshift, too - it's just a test/wip 18:34:34 lmacken: thats cool. Do we want to do some kind of announcement? perhaps after release? 18:34:40 I got busy last week-end on mongodb vs postgresql for HyperKitty, but not much since 18:34:46 nirik: yeah, that sounds good. 18:35:22 nirik: I still need to track down why some new packages aren't getting indexed properly, so holding off on the apps.fp.o announcement until after the release is probably a good call. 18:35:25 ah kinda fix fedora-active-user to the new bz :) 18:35:51 lmacken: yeah, and that way we can doublecheck over things to make sure it's all set. 18:35:56 without freeze issues, etc 18:36:22 then to that end (re: packaging python-flask-*) if someone is bored and wants to get these in EPEL, I would love you: python-flask, python-flask-sqlalchemy, python-flask-wtf, python-wtforms 18:36:43 relrod: thinking -- we should hammer out the FAS auth for flask but separate it out once we're done into python-fedora-flask 18:37:00 18:37:00 * threebean wonders if ianweller's fudcon app has already done this 18:37:07 his app uses openid 18:37:08 I think it uses openid 18:37:09 threebean: I was planning on just using openid + limited to fedoraproject openid only 18:37:11 relrod: they appear to have epel branches. I'd try and contact the maintainer and see if they mind you managing those since they never built for it. 18:37:22 abadger1999: is there a good reason to NOT do that? 18:37:50 skvidal: I like using openid rather than dedicated fas auth in general. Seems to make things more flexible 18:38:07 on the openid front... we may want to look at mod_auth_openid. It looks like it might work to replace our mod_auth_pg thing if we wanted to. 18:38:07 the one issue is that I don't believe we have an active member of our dev team that knows openid server-side 18:38:18 so if it's broken in fas, we won't be able to fix it easily. 18:38:21 abadger1999: good point 18:38:38 * mdomsch is making _slow_ progress on MM 1.4. Bug fixes and perf improvements in the last couple weeks. I need to do more testing before I unleash it on the world - well after F17 launch 18:38:41 But we're already depending on it for several things without issue. 18:39:07 so... more doesn't hurt a whole lot and it does give other sorts of flexibility. 18:39:34 abadger1999: you can learn in 2 hours, right? :) 18:39:55 jds2001: Yeah. two hours on the beach in the bahamas, you foot the bill ;-) 18:40:09 sounds like a plan 18:40:12 one advantage of mod_auth_openid is that you can run a script after the openid auth to determine if the auth is allowed or not, we could possibly hook that into the 2 factor stuff. 18:40:26 nirik: Mmm that sounds nice. 18:40:38 * marcdeop joinns that plan 18:40:51 http://findingscience.com/mod_auth_openid/ if anyone wants to look it over. 18:41:03 it's under review (stalled), so we would need to finish that before we use it. 18:41:41 anyhow, any more app news? or shall we move on? 18:41:58 nothing else from me 18:42:08 ok, upcoming items: 18:42:16 #topic Upcoming Tasks/Items 18:42:16 #info 2012-05-08 to 2012-05-29 FINAL FREEZE 18:42:16 #info 2012-05-29 - F17 release 18:42:16 #info 2012-06-01 - nag fi-apprentices. 18:42:16 #info 2011-06-03 - gitweb-cache removal day. 18:42:16 #info 2012-06-04 - class B reboots? 18:42:17 #info 2012-06-05 - class A reboots? 18:42:19 #info 2012-06-08 OOW: osuosl01.fedoraproject.org 18:42:20 #info 2012-06-17 OOW: sign-vault02.phx2.fedoraproject.org 18:42:22 #info 2012-06-21 to 2012-07-04 Kevin is off on trains and boats. 18:42:24 I was looking at scheduling some mass rebooting for after freeze. 18:42:30 get us up on the current kernel and updates. 18:43:11 nirik: Historically the final freeze goes until one day after the release, not day-of 18:43:21 yeah, sorry you are right. 18:43:45 * nirik adjusts 18:43:53 anyhow, anything folks want to schedule or note? 18:44:26 trip to Az in July is the only thing on my radar. 18:44:29 voting will be happening June 1-7 18:44:44 #info sometime in June hardware will show up in Phx for new systems 18:44:47 nirik: going to look at doing some koji changes soon 18:44:49 there's also town hall meetings coming up for the elections... everyone should go and ask hard questions. ;) 18:45:11 smooge: cool. yeah, we should coordinate that with rhit folks and get it scheduled. 18:45:16 dgilmore: whats on the slate? 18:45:23 * inode0 is hoping all the rebooting doesn't affect voting as we needed to extend once before for interruptions 18:45:45 inode0: I could push them off a week... 18:45:48 nirik: move to mod_wisgi 18:45:49 and avoid that. 18:45:58 wsgi 18:46:04 mediawiki upgrade 18:46:24 nirik: there has been somebiggish changes in koji 18:46:29 upstream 18:46:35 inode0: I'll do that. Thanks for the note. 18:46:53 going to look at setting up some policies to limit certain builds to certain builders 18:46:53 smooge: yeah, we need to get staging all updated so we can test... but would be good to get that done too. 18:47:18 dgilmore: cool. Should we schedule that for when we do reboots? or will it be further down the road? 18:47:33 nirik: maybe further down the road 18:47:38 ok 18:47:52 koji with mod_wsgi has not had a lot of testing yet 18:48:08 its a pretty big change from mod_python 18:48:15 yeah, but mod_python is dead. 18:48:24 yerp 18:48:42 and we should secure our apps from hash collission attacks soon :) 18:49:01 **collision 18:49:10 lmacken: yeah. Possibly we could do that at the same time as a reboot outage... 18:49:16 although it shouldn't be much downtime 18:49:35 cool. 18:49:51 if it's ready by then I guess. ;) 18:50:22 we should probably just take the environment variable approach first, since mod_wsgi upstream has been far from useful with regard to this patch 18:50:34 yeah. seems easiest. 18:51:02 #info mediawiki 119 upgrade coming. 18:51:09 #info koji upgrades coming 18:51:25 #info will apply fix for mod_wsgi hash stuff soon 18:51:41 #info trip out onsite to phx2 sometime in the next few months. 18:51:47 Anything else? 18:51:59 MM 1.4 when I get around to it 18:52:23 #info mirrormanager 1.4 coming too 18:52:40 * nirik wonders if we can get all this done before f18 freezes start. ;) 18:52:45 time will tell 18:52:56 #topic Open Floor 18:53:02 S3 mirrors 18:53:03 * marcdeop encourages positive attitude! 18:53:07 ok, anything for open floor? questions comments? 18:53:13 I've started a new bucket for us-west-1 18:53:15 mdomsch: oh yeah, whats the status there? 18:53:35 and bapp02 is doing an s3cmd sync from s3://us-east-1 to s3://us-west-1 18:53:47 so in reality s3 is doing the copy for us 18:53:50 nice 18:53:56 well... S3 is slow... 18:54:03 it's been running for 3-4 days now 18:54:06 since Monday 18:54:06 how many regions should we/can we support? 18:54:23 we could do several, but the copying is _really_ slow 18:54:47 and arguably we should move archive (f8) content into its own set of buckets, so s3cmd list doesn't have to walk that content 18:54:58 on every sync run 18:55:27 yeah. 18:55:33 I presume we prefer to let s3 do the copying (initial copy costs the OpenShift team about $11) 18:55:43 into each bucket, rather than us push from phx2 into each bucket 18:55:47 is this something we want to announce/shout out? either as part of release or seperate? 18:55:50 so it becomes a 2-stage sync 18:56:01 I don't know if it'll be done in time for the release... 18:56:08 yeah, I'd say let them do it unless it's just too slow to work 18:56:16 phx2 -> us-east-1; us-east-1->us-west-1 18:56:32 it's taking days to move 300GB 18:56:44 mdomsch: id say phx2 to us-west-1 18:56:46 but I forget how long the initial upload from phx2 -> us-east-1 was 18:56:51 just because its physically closer 18:57:05 do we have any amazon contacts we could ping and ask? 18:57:16 * nirik looks at spevack. 18:57:16 nirik: spevack 18:57:43 anyhow, I can bring that online whenever; it's just a DNS change once the content is in place 18:57:52 and a s3sync change to copy to a second destination 18:58:07 mdomsch: can you ping spevack and see if he can provide any suggestions on how we can better do it? 18:58:17 * nirik can if you like too 18:58:22 mdomsch: do we now what kinda hits the ec2 mirror gets? 18:58:37 dgilmore: we have the logs; I don't have a log parsing tool to count them 18:58:51 mdomsch: ok, mostly just curious 18:59:14 many many many small files with 1-50 hits in each file 18:59:52 that's all I have 18:59:55 mdomsch: ok, mostly im curious as there has been zero feedback on the AMIS for f17 19:00:05 makes me wonder if they have even been used 19:00:33 amis? 19:00:46 marcdeop: the fedora images in EC2 19:00:48 amazon images 19:00:58 marcdeop: Amazon Machine Images 19:01:02 https://aws.amazon.com/amis 19:01:03 so someone can say: "make me a new instance... Fedora 17 please" 19:01:04 ;) 19:01:10 ohh yeah thanks 19:01:22 :) 19:01:30 EC@ 19:01:47 ok, any other open floor items? shall we call it a meeting? 19:02:34 ok, thanks for coming everyone! 19:02:37 nirik: think we are good 19:02:48 as usual, hang out in #fedora-admin, #fedora-noc and #fedora-apps. 19:02:51 #endmeeting