20:00:29 #startmeeting 20:00:32 gday mmcgrath 20:00:40 * ricky 20:00:45 #topic Infrastructure -- Who's here? 20:00:51 * johe|home takes a seat 20:00:52 dgilmore: how's it going? 20:00:59 * SmootherFrOgZ is 20:01:08 * sijis sijis is here. 20:01:09 * ke4qqq is 20:01:12 mmcgrath: 2 builders to go 20:01:36 dgilmore: for stg ? 20:01:37 dgilmore: excellent, happy to hear it. 20:01:38 hello 20:01:42 Well lets get started 20:01:48 #topic Infrastructure -- Tickets 20:01:55 .tiny https://fedorahosted.org/fedora-infrastructure/query?status=new&status=assigned&status=reopened&group=milestone&keywords=~Meeting&order=priority 20:01:56 mmcgrath: http://tinyurl.com/47e37y 20:02:03 .ticket 1503 20:02:04 abadger1999: take it 20:02:07 mmcgrath: #1503 (Licensing Guidelines for apps we write) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/1503 20:02:17 SmootherFrOgZ: nope 20:02:26 So we've had a new license pop up in apps we've written recently 20:02:29 AGPLv3+ 20:02:47 That's incompatible with GPLv2 which is what the majority of our apps use presently. 20:03:12 After looking over the situation with spot, it seems like it would be good to move everything to AGPLv3+. 20:03:26 im ok with the move 20:03:31 (With libraries going to LGPLv2+) 20:03:35 abadger1999, when you say we use.. do you mean we right or other stuff 20:03:42 We write. 20:03:46 s/right/write/ 20:03:49 thanks 20:04:01 smooge: This would not affect code that we don't write. 20:04:12 And it's a recommendation rather than a hard and fast rule. 20:04:31 abadger1999: have you run into anyone saying "ehh, I don't think we should do this." ? 20:04:35 ie: mdomsch wants mirrormanager to be MIT; mediawiki plugins should follow mediawiki's license 20:04:47 mmcgrath: So far everyone's been positive. 20:04:58 abadger1999: ok, so how do we actually _do_ it? 20:05:03 sed? 20:05:37 yeah, we have to replace COPYING files with AGPL/LGPL and then change the headers in source files. 20:05:40 well you need to look at each app and see if its something we wrote or pulled in from somewhere else 20:06:04 do you need to get written proof from author before changing? 20:06:23 How urgent is this time-wise? 20:06:24 if its pulled in we need to deal with it.. if its something we wrote 100% we should be able to replace COPYING/headers 20:06:33 sijis: for the majority of things no, but I am going to notify authors of pkgdb and python-fedora before I make chanes. 20:06:44 ricky: I'd say not real urgent, but the longer we wait... the longer we're going to wait I suspect. 20:06:47 For example, with FAS, I'd like to eventually rewrite the OpenID provider part instead of dealing with licensing pain because of samadhi or anything. 20:06:53 sijis: The CLA gives us the ability to do a relicense if the contribution was made without an explicit license. 20:07:21 abadger1999: some seemed timid about that on f-a-b. I'm less timid. 20:07:22 ricky the other option is to find out what jcollie thinks about AGPLv3+ 20:07:28 but we should ask 20:07:43 abadger1999: lets take an app like fas first. 20:07:44 just see how it goes. 20:08:14 yeah, it's common courtesy and also gives people a chancce to holler "Oh wait, I actually didn't own the copyright to that code.. sorry." 20:08:33 abadger1999: are you going to lead the effort on this? 20:08:37 I'd like to do python-fedora soon It's moving to LGPLv2+ which is more permissive 20:08:42 should we open a ticket for each app? 20:08:43 how many apps are we talking about for this? +/-15? 20:08:46 mmcgrath: I can. Yes, each app. 20:08:48 sijis: less then 15 20:08:52 sijis: Less htan 15 20:09:25 abadger1999: sounds good, so anything else? 20:09:39 A ticket for each app will let us come back next week and say -- half of our app authors like a licensing policy but don't want to change *their* app. 20:09:45 Which would mean we need to rethink. 20:10:09 I think that's all unless someone wants to shout that it's a bad idea now :-) 20:10:25 anyone have anything to say? If not now, take it to the list. 20:10:27 and do it sooner, not later. 20:10:45 Ok, so next topic 20:10:54 #topic Infrastructure -- The merge, outages and issues. 20:11:00 So we had a merge last week. 20:11:07 and since the merge we've had some issues 20:11:13 and it's not something obvious. 20:11:16 define merge for me? 20:11:21 and, in fact, could be completely unrelated. 20:11:29 smooge: merge from staging to master branches in puppet. 20:11:37 smooge: We made a ton of changes in the staging branch and merged them to production :-) 20:11:46 Which basically involved refactoring a bunch of puppet code, cleaning things up, creating some new modules, etc, etc. 20:11:54 I've not seen a wiki outage since yesterday. 20:12:00 I need to go through the logs and look. 20:12:21 while doing some digging we, just in general, found strange issues in our environment. 20:13:04 mmcgrath, ricky thanks.. 20:13:20 what have been the strange ones 20:13:27 for example - http://mmcgrath.fedorapeople.org/proxy-errors.html 20:13:51 200,000+ 502's per day. 20:13:55 just seems massive to me. 20:14:00 In terms of the big outages, they've all seemed to happen during mysql database backups (which lock tables) or smolt render stats jobs. 20:14:22 The proxy errors and 500s seem to be something else though. 20:14:28 20:14:40 and our current lead on the 500's errors for fas is a new mod_wsgi 20:14:44 Have the 500 errors stayed normal? 20:14:44 jbowes is working on that. 20:15:03 (As in, have they gone up after the merge or not?) 20:15:31 ricky: hard to say 20:15:46 http://mmcgrath.fedorapeople.org/JuneErrors.html 20:15:54 I'll re-check today now that it's been a few more days. 20:15:57 clearly we had a major spike 20:16:10 mmcgrath: the first graph shows it being mostly proxy2 20:16:14 but it seems to have gone back down. 20:16:21 Strange. 20:16:22 sijis: yeah, and proxy2 is an odd beast. 20:16:31 proxy2 is load balanced with proxy1 behind the PHX balancer. 20:16:35 _however_ 20:16:45 anything in phx uses proxy2 directly to get to the account system. 20:16:51 which not only includes shell accounts. 20:17:03 but also includes our web applications contacting fas for session, auth, etc. 20:17:09 which is a significant amount of traffic. 20:17:18 interesting.. is there a reason for just proxy2? 20:17:21 Funny that proxy1 seems fine. 20:17:30 ricky: well it does get a lot less traffic. 20:17:34 Like it didn't jump significantly at all. 20:17:39 I guess. 20:17:41 smooge: the network team won't let us contact the balancer IP directly. 20:18:03 so you are forced to pick a proxy? 20:18:05 ah ok could we setup another proxy? 20:18:22 smooge: we have two of them there. 20:18:28 but no good way to balance between the two of them. 20:18:47 we could put a load balancer in there, but it'd be just another box, and would need to be rebooted as often as proxy2 is anyway 20:18:58 Is the problem really coming from our PHX admin.fp.o setup though? 20:19:06 mmcgrath, no what I meant was one that was just for that so we could cut down on what might be causing the erorrs? 20:19:12 The 502s really jumped everywhere, so that's what I want to know the root cause of. 20:20:04 so if its a bruteforce attack on stuff we could get an idea of what app is being targeted or soemthing 20:20:26 I think the errors are on our end, I need to do more log checking to know for sure though 20:20:29 But the brute force shouldn't be causing 502, it should be working :-) 20:20:41 but yeah we can add and remove more proxy servers in PHX if we want to 20:20:58 mmcgrath: Can we separate that graph into apache 502s and haproxy 502s? 20:21:10 Right now they're lumped together in the source where you're getting it from, right? 20:21:37 ricky: I don't think so, because if haproxy or the app server returned a 502, apache would log a 502. 20:21:46 so proxyX will always have our largest number of 502's 20:21:55 then haproxy (if we're logging that, not even sure) 20:21:58 then the app server 20:22:14 although the app servers probably don't throw 502 20:22:17 mmcgrath: But some 502s are coming from apache, as in proxy1 couldn't contact locahost:10009 20:22:32 Those are the strangest ones to me. 20:22:42 I'll have to look closer then. 20:22:47 firewall? 20:23:08 sijis: I don't think so - it definitely works a large percent of the time 20:23:20 sijis: I'd actually think that's the app server not responding to haproxy, and thus not responding to the proxy server. 20:23:41 But that should strictly cause haproxy 502s not apache 502s, correct? 20:23:41 and I'm not seeing us hitting our haproxy limit. 20:23:46 and we've seen both :-( 20:23:55 ricky: when looking at the logs, how can you tell the difference? 20:24:15 oh from it saying it couldn't contact localhost:10009 20:24:21 I'm not sure. I'd expect the apache 502s to show up in the apache error log and both types of 502s to show up in the error log. 20:24:31 I'll have to verify that tohugh. 20:24:32 **though 20:24:38 hm 20:24:39 hm 20:24:40 hmmmm 20:24:59 Was your source for these graphs the error log or the access log? 20:25:09 acciess I believe 20:25:11 is haproxy on a different server or on proxy2? 20:25:12 * mmcgrath looks 20:25:26 sijis: each proxy server has it's own haproxy service on the same host 20:25:51 ricky: access.log 20:26:09 perhaps we should continue discussing this after the meeting. 20:26:12 Ah, OK. 20:26:15 any objections? 20:26:20 Sure thing 20:26:42 nope. 20:27:12 # topic Infrastructure -- Eye in know db. - INNODB 20:27:22 #topic Infrastructure -- Eye in know db. - INNODB 20:27:33 ricky: this one's you. Talk about your plans, what's going on, what's going wrong, etc. 20:27:35 is that a rock band? 20:27:40 Any MySQL experts around, by the way? :-) 20:27:50 ricky: abadger1999 is a mysql expert 20:27:52 ricky: for some definition of expert 20:27:53 :-P 20:28:17 Part of the big outages we've seen since the merge seems to be due to mysql backups (and smolt's stats refresh script, which might be a separate problem) 20:28:36 We've seen this behavior with the zabbix database, where the backup would lock entire tables 20:28:43 ricky: Yep, of the yum erase '*ysql' ; yum install 'postgres*' variety 20:28:46 abadger1999: Hehe 20:28:47 * mmcgrath notes we've always had a small problem with backups and outages. But they've been tiny blips. Lately they've been throwing nagios alerts. 20:29:33 how many mysql databases do we have? 20:29:35 We'd like to move to using the --single-transaction option to mysqldump, which combined with InnoDB, should make backups not lock the entire table 20:30:02 ricky: yes! 20:30:03 THe main mysql usage we have is mediawiki, smolt, and zabbix 20:30:18 Although we have a few others for stuff like cacti, prelude/prewikka, etc. 20:30:20 ricky: FWIW, we've also had good luck with http://www.zmanda.com/backup-mysql.html (community edition) 20:30:31 ricky, are they seperate servers or one single one 20:30:49 Jeff_S: Thanks, I'll take a look at that later 20:30:54 smooge: They're all on db1 20:30:57 smooge: all mysql db's are on db1 20:31:06 So far, the biggest pain we've had so far is the host_links table in smolt 20:31:16 ricky: and how big is it? 20:31:19 O:-) 20:31:25 It has above 70M rows, and I haven't gotten a single successful conversion to InnoDB yet. 20:31:45 And the thing with --single-transaction is that the tables need to be InnoDB to be sure that everything gets dumped in a consistent state 20:31:59 but single-transaction will probably solve your main problem of locking the table(s) 20:32:00 ricky: We're able to dump that table? Are we able to reload it except as innodb? 20:32:02 wow thats quite a bit 20:32:07 ricky: and what are the downsides to innodb? (space, etc, etc) 20:32:16 slower 20:32:17 mmcgrath: slower at certain operations 20:32:22 So the approaches that we've tried so far are: converting using alter table, and sedding a dump to change the table type, and loading it. 20:32:26 how much slower? 20:32:38 The first didn't finish after some large number of hours, and the second is going now. 20:32:59 mmcgrath: I'm actually not that sure about the downsides yet. Apparently loading huge tables is a huge pain. 20:33:02 ricky: I'm going to want render-stats metrics too 20:33:05 mmcgrath: depends on the dataset & queries. the locking though more than makes up for it IMO 20:33:22 Also, some tables needed MyISAM for full text search - the only table affected by this is mediawiki's searchindex tables 20:33:32 :-( 20:33:34 (Which is just a copy of another InnoDB table, I believe) 20:33:45 ricky: and, in theory, we'll be able to get rid of that when we have a fedora search engine. 20:33:52 Hopefully. 20:34:23 Anyway, we'll probably have a mysql outage some time in the future once we get a successful test in staging. 20:34:39 mmcgrath: one of our past employees wrote this, I think it explains the reasons for using InnoDB pretty well http://tag1consulting.com/MySQL_Engines_MyISAM_vs_InnoDB 20:34:39 ricky: yeah, how have the other conversions gone? 20:34:44 what might be the case now is that maybe our configs aren't tuned for large innodb tables. 20:34:50 ok what books/sites should I read to catch up how to help this. (DB's are not my specialty :/) 20:35:08 mmcgrath: All of the other tables in the smolt db other than host_links have finished in <20 minutes 20:35:35 Apart from the smolt db, most of the mediawiki db is already innodb 20:36:03 The other databases that need conversions are: cacti, prelude._format, prewikka, and transifex (which isn't used anymore anyway) 20:36:04 ricky: I believe I went through and did some innodb conversions back in the day on some of those. 20:36:48 prelude and prewikka are pretty much dispensable since that stuff is still being tested (lmacken even purged and recreated some of those dbs recently) 20:37:05 ricky: how big were those dumps? 20:37:32 So smolt is basically the big hurdle - although I have some questoins about the smolt upgrade and the db changes there 20:37:45 The dump of the smolt database is 2.5G 20:37:52 * lmacken looks at the time, and rolls in late 20:38:00 ricky: 20:38:00 alter table host modify column cpu_model varchar(80); 20:38:01 alter table host add column cpu_stepping int(11) DEFAULT NULL; 20:38:01 alter table host add column cpu_family int(11) DEFAULT NULL; 20:38:01 alter table host add column cpu_model_num int(11) DEFAULT NULL; 20:38:06 that's the smolt upgrade. 20:38:16 mmcgrath: Oh, OK - that's no problem at all then. 20:38:38 The host table took <20 minutes, so we can do that before or after, and it's fine 20:38:39 * mmcgrath doesn't really even know what "int(11)" means 20:38:44 have you guys been using SQLAlchemy-migrate for that stuff? or doing it by hand? 20:38:46 I need to look that up :) 20:39:05 lmacken: honestly I can't stand alchemy-migrate so I've been doing it by hand. 20:39:20 mmcgrath: heh. I've never used it before 20:39:41 :) 20:39:47 ricky: ok, so anything else on the db front? 20:40:17 Nope, but if anybody knows a lot about MySQL, let us know about your experiences with stuff like this 20:40:20 Jeff_S: Thanks again for the links! 20:40:46 k 20:40:50 ricky: np. I'm glad to have our current DBA lend a hand if needed 20:40:53 #topic Infrastructure -- Posse 20:41:05 So I haven't been as transparent with this as I should be 20:41:09 It's basically this 20:41:12 #link http://teachingopensource.org/index.php/POSSE_2009 20:41:21 we're providing some guests for a week for them to use. 20:41:42 +1 to open source :) 20:41:43 Is it going to be on fasClient? :-) 20:41:55 ricky: nope, they're completely disconnected atm. 20:42:06 this is their first time through this. 20:42:10 Ah, OK 20:42:11 maybe next year. 20:42:16 but all of these guests are on cnode1 20:42:20 part of the cloud stuff. 20:42:22 what servers are their guest on 20:42:27 Hehe 20:42:29 ah 20:42:31 I ended up not using osuosl1 20:42:46 since it's RHEL5 and for some reason xen+fedora 11 seems to be my white whale. 20:42:54 but cnode1 was F10, and using KVM worked just fine 20:43:05 Anyone have any other questions on that? 20:43:48 Ok 20:43:52 #topic Infrastructure -- Open Floor 20:43:58 anyone have anything they'd like to discuss? 20:44:16 I'm going to be deploying a new version of bodhi tonight/tomorrow to support EPEL :) 20:44:37 hopefully we'll be able to start queueing updates up tonight 20:44:38 yeah 20:44:43 and ideally mashing repos tomorrow 20:45:26 lmacken: sounds good 20:45:33 and on a related note, I need to rebuild relepel1 20:45:43 * mmcgrath fail built it 20:45:57 anyone have anything else? 20:45:57 smooge: ? 20:46:20 sorry 20:46:27 keyboard problems 20:46:47 I am checking to see what boxes need updates and I am working on seeing what ones I can do 20:46:56 I should have that done by tonight/tomorrow. 20:47:16 After that I am checking to see that func and puppet are working on the boxes 20:47:32 and then finding out all the secret handshakes and such 20:47:41 heheh 20:47:43 fun times 20:47:56 I should have the func done by friday and then it will be time to work on zabbix 20:48:03 smooge: excellent. 20:48:16 Ok, and with that if no one has anything else we'll close in 30 20:48:16 zabbix will be next weeks project 20:48:16 done 20:49:09 ok everyone, thanks for coming! 20:49:12 #endmeeting