19:59:45 #startmeeting Infrastructure 19:59:45 Meeting started Thu Jul 29 19:59:45 2010 UTC. The chair is mmcgrath. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:59:45 Useful Commands: #action #agreed #halp #info #idea #link #topic. 19:59:51 #meetingname infrastructure 19:59:51 The meeting name has been set to 'infrastructure' 19:59:52 hi there 20:00:01 'ello. 20:00:03 Yo! 20:00:09 * CodeBlock 20:00:10 :-) just in the nick of time 20:00:24 #topic who's here? 20:00:50 here 20:00:51 I'm here according to what I see. 20:00:53 * CodeBlock ... again ;) 20:01:15 skvidal is blogging about bikers 20:01:16 * sijis is around 20:01:19 Ok, well lets get started. 20:01:22 #topic Meeting tickets 20:01:24 * skvidal is here 20:01:27 .tiny https://fedorahosted.org/fedora-infrastructure/query?status=new&status=assigned&status=reopened&group=milestone&keywords=~Meeting&order=priority 20:01:27 mmcgrath: http://tinyurl.com/47e37y 20:01:30 .ticket 2275 20:01:32 mmcgrath: #2275 (Upgrade Nagios) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2275 20:01:40 CodeBlock: what's the word here? 20:02:33 well, we have a noc3 running now, and have moved nagios-external to it, which is now nagios3 20:02:34 so... for now if anyone sees any problems with it, please let me know 20:02:45 CodeBlock: are we getting alerts from noc3 yet? 20:03:34 mmcgrath: I think we should be, I don't really have a way to test that - I might add a fake check just to see 20:03:38 but theoritically, we should be 20:03:49 k 20:03:58 kill a server and find out 20:04:04 CodeBlock: well, lets let it run until next week, if all's good we'll rename and have at it. 20:04:08 no wait no a good idea 20:04:11 CodeBlock: anything else? 20:04:15 smooge: hehe 20:04:19 mmcgrath: Don't think so 20:04:27 k 20:04:29 next ticket 20:04:31 .ticket 2277 20:04:32 mmcgrath: #2277 (Figure out how to upgrade transifex on a regular schedule) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/2277 20:04:44 abadger1999: I think you created this one. 20:04:50 AFAIk we still have no one maintaining it in EPEL 20:04:53 any translators here? 20:05:13 mmcgrath: Yeah -- talked with translators at FUDCon Santiago and we thought that it might be good to get this on a regular schedule 20:05:34 That way we don't disrupt translator workflow at a critical point in the Fedora release schedule. 20:05:42 abadger1999: the funny thing here is I don't think this really has much to do with us beyond a yum update. 20:05:55 if someone from the translators team wants to build the rpm at specific times, we do montly updates anyway 20:05:56 mmcgrath: Well.. don't we want to test the upgrades too? 20:06:01 and we could certainly do others 20:06:03 we do in staging first 20:06:11 but without any updatese coming, we've got nothign to test :( 20:06:26 and without a test plan... 20:06:27 And last time... the update required we do a bit of db work before/during/after the code was updated. 20:06:33 oops. 20:07:02 Also... the way transifex works keeps changing -- so it's probably not going to see many updates in EPEL. 20:07:22 but the updates bring new features that our translators will want so we probably have to stay o nthe treadmill. 20:07:44 We'll make whatever they want us to run work, but someone still needs to package it 20:07:52 and I don't think that's us since it's just an upstream now. 20:08:11 we're too far removed from that group and workflow to know when they want what, what's important, etc. 20:08:21 surely some translator can take it. it's orphaned in EPEL 20:09:00 mmcgrath: Well -- otoh they're too far away from us to know what happens when you try to upgrade from 0.6 to 12.15 in one smooth go. 20:09:00 I think because it didn't fit well with EPEL no broken updates 20:09:19 abadger1999: nothing's going to fix that though 20:09:29 we have staging, it's not hard for us to upgrade and give it a look, let them know it's ready 20:10:01 mmcgrath: Sure. But we need to coordinate schedules and such. 20:10:11 what is there to coordinate though? 20:10:34 if that team decides they want a new transifex, they can package it and make it ready. They can email us if they want to update it. 20:10:38 Remember last time? with stickster asking if we sho9uld move our translation infrastructure to indifex.org for F14? 20:10:50 I remember a package not being ready. 20:10:54 mmcgrath: Does docs package mediawiki? 20:11:12 the translations team doesn't have to package transifex if they don't want but someone does. 20:11:41 and that someone's not us. Stuff like this is a partnership. They gotta pony up and do some work too. 20:11:50 I think packaging is our job. 20:11:58 I think it's absolutely not. 20:12:11 since we don't know anything about it, nor the needs, nor what translations needs are 20:12:16 we're not upstream for transifex. 20:12:19 There's certainly a partnershiip here but why would it be translators jobs to package something? 20:12:20 we're not the users of transifex 20:12:35 because they're the team that wants to use it. 20:12:37 There's no guarantee that they're system admins and know the first things baout how to install software. 20:12:52 then it's their job to find a packager. 20:13:14 we don't run around looking for apache packagers 20:13:32 mmcgrath: If RHEL dropped it, wouldn't we? 20:13:45 maybe 20:13:51 mmcgrath: The EPEL maintainer dropped the ball with mod_wsgi so we package it in infra. 20:13:53 I'm just pissed that this work got dropped in our lap. 20:14:00 It make sme not want to accept any hosting without a written agreement. 20:14:14 mmcgrath: Didn't it get dropped in our lap when we chose to install transifex in our environment? 20:14:19 Teams cannot, ever, come to us just with uses in mind. they need to take part ownership. 20:14:25 .whoowns transifex 20:14:25 mmcgrath: ivazquez (orphan in Fedora EPEL) 20:14:31 orphan in epel. 20:14:36 you're suggesting we now do that work too 20:14:42 Which is Why I support you when you say we shouldn't be accepting new apps without more manpower coming with it :-) 20:15:23 but just like with docs, the websites team, etc. They all have major ownership over those services. 20:15:26 we just host it. 20:15:53 have we even asked the translations team to find a packager? 20:16:26 mmcgrath: I doubt it. But it can't go into EPEL -- it needs to be a packager that adds it into the infrastructure repo. 20:16:35 why can't it go in epel? 20:16:45 mmcgrath: Because it changes too much. 20:16:51 isn't that for the packager to decide? 20:16:59 [13:09:00] I think because it didn't fit well with EPEL no broken updates 20:17:13 do we know an upgrade would break updates or are we assuming it? 20:17:14 mmcgrath: No. EPEL policy dictates what can go into EPEL. 20:17:30 since the current upgrade hasn't even been tested. 20:17:42 I don't think we should just forbid transifex from being in EPEL 20:17:43 I don't care where the RPM goes, but someone on that team (who knows their schedule) needs to package it. 20:17:48 mmcgrath: yes, upgrades from transifex violate the epel policies on updates quite frequently. 20:18:12 mmcgrath: Okay, so what can we give them? Sponsor them into sysadmin-web so they can use staging? 20:18:27 naw, sysadmin-test so they can use dev. 20:18:30 unless they want web. 20:18:39 and want to take ownership of hosting it as well. I'm fine with that. 20:18:45 but that's a much bigger commitment then just packaging. 20:18:51 and I don't think they'd need it. 20:18:59 sure, if they want it we can help them. But I wouldn't think that's a requirement. 20:19:10 Okay -- I just think that only having control over packaging doesn't help much -- packaging is part of deployment. 20:19:28 * mmcgrath asks ignacio why no one's maintaining it in EPEL. 20:19:43 I'm looking at the Transifex download page 20:19:55 Soon, Tx will land in a yum repo near you, and you'll be able to install it with something like yum install transifex. 20:20:01 in fairness, packaging is the only thing standing between upstream and us at the moment. It can be part of anything in that process, at the moment it's the only thing missing. 20:20:42 ignacio is goign to ask diegobz if he'll continue to maintin it. 20:20:52 it's quite possible we've made a problem where there isn't one, just miscommunication. 20:21:24 abadger1999: I have no problem with someone from this team who wants to do the packaging for it. but I'm just doubtful anyone will step up since that's not what we do. 20:21:28 upstream clearly is pointing their potential users to the yum repositories 20:22:09 and I don't think it should be required of us. 20:22:17 anywho, anyone have any additional questions on that? 20:23:01 I don't think we have to do it, but if others are unable to find packagers, we should help make a few packagers out of some translators 20:23:03 mmcgrath: I would tend to disagree. I think that packaging is absolutely something that we do... but we can talk about that in some other, bigger arena. 20:23:12 abadger1999: k 20:24:06 #topic updates 20:24:15 smooge: how'd this go, what went wrong, what needs to be done still and how can we avoid it next time? 20:24:27 seems like the last 4 months of updates something has gone not right or it's taken longer then the outage window, etc. 20:24:31 mmcgrath: abadger1999: Tuning in just now to this conversation -- As a data point in your conversation: http://lists.fedoraproject.org/pipermail/trans/2010-July/007819.html 20:24:43 ok we are working on 144 servers and 96 are still needing to be rebooted/final updates 20:25:19 so two outage windows and less then half of the servers actually got updated and rebooted? 20:25:22 what happened? 20:25:54 ok a couple of issues. We wanted to use func-yum and I found some bugs for skvidal 20:26:09 Second we had a compete between git branching and outage 20:26:50 so nothing inside PHX2 was going to be rebooted because of that. Last week I ran into an issue with TG2 and moksha that took my time downgrading 20:27:39 the func-yum issue was not with func-yum but with func and the groups setup 20:27:39 yeah that was a bummer, we had dmalcolms and Oxf13's outage scheduled over the top of ours 20:27:46 I fixed it late yesterday 20:27:49 sorry for the hassle 20:28:26 skvidal, it was a small thing actually once it worked it was pretty quick 20:28:28 it's fine, I'm just trying to figure out what's going on because I know we've not had a clean upgrade process in some months 20:28:29 I would think we would be able to avoid scheduling outages over each other 20:28:48 then I ran into a couple of issues where a box that had been set up to xen shutodwn decided to do a xen save 20:28:49 onekopaka_laptop: well dmalcoms was just a fat finger in the scheduling 20:28:55 this meant I had to go and reboot again and such 20:29:35 mmcgrath: and Oxf13's outage? 20:29:47 massive. 20:29:50 mmcgrath: what part of the update process has been unclean? I'm not being defensive - I want to make sure i've gotten all the shit fixed :) 20:30:13 Oxf13: was the problem that your outage was larger than you thought? 20:30:18 onekopaka_laptop: you'd have to ask smooge there, I'm not sure he even scheduled a second outage 20:30:21 no, in fact it may be shorter. 20:30:23 skvidal: its' just not worked right. 20:30:27 most boxes were updated last week. Its just about 8 updates and then trying to figure out why some boxes in the cnode and virtweb do not show up in puppet/func 20:30:33 skvidal: we've always ended up going way past the outage window or not getting boxes updated or rebooted. 20:30:34 then I need to get vpn working to a couple others 20:30:35 onekopaka_laptop: but I was late in requesting the outage 20:30:45 mmcgrath, I scheduled a second outage over the weekend 20:31:01 mmcgrath: ah - gotcha. 20:31:22 smooge: which weekend, this last one or the next one? 20:31:27 * mmcgrath might have just missed it 20:31:38 last weekend 20:31:58 July 23rd 20:31:59 I got one of three dates wrong STILL 20:32:04 I saw an email 20:32:14 so taht didn't overlap with jesses update right? 20:32:20 err jesse's outage 20:32:22 or rather I went back now and saw 20:32:35 smooge: so looking forward, what needs to happen in the future? 20:33:56 ok we need to better advertise and work with our customers/partners on scheduling down time 20:34:29 Wednesday had been picked because Mon/Fri are usually bad and Thursday is meeting day for many of us 20:34:39 Tuesday was causing issues too 20:34:39 mmcgrath: from what I see, Jesse's update, quoted at 48 hours, would have been happening at the same time as updates 20:34:43 I was also bad about communicating how much outage I would need 20:35:15 smooge: how long would a total upgrade and reboot take? 20:35:49 well it takes me about 30 minutes per xen server and clients to update and then a reboot is usually adding in 10-15 minutes 20:35:58 not counting ping/irqs 20:36:26 some of that can be parsed out to more people but other parts cant 20:36:27 would having additional hands help? 20:36:39 we can't reboot all the app or proxy servers at the same time 20:36:57 and then you have bapp01/db02 which have been now missed through 2-3 reboot cycles 20:36:59 smooge: I agree, that'd cause massive amounts of panic 20:36:59 30 minutes per xen server? 20:37:04 that seems way too high. 20:37:23 smooge: so what does that mean in total outage window? 20:37:28 mmcgrath: that's including the guests 20:37:35 yeah that still seems way to high. 20:37:43 I'd think we could do all the external machines at once. 20:37:51 or pretty close to that. 20:37:59 I don't see any reason to do updates in serial. 20:38:02 mmcgrath, the func-yum speeds it up some but then there is the "oh why is postgres83 yelling that 84 showed up." or other things that require a little bit of hand holding 20:38:15 smooge: okay I have a couple of ideas here 20:38:51 1. we should be able to use func-list-vms-per-host to know which hosts are where 20:38:52 mmcgrath, I did that once :). I remember being advised to be less adventurous :) 20:39:13 2. then dump the hosts back out to func-yum for doing updates on a set of items under a vm at a time - in parallel 20:39:22 smooge: I think doing that should be a goal of ours. lets fix things that are broken (like the postgres clusterfuck) 20:39:29 though I think that only impacted 2 publictest servers 20:39:32 I could be wrong. 20:39:59 mostly pt. I ran into something on app01.stg and bapp1 I think similar 20:40:02 no thats nagios 20:40:04 I can add a test-run func-yum run 20:40:15 skvidal, a test run would be nice 20:40:46 at 30 minutes per xen server we're looking at something like 12-15 hours of outage per month for upgrades for less then 200 hosts and that doesn't seem reasonable to me. 20:40:46 I end up thinking I remember all the crap hosts and then end up with "oh wait that needs something" 20:41:38 mmcgrath, I agree. I need to come up with something better 20:42:20 ideally it should be func-yum upgrade and find out what needs to be restarted but they keep updating the kernel and glibc 20:42:22 would upgrading staging first give some indication on what needs to be done to the prod boxes? 20:42:30 sijis: only sometimes 20:42:38 speaking of upgrading stuff, mmcgrath any idea when the fas update is going to go live? 20:42:46 smooge: restarted I can make happen using needs-restarting 20:42:54 smooge: just gotta deploy it 20:43:05 CodeBlock: depends on when the sqlalchemy bits are fixed, abadger1999's got them on his todo but he's a busy dood. we need like 8 more toshios 20:43:08 skvidal, I 'deployed' a verion in my home directory on a bunch of boxes. 20:43:24 mmcgrath: hehe, alright 20:43:43 smooge: I have a couple of other items like 20:43:48 'is running kernel latest' 20:43:52 mmcgrath, my 30 minutes is also me being a bit more cautious than I probably should be 20:44:15 mmcgrath, smooge: let me ask a question 20:44:17 I just keep knocking off someone important when I start doing it faster 20:44:20 shoot 20:44:30 why do we target updates as monthly? 20:44:44 because it has to be some interval and that seemed reasonable. 20:44:46 why not weekly - to keep the sheer overwhelmingness of the change to a smaller amount 20:44:48 automatic updates == teh fail. 20:44:54 I wasn't suggesting automatic 20:44:59 I understand wanting to watch them 20:45:07 I think the problem isn't the number of packages isn't the problem. it's getting the updates done. 20:45:51 skvidal, I tried doing that last year I think. I ended up with issues that func-yum should fix now (not getting all app servers in sync etc). 20:46:29 the big one comes out that kernel updates end up being the big time killer 20:46:37 b/c of the reboots? 20:46:42 yeah. 20:47:06 okay - we can also do update targets 20:47:22 take it out of DNS; shutdown all the domU's; reboot the domO; make sure the domU's come up; put it back in DNS 20:47:47 'take it out of dns' 20:47:51 take _what_ out of dns? 20:47:56 skvidal: 'dig fedoraproject.org' 20:48:01 we have had an issue where domU's sometimes come up perfectly and other times the dom0 says "Oh I know I just started burt you cant have 2 GB of ram ' 20:48:09 wildcard and @ 20:48:13 oh 20:48:15 you mean the app server 20:48:17 I'm sorry 20:48:24 most of the outside servers have 1 proxy on them somewhere 20:48:25 well the proxy servers 20:48:34 i thought you were talking about the xen server 20:48:36 I do wish we had automated proxy recovery 20:48:41 mmcgrath, CodeBlock: btw, what did you think about just having fas conflict with the earlier sqlalchemy? 20:48:42 then check to make sure that haproxy says the app box came up ok 20:48:50 skvidal, sorry wasn't clear 20:48:54 mmcgrath: define "automated proxy recovery" 20:49:01 Since fas is on its own servers, it should work. 20:49:08 But it's icky icky. 20:49:11 abadger1999: I'm fine with that if we won't run into any issues there. 20:49:27 abadger1999: we don't have any need for fas0X to actually have the older alchemy right? 20:49:33 python-fedora doesn't need it for fasClients or anything? 20:49:47 agreed on the automated proxy recovery - otherwise all the dns dreck would be simpler 20:49:50 Try it out. I think that jsut having the sqlalchemy that fas needs installed will work. python-fedora doesn't need it. 20:49:51 onekopaka_laptop: right now if app1 goes down, no one notices because the load balancers take it out. When it's back they add it back in. We don't have to do anything 20:50:03 onekopaka_laptop: with the proxy servers it's a lot more complicated since it's dns balance based. 20:50:09 mmcgrath: okay. 20:50:12 hmm... although the python-fedora package seems to require sqlalchemy. 20:50:24 * abadger1999 will figure out why that is. 20:50:28 abadger1999: k, I didn't realize it would be that easy. so the upgrade process would just be to remove the old version and upgrade 20:50:35 abadger1999: oh yeah that is a little weird :) 20:50:37 mmcgrath: so treating DNS round-robin same as haproxy 20:50:52 onekopaka_laptop: ehh, sort if. we also have geodns in there but that's something else. 20:51:00 mmcgrath: yeah. 20:51:02 the big thing is when a proxy server goes down, dns keeps sending people there. 20:51:03 ? 20:51:13 mmcgrath: which is very bad. 20:51:18 yeppers 20:51:41 holy moly the meeting has 10 minutes left :) 20:51:47 anyone mind us moving on for now? 20:51:59 mmcgrath., smooge : I'd like to talk more - later about updates 20:52:00 I"m pretty sure nobody wants to rewrite BIND to have that functionality too 20:52:04 but I'm fine moving on for now 20:52:07 skvidal: sure, I'm around :) 20:52:18 onekopaka_laptop: I bet there are options I just don't know of any 20:52:19 ok 20:52:22 #topic Open Floor 20:52:23 mmcgrath: okay 20:52:25 anyone have anything to discuss? 20:52:36 blogs.fp.o, I guess 20:52:39 I'm back 20:52:52 and I'm actually going to get the documentation done 20:52:54 ok 20:52:58 Just a heads up, I'm working on a personal repo setup thing on people - http://repos.fedorapeople.org/ 20:53:00 I have alll these nice screenshots 20:53:03 onekopaka_laptop: yeah we need SOP and stuff. 20:53:04 perfect 20:53:21 so people will be able to easily make their own blogs! 20:53:21 mmcgrath: different than throwing repos in ~/public_html ? 20:53:22 I am building a fakefas for the Insight project. Then I need to work on how to deal with updates better 20:53:46 mdomsch: a little yeah, take a look - https://fedoraproject.org/wiki/User:Mmcgrath/Fedorapeople_repos 20:54:02 mdomsch: if you have anything you can host or you'd like to please run through the steps. I'd like to announce it soon but it could use testing :) 20:54:18 onekopaka_laptop: i did notice a bug.. when you get an error in control panel whne you first login. i think that should only be displayed after you login. 20:54:53 sijis: do you have anything like a screenshot? 20:55:04 sijis: because I know there's one ugly ugly ugly bug 20:55:05 you can reproduce it everytime 20:55:20 as long as you aren't logged in 20:55:25 where if you go to log into the admin 20:55:36 it throws you to a non-existant page 20:55:45 we could be talking about the same one 20:55:51 its a redirect problem, ibeliee 20:55:55 *believe 20:56:03 yeah 20:56:03 20:56:13 k, we've got 5 minutes left. Anyone have anything else they'd like to discuss? 20:56:22 I did forget to mention varnish is all in place now for smolt and the wiki 20:56:23 I think it's because WordPress is all unaware of the SSL fun 20:56:25 it's been working fine 20:56:27 anyone seen any issues? 20:56:34 mmcgrath: nope. 20:56:39 however 20:56:51 addendum to my email about assests 20:57:06 last I checked, /static is served by Apache on the proxy 20:57:12 not me 20:57:19 and it never goes past that 20:57:33 onekopaka_laptop: correct, that actually never gets as far as the varnish server. 20:57:35 so we have nothing to worry about there. 20:57:40 it gets served directly from the proxy servers 20:58:07 * gholms invites everyone interested to the Cloud SIG meeting right after this 20:58:12 so just so nobody gets hung up on that typo 20:58:17 not typo 20:58:19 topic* 20:58:30 Nice 20:58:35 * mdomsch needs to roll a new MM release out 20:58:45 coolz 20:58:49 not critical, but geppetto wanted it 20:58:55 20:58:55 mdomsch: is it feature packed? 20:59:03 anyone have anything else to discuss? If not we'll close the meeting in 30 20:59:15 mdomsch: showing up Apple with over 9000 new features? ;-) 20:59:19 it's mostly bugfixes, but one feature (marking private=True on private mirrors in metalinks) 20:59:34 which rawhide yum uses to let people use only private mirrors if that's their policy 20:59:41 Ok I'm going to close so the cloud guys can get going :) 20:59:46 #endmeeting