20:00:47 #startmeeting Infrastructure 20:00:47 Meeting started Thu Feb 11 20:00:47 2010 UTC. The chair is mmcgrath. Information about MeetBot at http://wiki.debian.org/MeetBot. 20:00:48 oh so much 20:00:48 Useful Commands: #action #agreed #halp #info #idea #link #topic. 20:01:20 #topic who's here? 20:01:23 * mmcgrath is 20:01:31 * lmacken 20:01:33 * a-k is 20:01:34 * heffer is too, but just by chance 20:01:35 * sijis 20:02:07 * hiemanshu 20:02:19 * ricky 20:02:34 * skvidal is 20:02:47 I've got 3 main things I want to talk about. The first two should be short the third one is about updates and will likely be longer 20:02:51 So I'll just get started 20:02:56 actually 4 things, 3 are short 20:03:01 #topic VPN issues 20:03:19 We've been seeing strange vpn issues. we saw a cluster of like 5 outages over the span of an hour this morning. 20:03:33 I poked around a bit, did a couple of restarts and have benerally been keeping an eye on things. 20:03:41 I thought they were fixed except that we had another one about 5 minutes ago. 20:04:06 There's lots of things this could be, but the biggest vpn change we've made was yesterday we were running on bastion2, which was xen. Now we're running on bastion1 which is kvm. 20:04:28 I can't say for sure that's what is going on, but we've seen performance issues before with misconfigured vms 20:04:35 anyone have any questions or concerns on that? 20:04:45 could it be network itself? 20:04:50 sijis: it could be 20:05:00 the outages are short lived and unpredictable 20:05:10 so it's been difficult to troubleshoot 20:05:24 Ok, next topic 20:05:28 #topic Equallogic 20:05:45 It's in, it's powered up and Dgilmore has even logged into it so he can be imprinted as it's father. 20:05:54 :-) 20:05:56 but we don't think the network ports are actually configured. 20:06:04 so, like I said, short topic on that. 20:06:10 we'll keep working on it and see how it goes. 20:06:19 * dgilmore is here 20:06:22 any questions or comments on that? 20:06:39 please give me multiple gig ports 20:06:43 pretty please 20:06:49 dgilmore: well, you should have 8 of them there. 20:06:57 and we can do whatever bonding we desire. 20:07:08 WANT 20:07:16 Ok, nothing else on that? 20:07:22 nothing 20:08:06 buhhh 20:08:13 I forgot what the third thing was so we'll go right on to the 4th 20:08:15 #topic Updates 20:08:28 So we did a group of updates yesterday and, needless to say, things didn't go well. 20:08:36 There's a number of complicated issues here. 20:08:52 1) We have latest versions of things in our repos that aren't to be updated 20:08:58 2) actually getting a list of things that are to be updated 20:09:02 3) actually doing the updates. 20:09:17 okay 20:09:21 can I jump in here? 20:09:30 Unfortunately system updates scale horribly. Restarting httpd on one server isn't that different from restarting it on 100 servers. But doing updates and restarts... completely different story. 20:09:41 skvidal: absolutely, have at it 20:09:43 okay 20:09:51 so something we originally wrote func for was this case 20:09:58 being able to get a lot of info and act on it 20:10:02 but we never implemented this 20:10:08 b/c we got off on other things 20:10:18 so I decided to work on it this week and I have a really simple script 20:10:27 skvidal: you're talking specifically about 3) or also 2? 20:10:35 2 and 2 20:10:36 err 20:10:37 2 and 3 20:10:40 20:10:42 so here's the gist 20:10:43 get all updates via yumcmd.check_update via func 20:10:43 • store timestamp of check and list of updates in a dir/db with name of host 20:10:43 • store complete list of installed pkgs for each host 20:10:43 • cmd should 20:10:43 ∘ list hosts needing updates 20:10:44 ∘ list hosts needing a certain pkg updated 20:10:46 • apply updates - glob or all 20:10:50 ∘ report results of this 20:11:14 right now I'm storing things really simply so we can search it trivially 20:11:34 what's with the unicode bullets? 20:11:37 /some/path/$hostname/[installed|updates|updated-$TIMESTAMP|orphans] 20:11:43 Oxf13: from my gnote notes - sorry 20:11:46 s'ok 20:11:58 Oxf13: I use it to brainstorm then paste it in places 20:12:05 skvidal: ditoo 20:12:11 -o+t 20:12:35 the idea would be to have the script run using func, async, at regular intervals (maybe only once a day is enough) 20:12:38 skvidal: so lets flash forward to where all this work is done and is in place. What would we do come update day? 20:12:40 to know what';s on the boxes and their status 20:12:59 func-yum -h hostname --pkg pkgname --update 20:13:00 or 20:13:03 func-yum --update 20:13:08 which hits all the hosts 20:13:18 or func-yum -h hostglob --pkg pkgglob --update 20:13:43 will we get any output or feedback from that? 20:13:47 then the results of those runs will be stored in /some/path/$hostname/updated-YYYY-MM-DD-HH:MM:SS 20:14:07 mmcgrath: so you can see what the results are explicitly 20:14:12 w/o having to chase all over the place 20:14:26 does that make sense? 20:14:39 yeah. I like that, pssh does something similar for ssh commands. 20:14:54 so I've got the storing info 20:14:57 and updates part working 20:15:04 I need to update func and certmaster for our hosts 20:15:07 b/c we're running an old one 20:15:12 which doesn't support the --timeout option :) 20:15:16 which is important here 20:15:21 and then one more thing I'm working on is 20:15:25 func-yum --status 20:15:32 which spits out the status of the hosts as it last knew it 20:15:40 so things like: 20:15:46 Last Checked: timestamp 20:15:52 Last Updated: timestamp 20:16:00 updates available: #of pkgs 20:16:05 installed pkgs: #of pkgs 20:16:09 orphans: #of pkgs 20:16:18 which seems like a reasonable set of things to list out 20:16:27 skvidal: do you need any help with that? 20:16:34 sure - it's just a single script 20:16:39 smooge: you around? we haven't heard from you yet? :) 20:16:42 I'm hoping to post a draft of it this afternoon 20:16:47 skvidal: excellent. 20:16:47 yes 20:16:48 sorry 20:16:54 one place where I do need help 20:16:55 I have this meeting an hour from now 20:16:58 smooge: :) 20:17:00 changing 20:17:10 is the error reporting/catching 20:17:18 there are lots of things that get in the way here 20:17:27 skvidal: yeah, and we've had some bad luck with conflicts in the past. 20:17:27 and I want to make sure I catch and report all the errors sanely 20:17:33 mmcgrath: mmm conflicts 20:17:43 mmcgrath: so, something we should consider doing 20:17:47 even though it is a pain in the arse 20:17:55 is running yum transactions for updates with tsflags=test 20:18:03 which does EVERYTHING but nothing actually gets written out 20:18:07 ok catching up.. the big issue that I had was that about 1/3 of systems required manual flag changes to yum to work 20:18:09 and no scriptlets are actually run 20:18:17 smooge: manual flag changes like what? 20:18:27 --exclude --disablerepo 20:18:46 hmm, disablerepo? 20:18:54 I sortof get 'exclude' 20:18:57 skvidal: would that do a full download of the package? because I was thinking about doing that as part of a pre-update thing so we don't pound puppet1 with updates and so when the actual time comes it takes less time. 20:19:18 if what you want does download the package, we could kill two birds with one stone. 20:19:21 mmcgrath: yes - it does everything including run the transaction but it runs it in rpm's test mode which does nothing 20:19:46 skvidal, there are a couple of boxes that have outside repositories and updates will come up squirrely unless I turn off the repos. Thankfully disable repo only occurs on .stg and publictest boxes normally 20:19:49 mmcgrath: for a good time set tsflags=test in yum.conf under [main] and forget about it 20:19:57 mmcgrath: it's great fun trying to figure out why you ALWAYS have new updates 20:20:03 heheheh 20:20:18 smooge: if we know the set of updates we mandate we could only explicitly enable those 20:20:19 smooge: so what were some of the biggest issues you ran into with this last round of updates? 20:20:39 ok slowness of updates. 20:20:51 smooge: taking too long to download or too long to install? 20:21:02 (or too long between udpate sessions) 20:21:09 smooge: the actual 'yum -y update' part? 20:21:09 1) slowness of updates. some boxes sit for 2-3 minutes on installation of rpm glibc and such.. 20:21:12 * nirik notes doing them more regularly would help with that. 20:21:29 smooge: yah - that's rpm fingerprinting - and there's nothing we can do until rhel6 20:21:31 2) slowness of updates. slow network to outside. ibiblio was slower than telia1 20:21:33 nirik: so would downloading the packages earlier. We already do them monthly. 20:21:45 Do we ever not want an update available from the RHEL updates? 20:22:01 3) errors in updates. various packages would spew scriplet %post errors I wanted to make sure they were ok 20:22:02 well, that would help with the download part, but not the applying part. 20:22:04 If not, could that just be automated so we just need to think about rebooting? 20:22:22 4) conflicting packages. 20:22:39 5) systems not coming back due to rawhide+xen 20:22:54 yeah rawhide + xen is an absolute bitch 20:23:02 I wonder if we moved our rawhide boxes to KVM if we'd have a better go at them. 20:23:04 6) updating 8 boxes at once on a xen box cause slowness. 20:23:36 nirik: how often do you think is good to do updates? 20:23:47 * mmcgrath thinks this is a good discussion to have 20:24:03 we currently do them monthly? 20:24:10 nirik the locality of a 'proxy' for the remote boxes would make some of the delays easier to know. I can deal with 10 minute wait on install.. but watching a package stop downloading for that long gets me wondering 20:24:12 sijis: yeah, unless there's security updates. 20:24:17 mmcgrath: we'd have a much better go with rawhide on kvm 20:24:24 well, for our customers we do them daily if they are not requiring a reboot. ;) If they are, we schedule a day and/or time to do them and do reboots. 20:24:30 mmcgrath: but any rawhide host has a inherent risk of not coming back after a change 20:24:43 most rhel updates are security updates. 20:24:48 nirik: how are you doing them? 20:25:33 nirik: sadly, there has been more and more of non-security updates in the EL channels as of late 20:25:44 Oxf13: and we're still averaging 1 kernel update / month. 20:25:48 which has also been a PITA. 20:25:59 We may want to be more careful about the kernel updates and determine if we really need to reboot. 20:26:08 I typically use 'mussh'... run a check-update over a group (different host lists/groups) and make sure they are all things we know what they are, then use mussh with 'yum -y' and apply them. Then go back and restart anything that needs restarting. 20:26:21 yeah, kernel updates have gone way up in frequency it seems like. ;( 20:26:30 I don't know wtf that's about but it is very annoying 20:26:39 smooge: ok, so back to the issues you saw 20:26:51 those are all generally things I see when I do updates 20:27:03 and I think with some work much of it can be automated. 20:27:21 some of the kernel updates however we have applied and not rebooted for. 20:27:29 and while it can be paralleled I didn't get to the part where I wasn't dealing with potential races til way after the window for updates should have finished 20:27:31 nirik: yeah 20:28:00 so we are about 1/2 updated 20:28:25 we still have most remote locations to do 20:28:29 okay so test transacting would help find systems which are more likely to die 20:28:29 skvidal: just curious, how long do you think it'll be before you're ready to actually test? 20:28:42 because it sounds like smooge still has some to do, but we freeze next week for the alpha. 20:29:00 I need func updated on some boxes - so I could test on the ones I update 20:29:04 mmcgrath, I am wanting to postmortem yesterday since I felt I was just shit-canning our infrastructure 20:29:08 I was going to start by testing people1 20:29:18 I haven't updated that box at all 20:29:25 skvidal, so it should be good for a test 20:29:50 smooge: naw, you did fine, the only bad ones were that xen4-mgmt's RSA-II decided to stop working (which made the shutdown -h a problem) 20:29:58 the next issue I ran into was that things like transifex should not have been updated .. 20:30:04 and the other one was just waiting for db3 to come back online, lvm + large shares is annoying. 20:30:12 smooge: yea, and that's the last thing I want to talk about 20:30:16 I think xen4 is having real issues 20:30:16 brb 20:30:18 Basically we need to have a test repo 20:30:21 and not enable it anywhere. 20:30:42 ricky: you're working on transifex now right? 20:30:44 is epel-testing enabled everywhere? 20:30:51 nirik yes 20:30:54 nirik: at the moment it is and we have very few problems with it 20:30:55 Yeah, I wasn't aware there was a new package in EPEL 20:31:01 smooge: whats the puppet epel-test thing you ran into? 20:31:08 puppet is the usual one 20:31:16 ricky: oh the new transifex is in epel? 20:31:20 Did you guys get issues with puppet? I've been testing the latest version without any pain 20:31:26 yeah, just another source of package updates... if you could reduce the need for that it would help make updates easier. 20:31:34 ricky: I didn't think so but I've heard people complaining about it so I must have missed it. 20:31:34 a couple of php packages on some box a while back. 20:31:36 Er, I'm not sure, maybe it came from the infra repo 20:31:43 smooge: did we have a puppet update go bad recently? 20:31:48 and one time a bad-scriplet that left me two packages on the box 20:31:48 Always make sure to update the puppetmaster first on puppet updates 20:31:51 ricky: can you check real quick? 20:31:55 mmcgrath, 3x last month 20:32:06 It's from infra, my mistake 20:32:09 smooge: we had 3 puppet updates? or we had 3 of them go bad? 20:32:11 what happened? 20:32:20 Maybe we need an infrastructure-test for this special staging stuff :-) 20:32:24 mmcgrath, I did the updates in sections last month 20:32:26 ricky: yeah that's what I'm proposing 20:32:34 Otherwise, if we decide to rebuild app1, we need to special case a bunch of stuff 20:32:36 smooge: but what happened? 20:32:45 **appX 20:32:48 mmcgrath, so there were 2-3 pushes of puppet packages and each time I seemed to get some boxes updated to the new stuff 20:32:59 which broke puppet1 so I had to then update it and the boxes I had done before 20:33:06 what broke though? 20:33:11 like what were the errors? 20:33:16 puppet couldn't talk to them. 20:33:32 I didn't find the error.. ricky let me know 2-3 days after I had done the updates when he caught it 20:33:34 the new versions of puppet couldn't talk to the old puppetmaster or the other way around? 20:33:41 ricky: do you remember what happened there? 20:33:44 The server is generally backwards compatible 20:33:58 I think it was the clients weren't getting updates 20:33:59 So if you accidentally update a client, update the server and check if stuff works - no need to rush on updating clients 20:34:25 so various boxes were in lala land for a couple of days. 20:34:26 I don't remember what happened :-/ 20:34:35 yeah 20:34:39 The only thing that should cause pain is a client update without the corresponding server one though 20:34:43 So it must have been that if anything, I guess. 20:34:44 ricky: are you still getting errors sent to you? 20:34:44 but I am trying to piece from xchatlogs 20:35:02 I'm still getting a ton of errors, but most are an unrelated SELinux thing (and lack of mount ACLs in staging) 20:35:15 I think we can reenable puppet email to everybody once that SELinux thing gets fixed 20:35:37 ricky: k 20:36:15 Ok, so I'll create a new testing repo, put it on all the servers but make it so you have to explicitly enable it to use it. 20:36:46 mmcgrath, I am working on a short blurb for what I have done in the past and what we could see if it works for us 20:36:57 its longer than IRC level so will send to infrastructure list later today 20:37:13 smooge: k, is it vastly different from what we've generally agreed upon here? 20:37:17 ricky can I get them right now even with the selinux stuff 20:37:27 OH! that reminds me, another thing we didn't do this time around... 20:37:27 So any thoughts about automating updates that come from RHEL as opposed to EPEL/Infra repo? 20:37:29 I am not sure.. it could be :) 20:37:37 we didn't update in staging first. 20:37:49 smooge: Really? As in emails in the form of "Puppet Report for XXX" ? 20:37:50 or if we did staging didn't function well for us. 20:37:55 ricky please 20:38:00 ricky best way for me to learn 20:38:13 mmcgrath, the issues I had with updating staging was a couple 20:38:16 Oh, sorry - I thought you said you were getting them, not that you wanted to get them 20:38:19 Sure thing 20:38:22 1) stuff wasn't exactly the same as in production 20:38:25 ricky: share the pain :) 20:38:47 2) boxes are spread out over many xen servers which needed to be rebooted due to xen changes 20:38:56 3) which affected boxes that weren't staging 20:39:23 I'm more specifically wondering how we missed the transifex and fedoracommunity updates, because neither of those rpms are capable of working in our environment at the moment. 20:39:39 I mean, once we have the testing repo in place, that might be fixed, but it'd still be good to have a way to catch it 20:40:10 I can go check the logs, but I do not think they had been updated on those boxes til I got to them 20:40:26 smooge: thats what I mean, once you updated them did you check to see they were still working? 20:40:30 so yes they had not been properly tested 20:40:37 I get it slowly 20:41:06 smooge: one thing I had started working on but need to get back to is this: 20:41:09 http://git.fedorahosted.org/git/fedora-infrastructure.git/?p=fedora-infrastructure.git;a=tree;f=scripts/site-tests;h=148a785193f868a280d27b61adea7af2bcb61c85;hb=HEAD 20:41:12 .tiny http://git.fedorahosted.org/git/fedora-infrastructure.git/?p=fedora-infrastructure.git;a=tree;f=scripts/site-tests;h=148a785193f868a280d27b61adea7af2bcb61c85;hb=HEAD 20:41:14 mmcgrath, no I had not.. to be honest I didn't grok that it was breaking things. 20:41:14 mmcgrath: http://tinyurl.com/yeocsvz 20:41:16 sorry 20:41:23 ah yeah. 20:41:39 one thing I usually try to do is update staging first and make sure they're all still working before moving on 20:41:43 that's a good step to add to our SOP 20:42:00 is stg done a day or so prior to prod? 20:42:00 smooge: but that link has some scripts I was working on to basically go out and hit our environment, doing tests for 200's, things like that. 20:42:01 I thought changes to transifex would have been tested before I got to them... I am quite guilty of Somebody Elses Problem field 20:42:13 sijis, it will be 20:42:38 ah ok. good 20:42:45 smooge: well, there's multiple types of tests involved, but it's always up to us to verify things are working when we're the ones making the change. 20:42:54 sijis, I will add that to my self-flaggelation email I am writing 20:43:34 mmcgrath, yes. I agree I got caught up in trying to get everything done by window and didn't do my job properly. 20:43:54 We don't exactly make it easy :) 20:44:05 admitting you screwed up is the first step in scew-a-holics anonymous 20:44:06 hopefully after skvidal's work is done updates won't be such a big deal. 20:44:17 we'll see 20:45:36 smooge: but yeah, take a look at those fedora-infrastructure.git/scripts/site-tests/ scripts 20:45:39 they're nifty :) 20:45:50 ok, anyone have anything else on this topic before we move on? 20:46:06 another repo I need to check out. is that ok for my office box or should it stay inside the colo? 20:46:18 I am done 20:46:37 It's public 20:46:45 smooge: that one's ok to do whatever with, it's on fedorahosted.org 20:46:47 (As in, git://git.fedorahosted.org/git/fedora-infrastructure.git) 20:47:08 ok 20:47:11 smooge, mmcgrath: Staging is a hybrid environment though.... I think fedoracomunity and transifex are both updated beyond production in staging. 20:47:29 they're both in some weird state for sure. 20:48:05 abadger1999, my writeup covers a possible fix. BY ADDING MORE BUREAUCRACY. No not really.. wanted to see if skvidal was awke yet 20:48:12 Ok, well lets all think on this some more and re-group next week. 20:48:17 smooge: thanks, you're a prince 20:48:25 #topic search engine 20:48:29 a-k: any update on the search engine? 20:48:33 The new repo will go a long ways. 20:48:40 * mmcgrath is trying to speed things up since we've only got 10 minutes or so left 20:48:52 skvidal, you are welcome. I see you get enough ribbing as it is so I owe you a lunch at a cafe next time I am in NC 20:48:52 Really fast update... No progress to report this week 20:49:06 a-k: no worries 20:49:17 #topic Freeze 20:49:24 * ricky shivers 20:49:27 Just a reminder, we freeze for two weeks starting next tuesday 20:49:43 smooge: remember, I'm one of your followers :) 20:49:52 ricky: funny (not) :) 20:49:55 YOU ARE AN INDIVIDUAL 20:49:59 skvidal: One thing I'm anticipating -- new pkgdb won't go into production in time for this freeze. There's just too many outstanding issues. 20:50:01 ok freeze tag 20:50:11 abadger1999: :( 20:50:19 Just a heads up, we may try to get a change request in for transifex 0.7 20:50:19 That means, tags from the pkgdb and critpath won't be there until after we unfreeze. 20:50:23 ok when are we freezing exactly 20:50:27 abadger1999: fooey 20:50:28 Docs needs this badly for their translations 20:50:37 smooge: the 16th 20:50:37 brrr, it's cold in here :P 20:50:55 abadger1999, can we go for a change request for the change? 20:51:04 Welll... 20:51:22 ricky: no way to get it in before the freeze? 20:51:26 Oxf13: Under the new no frozen rawhide, when are we doing mass branching? 20:51:37 abadger1999: alpha freeze 20:51:41 so... tuesday 20:52:11 That might happen as well - I'll try to get some test repos setup and tested by this weekend 20:52:22 Okay... smooge, If mass branching is done, I might do it via change request. 20:52:32 But I'm very hesitant. 20:52:38 abadger1999: whats the worry? 20:52:51 techniaclly if the mass branch is part of the release, it's not actually frozen. 20:52:53 I'm not sure if we need specific testing for docs' use case though, since they're apparently the big consumers for this update 20:53:07 mmcgrath: Lots of changes, lots of bugs I noticed and squashed, sync script is slow, db is huge. 20:53:30 abadger1999: oh, this is all related to the work you're doing with pkgdb? 20:53:43 abadger1999, ok thanks 20:54:05 mmcgrath: Yep. And a little part of it is just that I didn't do the majority of the code this time so my gut doesn't trust all of the changes that went in yet. 20:54:34 abadger1999: well as that comes let me know how I can help 20:54:42 Some time in staging will let me know what to expect. 20:55:05 ricky, mmcgrath, skvidal: So here's a question -- is tx update more important than new pkgdb? 20:55:21 new pkgdb gets us tags and critpath which we need. 20:55:34 But it sounds like the tx update needs some love and is important as well. 20:55:36 The tx update is a blocker for docs, so it's pretty important 20:56:00 Right now, we have it running in staging - we need test repos (ideally test repos that test docs workflow) and also some config file cleanup. 20:56:23 ricky, its just documentation.. i mean next we will be worrying about quality assurance :) 20:56:24 Do you guys want me to switch over to working on tx instead of pkgdb since I already am sure pkgdb is going to slip? 20:56:27 (This is why puppet is currently disabled on app01.stg, sorry for hogging it :-)) 20:57:09 abadger1999: I don't really know I have the knowledge to answer that. 20:57:16 I don't know what tx not making it in would mean 20:57:35 mmcgrath: no French/German/etc translations? 20:57:49 .ticket 1455 20:57:51 ricky: #1455 (transifex upgrade) - Fedora Infrastructure - Trac - https://fedorahosted.org/fedora-infrastructure/ticket/1455 20:57:57 G: for what? we had german and french translations for F12 20:57:59 that's my confusion 20:58:21 My info is what sparks said on the second-to-last comment 20:58:31 Apparently docs translations need certain features from tx 0.7 20:58:42 "This will adversely affect the Release Notes and all other Docs Guides if not completed by Mar 11." 20:58:55 huh? why 20:58:55 Looking at that comment again though, the date is past the freeze, so not as much of a rush as I thought 20:59:51 ricky: k 20:59:58 well since we're about done I'm going to open the floor real quick 21:00:01 #topic open floor 21:00:06 anyone have anything they'd like to quickly discuss? 21:00:10 i had something.. sneezed and forgot it 21:00:28 don't turn 40.. its the new 80 21:00:33 hahaha 21:00:36 Ok, and with that 21:00:37 :-) 21:00:37 #endmeeting