19:00:01 #startmeeting Infrastructure (2013-02-21) 19:00:01 Meeting started Thu Feb 21 19:00:01 2013 UTC. The chair is nirik. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:01 Useful Commands: #action #agreed #halp #info #idea #link #topic. 19:00:01 #meetingname infrastructure 19:00:01 The meeting name has been set to 'infrastructure' 19:00:01 #topic welcome y'all 19:00:01 #chair smooge skvidal CodeBlock ricky nirik abadger1999 lmacken dgilmore mdomsch threebean 19:00:01 Current chairs: CodeBlock abadger1999 dgilmore lmacken mdomsch nirik ricky skvidal smooge threebean 19:00:13 * skvidal is here 19:00:15 hello everyone. whos around for an infrastructure meeting? 19:00:15 not guilty 19:00:23 * cyberworm54 is here 19:00:25 * lmacken 19:00:26 * threebean is kinda here 19:00:28 * maayke is here 19:00:33 * abadger1999 here 19:00:40 * pingou here 19:00:52 * SmootherFrOgZ here 19:02:08 ok, I guess lets go ahead and dive in... 19:02:15 #topic New folks introductions and Apprentice tasks. 19:02:30 any new folks like to introduce themselves? or apprentices with questions or comments? 19:03:04 Hi I am an apprentice and hopefully I can learn and contribute as much as I can 19:03:31 welcome (back) cyberworm54 19:03:57 Thanks! 19:04:01 to digress a bit... do folks think our apprentice setup is working well? or is there anything we can do to improve it? 19:04:20 I think the biggest problem is new people getting up to speed and finding things they can work on. 19:04:52 nirik: also - we have a fair amount more code-related tasks than general admin tasks that newcomers can get into 19:04:56 we are also low on new easyfix tickets, particularly in the sysadmin side. 19:05:02 yeah. 19:05:14 it is a bit ...confusing but once you get to the docs and actually read it you have a start point 19:05:28 #info new easyfix tasks welcome, team members are encouraged to try and file tickets for them. 19:06:06 ok, moving on then I guess. 19:06:17 #topic Applications status / discussion 19:06:27 any application / development news this week or upcoming? 19:06:46 I've been doing some cleanup on the pkgdb db scheme 19:06:49 before: http://ambre.pingoured.fr/public/pkgdb.png 19:06:57 after: http://ambre.pingoured.fr/public/pkgdb2.png 19:07:25 that's with the help of abadger1999 :) 19:07:29 wow. nice! 19:07:29 nice ☺ 19:07:42 #info pingou has vastly simplified the pkgdb db. 19:07:46 * abadger1999 just reviews and makes suggestions to what pingou writes ;-) 19:07:54 pushed a new version of pkgdb-cli (waiting to arrive in testing) and pushed upstream a new version of copr-cli 19:08:16 #info new pkgdb-cli pushed out as well as copr-cli 19:08:19 New fas release is finally out the door. Planning to upgrade production on Feb 28. 19:08:29 abadger1999 and I have started to think about pkgdb2 basically, schema update is the first step 19:08:56 pkgdb -- yeah, and pkgdb2 api is probably going to be the second step 19:08:57 #info fas release being tested in staging, for 2013-02-28 release to prod. 19:09:19 as a note for admins -- the fas release that introduced fedmsg introduced a bug that you should know about 19:09:40 btw, there's a bunch of locale fixes in the new fas release 19:09:41 email verification when people change their email address was broken. 19:09:50 thats the one we have in prod, but we have hotfixed it right? 19:10:00 would good to test fas with different languages 19:10:32 cool. 19:10:39 it would change the email when the user first entered the updated email in the form instead of waiting for them to confirm that the received the verification email. 19:10:45 I saw in stg that it also has the 'no longer accept just yubikey for password' in. 19:11:37 askbot got fedmsg hooks in production this week. there are some new bugs to chase down regarding invalid sigs and busted links.. 19:11:41 any other application news? oh... 19:11:56 #info askbot is now sending fedmsg's. 19:11:58 Latest status -> http://www.fedmsg.com/en/latest/status/ 19:12:08 fedmsg.com? wow 19:12:25 Has anyone had a chance to test patrick's fas-openid dev instance? any feedback for him? 19:12:26 nirik: Hmm... looks like production isn't hotfixed. 19:12:30 threebean: what's the status on fedmsg emitters from outside of the vpn? 19:12:35 nirik: but next fas release will have the fix. 19:12:40 abadger1999: :( I thought we did. ok. 19:12:47 skvidal: no material progress yet, but I've been thinking it over. 19:12:50 Can we wait until Thursday? 19:13:01 threebean: okay thanks 19:13:04 skvidal: I have some janitorial work to do.. then that's next on my list. 19:13:21 threebean: that's the limiting factor for adding notices from coprs, I think 19:13:29 abadger1999: I suppose 19:14:12 * threebean nods 19:14:18 I've used fas-openid but not tested it heavily. It has worked and looks nice. puiterwijk has a flask-fas-openid auth plugin that he's tested and converted fedocal, IIRC, to use it. 19:14:41 yeah, it's worked for me for a small set of sites I tested. 19:15:22 speaking of fedocal, I need to tag 0.1.0 and put it up for review 19:15:29 #info more fas-openid testing welcome. Has worked for those folks that have tried it so far. 19:15:41 the current feature requests will have to wait for the next release... 19:15:57 pingou: yeah. Will be good to get it setup. :) 19:16:15 Oh, fchiulli has a new version of elections that's ready for some light testing 19:16:16 #info fedocal ready for 1.0 tag and review process. 19:16:24 abadger1999: oh cool! 19:16:30 http://elections-dev.cloud.fedoraproject.org/ 19:16:31 abadger1999: cool. Is there an instance up? 19:16:34 nice. 19:16:44 nirik: should be 19:16:48 You need to make an account on fakefas in order to try it out. 19:17:04 #info testing on new elections version welcome: http://elections-dev.cloud.fedoraproject.org/ (make account in fakefas) 19:17:06 Please do try it out. 19:17:06 abadger1999: is elections switching to fas-openid, too? 19:17:24 abadger1999: and the code is ? 19:17:45 skvidal: I believe it is using flask-fas right now because flask-fas-openid isn't in a released python-fedora yet. 19:18:03 pingou: https://github.com/fedora-infra/elections 19:18:14 abadger1999: got it 19:18:19 abadger1999: great 19:18:20 abadger1999: thx 19:18:36 np 19:18:47 I have one more application type thing to discuss... dunno if abompard is still awake, but we should discuss mailman3. ;) 19:18:51 I am all for moving more things over to the flask-fas-openid plugin though. 19:19:15 * nirik is too. 19:19:33 anyhow, we are looking at setting up a mailman3 staging to do some more testing and shake things out. 19:19:41 however, mailman3 needs python 2.7 19:19:43 nirik: yeaj 19:20:06 so, it seems: a) rhel6 + a bunch of python rpms we build and maintain against python 2.7 19:20:12 or b) fedora 18 instance 19:20:30 abadger1999, congrats on election stuff 19:20:38 yes, and MM3 really does not work on python 2.6, sadly 19:20:47 * pingou question: which one will be out first: EL7 or MM3? :-p 19:20:55 we are starting to have more fedora in our infra (for example, the arm builders are all f18) 19:21:09 so, we might want to come up with some policy/process around them. Like when do to updates, etc. 19:21:09 smooge: thanks. It was all fchiulli though :-) I told him he can be the new owner of the code too :-) 19:21:13 I've already rebuilt an application for a non-system python, and it's not much fun 19:21:33 bwahahahah 19:21:33 as in non-scriptable 19:21:58 yeah, it's pain either way... 19:21:59 * abadger1999 thinks fedora boxes are going to be preferable to non-system python. 19:22:07 +1 19:22:09 nirik: an idea 19:22:11 I'm leaning that way as well. 19:22:16 by the way, Debian has a strange but nifty packaging policy for python package that make them work with all the installed versions of python 19:22:21 I think we should make a bunch of servers rawhide 19:22:40 abompard: I assume the db /data for mm3 is all separate from where it needs to run, right 19:22:46 abompard: yeah -- I've looked at the policy but not hte implementation. But every time I've run it by dmalcolm, he's said he doesn't like it. 19:23:04 abompard: i think some of that might be because he has looked at the implementation :-) 19:23:05 abadger1999: understandably, it's symlink-based 19:23:17 skvidal: yeah, to some extent 19:23:23 nirik: I wonder if we could have 2 instances - talking to the same db - so we could update f18 to latest - run mm3 on it in r/o mode - to make sure it is working 19:23:27 skvidal: it has local spool directories 19:23:30 nirik: then just pass the ip over to the other one 19:23:40 in the past we have been shy of fedora instances because of the massive updates flow I think, as well as possible bugs around those updates. I think it's gotten much better in the last few years (I like to think due to the updates policy, but hard to say) 19:23:59 nirik: which is why I was thinking we don't do updates to the RUNNING instance 19:24:07 we just swap out the instance that is in use/has that ip 19:24:08 ... or less contributors? /me ducks and runs 19:24:16 :) 19:24:22 nirik: so we test the 'install' 19:24:24 skvidal: right, so a extra level of staging? 19:24:31 nirik: one level, really 19:24:32 skvidal: I don't know how MM3 will handle a read-only DB 19:24:37 prod and staging 19:25:02 well, right now we are talking about a staging instance only, but yeah, I see what you mean. we could do something along those lines. 19:25:17 I also think for some use cases it's not as likely to break... 19:25:36 ie, for mailman, postfix and mailman and httpd all need to work, but it doesn't need super leaf nodes right? 19:25:39 abompard: understood 19:26:02 nirik: anyway - just an idea 19:26:04 nirik: ooo - actually 19:26:13 nirik: I just had a second idea that you will either hate or love 19:26:14 where as for something like a pyramid app, it would be a much more complex stack 19:26:16 nirik: snapshots 19:26:16 skvidal: we may get bugs because of that, not because of the upgrade 19:26:30 nirik: we snapshot the running instance in the cloud 19:26:32 yeah, we could do that too. 19:26:32 nirik: upgrade it 19:26:36 and if it dies - roll it out 19:27:04 for the moment it will only be low-traffic lists anyway 19:27:22 and I must check that, but if MM is not running, I think postfix keeps the message 19:27:30 skvidal: how would that work in terms of data? would we keep the db and local spool directory separate from the snapshots? 19:27:33 and re-delivers when MM starts 19:27:34 abompard: yes 19:27:35 FWIW, I run f18 servers at home here, and they have been pretty darn stable. (as they were when f17... earlier releases had more breakage from my standpoint) 19:27:41 err 19:27:41 abadger1999: yes 19:27:44 Cool. 19:28:11 abadger1999: no reason we can't have a mm3-db server in the cloud :) 19:28:12 * abadger1999 kinda likes that. although possibly he just doesn't know all the corner cases there :-) 19:28:16 yeah. I'm sure we could do something with snapshots. 19:28:21 anyway - just an idea 19:28:23 nothing in stone 19:28:27 yeah. 19:29:06 also, for updates, we may just do them on the same schedule as rhel ones, unless something security comes up in an exposed part... ie, just look at the httpd, etc not the entire machine. 19:29:42 anyhow, all to be determined, we can feel out a policy. 19:29:49 anything else on the applications side? 19:29:56 I have two more 19:30:00 Do we have a schedule for getting fas-openid into production? 19:30:28 abadger1999: I think it's ready for stg for sure now... but not sure when prod... 19:30:58 I'm fine with rolling it out as fast as we are comfortable with. 19:31:03 I'd like to see it get more use. ;) 19:31:04 I think we're coming along great. But if we're going to start migrating apps to use fas-openid/telling people to use it when developing their apps (like elections), then we need to have a plan for getting it into prod 19:31:09 19:31:19 nirik: it's setup to replace the current fas urls? 19:31:34 abadger1999: not fully sure on that. I think so... 19:31:36 * abadger1999 was wondering if we could deploy it and just not announce it for a few weeks 19:31:46 thats a thought. 19:32:22 alright -- I guess let's talk about htis more on Friday after our classroom session with puiterwijk :-) 19:32:26 Oddly I have noticed that for things like askbot you get two different "users" with different urls. 19:32:28 yeah 19:33:05 Other thing is for all the devs here, how's the "review all changes" idea working out? 19:33:27 #info will try out an f18 server for mm3 staging testing and feel out an updates policy, etc. Possibly using snapshots more. 19:33:39 I've liked how it works with pingou, puiterwijk, and SmootherFrogZ for fas, python-fedora, and packagedb. 19:33:46 #info will look at moving fas-openid to prod as soon as is feasable. 19:33:55 abompard: how much space do you need on the mm server itself - if you are not storing the db there? 19:33:59 lmacken: Is it working okay for bodhi and such too? 19:34:07 anything that's falling through the cracks? 19:34:14 skvidal: I need to check that 19:34:21 skvidal: if we are doing this as a real staging, we might want to just make a real 'lists01.stg.phx2' virthost instead of cloud? 19:34:26 abompard: I defintevely like it 19:34:53 Do we want to say that certain things are okay to push without review? (making a release would be a candidate...I was going to suggest documentation earlier but pingou found a number of problems with my documentation patch :-) 19:34:53 abadger1999: ^ :) 19:34:54 nirik: okay - I didn't know if we wanted to be cloud-er-fic about it or not 19:35:01 nirik: thx 19:35:31 skvidal: yeah, I'm open to either, but I think right now until we have less fog in our clouds, a real one might be better for this... but either way 19:35:53 nirik: well - with attached persistent volumes - using one of the qcow imgs is non-harmful 19:35:55 abadger1999: I like seeing the extra review. I've not done much reviewing myself. ;) 19:36:06 skvidal: not much, a few hudred MB 19:36:08 nirik: but I agree about fog 19:36:17 * abadger1999 notes that threebean is in another meeting but said he still likes the idea but hasn't done it consisstently all the time. So more experimentation with it needed. 19:36:43 * abadger1999 liked that nb reviewed a documentation update the other day :-) 19:37:02 I think it can bring us new contributor 19:37:21 some of them are easyfix 19:37:31 other are bigger and then might need more experienced reviewers 19:37:57 yeah 19:38:21 welcome mdomsch 19:38:41 Yeah. I agree. it's nice to have someone else's eyes on the bigger fixes even if they're relatively new too, though. It's better than before where I would have committed it without any review at all. 19:38:49 better late than never 19:38:51 that reminds me, mdomsch was going to look at updating mm in prod to 1.4 on friday... if not then, then sometime soon. ;) 19:39:04 #info feedback on github reviews of all commits welcome. 19:39:11 anyone have any grief with doing a major MM upgrade tomorrow afternoon? 19:39:11 #info mirrormanager update to 1.4 soon. 19:39:47 mdomsch: If you're around in case it goes sideways it would be very nice. 19:39:51 everything I know I've broken, I've fixed. Now it's time to test in production. :-) 19:39:52 I think it should be fine. We can be somewhat paranoid and not touch one of the apps so we have an easy fallback. 19:40:11 get the fixes in that you've had pending and get us onto a single codebase for development. 19:40:13 k 19:40:25 (until we are sure the others are all working right I mean) 19:40:31 right 19:40:34 so bapp02, then app01 19:40:47 * nirik nods. 19:40:48 and I'll stop the automatic push from bapp02 to app* 19:40:58 sounds good. 19:41:00 until we're comfortable. Worst case, we have slightly stale data for a few hours 19:41:21 * nirik nods. 19:41:28 instead of "if you're around" it would'vs been clearer for me to say "as long as you're around" :-) 19:41:43 mdomsch: you've picked up all the hotfixes into 1.4 right? 19:41:46 abadger1999: naturally; I'm not around nearly as much 19:41:56 Yeah. we miss you ;-) 19:42:16 abadger1999: +1 :) 19:42:40 anyhow, any other application news? or shall we move on? 19:43:00 #topic Sysadmin status / discussion 19:43:06 nirik: yes I pulled them all in while at FUDCon 19:43:17 lets see... this week smooge was out at phx2 for a whirlwind tour. 19:43:22 mdomsch: cool. 19:43:45 #info smooge got out bnfs01 server's disks working again. 19:43:51 #undo 19:43:51 Removing item from minutes: 19:43:56 #info smooge got our bnfs01 server's disks working again. 19:44:09 kind of sort of 19:44:19 I've been tweaking nagios of late... hopefully making it better. 19:44:30 #info nagios adjustments in progress 19:44:56 We should have net for the rest of the arm boxes friday. 19:45:07 #info arm boxes will get new net friday hopefully 19:45:14 I had a discussion with the author of pynag this morning 19:45:49 cool. Worth using for a tool for us to runtime manage nagios? 19:45:50 if we have people willing to spend some time - we could easily build a query tool/cli-tool for nagios downtimes/acknowledgements/etc 19:46:07 that would be quite handy, IMHO 19:46:12 nirik: it needs some code to make it work - but I think the basic functionality is available 19:46:41 for some things the ansible nagios module would do, but for others it would be nice to have a command line. 19:47:15 I'd like to look at doing a mass reboot next wed or so... upgrade everything to rhel 6.4. 19:47:17 skvidal: interesting! 19:47:37 Might do staging today/tomorrow to let it soak there and see if any of our stuff breaks. ;) 19:47:52 #info mass reboot next wed (tenative) for rhel 6.4 upgrades. 19:47:59 nirik: right - I'd like to be able to enhance the ansible nagios module to be more idempotent and 'proper' 19:48:04 nirik: pynag _could_ do that 19:48:17 yeah, it looks very bare bones right now. 19:48:35 in particular we could use a 'downtime for host and all dependent hosts' type thing 19:48:52 nirik: we could also use a 'give me the state of this host' 19:48:58 without having to go to the webpage 19:49:14 according to palli (a pynag developer) it can read status.dat 19:49:15 from nagios 19:49:18 I am looking at lldpd for our PHX2 systems http://vincentbernat.github.com/lldpd/ Mainly to better get an idea of where things are 19:49:20 to determine ACTUAL state 19:49:26 finally in the sysamin world, I'd really like to poke ansible more and get it to where we can use it for more hosts. Keep getting sidetracked, but it will happen! :) 19:51:04 smooge: another thing we could look at there is http://linux-ha.org/source-doc/assimilation/html/index.html (it uses lldpd type stuff). They are about to have their first release... so very early days. 19:51:33 ah cool 19:51:38 wiill look at that also 19:51:55 oh, on nagios, I set an option: soft_state_dependencies=1 19:52:22 this hopefully will help us not get the flurry of notices when a machine is dropping on and off the net, or has too high a load to answer, then answers again. 19:52:50 #topic Private Cloud status update / discussion 19:53:01 skvidal: want to share your pain where we are with cloudlets? :) 19:53:08 sure 19:53:23 last week I did the euca upgrade and the wheels came right off 19:53:29 and then it plunged over a cliff 19:53:31 into a volcano 19:53:41 sounds like a lot of fun 19:53:42 where it was eaten by a volcano monster 19:53:54 who was riding a yak 19:53:56 anyway the euca instance is limping along at the moment with not-occasional failures :( 19:54:04 smooge: and the yak had to be shaven 19:54:17 brough back some pictures > 19:54:19 ? 19:54:21 so... 19:54:35 I've been working on porting our imgs/amis/etc over to openstack 19:54:44 and getting things more production-y in the openstack instance - 19:54:58 I got ssl working around the ec2 api for openstack 19:55:11 #info euca cloudlet limping along after upgrade. 19:55:12 working on ssl'ing the other items 19:55:18 for the past couple of days 19:55:26 #info work on going to bring openstack cloudlet up to more production 19:55:28 I've been in a fist fight with openstack and qcow images 19:55:33 and resizing disks 19:55:47 I just got confirmation from someone that what we want to do is just not possible at the moment :) 19:55:54 lovely. ;( 19:56:10 nirik: not until we get the initramdisk to resize the partitions :( 19:56:17 so - I'm punting on this 19:56:24 I just put in a new ami and kernel/ramdisk combo 19:56:29 that's rhel6.4 latest 19:56:30 sometimes that is best 19:56:35 yeah. I think that could work, but needs some time to get working right. Hopefully by the cloud-utils maintainer. ;) 19:56:38 and since it is an AMI it resizes the disks 19:56:50 what it DOES NOT DO is follow the kernel on the disk - it uses the one(s) in the cloud 19:56:54 which is suck 19:57:00 but at least it is known/obvious suck 19:57:08 but it should also get us moving past it for now. 19:57:11 I've also just built a new qcow from rhel6.4 19:57:27 so for systems that don't need to be on-the-fly made - we can spin them up 19:57:31 growpart the partition 19:57:33 reboot 19:57:35 resize 19:57:37 and go 19:57:47 and i'm working on a playbook to handle all of the above for you 19:57:51 and, yes, it makes me cry inside 19:58:08 ;( 19:58:12 that's where we are at the moment 19:58:26 I am making new keys/accounts/tenants/whatever 19:58:35 for our lockbox 'admin' user 19:58:40 for making persistent instances 19:58:53 * nirik nods. 19:58:57 the next step is to start making use of the resource tags in openstack 19:59:02 so we can more easily track all this shit 19:59:15 also I have to make a bunch of volumes and rsycn over all the data from the euca volumes :( 19:59:30 I fully expect that last part to be a giant example of suffering 19:59:46 yeah. we should probibly move one set of instances first and sort out if there's any doom 19:59:57 if I sound kinda 'bleah' there's a reason 20:00:02 nirik: I thought I'd start with the fartboard 20:00:07 heh. ok 20:00:37 nirik: also - now that we have instance tags - it should be doable to write a simple 'start me up' script using ansible to spin out the instances 20:00:40 and KNOW where they are 20:00:48 ok, we are running over time... let me quickly do upcoming and open floor. ;) 20:00:52 sorry 20:00:55 thx 20:00:57 thats fine. ;) all good info 20:01:04 one last thing 20:01:08 if anyone wants to get involved 20:01:08 ping me 20:01:29 #info please see skvidal if you want to get involved in our private cloud setup 20:01:33 #topic Upcoming Tasks/Items 20:01:42 (big paste) 20:01:44 #info 2013-02-28 end of 4th quarter 20:01:44 #info 2013-03-01 nag fi-apprentices 20:01:44 #info 2013-03-07 remove inactive apprentices. 20:01:44 #info 2013-03-19 to 2013-03-26 - koji update 20:01:44 #info 2013-03-29 - spring holiday. 20:01:46 #info 2013-04-02 to 2013-04-16 ALPHA infrastructure freeze 20:01:48 #info 2013-04-16 F19 alpha release 20:01:50 #info 2013-05-07 to 2013-05-21 BETA infrastructure freeze 20:01:52 #info 2013-05-21 F19 beta release 20:01:54 #info 2013-05-31 end of 1st quarter 20:01:56 #info 2013-06-11 to 2013-06-25 FINAL infrastructure freeze. 20:01:58 #info 2013-06-25 F19 FINAL release 20:02:00 anything people want to schedule/note etc? 20:02:07 I'll add the fas update and the mass reboot. 20:02:20 Sounds good. 20:02:49 #topic Open Floor 20:02:54 Anyone have items for open floor? 20:03:32 I have a series of blog post 'Fedora-Infra: Did you know?' coming, like once a week for the coming 4 weeks 20:03:32 ok. 20:03:42 pingou: wow 20:03:46 pingou: awesome. More blog posts would be great. 20:03:49 short stuff, speaking about some cool features/ideas 20:03:52 pingou: looking forward to seeing those 20:04:10 Thanks for coming everyone. Do continue over on our regular channels. :) 20:04:14 #endmeeting