18:00:27 #startmeeting Infrastructure (2015-01-15) 18:00:27 Meeting started Thu Jan 15 18:00:27 2015 UTC. The chair is nirik. Information about MeetBot at http://wiki.debian.org/MeetBot. 18:00:27 Useful Commands: #action #agreed #halp #info #idea #link #topic. 18:00:27 #meetingname infrastructure 18:00:27 #topic aloha 18:00:27 #chair smooge relrod nirik abadger1999 lmacken dgilmore mdomsch threebean pingou puiterwijk 18:00:27 The meeting name has been set to 'infrastructure' 18:00:27 Current chairs: abadger1999 dgilmore lmacken mdomsch nirik pingou puiterwijk relrod smooge threebean 18:00:32 * pingou here 18:00:58 here 18:01:04 o/ 18:01:19 * threebean 18:01:39 * relrod here 18:02:55 * michel_slm here 18:03:22 * sborza here 18:03:24 alright, I guess lets get started then... 18:03:26 * herlo here 18:03:34 #topic New folks introductions and Apprentice tasks. 18:03:42 any new folks like to introduce themselves? 18:03:48 or apprentices with questions or comments? 18:04:10 * sborza raises hand 18:04:23 * fale is here too 18:04:28 * abompard here too :-) 18:04:29 my ssh access may not be working now that i've moved accounts? 18:04:50 sborza: it should have activated on the new account... we can debug after the meeting in #fedora-admin? 18:04:55 actually, management has moved from puppet to ansible, correct? 18:04:56 seems lke the pub key is getting denied...i've updated the config based on: http://infrastructure.fedoraproject.org/infra/docs/sshaccess.txt 18:05:04 nirik: sure thing, thanks 18:05:27 mhurron: is moving, yes 18:05:34 mhurron: we are in the process of doing so yeah... 18:05:43 we have ~50 machines left in puppet... 18:05:45 the apprentice wiki page has a workflow with puppet, is there one for ansible? or a page that details what is in puppet and what is in ansible? 18:06:09 * tflink shows up late :-/ 18:06:13 there's a migration page for puppet->ansible move, but yeah we should update things for ansible 18:06:44 * herlo blames tflink 18:06:48 :) 18:07:26 mhurron: I'll try and update that page, or provide info to someone else who wants to. ;) 18:07:39 18:07:47 herlo: penalty for showing up late is being blamed for ... everything? 18:07:53 tflink: exactly. :D 18:07:59 tflink: it shows we love you. 18:08:03 ha 18:08:13 thanks ... i'll do it if I can get the information 18:08:29 huh, didn't realize it worked that way 18:08:37 mhurron: excellent. Happy to answer questions in #fedora-admin, #fedora-noc, etc. ;) 18:08:45 tflink: who better than those that care to give you a hard time? 18:09:01 #topic Applications status / discussion 18:09:11 any application news, status or info this week? 18:09:12 * tflink will have to remember that one :) 18:09:31 the last el7 builds we needed for blockerbugs have been done and are in epel7-testing 18:09:40 nirik: woud new application suggestions go here? 18:09:57 waiting on final review for application changes to support el7 but that should be done this week 18:10:19 tflink: awesome. Happy to work on migrating it. 18:10:21 herlo: sure. 18:10:35 #info blockerbugs nearing rhel7/ansible migration ready 18:10:42 I wanted to propose we have a look at d-note. https://pthree.org/2014/06/09/officially-announcing-d-note-version-1-0/ 18:10:54 This is what it does: https://secrets.xmission.com/ 18:11:06 and it's on github here: https://github.com/atoponce/d-note 18:11:15 I'm working on packaging it for Fedora now. 18:11:39 I've been mostly working on pkgdb2 this week 18:11:55 herlo: interesting. 18:11:58 * herlo is willing to deploy and maintain it... 18:12:07 to adjust it as discuss with rel-eng to migrate the new package / new branch processes to pkgdb (instead of bugzilla) 18:12:17 of course, I'll need some ansible guidance. :) 18:12:34 herlo: well, you need a few more folks. I am very much not willing to deploy anything that only one person works on. 18:12:43 and we need to decide if it would be of use to us. 18:12:49 nirik: sure... which is why I am bringing it up now. 18:12:51 but yeah, packaging it up would be first step 18:13:12 in that line, i'd like to propose we look at graphite-web for metrics visualization: https://github.com/graphite-project/graphite-web 18:13:21 nirik: the conversation of usefulness is in the concept of being able to share secrets in a place we trust. 18:13:21 as well as updating our (way out of date) collectd configs 18:13:29 sborza: we use graphite at LF. It's pretty nice. 18:13:33 herlo: your secrets.xmission link doesn't work here 18:13:49 pingou: really? 18:13:51 weird... 18:13:57 herlo: a lot of changes happening now, would love to get this added to the EPEL repos too 18:13:57 it works on my pc 18:13:58 pingou: it's https. 18:14:02 yes 18:14:06 sborza: indeed... 18:14:17 herlo: can you also use that to make open ones that can stay around? or are they always destroyed on one view? 18:14:27 * nirik is wondering if it could replace paste. 18:14:34 pingou: well, you can stand up an instance quickly. from my repo at https://github.com/herlo/d-note 18:14:42 nirik: would be happy to work with someone to get this tested/integrated (re: graphite-web) 18:14:43 nirik: no. It's not intended for that. 18:14:49 ok. 18:14:58 nirik: he does have plans for user profiles and such. 18:15:09 sborza: I'm interested in it, but can't volunteer too much time. there's lots to consider. 18:15:18 * nirik notes we have a RFR process for these things. ;) 18:15:25 nirik: and there's a possibility of allowing multiple people to see the note. Only each instance would be deleted upon viewing, I think. 18:15:26 sborza: most of the work would probably be tied up in porting our custom plugins over. 18:15:39 nirik: I just wanted to get a feeling if it would be useful first. :) 18:15:52 threebean: fair enough, doesn't have to be asap but at some point soon 18:16:00 herlo: sure. 18:16:10 It can create both private and public notes, but it missing the code highlight feature 18:16:27 I'm not saying no to anything, just noting that we are very busy. :) 18:16:39 sborza: first thing to look into would be requesting that all the components/packages get branched for EPEL7 (they already exist in EPEL6 which is nice) 18:17:08 threebean: sounds good re: collectd...if we want the same for graphite-web should I do that myself? 18:17:21 threebean: full disclosure, i'm part of the graphite-web steering committee :) 18:17:32 nirik: as usual. Busy people are busy. :) 18:17:48 sborza: cool ;p. yeah, I actually just meant graphite-web needs the branch requests. we already have collectd on el7. 18:17:55 * herlo is done proposing. The interest level is enough to get me started for sure. 18:18:19 threebean: awesome, let's talk more in #fedora-admin (or #fedora-devel?) 18:18:40 sborza: -admin works for this 18:18:58 herlo: it looks like a cool thing in any case. Even if we don't have a instance. 18:18:59 sborza: can it handle any apps, other than web app? 18:19:26 sborza: are you planning to also package a prettified front-end like Team Dashboard or Graphene? 18:19:36 nirik: right. The point I was hoping to make was that we could have 'locations' of trustiness. 18:19:49 sure, could be handy 18:20:47 ok, any other application type news? 18:21:04 pingou: not sure what you're asking, example? 18:21:20 sborza: we currently use collectd to monitor the load on fedmsg 18:21:39 and on the queue of some application relying on fedmsg 18:21:45 michel_slm: can do...graphene is awesome, metrilyx is decent and grafana is phenomenal 18:22:40 sborza, pingou: those are the custom plugins I was talking about. I'm almost certain graphite's tools can handle those cases. 18:22:47 pingou: graphite-web can handle a variety of metrics collectors, including anything custom 18:22:56 cool :) 18:23:04 http://play.grafana.org/ does look nice 18:23:07 sborza: awesome, let me know if you need reviewers 18:23:34 * mirek-hm is late but here 18:23:40 #topic Sysadmin status / discussion 18:23:49 ok, on the sysadmin side... the usual stuff. ;) 18:24:06 We got a few bad drives replaced yesterday. Thanks puiterwijk for working with folks to get that done 18:24:26 I've been working on getting the new rhel7/ansible koji hubs done so we can switch to them. 18:24:32 michel_slm: fo sho, will do 18:24:41 smooge has been burried in hardware ordering and budget fun. ;) 18:25:07 hardware is ordered 18:25:13 s390 box is being racked 18:25:16 we also got a gigantic spike in web requests this morning that saturated our proxies for about 10min or so. Not sure what caused it... 18:25:20 qa boxes will be installed this afternoon 18:25:22 I broke DNS 18:25:42 smooge: thats a new s390 koji hub? 18:25:54 yes. I am getitjng it installed in our new racks 18:26:05 I really would like to set it up with ansible and in our normal machine setup 18:27:32 I have no idea how the current one is setup 18:27:35 Nagios has been very noisy of late... hopefully we can settle it down some. 18:27:47 since the s390 is somewhere not in redhat 18:27:57 smooge: well, when you get the new one installed, let me know. I can talk to sharkcz and we can get it sorted. ' 18:28:15 it's up in boston I think... 18:28:40 anyhow, I want to get all the secondary stuff setup in ansible and in our normal processes. 18:28:42 probably. I thought at one time it was at IBM but not sure 18:28:48 so they have 2fa and get regular updates and all that 18:29:01 I am fine with that 18:29:44 Oh, I made a db01.stg instance... and moved some databases from the old db02.stg to it. 18:30:03 there's still a bunch more to move. If any sysadmin-mainer folks want to do that, please feel free. ;) 18:30:42 #topic Upcoming Tasks/Items 18:30:42 https://apps.fedoraproject.org/calendar/list/infrastructure/ 18:30:52 anyone have upcoming tasks they would like to schedule or note? 18:31:26 fosdem and devconf in early Feb 18:32:13 also smooge and I will be at our main datacenter in feb 18:32:27 likely we will need some outages then... to add memory to machines. 18:32:57 So, we have some time left in the meeting and I wanted to see about trying something new... 18:33:15 oh? 18:33:20 suspens :) 18:33:38 I thought it would be nice when we have extra time to talk about one specific app of ours. 18:33:45 how it's setup, what it does, etc. 18:33:55 +1 18:33:58 This week I could talk about koji or ansible... anyone have a preference? 18:35:20 nirik: seems like you get to decide 18:35:20 hm. cool. 18:35:27 ok, I guess I will pick one... koji, since I have been working on it. 18:35:34 #topic An App overview - koji 18:35:54 so, koji is our buildsystem. It has several parts... 18:36:18 a hub that runs a wsgi / httpd application - right now this is koji03 18:36:48 a secondary hub (could take over if primary died) that right now runs some cron jobs and kojira 18:36:52 this is koji04 18:37:17 kojira is a process that launches things like buildroot rebuilds and the like. You can see it in the web interface... jobs owned by it. 18:37:36 There's also a pile of builders. They all talk to the hub to get jobs and report on status and such. 18:38:06 koji uses a self signed cert with its own CA... because it uses certs to identify builders and people 18:38:33 we also have a squid proxy - kojipkgs02 currently that sits in front of the koji packages urls... 18:38:50 builders and other things that download from koji hit that squid where it's cached. 18:39:00 (for packages and most things) 18:39:27 builders run the 'kojid' process. This talks to the hub, and identifies with a cert. 18:40:13 Current koji (rhel6/puppet) uses heartbeat to keep a application ip on the active hub (usually koji03) 18:40:28 new koji I am working on (rhel7/ansible) will use keepalived 18:40:51 Storage for koji is a nfs mounted volume from the netapp. 18:41:12 It has to be mounted on the hub and any builders that do newrepo tasks (but not on builders that don't) 18:41:35 question 18:41:35 Thats kind of the high level... questions? 18:41:57 those builders that do newrepo tasks 18:42:01 do they also do other tasks? 18:42:04 yep. 18:42:08 * threebean nods 18:42:12 koji has the concept of 'channels' 18:42:28 you can setup channels for things and then tell it specific builders do those channels. 18:42:32 so, a job submitted by joe-user can potentially get run on a builder that has write-access to the netapp mount? 18:42:55 yep. It would be in a mock chroot tho. 18:43:02 * threebean nods 18:43:08 cool. 18:43:10 that chroot shouldn't have any access... 18:43:34 ! 18:43:48 We do have some special channels/rules... there's some special builders for secure boot for example that only do builds of packages that need secure boot signing. 18:44:10 pingou: yes? 18:44:20 Are all the koji hosts el6 still? (can the builders be moved to el7 separatly from the masters? (Although I imagine heartbeet/keepalive might be the problem)) 18:45:02 pingou: most of the builders are now f21. ;) 18:45:07 oh true :) 18:45:36 all the arm ones, the buildvm and buildhw ones are all f21. 18:45:49 the bkernel ones are still f20. I need to upgrade them. 18:46:09 the buildppc ones are rhel6 I think... but we only need ppc for epel 18:46:53 koji03/04 and kojipkgs02 are all rhel6/puppet. I am working now to move them to koji01/02, kojipkgs01 that is ansible and rhel7 18:48:13 oh, there's also a db host... I already moved it to rhel7 18:48:54 ok, any other questions? does someone want to do some other app next week? :) 18:49:47 oh - here's a question. the new koschei service seems to be hitting koji pretty hard. 18:50:13 is koji handling that new load OK? 18:50:18 yeah, it has a limiter... I think it only ever does 40 jobs at once or something. 18:50:25 .load 18:50:25 pingou: Error: You don't have the owner capability. If you think that you should have this capability, be sure that you are identified before trying again. The 'whoami' command can tell you if you're identified. 18:50:28 yeah, perfectly fine. 18:50:31 30* 18:50:54 it's actually been nice when I had a problem with the f21 buildvm's... I could count on it submitting jobs so I could see if we fixed it or not. 18:51:01 mizdebsk: sorry, my mistake. ;) 18:51:01 ;p 18:51:08 and koji load must be <50% at the time of executing new build by koschei 18:51:19 all that is configurable, feedback is welcome 18:51:42 .buildload 18:51:43 nirik: Load: 116.9 Total: 272.0 Use: 43.0% (Medium Load) 18:52:16 mizdebsk: looks fine here. just curious about how its playing out. 18:52:16 I think it's been fine. I haven't seen any issues with it. 18:52:41 During mass rebuilds we get a nice backlog (submitting faster than it can build), but even then it chruns thru them pretty fast. 18:52:48 we have 91 builders currently enabled. 18:53:06 .builders 18:53:06 pingou: Enabled: 91 Ready: 78 Disabled: 69 18:53:26 being used == Enabled - Ready? 18:53:42 46 arm, 27 buildvm, 12 buildhw, 2 bkernel, 2 buildppc 18:53:58 ready means load is low and builder is accepting tasks 18:54:12 pingou: koji uses a loading setup... so you can tell it how much load a builder can handle 18:54:22 when it gets that load or more it stops being ready to accept new jobs. 18:54:35 ok 18:54:54 % koji list-hosts 18:54:54 Hostname Enb Rdy Load/Cap Arches Last Update 18:54:54 arm02-builder00.arm.fedoraproject.org Y N 6.0/2.0 armhfp 2015-01-15 18:53:06 18:55:09 it's obvious that koschei needs hardware - there were plans to add new hardware to koji instead having separate pool just for koschei 18:55:13 so that builder can handle a load of 2.0, but it has a 6.0 job... so its not ready currently 18:55:36 also during mass rebuilds i'm suspending koschei 18:55:51 nirik: cool, thanks 18:56:28 mizdebsk: I wonder if we could do something with out staging koji... (which is currently not working). If it got regular builds it would help us know it's working along ok 18:57:08 just an idle thought. 18:57:30 pingou / threebean: either of you have an app you might want to talk about next week? ;) 18:57:34 maybe stagging koschei could use it? i started playing with ansible setup for koschei, but i don't have access to bastion and such 18:57:35 sure :P 18:57:40 mizdebsk, you said there were plans ... 18:57:45 have they changed? 18:57:47 nirik: I might be able to find one, maybe two ;-) 18:58:17 mizdebsk: might work. I need to fix the issue where it's not starting builders, but then it would be good to look into 18:58:20 cool. 18:58:26 #topic Open Floor 18:58:32 Anyone have anything for open floor? 18:58:33 smooge, currently i've been told that there will be budget for new hardware if it's needed 18:58:56 but so far it wasn't necessary - afaict koji is handling current load just fain 18:59:12 fine* 18:59:14 mizdebsk: would you like a 'koschei' cert for koschei, so it can submit builds as itself instead of msimacek ? 18:59:27 threebean, yes, i've filled ticket for that 18:59:33 cool, cool. 18:59:53 mizdebsk, thanks. I just wanted to make sure if I needed to put in an order before the end of the year or not (or prepare for boxes to arrive) 19:00:00 threebean, https://fedorahosted.org/rel-eng/ticket/5941 19:00:14 * pingou gtg 19:00:20 we really need to get thru more releng tickets in the releng meetings. ;) we will get there 19:00:25 thanks for the meeting everyone, thanks for chairing nirik 19:00:25 i assume cert will be generated as part of RFR 19:00:58 mizdebsk: yeah, we can get it sorted. 19:01:05 ok, thanks for coming everyone... 19:01:11 A couple things I forgot to mention in the apps section 19:01:12 will close in a min if nothing else. 19:01:12 1) A new release of FMN got deployed as promised last week. Lots of bugfixes and enhancements, but mostly a new set of nice defaults: 19:01:14 https://apps.fedoraproject.org/notifications 19:01:15 oh, go ahead 19:01:16 You there's a button on each messaging context that let's you reset your filters to the new defaults if you'd like to test it out. 19:01:18 2) A new release of the-new-hotness got deployed to stg that follows up on bugs about real builds, not just scratch builds. There's a little more work to be done before we can put it in production. Should be done with that in a week or two. 19:01:20 3) We had some issues with FAS in fedmsg the past few weeks. Two of the fas servers weren't publishing any messages. Should be all fixed now. 19:02:21 threebean: thanks for tracking that down. 19:02:21 (heh, that's all.. ;p) 19:02:33 np 19:02:36 mizdebsk: its causeing issues 19:02:39 I really need to redo fas01 too... but I want to try and confirm the cert stuff is all ok in asible 19:02:41 ansible 19:02:53 mizdebsk: we did some things to have java builds mostly go to x86 19:03:19 mizdebsk: and because of the load of the tasks its not happening causeing complaints from the java folks 19:03:45 dgilmore: huh... was this eclipse folks? or ? 19:04:01 nirik: no. the people doing other java builds 19:04:04 dgilmore, koschei runs some tasks with --arch-override x86_64 because arm builders are not usable for java imho 19:04:10 that may be the reason 19:04:36 mizdebsk: its causing issues 19:04:39 ok, because one of the arm builders in the eclipse channel is down with a dead drive... so it's only got 1 left I think, making it slow 19:05:06 java on arm is missing a jit so its slow 19:05:08 dgilmore, but this is really issue related to arm and java, not koschei itself 19:05:21 are there are any other issues with koschei? 19:05:33 it is something that is going to need some planning and coordination 19:05:48 perhaps we could continue this over in #fedora-releng? we are over time on our meeting. ;) 19:05:56 mizdebsk: yes but koschei is breaking what we are doing to deal with it 19:06:18 * mizdebsk is fine with moving to different channel 19:06:49 nirik: sure 19:06:50 Thanks for coming folks. 19:06:54 #endmeeting