18:00:05 #startmeeting Infrastructure (2015-10-08) 18:00:05 Meeting started Thu Oct 8 18:00:05 2015 UTC. The chair is nirik. Information about MeetBot at http://wiki.debian.org/MeetBot. 18:00:05 Useful Commands: #action #agreed #halp #info #idea #link #topic. 18:00:06 #meetingname infrastructure 18:00:06 The meeting name has been set to 'infrastructure' 18:00:06 #topic aloha 18:00:06 #chair smooge relrod nirik abadger1999 lmacken dgilmore mdomsch threebean pingou puiterwijk pbrobinson 18:00:06 Current chairs: abadger1999 dgilmore lmacken mdomsch nirik pbrobinson pingou puiterwijk relrod smooge threebean 18:00:06 #topic New folks introductions / Apprentice feedback 18:00:12 morning everyone. 18:00:14 * roshi is here 18:00:16 Morning 18:00:24 morning 18:00:43 Will give a few minutes for any new folks to give a short one line introduction and for any apprentices with questions or comments. 18:01:26 * dotEast2015 is here 18:01:29 Here 18:01:43 * aikidouke here 18:01:43 just wanted to let you guys know I should be getting back active 18:01:55 hey pcreech. Welcome back. 18:02:07 * threebean is here 18:02:07 .hellomynameis smdeep 18:02:08 had a external commitment that took up a lot of extra time 18:02:09 smdeep: smdeep 'Sudeep Mukherjee' 18:02:17 thanks nirik! 18:02:27 I'm Brian, looking to join in with Fedora infra team focusing on more of the sysadmin type stuff. Hope to get involved with system some system perforamce. 18:02:50 puffi: that would be great :) 18:02:59 Hi , abhiii5459_ from India here. newbie. This is my first fedora-infra meeting. Looking to contribute to Fedora (abhi.darkness@gmail.com | Github- abhiii5459) 18:03:02 My name is Justin - for now, I'm quietly observing, not sure if I can contribute much now but I'm interesting in learning more. 18:03:32 cool. :) Welcome all the new folks. ;) 18:03:52 do see: https://fedoraproject.org/wiki/Infrastructure/GettingStarted 18:04:12 and feel free to ask questions as we go or after in #fedora-admin #fedora-apps or #fedora-noc 18:04:31 Great, Thanks nirik 18:04:33 Will do :) Thank you! 18:04:50 Any sysadmin types interested in our apprentice group can see me after the meeting over in #fedora-admin and I can get you added. ;) 18:05:07 For developers we have #fedora-apps folks able to help point you to easyfix tickets. 18:05:51 Great, any other new folks? or apprentices with questions? 18:06:32 alright, lets go on to status/info 18:06:41 #topic announcements and information 18:06:41 #info Some more inconclusive nfsv4 testing, ticket open with netapp now - kevin 18:06:42 #info Moved virthost15 to rhel7 without too much pain - kevin 18:06:42 #info Moved a smtp-mm host from coloamer01 to osuosl - kevin 18:06:42 #info Mass update/reboot cycle done - kevin, smooge, patrick 18:06:43 #info Cloud updated/rebooted - patrick 18:06:44 #info Fixed mirrorlist servers to not log all hits to log01 - kevin 18:06:46 #info Lots of debugging on proxy load issues, still in progress - kevin, patrick 18:06:48 #info migrated taskotron-dev to f22, stg is soon to follow - tflink 18:06:53 anything in there anyone would like to expand on or add? 18:07:13 I'll note that I (and today tflink) have been the only ones updating... ;) Nag nag. :) 18:07:43 ok, I had a discussion item: 18:07:45 #info Ipsilon backporting busy. Hopefully lots of bugs fixed after updating later today 18:07:46 #topic high cpu load on proxies - kevin / patrick 18:08:06 so, let me describe the issue and see if folks have any ideas to track it down. ;) 18:08:08 :) 18:08:27 We have a bunch of proxies. Some of them are "bigger" than others (more cpus, memory, etc). 18:08:43 Some of the smaller ones have started having very high load of late. 18:09:06 In particular proxy08. It has 4 cpus and they pretty much get pegged. 18:09:14 It's not i/o related. 18:09:18 It's not memory related. 18:09:29 It's apache taking up too much cpu. 18:09:53 https://admin.fedoraproject.org/collectd/bin/graph.cgi?hostname=proxy08.fedoraproject.org;plugin=load;type=load;begin=-50000 18:09:59 You can see the load there. 18:10:18 We thought it might be related to ssl handling, so we tried some things to mitigate that... but they didn't seem to help much 18:10:21 nirik: Is it a case of "bullying" from a load perspective or how are the connections distrubuted between the proxies? 18:10:34 puffi: it's round robin dns. 18:10:40 By geoip region 18:10:56 so if you do a 'host fedoraproject.org' you can see all the proxies in your region 18:11:07 https://admin.fedoraproject.org/collectd/bin/index.cgi?hostname=proxy01.phx2.fedoraproject.org&hostname=proxy02.fedoraproject.org&hostname=proxy03.fedoraproject.org&hostname=proxy04.fedoraproject.org&hostname=proxy05.fedoraproject.org&hostname=proxy06.fedoraproject.org&hostname=proxy07.fedoraproject.org&hostname=proxy08.fedoraproject.org&hostname=proxy10.phx2.fedoraproject.org&hostname=proxy11.fedoraproject.org&plugin=load×pan=86400&action=sh 18:11:07 ow_selection&ok_button=OK 18:11:14 (sorry that got cut off) 18:11:34 but you can see the load on them all there... and when we dropped proxy08 out the others saw a load spike 18:11:46 (which makes sense as they would take over the connections) 18:12:23 so, we need some way to figure out why apache is so cpu bound on those hosts. 18:13:20 would swapping out apache to something like nginx help? (I imagine that's non-trivial though) 18:13:53 nirik: does load increase/decrase with respect to increased/decreased traffic? 18:14:01 nirik: I've heard that it does, but as you said, it's very difficult since all of our config needs to be rewritten 18:14:48 pcreech|work: well, I think so, but we don't have a fine tool there. It's either (no load / out of dns) or (all load/ in dns) 18:15:00 .host fedoraproject.org 18:15:05 we did actually have it in with just an ipv6 ip... and it handled that just fine. 18:15:13 but there's a lot less ipv6 traffic in general 18:15:13 would running strace on the apache pid tell us anything? 18:15:37 aikidouke: I tried that... it just seems like they are handling requests... 18:15:47 I didn't see much stalling there... 18:15:53 hmm 18:16:13 does the change in cpu load correlate to a time when some package was updated on the proxy? I'm thinking of some sort of a regression somewhere. 18:16:40 avij: doesn't seem to, or if so, it was long ago... this has been happening a while now. 18:17:13 https://admin.fedoraproject.org/collectd/bin/index.cgi?hostname=proxy08.fedoraproject.org&plugin=load×pan=31622400&action=show_selection&ok_button=OK 18:17:15 ok. I'm browserless at the moment, I wasn't able to see the graphs. 18:17:19 hum, it does seem to go up in aug 18:17:51 when was the last httpd update? 18:18:25 is it a rhel 6 or 7 box? 18:18:28 but it was spiking up before then 18:18:33 7 18:19:11 anyhow, I don't want to take the entire meeting on this, but wanted to throw it out there if anyone has ideas. ;) 18:19:29 so, if you do, let me know in #fedora-admin / #fedora-noc 18:20:16 does anyone want to look at some tickets today? Or should we move on to tflink talking about taskotron? ;) 18:21:07 What testing are you guys doing with NFSv4 is it kerberos/ACL/perf or 18:21:23 puffi: we are currently using almost 0 nfsv4. 18:21:35 we are trying to switch things to it, but hit a netapp issue... 18:21:36 nirik: Ah it was just an agenda item 18:21:45 oh, info from before? yeah... 18:22:09 nfsv4 mounts hang for us. :) We are going to work with netapp folks to provide a tcpdump of it... 18:22:24 I want say that you guys did some great detective work yesterday on the proxies 18:22:36 smooge: yeah, but sadly none of it helped. ;) 18:22:46 nirik: well, it didn't help enough for proxy08 :) 18:23:06 It did increase the security of our connections, so I wouldn't say it didn't do anything at all :-) 18:23:26 sure, true. 18:23:50 tflink: ok, you ready? 18:23:59 yep 18:24:01 #topic Learn about: taskotron - tflink 18:24:05 take it away... ;) 18:24:33 I'm planning to talk a bit about Taskotron but more from an infra POV - how it's deployed, etc. 18:25:24 Taskotron is our system for running automated tasks. We use tasks instead of tests purposely - the system is designed to be more flexible than just a test automation system, even if that is the primary driver right now 18:25:33 At its core, taskotron is a system that does stuff and reports results in response to trigger events. For the fedora case, this starts with Fedmsgs and will start ending with Fedmsgs signlalling results shortly after F23 Final freeze ends. 18:26:18 A high-level conceptual diagram of flow in a full Taskotron deployment: http://tirfa.com/images/taskbot-overview.png 18:27:35 That diagram helps some to reinforce the concept that Taskotron isn't a single thing - it's many moving parts working together in a coherant system that's capable of the end-to-end "run automated tasks at scale" 18:27:56 At the core of this is libtaskotron: it is the code which runs tasks and the part of Taskotron which users will use most often, even if indirectly. By design, users can use libtaskotron to run tasks on their local systems without setting up the rest of Taskotron but this not the system as a whole. 18:28:10 upstream git: https://bitbucket.org/fedoraqa/libtaskotron 18:28:18 docs: https://docs.qadevel.cloud.fedoraproject.org/libtaskotron/latest/index.html 18:29:09 for the other major components; resultsdb is a restful results database used in taskotron and a couple other projects. execdb tracks task execution status from trigger, through execution and results reporting. It provides a single reference point from which all information about a task is available. 18:29:20 taskotron-trigger sits on fedmsg-hub and looks for specific fedmsgs. The code is rather simplistic but a rewrite is planned once a few more deliverables are in place. 18:30:02 We make heavy use of buildbot - it is the part of a Taskotron deployment which is responsible for delegating tasks to worker nodes and retrieving results. From a high conceptual level, Taskotron isn't that different from many forms of CI and it would have been silly for us to reinvent that wheel when there were good projects already in place. 18:30:15 before I get into deployment, any questions? 18:30:16 The RuleEngine, is it something that needs periodic addition of "rules"? Is it fairly exhaustive? Alternatively, does it add new rules to itself? 18:30:54 are the nodes all vms? or there's a mix of vms and bare metal? it doesn't deploy nodes right? just uses existing ones? 18:30:57 abhiii5459_: for now, yes. we chose to keep it simple for now until we had a better idea of what folks would use it for and had a definite need for it 18:31:08 er, yes, it does need periodic maintenance 18:31:23 tflink : okie :) 18:31:27 it does not add rules to iteself - right now, all rules are hardcoded but that'll be changing in the medium future 18:31:42 https://bitbucket.org/fedoraqa/taskotron-trigger 18:31:57 nirik: for the moment, everything is a VM but that will be changing soon 18:32:46 * nirik nods. ok 18:33:46 Also - one of the key points of Taskotron is that we're looking to make automation easier for contributors - I don't pretend to be an expert on the kernel, libvirt, gnome etc. so the idea is to make it easier for contributors to write their own tasks and we make sure those tasks are run smoothly 18:34:01 * tflink assumes no questions, moves on to deployment 18:34:13 In terms of deployment, there are 4 basic "machine types" for our Taskotron deployments: master, client, resultsdb and virthost. The master contains the buildbot buildmaster, task artifacts and taskotron-trigger. resultsdb machines contain resultsdb and execdb. virthosts are (surprise, surprise) bare metal machines whose primary responsability is to run virtual machines. 18:35:25 When we get down to clients, I want to describe 2 different things - the way that we have been doing clients and the way that we will be doing clients by the end of the year. 18:35:36 Currently all of our clients are long-running virtual machines, each with a single buildslave which takes tasks from the master. However, that is about to change as we get disposable clients into place. 18:36:15 That being said, our next big feature i something that we're calling 'disposable clients' to spawn new VMs for each task executed in production - this guarantees a clean and reproducable execution environment for each and every task, regardless of what was executed previously on that client. 18:36:40 however, it also increases complexity by at least an order of magnitude 18:36:45 nice. Yeah. 18:37:31 all those VMs will be relatively plain vanilla VMs - we chose not to use anything like openstack 18:38:47 for 2 reasons - the primary being the scary complexity of openstack. the other gets back to one of our design philosophies - users need to be able to execute tasks locally and have that execution process match our production instances as closely as we can reasonably make it the same 18:38:55 Looking sheerly at the changes we'll see in deployment, the change is relatively straight forward - instead of having VMs with buildslave processes, we'll have multiple users on dedicated bare-metal hardware. Each of those users will run a buildslave process and have permissions to manage virtual machines on the local host - when a new task comes in, that client will spin up a new VM, execute the task and extract execution information from that VM 18:38:55 before destroying it. 18:39:26 the other change is that we'll likely be running Fedora on bare metal 18:39:54 fair enough. this is after f23 release? 18:40:03 yeah 18:40:22 we were talking about moving the koji builders to 23 after f23 is out also 18:40:36 we're getting close to merging in all the code we need but this all needs to have the ever living crap tested out of it before the feature makes it into production 18:41:06 I know we're going to hit problems once we start using the disposable client code more but I'm not sure what those issues will be :) 18:41:17 nirik: asuming we can test dnf :( 18:41:29 dgilmore: right. ;) 18:41:45 tflink: if you knew you would fix them before. ;) 18:42:17 that's pretty much what I had planned to talk about, though 18:42:24 nirik: exactly 18:42:52 we also still need to sort out image building sometime... 18:43:08 yeah, I need to respond to that email chain - was on my todo list for today 18:43:27 it doesn't sound like the copr folks are all that interested, though 18:43:29 * nirik ponders. I wonder if ostree would help any here... 18:43:41 yeah, they don't seem to mind much with the slower spin up times. 18:43:50 I dont' know enough about ostree to say but I suspect not 18:44:23 one reason why we're starting with VMs instead of docker is to provide a close-ish simulation of the end running environment 18:44:27 yeah, you install things to test on the instances right? 18:44:31 and I suspect that ostree would be a change 18:44:46 that's one of the features coming with disposable clients, yes 18:45:56 any other questions or comments? 18:46:21 so, adding disposable clients is the big upcoming thing... 18:46:24 are there plans after that? 18:46:29 dist-git style tasks 18:46:45 * tflink looks for a link 18:46:56 https://phab.qadevel.cloud.fedoraproject.org/w/taskotron/roadmap/ 18:47:27 that might be nice. We should also fold that into discussions about a frontend or dist-git change that lets people do PR's etc. (like pagure) 18:47:39 the idea there is to enable something like dist-git for tasks to live in so that package maintainers can just drop some task formulas into a repo and have those run @ every commit, every build etc. 18:48:24 yep. :) 18:48:26 sounds great 18:48:34 after that, not sure yet 18:49:16 at some point, i want to be able to let folks submit their own tasks to be run on a fedmsg event of their choosing 18:49:31 I suspect a lot of work will go into documentation and helping maintainers get tasks written 18:49:40 agreed 18:50:08 oooh... that could be handy/cool. 18:50:46 tflink: just wondering how those tasks will be sandboxes then. 18:51:14 puiterwijk: the user submitted ones? some form of VM, not 100% sure of the details 18:51:36 Right, just saying that that should be considered before opening it up to user-submtited tasks :) 18:52:00 sure, and also it should have logging and such... trust but verify 18:52:02 there's been some debate on how much those VMs need to be locked down if we restrict FAS group access 18:52:36 * tflink is hoping to get elasticsearch deployed for disposable clients so that all the system and task logs are shipped to a single, searchable place 18:52:57 but that's a whole different discussion, i think :) 18:53:04 yeah, we can look when that part of things is implemented... 18:53:05 yeah 18:53:36 at some point, there are some plans to start running some cloud tests in taskotron but I have no idea what the real plans on that front are 18:54:00 that would have it hooking to openstack? or I guess it's too early to say 18:54:24 possible but I don't think that's required for the kinds of testing that I'm thinking of for an initial setup 18:54:30 ok. 18:54:36 Thanks for all the info tflink! 18:54:43 nirik: at first it'll be moving autocloud into taskotron, to run the same tests 18:54:44 i figure we'll cross that bridge if/when we get there 18:54:48 for cloud, anyways 18:54:50 aiui 18:55:08 ok, that needs virt at least... but I guess you could do nested perhaps. 18:55:21 if anyone has questions, feel free to ask - I'm usually around and am happy to help 18:56:01 thanks again tflink 18:56:04 #topic Open Floor 18:56:12 anyone have anything for open floor? 18:56:15 thank you tflink 18:56:16 nirik: that's been part of our design from the get-go for disposable clients 18:56:18 https://phab.qadevel.cloud.fedoraproject.org/w/taskotron/planning/disposable_clients_high_level_design/ 18:56:40 question - when does the final freeze for infra start? 18:56:42 nirik, dotEast2015: no problem 18:56:45 tuesday, no? 18:57:18 I wanted to mention one thing: If I can find time, I am hoping to install a new gobby instance. with the new/current version. Just a slight heads up if I do you will all need to install gobby-0.5 18:57:22 threebean: tuesday 18:57:37 2015-10-13 f23 final freeze 18:57:46 cool, cool. 18:57:56 wow, getting chilly already 18:58:11 * nirik shivers 18:58:19 nirik: is gobby-0.5 in F22/23 repos already, or will we need to ping the maintainer? 18:58:49 puiterwijk: it's already in. But we should talk to tyll and lmacken and see if they can drop the old one and rename the new one sometime 18:59:05 Okay 18:59:19 The new one also can do ssl cert auth... if we wanted we could use our koji certs to auth? or is that too weird/not a good use for them? 18:59:41 I think that's going to confuse a lot of people 18:59:55 also the new one has plugins and a bunch of better features... undo, logs of changes you can play back, etc. 19:00:00 I would personally not be in favor of using cert auth with these kinds of tools 19:00:21 ok. I think the password is kinda silly, but we need something to keep spammers out 19:01:19 well, I'll sort something out... 19:01:33 perhaps we could try leaving it open and see if it's abused before locking it down again 19:01:45 I don't know how many spammers would install a client 19:02:16 we could also write a openid/auth plugin perhaps, but plugins are all c++ :) 19:02:36 Well, that would be doable I guess 19:02:49 anyhow, can investigate out of meeting. ;) 19:02:53 Thanks for coming everyone! 19:02:57 #endmeeting