18:00:05 <nirik> #startmeeting Infrastructure (2015-10-08)
18:00:05 <zodbot> Meeting started Thu Oct  8 18:00:05 2015 UTC.  The chair is nirik. Information about MeetBot at http://wiki.debian.org/MeetBot.
18:00:05 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic.
18:00:06 <nirik> #meetingname infrastructure
18:00:06 <zodbot> The meeting name has been set to 'infrastructure'
18:00:06 <nirik> #topic aloha
18:00:06 <nirik> #chair smooge relrod nirik abadger1999 lmacken dgilmore mdomsch threebean pingou puiterwijk pbrobinson
18:00:06 <zodbot> Current chairs: abadger1999 dgilmore lmacken mdomsch nirik pbrobinson pingou puiterwijk relrod smooge threebean
18:00:06 <nirik> #topic New folks introductions / Apprentice feedback
18:00:12 <nirik> morning everyone.
18:00:14 * roshi is here
18:00:16 <puffi> Morning
18:00:24 <smdeep> morning
18:00:43 <nirik> Will give a few minutes for any new folks to give a short one line introduction and for any apprentices with questions or comments.
18:01:26 * dotEast2015 is here
18:01:29 <puiterwijk> Here
18:01:43 * aikidouke here
18:01:43 <pcreech|work> just wanted to let you guys know I should be getting back active
18:01:55 <nirik> hey pcreech. Welcome back.
18:02:07 * threebean is here
18:02:07 <smdeep> .hellomynameis smdeep
18:02:08 <pcreech|work> had a external commitment that took up a lot of extra time
18:02:09 <zodbot> smdeep: smdeep 'Sudeep Mukherjee' <smdeep@gmail.com>
18:02:17 <pcreech|work> thanks nirik!
18:02:27 <puffi> I'm Brian, looking to join in with Fedora infra team focusing on more of the sysadmin type stuff. Hope to get involved with system some system perforamce.
18:02:50 <threebean> puffi: that would be great :)
18:02:59 <abhiii5459_> Hi , abhiii5459_ from India here. newbie. This is my first fedora-infra meeting. Looking to contribute to Fedora (abhi.darkness@gmail.com | Github- abhiii5459)
18:03:02 <jflory7> My name is Justin - for now, I'm quietly observing, not sure if I can contribute much now but I'm interesting in learning more.
18:03:32 <nirik> cool. :) Welcome all the new folks. ;)
18:03:52 <nirik> do see: https://fedoraproject.org/wiki/Infrastructure/GettingStarted
18:04:12 <nirik> and feel free to ask questions as we go or after in #fedora-admin #fedora-apps or #fedora-noc
18:04:31 <puffi> Great, Thanks nirik
18:04:33 <abhiii5459_> Will do :) Thank you!
18:04:50 <nirik> Any sysadmin types interested in our apprentice group can see me after the meeting over in #fedora-admin and I can get you added. ;)
18:05:07 <nirik> For developers we have #fedora-apps folks able to help point you to easyfix tickets.
18:05:51 <nirik> Great, any other new folks? or apprentices with questions?
18:06:32 <nirik> alright, lets go on to status/info
18:06:41 <nirik> #topic announcements and information
18:06:41 <nirik> #info Some more inconclusive nfsv4 testing, ticket open with netapp now - kevin
18:06:42 <nirik> #info Moved virthost15 to rhel7 without too much pain - kevin
18:06:42 <nirik> #info Moved a smtp-mm host from coloamer01 to osuosl - kevin
18:06:42 <nirik> #info Mass update/reboot cycle done - kevin, smooge, patrick
18:06:43 <nirik> #info Cloud updated/rebooted - patrick
18:06:44 <nirik> #info Fixed mirrorlist servers to not log all hits to log01 - kevin
18:06:46 <nirik> #info Lots of debugging on proxy load issues, still in progress - kevin, patrick
18:06:48 <nirik> #info migrated taskotron-dev to f22, stg is soon to follow - tflink
18:06:53 <nirik> anything in there anyone would like to expand on or add?
18:07:13 <nirik> I'll note that I (and today tflink) have been the only ones updating... ;) Nag nag. :)
18:07:43 <nirik> ok, I had a discussion item:
18:07:45 <puiterwijk> #info Ipsilon backporting busy. Hopefully lots of bugs fixed after updating later today
18:07:46 <nirik> #topic high cpu load on proxies - kevin / patrick
18:08:06 <nirik> so, let me describe the issue and see if folks have any ideas to track it down. ;)
18:08:08 <threebean> :)
18:08:27 <nirik> We have a bunch of proxies. Some of them are "bigger" than others (more cpus, memory, etc).
18:08:43 <nirik> Some of the smaller ones have started having very high load of late.
18:09:06 <nirik> In particular proxy08. It has 4 cpus and they pretty much get pegged.
18:09:14 <nirik> It's not i/o related.
18:09:18 <nirik> It's not memory related.
18:09:29 <nirik> It's apache taking up too much cpu.
18:09:53 <nirik> https://admin.fedoraproject.org/collectd/bin/graph.cgi?hostname=proxy08.fedoraproject.org;plugin=load;type=load;begin=-50000
18:09:59 <nirik> You can see the load there.
18:10:18 <nirik> We thought it might be related to ssl handling, so we tried some things to mitigate that... but they didn't seem to help much
18:10:21 <puffi> nirik: Is it a case of "bullying" from a load perspective or how are the connections distrubuted between the proxies?
18:10:34 <nirik> puffi: it's round robin dns.
18:10:40 <nirik> By geoip region
18:10:56 <nirik> so if you do a 'host fedoraproject.org' you can see all the proxies in your region
18:11:07 <nirik> https://admin.fedoraproject.org/collectd/bin/index.cgi?hostname=proxy01.phx2.fedoraproject.org&hostname=proxy02.fedoraproject.org&hostname=proxy03.fedoraproject.org&hostname=proxy04.fedoraproject.org&hostname=proxy05.fedoraproject.org&hostname=proxy06.fedoraproject.org&hostname=proxy07.fedoraproject.org&hostname=proxy08.fedoraproject.org&hostname=proxy10.phx2.fedoraproject.org&hostname=proxy11.fedoraproject.org&plugin=load&timespan=86400&action=sh
18:11:07 <nirik> ow_selection&ok_button=OK
18:11:14 <nirik> (sorry that got cut off)
18:11:34 <nirik> but you can see the load on them all there... and when we dropped proxy08 out the others saw a load spike
18:11:46 <nirik> (which makes sense as they would take over the connections)
18:12:23 <nirik> so, we need some way to figure out why apache is so cpu bound on those hosts.
18:13:20 <roshi> would swapping out apache to something like nginx help? (I imagine that's non-trivial though)
18:13:53 <pcreech|work> nirik: does load increase/decrase with respect to increased/decreased traffic?
18:14:01 <puiterwijk> nirik: I've heard that it does, but as you said, it's very difficult since all of our config needs to be rewritten
18:14:48 <nirik> pcreech|work: well, I think so, but we don't have a fine tool there. It's either (no load / out of dns) or (all load/ in dns)
18:15:00 <smdeep> .host fedoraproject.org
18:15:05 <nirik> we did actually have it in with just an ipv6 ip... and it handled that just fine.
18:15:13 <nirik> but there's a lot less ipv6 traffic in general
18:15:13 <aikidouke> would running strace on the apache pid tell us anything?
18:15:37 <nirik> aikidouke: I tried that... it just seems like they are handling requests...
18:15:47 <nirik> I didn't see much stalling there...
18:15:53 <aikidouke> hmm
18:16:13 <avij> does the change in cpu load correlate to a time when some package was updated on the proxy? I'm thinking of some sort of a regression somewhere.
18:16:40 <nirik> avij: doesn't seem to, or if so, it was long ago... this has been happening a while now.
18:17:13 <nirik> https://admin.fedoraproject.org/collectd/bin/index.cgi?hostname=proxy08.fedoraproject.org&plugin=load&timespan=31622400&action=show_selection&ok_button=OK
18:17:15 <avij> ok. I'm browserless at the moment, I wasn't able to see the graphs.
18:17:19 <nirik> hum, it does seem to go up in aug
18:17:51 <avij> when was the last httpd update?
18:18:25 <aikidouke> is it a rhel 6 or 7 box?
18:18:28 <nirik> but it was spiking up before then
18:18:33 <nirik> 7
18:19:11 <nirik> anyhow, I don't want to take the entire meeting on this, but wanted to throw it out there if anyone has ideas. ;)
18:19:29 <nirik> so, if you do, let me know in #fedora-admin / #fedora-noc
18:20:16 <nirik> does anyone want to look at some tickets today? Or should we move on to tflink talking about taskotron? ;)
18:21:07 <puffi> What testing are you guys doing with NFSv4 is it kerberos/ACL/perf or
18:21:23 <nirik> puffi: we are currently using almost 0 nfsv4.
18:21:35 <nirik> we are trying to switch things to it, but hit a netapp issue...
18:21:36 <puffi> nirik: Ah it was just an agenda item
18:21:45 <nirik> oh, info from before? yeah...
18:22:09 <nirik> nfsv4 mounts hang for us. :) We are going to work with netapp folks to provide a tcpdump of it...
18:22:24 <smooge> I want say that you guys did some great detective work yesterday on the proxies
18:22:36 <nirik> smooge: yeah, but sadly none of it helped. ;)
18:22:46 <puiterwijk> nirik: well, it didn't help enough for proxy08 :)
18:23:06 <puiterwijk> It did increase the security of our connections, so I wouldn't say it didn't do anything at all :-)
18:23:26 <nirik> sure, true.
18:23:50 <nirik> tflink: ok, you ready?
18:23:59 <tflink> yep
18:24:01 <nirik> #topic Learn about: taskotron - tflink
18:24:05 <nirik> take it away... ;)
18:24:33 <tflink> I'm planning to talk a bit about Taskotron but more from an infra POV - how it's deployed, etc.
18:25:24 <tflink> Taskotron is our system for running automated tasks. We use tasks instead of tests purposely  - the system is designed to be more flexible than just a test automation system, even if that is the primary driver right now
18:25:33 <tflink> At its core, taskotron is a system that does stuff and reports results in response to trigger events. For the fedora case, this starts with Fedmsgs and will start ending with Fedmsgs signlalling results shortly after F23 Final freeze ends.
18:26:18 <tflink> A high-level conceptual diagram of flow in a full Taskotron deployment: http://tirfa.com/images/taskbot-overview.png
18:27:35 <tflink> That diagram helps some to reinforce the concept that Taskotron isn't a single thing - it's many moving parts working together in a coherant system that's capable of the end-to-end "run automated tasks at scale"
18:27:56 <tflink> At the core of this is libtaskotron: it is the code which runs tasks and the part of Taskotron which users will use most often, even if indirectly. By design, users can use libtaskotron to run tasks on their local systems without setting up the rest of Taskotron but this not the system as a whole.
18:28:10 <tflink> upstream git: https://bitbucket.org/fedoraqa/libtaskotron
18:28:18 <tflink> docs: https://docs.qadevel.cloud.fedoraproject.org/libtaskotron/latest/index.html
18:29:09 <tflink> for the other major components; resultsdb is a restful results database used in taskotron and a couple other projects.  execdb tracks task execution status from trigger, through execution and results reporting. It provides a single reference point from which all information about a task is available.
18:29:20 <tflink> taskotron-trigger sits on fedmsg-hub and looks for specific fedmsgs. The code is rather simplistic but a rewrite is planned once a few more deliverables are in place.
18:30:02 <tflink> We make heavy use of buildbot - it is the part of a Taskotron deployment which is responsible for delegating tasks to worker nodes and retrieving results. From a high conceptual level, Taskotron isn't that different from many forms of CI and it would have been silly for us to reinvent that wheel when there were good projects already in place.
18:30:15 <tflink> before I get into deployment, any questions?
18:30:16 <abhiii5459_> The RuleEngine, is it something that needs periodic addition of "rules"? Is it fairly exhaustive? Alternatively, does it add new rules to itself?
18:30:54 <nirik> are the nodes all vms? or there's a mix of vms and bare metal? it doesn't deploy nodes right? just uses existing ones?
18:30:57 <tflink> abhiii5459_: for now, yes. we chose to keep it simple for now until we had a better idea of what folks would use it for and had a definite need for it
18:31:08 <tflink> er, yes, it does need periodic maintenance
18:31:23 <abhiii5459_> tflink : okie :)
18:31:27 <tflink> it does not add rules to iteself - right now, all rules are hardcoded but that'll be changing in the medium future
18:31:42 <tflink> https://bitbucket.org/fedoraqa/taskotron-trigger
18:31:57 <tflink> nirik: for the moment, everything is a VM but that will be changing soon
18:32:46 * nirik nods. ok
18:33:46 <tflink> Also - one of the key points of Taskotron is that we're looking to make automation easier for contributors - I don't pretend to be an expert on the kernel, libvirt, gnome etc. so the idea is to make it easier for contributors to write their own tasks and we make sure those tasks are run smoothly
18:34:01 * tflink assumes no questions, moves on to deployment
18:34:13 <tflink> In terms of deployment, there are 4 basic "machine types" for our Taskotron deployments: master, client, resultsdb and virthost. The master contains the buildbot buildmaster, task artifacts and taskotron-trigger. resultsdb machines contain resultsdb and execdb. virthosts are (surprise, surprise) bare metal machines whose primary responsability is to run virtual machines.
18:35:25 <tflink> When we get down to clients, I want to describe 2 different things - the way that we have been doing clients and the way that we will be doing clients by the end of the year.
18:35:36 <tflink> Currently all of our clients are long-running virtual machines, each with a single buildslave which takes tasks from the master. However, that is about to change as we get disposable clients into place.
18:36:15 <tflink> That being said, our next big feature i something that we're calling 'disposable clients' to spawn new VMs for each task executed in production - this guarantees a clean and reproducable execution environment for each and every task, regardless of what was executed previously on that client.
18:36:40 <tflink> however, it also increases complexity by at least an order of magnitude
18:36:45 <nirik> nice. Yeah.
18:37:31 <tflink> all those VMs will be relatively plain vanilla VMs - we chose not to use anything like openstack
18:38:47 <tflink> for 2 reasons - the primary being the scary complexity of openstack. the other gets back to one of our design philosophies - users need to be able to execute tasks locally and have that execution process match our production instances as closely as we can reasonably make it the same
18:38:55 <tflink> Looking sheerly at the changes we'll see in deployment, the change is relatively straight forward - instead of having VMs with buildslave processes, we'll have multiple users on dedicated bare-metal hardware. Each of those users will run a buildslave process and have permissions to manage virtual machines on the local host - when a new task comes in, that client will spin up a new VM, execute the task and extract execution information from that VM
18:38:55 <tflink> before destroying it.
18:39:26 <tflink> the other change is that we'll likely be running Fedora on bare metal
18:39:54 <nirik> fair enough. this is after f23 release?
18:40:03 <tflink> yeah
18:40:22 <nirik> we were talking about moving the koji builders to 23 after f23 is out also
18:40:36 <tflink> we're getting close to merging in all the code we need but this all needs to have the ever living crap tested out of it before the feature makes it into production
18:41:06 <tflink> I know we're going to hit problems once we start using the disposable client code more but I'm not sure what those issues will be :)
18:41:17 <dgilmore> nirik: asuming we can test dnf :(
18:41:29 <nirik> dgilmore: right. ;)
18:41:45 <nirik> tflink: if you knew you would fix them before. ;)
18:42:17 <tflink> that's pretty much what I had planned to talk about, though
18:42:24 <tflink> nirik: exactly
18:42:52 <nirik> we also still need to sort out image building sometime...
18:43:08 <tflink> yeah, I need to respond to that email chain - was on my todo list for today
18:43:27 <tflink> it doesn't sound like the copr folks are all that interested, though
18:43:29 * nirik ponders. I wonder if ostree would help any here...
18:43:41 <nirik> yeah, they don't seem to mind much with the slower spin up times.
18:43:50 <tflink> I dont' know enough about ostree to say but I suspect not
18:44:23 <tflink> one reason why we're starting with VMs instead of docker is to provide a close-ish simulation of the end running environment
18:44:27 <nirik> yeah, you install things to test on the instances right?
18:44:31 <tflink> and I suspect that ostree would be a change
18:44:46 <tflink> that's one of the features coming with disposable clients, yes
18:45:56 <tflink> any other questions or comments?
18:46:21 <nirik> so, adding disposable clients is the big upcoming thing...
18:46:24 <nirik> are there plans after that?
18:46:29 <tflink> dist-git style tasks
18:46:45 * tflink looks for a link
18:46:56 <tflink> https://phab.qadevel.cloud.fedoraproject.org/w/taskotron/roadmap/
18:47:27 <nirik> that might be nice. We should also fold that into discussions about a frontend or dist-git change that lets people do PR's etc. (like pagure)
18:47:39 <tflink> the idea there is to enable something like dist-git for tasks to live in so that package maintainers can just drop some task formulas into a repo and have those run @ every commit, every build etc.
18:48:24 <nirik> yep. :)
18:48:26 <nirik> sounds great
18:48:34 <tflink> after that, not sure yet
18:49:16 <tflink> at some point, i want to be able to let folks submit their own tasks to be run on a fedmsg event of their choosing
18:49:31 <roshi> I suspect a lot of work will go into documentation and helping maintainers get tasks written
18:49:40 <tflink> agreed
18:50:08 <nirik> oooh... that could be handy/cool.
18:50:46 <puiterwijk> tflink: just wondering how those tasks will be sandboxes then.
18:51:14 <tflink> puiterwijk: the user submitted ones? some form of VM, not 100% sure of the details
18:51:36 <puiterwijk> Right, just saying that that should be considered before opening it up to user-submtited tasks :)
18:52:00 <nirik> sure, and also it should have logging and such... trust but verify
18:52:02 <tflink> there's been some debate on how much those VMs need to be locked down if we restrict FAS group access
18:52:36 * tflink is hoping to get elasticsearch deployed for disposable clients so that all the system and task logs are shipped to a single, searchable place
18:52:57 <tflink> but that's a whole different discussion, i think :)
18:53:04 <nirik> yeah, we can look when that part of things is implemented...
18:53:05 <nirik> yeah
18:53:36 <tflink> at some point, there are some plans to start running some cloud tests in taskotron but I have no idea what the real plans on that front are
18:54:00 <nirik> that would have it hooking to openstack? or I guess it's too early to say
18:54:24 <tflink> possible but I don't think that's required for the kinds of testing that I'm thinking of for an initial setup
18:54:30 <nirik> ok.
18:54:36 <nirik> Thanks for all the info tflink!
18:54:43 <roshi> nirik: at first it'll be moving autocloud into taskotron, to run the same tests
18:54:44 <tflink> i figure we'll cross that bridge if/when we get there
18:54:48 <roshi> for cloud, anyways
18:54:50 <roshi> aiui
18:55:08 <nirik> ok, that needs virt at least... but I guess you could do nested perhaps.
18:55:21 <tflink> if anyone has questions, feel free to ask - I'm usually around and am happy to help
18:56:01 <nirik> thanks again tflink
18:56:04 <nirik> #topic Open Floor
18:56:12 <nirik> anyone have anything for open floor?
18:56:15 <dotEast2015> thank you tflink
18:56:16 <tflink> nirik: that's been part of our design from the get-go for disposable clients
18:56:18 <tflink> https://phab.qadevel.cloud.fedoraproject.org/w/taskotron/planning/disposable_clients_high_level_design/
18:56:40 <threebean> question - when does the final freeze for infra start?
18:56:42 <tflink> nirik, dotEast2015: no problem
18:56:45 <tflink> tuesday, no?
18:57:18 <nirik> I wanted to mention one thing: If I can find time, I am hoping to install a new gobby instance. with the new/current version. Just a slight heads up if I do you will all need to install gobby-0.5
18:57:22 <nirik> threebean: tuesday
18:57:37 <nirik> 2015-10-13 f23 final freeze
18:57:46 <threebean> cool, cool.
18:57:56 <stickster> wow, getting chilly already
18:58:11 * nirik shivers
18:58:19 <puiterwijk> nirik: is gobby-0.5 in F22/23 repos already, or will we need to ping the maintainer?
18:58:49 <nirik> puiterwijk: it's already in. But we should talk to tyll and lmacken and see if they can drop the old one and rename the new one sometime
18:59:05 <puiterwijk> Okay
18:59:19 <nirik> The new one also can do ssl cert auth... if we wanted we could use our koji certs to auth? or is that too weird/not a good use for them?
18:59:41 <puiterwijk> I think that's going to confuse a lot of people
18:59:55 <nirik> also the new one has plugins and a bunch of better features... undo, logs of changes you can play back, etc.
19:00:00 <puiterwijk> I would personally not be in favor of using cert auth with these kinds of tools
19:00:21 <nirik> ok. I think the password is kinda silly, but we need something to keep spammers out
19:01:19 <nirik> well, I'll sort something out...
19:01:33 <nirik> perhaps we could try leaving it open and see if it's abused before locking it down again
19:01:45 <nirik> I don't know how many spammers would install a client
19:02:16 <nirik> we could also write a openid/auth plugin perhaps, but plugins are all c++ :)
19:02:36 <puiterwijk> Well, that would be doable I guess
19:02:49 <nirik> anyhow, can investigate out of meeting. ;)
19:02:53 <nirik> Thanks for coming everyone!
19:02:57 <nirik> #endmeeting