#fedora-meeting log

18:00:02 <nirik> #startmeeting Infrastructure (2016-10-20)
18:00:02 <zodbot> Meeting started Thu Oct 20 18:00:02 2016 UTC.  The chair is nirik. Information about MeetBot at http://wiki.debian.org/MeetBot.
18:00:02 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic.
18:00:02 <zodbot> The meeting name has been set to 'infrastructure_(2016-10-20)'
18:00:02 <nirik> #meetingname infrastructure
18:00:02 <zodbot> The meeting name has been set to 'infrastructure'
18:00:02 <nirik> #topic aloha
18:00:02 <nirik> #chair smooge relrod nirik abadger1999 lmacken dgilmore threebean pingou puiterwijk pbrobinson
18:00:02 <zodbot> Current chairs: abadger1999 dgilmore lmacken nirik pbrobinson pingou puiterwijk relrod smooge threebean
18:00:03 <nirik> #topic New folks introductions
18:00:15 <clime> Hey
18:00:17 <puiterwijk> hi
18:00:18 <pingou> o/
18:00:22 <jcline> Hello
18:00:25 <wind85__> hi
18:00:30 <pcreech> hi
18:00:36 * threebean \ó/
18:01:12 <nirik> morning everyone
18:01:13 <athos> hello :)
18:01:35 <marc84> hi
18:02:16 <nirik> any new folks want to give a short one line introduction?
18:03:15 <nirik> ok, will head on to status info...
18:03:27 <smooge> hello
18:03:30 <nirik> #topic announcements and information
18:03:30 <nirik> #info starting to migrate dbs in stg to new pgbdr pair of servers - kevin
18:03:31 <nirik> #info closed/moved/upstreamed a bunch of old tickets - kevin
18:03:31 <nirik> #info askbot load issues, due to a session db table - kevin/patrick
18:03:31 <nirik> #info sign-vault03 re-installed, some more sigul config to come - kevin/partick
18:03:32 <nirik> #info kevin out next week 2016-10-26 to 2016-10-30 - kevin
18:03:33 <nirik> #info jenkins outage friday - kevin
18:03:35 <nirik> #info bodhi 2.3 in staging - randy
18:03:40 <nirik> anything there folks want to discuss more, or add to?
18:04:16 <smooge> when is our next outage?
18:04:28 <nirik> probibly after f25 release...
18:04:33 * doteast here
18:04:34 <smooge> ok no problem
18:04:40 <nirik> I don't see anything urgent pending currently
18:04:47 <clime> #info copr-keygen upgraded to f24 - clime
18:04:58 <pingou> #info FAS3 plans/roadmap sent to the infra list
18:05:14 <nirik> both nice. ;)
18:05:14 <pingou> #info new git hook for alternative-arch people up for review/comment on the infra list
18:05:53 <wind85__> cool :) ...
18:06:02 <nirik> I'd like to mail all users who are TRAC_ADMIN in any trac instances and tell them to look at migrating... but I think I'll wait for the next pagure-importer release before doing so
18:06:40 * pingou looks forward the next pagure release
18:07:27 <nirik> me too. ;) (dunno whats in it, but they all have been nice... rapid development)
18:08:01 <pingou> well, I have a few PRs I'd like updated/merged but I was hoping to do it this week, seems less likely today :(
18:08:15 <nirik> so this next topic came up the other day in fedora-noc, and I thought I would see if we could discuss it more in meeting, but not sure everyone we might want is around...
18:08:21 <nirik> #topic change fedmsg setup to handle more hosts - patrick
18:08:30 <nirik> puiterwijk: can you give some background on this issue?
18:09:01 <puiterwijk> Sure
18:09:07 <puiterwijk> so, currently we have all hosts listening to all hosts
18:09:24 <puiterwijk> This was to avoid brokers (which would be a SPOF without clustering)
18:09:57 <puiterwijk> The problem however is that for all these connections, processes have LOTS of sockets open. We're now to the point where for some services that's making services touch (but just not cross) the max-file-descriptors limit
18:10:09 <nirik> and each host has N endpoints too right? so it's hosts X endpoints X hosts ?
18:10:16 <puiterwijk> Yep
18:10:52 <puiterwijk> So, we hit upon this with bodhi-backend03, where together with another bug, it crossed the file descriptor limit, and had to be restarted every day
18:11:13 <nirik> that default is 1024?
18:11:15 <puiterwijk> I fixed the other bug, but still, the number of open FDs is concerning, as it will cross the threshold if we keep increasing the number of hosts (which we will)
18:11:17 <jcline> So this is a question of ignorance, but would the solution be to be less aggressive with the meshing? That is, services only connect to other services they need messages from rather than everyone?
18:11:30 <puiterwijk> nirik: yes. We could increase that, but that's only temporary until we get to the next limit
18:12:19 <puiterwijk> jcline: I think that that's what we should be doing yes, but that would be tricky to get right. There's still some hosts that will need a full mesh from them, but we can just raise the limit and keep a watch on those
18:12:23 <nirik> sure, but could be a short term option while we think of a longer term plan. ;)
18:12:28 <pingou> puiterwijk: what would be the maximum maximum we could put?
18:12:34 <puiterwijk> (those are the datanommer boxes. They need a full mesh, since they need to collect data from all boxes)
18:12:55 <jcline> puiterwijk, makes sense.
18:12:59 <threebean> for info:  looks like we have 451 tcp endpoints now (by checking on batcave01).
18:13:04 <puiterwijk> pingou: 4096 is the default hard limit. We could theoretically increase that to 2^31 (if I recall correctly), but above some point the kernel is not always stable
18:13:24 <puiterwijk> threebean: right. The problem is that for every socket, there's at least 2 file handles: the socket handle and an eventfd
18:13:29 * threebean nods
18:13:30 <threebean> correct.
18:13:33 <puiterwijk> So that needs to be doubled
18:13:39 <puiterwijk> Which brings us to 902.
18:13:41 <pingou> 2^31=2147483648 so that gives us some space b/w that and 4096
18:13:43 <puiterwijk> Which is... getting close
18:13:45 <nirik> is there a common case also with a service/thing that only listens to itself, doesn't care about others?
18:13:51 <puiterwijk> pingou: correct. So we could do that.
18:14:02 <puiterwijk> nirik: no, I don't think we have a lot of that
18:14:27 <puiterwijk> pingou: but at some point we're also going to hit other limits. For the time being it's not pressing, just wanted to make sure we realize this
18:14:40 <pingou> puiterwijk: thanks for that
18:14:46 <puiterwijk> I think we should maybe add a nagios check on the fedmsg-hub processes
18:14:50 <nirik> yeah, we don't need to solve this today... but be thinking about it. ;)
18:14:58 <pingou> most apps are more sending than listening no?
18:15:06 <puiterwijk> pingou: yep
18:15:25 <puiterwijk> nirik: the thing we do need sooner rather than later, in my opinion, is the nagios checks for FD limit in fedmsg-hub :)
18:15:44 <puiterwijk> That shouldn't be difficult, but would at least warn us before it becomes a real problem
18:16:00 <nirik> sure, and perhaps bump it to 2048 too?
18:16:09 <puiterwijk> Sure.
18:16:52 <puiterwijk> That would make sure that we avoid the problem for a while, as I think it'll be some time before we double our infra
18:17:04 <puiterwijk> (unless we move everything to containers in the next month)
18:17:07 * puiterwijk hides
18:17:17 <nirik> right. and it should be easy to just do in the fedmsg-hub systemd unit...
18:17:38 <puiterwijk> I think it is, yes. Otherwise I have two lines of python that do that we can insert into fedmsg-hub
18:17:58 <nirik> of course we still have some rhel6 stuff. ;( so yeah, whatever works
18:18:21 <puiterwijk> I think just adding to fedmsg-hub might be easiest to get it everywhere
18:18:46 <nirik> sure.
18:19:02 <nirik> #info everyone will be thinking on longer term solutions when/if we need them.
18:19:04 <puiterwijk> So, anyone opposed to 1. increasing the limit in fedmsg-hub and 2. adding nagios, speak up now
18:19:11 <jcline> So fedmsg-hub connects to every fedora service and services that want to consume any message start fedmsg-hub and get everything?
18:19:15 <nirik> #info in the mean time we will add a nagios check and increase limit to 2048
18:19:30 <nirik> jcline: yeah.
18:19:48 <nirik> every endpoint of every service... some have just 1 or 2, some have more
18:19:55 <jcline> Okay, thanks. That's something I had wondered about.
18:20:08 <puiterwijk> jcline: look at our endpoints.py file on hosts :)
18:20:59 <nirik> so, related to this... I see a bunch of connections that are for fedmsg, but not connecting... perhaps we should clean those up too.
18:20:59 <jcline> So longer term apps need to be choosier about where they consume from rather than asking for the firehose
18:21:25 <puiterwijk> jcline: probably, yes, but as said, that's tricky to get setup
18:22:11 <smooge> I think it is 'easy' to do in the simple case but quickly becomes hard/impossible as you end up with dependencies between services
18:23:28 <nirik> I guess the things I see not connecting are bodhi03/04 (which shouldn't be running fedmsg-hub, so not sure why they are being connected to)
18:23:57 <puiterwijk> nirik: I think they should not be running -hub, but they should be running -relay.
18:24:03 <puiterwijk> Since they do fire off new messages
18:24:15 <nirik> they do?
18:24:24 <puiterwijk> Bodhi fires off messages for new updates etc..
18:24:38 <nirik> thats the backend tho...
18:24:50 <bowlofeggs> i think the frontend does updates right?
18:24:54 <puiterwijk> No? New update submissions is issued by frontend
18:24:55 <bowlofeggs> doesn't the frontend runt he API?
18:24:59 <puiterwijk> Yes
18:25:09 <bowlofeggs> also, comments on bodhi tickets probably send messages, which is also frontend
18:25:18 <puiterwijk> the backend only does the pushing etc and handles messages. frontend fires the actions from in its UI
18:25:23 <nirik> well, they are not running fedmsg-hub or relay currently
18:25:39 <puiterwijk> nirik: right. Because this probably uses the other stuff where the relay runs right in the apache process
18:26:00 <bowlofeggs> that sounds correct to me, but i'm not 100% sure
18:26:02 <nirik> ok, but then things shouldn't try and connect to it's endpoints... since nothing is listening there
18:26:17 <bowlofeggs> yeah i don't thinkt he frontend listens
18:26:26 <puiterwijk> Yes, it does, because that's how fedmsg works..
18:26:41 <bowlofeggs> oh just bidirectional?
18:26:42 <puiterwijk> fedmsg consumers connect to producers
18:26:47 <puiterwijk> and bodhi web is a producer
18:26:56 <bowlofeggs> right
18:27:04 <bowlofeggs> so lots 'o connections
18:27:13 <nirik> well, something seems wrong there to me, but we can poke at it out of meeting
18:27:17 <puiterwijk> However, I have not yet checked which producing system bodhi uses
18:27:28 <puiterwijk> nirik: I'll look how it is sending messages.
18:27:51 <nirik> thanks.
18:28:00 <nirik> ok, anything more on fedmsg for now?
18:28:48 <nirik> #topic Apprentice Open office hours
18:28:59 <nirik> any apprentices with questions, comments, looking for work, etc? :)
18:29:22 <odin2016> Not yet.... ;)
18:29:33 <wind85__> Yes I was starting to look at the easyfix tickets and there are a few SOP to be made...
18:30:02 <doteast> I'm looking to do something fun and new
18:30:17 <wind85__> They are quite easyfix really... Tough at the same time, a bit hard to get the documentation for the service in question...
18:30:42 <wind85__> Well mainly cos' this is my thrid meeting and I can't access the serves as well...
18:30:50 <nirik> wind85__: yeah... do ask in #fedora-admin or post to the list if you need more info... someone can answer and provide the info you see
18:30:52 <nirik> seek
18:31:12 <wind85__> nirik: alright then...
18:31:40 <nirik> doteast: fun _and_ new? hummm.... :)
18:32:03 <doteast> or old and boring :) works either way :)
18:32:13 <wind85__> :)
18:32:39 <saunind> i have a question about inventory/group_vars and patches from git
18:32:52 <nirik> doteast: ha. ok, will try and think of something for you....
18:33:01 <nirik> saunind: fire away.
18:33:10 <doteast> nirik, thank... I'm sure it will be fun
18:33:18 <doteast> *thanks
18:33:22 <saunind> Should i git clone https://infrastructure.fedoraproject.org/cgit/ansible.git/ this git repo
18:33:45 <saunind> ant than git format patch when i edit group_var files? right?
18:33:58 <nirik> yep. And then send them to the list. ;)
18:34:20 <nirik> someday before too long we hopefully will have a way for people to submit PR's... which might be faster/easier.
18:34:28 <saunind> ok thx nirik
18:35:03 <saunind> what is PR's?
18:35:06 <nirik> np. :)
18:35:10 <nirik> Pull Request.
18:35:18 <saunind> I one more question )
18:35:20 <nirik> It's how github type projects (and pagure) do changes
18:35:25 <saunind> And one more)
18:35:43 <nirik> https://help.github.com/articles/about-pull-requests/
18:35:47 <nirik> sure, ask awayy
18:36:05 <saunind> How to find a mentor/sponsor?))
18:36:50 <wind85__> saunind: thanks I wanted to ask as well :) ...
18:36:55 <doteast> PRs are noting working?! I thought this is one of the kills of pagure...
18:37:22 <doteast> oh, sorry, you want easier/faster
18:37:42 <nirik> saunind: well, we don't usually do direct mentoring... but you can ask anyone in the group or list or #fedora-admin your questions and someone whoever is around will answer...
18:37:53 <nirik> doteast: we don't have PR's for the ansible repo (yet)
18:38:30 <doteast> yeah, sorry... I was think of something else... as I recall it was the git syncing issue as I suppose
18:39:16 <nirik> yeah. we want our copy to be the main one, so we don't depend on another location...
18:39:22 <nirik> but we will try and work something out...
18:39:27 <nirik> #topic Open Floor
18:39:55 <nirik> anyone have anything for open floor? questions, comments, favorate bluetooth internet of things device?
18:40:34 <puiterwijk> nirik: all of the devices. They're all so broken :)
18:41:14 <clime> quantum computer pad
18:41:24 <clime> is pretty cool
18:41:44 <nirik> I liked the candle one. ;)
18:42:05 <clime> it gives correct answer for 5+8 correct in 1/20 cases
18:42:14 <nirik> ha
18:42:15 <clime> but it is a quantum computer!!! :)
18:42:22 <doteast> smart dust, not bluetooth tho
18:43:35 <nirik> oh, one other thing. I will be gone next week... smooge / puiterwijk: can one of you run the meeting? or someone else if they want...
18:43:51 <smooge> I can run meeting next week
18:44:03 <smooge> it will be very quick
18:44:10 <nirik> :)
18:44:11 <nirik> thanks
18:44:13 <puiterwijk> Thanks smooge :)
18:44:20 <doteast> :)
18:44:29 <nirik> ok, thanks for coming everyone. Do continue over in #fedora-admin, #fedora-apps and #fedora-noc
18:44:30 <smooge> openmeeting/heylookniriksnothere/closemeeting
18:44:31 <puiterwijk> and, thanks nirik for chairing today
18:44:32 <nirik> #endmeeting