18:00:02 #startmeeting Infrastructure (2016-10-20) 18:00:02 Meeting started Thu Oct 20 18:00:02 2016 UTC. The chair is nirik. Information about MeetBot at http://wiki.debian.org/MeetBot. 18:00:02 Useful Commands: #action #agreed #halp #info #idea #link #topic. 18:00:02 The meeting name has been set to 'infrastructure_(2016-10-20)' 18:00:02 #meetingname infrastructure 18:00:02 The meeting name has been set to 'infrastructure' 18:00:02 #topic aloha 18:00:02 #chair smooge relrod nirik abadger1999 lmacken dgilmore threebean pingou puiterwijk pbrobinson 18:00:02 Current chairs: abadger1999 dgilmore lmacken nirik pbrobinson pingou puiterwijk relrod smooge threebean 18:00:03 #topic New folks introductions 18:00:15 Hey 18:00:17 hi 18:00:18 o/ 18:00:22 Hello 18:00:25 hi 18:00:30 hi 18:00:36 * threebean \ó/ 18:01:12 morning everyone 18:01:13 hello :) 18:01:35 hi 18:02:16 any new folks want to give a short one line introduction? 18:03:15 ok, will head on to status info... 18:03:27 hello 18:03:30 #topic announcements and information 18:03:30 #info starting to migrate dbs in stg to new pgbdr pair of servers - kevin 18:03:31 #info closed/moved/upstreamed a bunch of old tickets - kevin 18:03:31 #info askbot load issues, due to a session db table - kevin/patrick 18:03:31 #info sign-vault03 re-installed, some more sigul config to come - kevin/partick 18:03:32 #info kevin out next week 2016-10-26 to 2016-10-30 - kevin 18:03:33 #info jenkins outage friday - kevin 18:03:35 #info bodhi 2.3 in staging - randy 18:03:40 anything there folks want to discuss more, or add to? 18:04:16 when is our next outage? 18:04:28 probibly after f25 release... 18:04:33 * doteast here 18:04:34 ok no problem 18:04:40 I don't see anything urgent pending currently 18:04:47 #info copr-keygen upgraded to f24 - clime 18:04:58 #info FAS3 plans/roadmap sent to the infra list 18:05:14 both nice. ;) 18:05:14 #info new git hook for alternative-arch people up for review/comment on the infra list 18:05:53 cool :) ... 18:06:02 I'd like to mail all users who are TRAC_ADMIN in any trac instances and tell them to look at migrating... but I think I'll wait for the next pagure-importer release before doing so 18:06:40 * pingou looks forward the next pagure release 18:07:27 me too. ;) (dunno whats in it, but they all have been nice... rapid development) 18:08:01 well, I have a few PRs I'd like updated/merged but I was hoping to do it this week, seems less likely today :( 18:08:15 so this next topic came up the other day in fedora-noc, and I thought I would see if we could discuss it more in meeting, but not sure everyone we might want is around... 18:08:21 #topic change fedmsg setup to handle more hosts - patrick 18:08:30 puiterwijk: can you give some background on this issue? 18:09:01 Sure 18:09:07 so, currently we have all hosts listening to all hosts 18:09:24 This was to avoid brokers (which would be a SPOF without clustering) 18:09:57 The problem however is that for all these connections, processes have LOTS of sockets open. We're now to the point where for some services that's making services touch (but just not cross) the max-file-descriptors limit 18:10:09 and each host has N endpoints too right? so it's hosts X endpoints X hosts ? 18:10:16 Yep 18:10:52 So, we hit upon this with bodhi-backend03, where together with another bug, it crossed the file descriptor limit, and had to be restarted every day 18:11:13 that default is 1024? 18:11:15 I fixed the other bug, but still, the number of open FDs is concerning, as it will cross the threshold if we keep increasing the number of hosts (which we will) 18:11:17 So this is a question of ignorance, but would the solution be to be less aggressive with the meshing? That is, services only connect to other services they need messages from rather than everyone? 18:11:30 nirik: yes. We could increase that, but that's only temporary until we get to the next limit 18:12:19 jcline: I think that that's what we should be doing yes, but that would be tricky to get right. There's still some hosts that will need a full mesh from them, but we can just raise the limit and keep a watch on those 18:12:23 sure, but could be a short term option while we think of a longer term plan. ;) 18:12:28 puiterwijk: what would be the maximum maximum we could put? 18:12:34 (those are the datanommer boxes. They need a full mesh, since they need to collect data from all boxes) 18:12:55 puiterwijk, makes sense. 18:12:59 for info: looks like we have 451 tcp endpoints now (by checking on batcave01). 18:13:04 pingou: 4096 is the default hard limit. We could theoretically increase that to 2^31 (if I recall correctly), but above some point the kernel is not always stable 18:13:24 threebean: right. The problem is that for every socket, there's at least 2 file handles: the socket handle and an eventfd 18:13:29 * threebean nods 18:13:30 correct. 18:13:33 So that needs to be doubled 18:13:39 Which brings us to 902. 18:13:41 2^31=2147483648 so that gives us some space b/w that and 4096 18:13:43 Which is... getting close 18:13:45 is there a common case also with a service/thing that only listens to itself, doesn't care about others? 18:13:51 pingou: correct. So we could do that. 18:14:02 nirik: no, I don't think we have a lot of that 18:14:27 pingou: but at some point we're also going to hit other limits. For the time being it's not pressing, just wanted to make sure we realize this 18:14:40 puiterwijk: thanks for that 18:14:46 I think we should maybe add a nagios check on the fedmsg-hub processes 18:14:50 yeah, we don't need to solve this today... but be thinking about it. ;) 18:14:58 most apps are more sending than listening no? 18:15:06 pingou: yep 18:15:25 nirik: the thing we do need sooner rather than later, in my opinion, is the nagios checks for FD limit in fedmsg-hub :) 18:15:44 That shouldn't be difficult, but would at least warn us before it becomes a real problem 18:16:00 sure, and perhaps bump it to 2048 too? 18:16:09 Sure. 18:16:52 That would make sure that we avoid the problem for a while, as I think it'll be some time before we double our infra 18:17:04 (unless we move everything to containers in the next month) 18:17:07 * puiterwijk hides 18:17:17 right. and it should be easy to just do in the fedmsg-hub systemd unit... 18:17:38 I think it is, yes. Otherwise I have two lines of python that do that we can insert into fedmsg-hub 18:17:58 of course we still have some rhel6 stuff. ;( so yeah, whatever works 18:18:21 I think just adding to fedmsg-hub might be easiest to get it everywhere 18:18:46 sure. 18:19:02 #info everyone will be thinking on longer term solutions when/if we need them. 18:19:04 So, anyone opposed to 1. increasing the limit in fedmsg-hub and 2. adding nagios, speak up now 18:19:11 So fedmsg-hub connects to every fedora service and services that want to consume any message start fedmsg-hub and get everything? 18:19:15 #info in the mean time we will add a nagios check and increase limit to 2048 18:19:30 jcline: yeah. 18:19:48 every endpoint of every service... some have just 1 or 2, some have more 18:19:55 Okay, thanks. That's something I had wondered about. 18:20:08 jcline: look at our endpoints.py file on hosts :) 18:20:59 so, related to this... I see a bunch of connections that are for fedmsg, but not connecting... perhaps we should clean those up too. 18:20:59 So longer term apps need to be choosier about where they consume from rather than asking for the firehose 18:21:25 jcline: probably, yes, but as said, that's tricky to get setup 18:22:11 I think it is 'easy' to do in the simple case but quickly becomes hard/impossible as you end up with dependencies between services 18:23:28 I guess the things I see not connecting are bodhi03/04 (which shouldn't be running fedmsg-hub, so not sure why they are being connected to) 18:23:57 nirik: I think they should not be running -hub, but they should be running -relay. 18:24:03 Since they do fire off new messages 18:24:15 they do? 18:24:24 Bodhi fires off messages for new updates etc.. 18:24:38 thats the backend tho... 18:24:50 i think the frontend does updates right? 18:24:54 No? New update submissions is issued by frontend 18:24:55 doesn't the frontend runt he API? 18:24:59 Yes 18:25:09 also, comments on bodhi tickets probably send messages, which is also frontend 18:25:18 the backend only does the pushing etc and handles messages. frontend fires the actions from in its UI 18:25:23 well, they are not running fedmsg-hub or relay currently 18:25:39 nirik: right. Because this probably uses the other stuff where the relay runs right in the apache process 18:26:00 that sounds correct to me, but i'm not 100% sure 18:26:02 ok, but then things shouldn't try and connect to it's endpoints... since nothing is listening there 18:26:17 yeah i don't thinkt he frontend listens 18:26:26 Yes, it does, because that's how fedmsg works.. 18:26:41 oh just bidirectional? 18:26:42 fedmsg consumers connect to producers 18:26:47 and bodhi web is a producer 18:26:56 right 18:27:04 so lots 'o connections 18:27:13 well, something seems wrong there to me, but we can poke at it out of meeting 18:27:17 However, I have not yet checked which producing system bodhi uses 18:27:28 nirik: I'll look how it is sending messages. 18:27:51 thanks. 18:28:00 ok, anything more on fedmsg for now? 18:28:48 #topic Apprentice Open office hours 18:28:59 any apprentices with questions, comments, looking for work, etc? :) 18:29:22 Not yet.... ;) 18:29:33 Yes I was starting to look at the easyfix tickets and there are a few SOP to be made... 18:30:02 I'm looking to do something fun and new 18:30:17 They are quite easyfix really... Tough at the same time, a bit hard to get the documentation for the service in question... 18:30:42 Well mainly cos' this is my thrid meeting and I can't access the serves as well... 18:30:50 wind85__: yeah... do ask in #fedora-admin or post to the list if you need more info... someone can answer and provide the info you see 18:30:52 seek 18:31:12 nirik: alright then... 18:31:40 doteast: fun _and_ new? hummm.... :) 18:32:03 or old and boring :) works either way :) 18:32:13 :) 18:32:39 i have a question about inventory/group_vars and patches from git 18:32:52 doteast: ha. ok, will try and think of something for you.... 18:33:01 saunind: fire away. 18:33:10 nirik, thank... I'm sure it will be fun 18:33:18 *thanks 18:33:22 Should i git clone https://infrastructure.fedoraproject.org/cgit/ansible.git/ this git repo 18:33:45 ant than git format patch when i edit group_var files? right? 18:33:58 yep. And then send them to the list. ;) 18:34:20 someday before too long we hopefully will have a way for people to submit PR's... which might be faster/easier. 18:34:28 ok thx nirik 18:35:03 what is PR's? 18:35:06 np. :) 18:35:10 Pull Request. 18:35:18 I one more question ) 18:35:20 It's how github type projects (and pagure) do changes 18:35:25 And one more) 18:35:43 https://help.github.com/articles/about-pull-requests/ 18:35:47 sure, ask awayy 18:36:05 How to find a mentor/sponsor?)) 18:36:50 saunind: thanks I wanted to ask as well :) ... 18:36:55 PRs are noting working?! I thought this is one of the kills of pagure... 18:37:22 oh, sorry, you want easier/faster 18:37:42 saunind: well, we don't usually do direct mentoring... but you can ask anyone in the group or list or #fedora-admin your questions and someone whoever is around will answer... 18:37:53 doteast: we don't have PR's for the ansible repo (yet) 18:38:30 yeah, sorry... I was think of something else... as I recall it was the git syncing issue as I suppose 18:39:16 yeah. we want our copy to be the main one, so we don't depend on another location... 18:39:22 but we will try and work something out... 18:39:27 #topic Open Floor 18:39:55 anyone have anything for open floor? questions, comments, favorate bluetooth internet of things device? 18:40:34 nirik: all of the devices. They're all so broken :) 18:41:14 quantum computer pad 18:41:24 is pretty cool 18:41:44 I liked the candle one. ;) 18:42:05 it gives correct answer for 5+8 correct in 1/20 cases 18:42:14 ha 18:42:15 but it is a quantum computer!!! :) 18:42:22 smart dust, not bluetooth tho 18:43:35 oh, one other thing. I will be gone next week... smooge / puiterwijk: can one of you run the meeting? or someone else if they want... 18:43:51 I can run meeting next week 18:44:03 it will be very quick 18:44:10 :) 18:44:11 thanks 18:44:13 Thanks smooge :) 18:44:20 :) 18:44:29 ok, thanks for coming everyone. Do continue over in #fedora-admin, #fedora-apps and #fedora-noc 18:44:30 openmeeting/heylookniriksnothere/closemeeting 18:44:31 and, thanks nirik for chairing today 18:44:32 #endmeeting