16:05:14 <zlopez> #startmeeting Infrastructure (2023-06-22)
16:05:14 <zodbot> Meeting started Thu Jun 22 16:05:14 2023 UTC.
16:05:14 <zodbot> This meeting is logged and archived in a public location.
16:05:14 <zodbot> The chair is zlopez. Information about MeetBot at https://fedoraproject.org/wiki/Zodbot#Meeting_Functions.
16:05:14 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic.
16:05:14 <zodbot> The meeting name has been set to 'infrastructure_(2023-06-22)'
16:05:14 <zlopez> #meetingname infrastructure
16:05:14 <zlopez> #chair nirik zlopez nb bodanel dtometzki jnsamyak lenkaseg
16:05:14 <zodbot> The meeting name has been set to 'infrastructure'
16:05:14 <zodbot> Current chairs: bodanel dtometzki jnsamyak lenkaseg nb nirik zlopez
16:05:20 <zlopez> #info Agenda is at: https://board.net/p/fedora-infra
16:05:20 <zlopez> #info About our team: https://docs.fedoraproject.org/en-US/cpe/
16:05:20 <zlopez> #info Fedora Infra documentation: https://docs.fedoraproject.org/en-US/infra
16:05:20 <zlopez> #topic greetings!
16:05:36 <phsmoura> .hi
16:05:37 <zodbot> phsmoura: phsmoura 'Pedro Moura' <pmoura@redhat.com>
16:05:38 <zlopez> Hi everyone, it seems that the matrix bridge is not OK today
16:05:54 <zlopez> So I will run the meeting from libera.chat
16:06:02 <nirik99> morning
16:06:23 <zlopez> There is a slight change in plan today, as I will be the host instead of lenkaseg
16:06:32 <zlopez> .hello zlopez
16:06:33 <zodbot> zlopez: zlopez 'Michal Konecny' <michal.konecny@pacse.eu>
16:07:07 <eddiejenningsjr> .hello eddiejennings
16:07:08 <zodbot> eddiejenningsjr: eddiejennings 'Eddie Jennings' <eddie@eddiejennings.net>
16:09:07 <zlopez> Let's see if there is somebody new
16:09:13 <zlopez> #topic New folks introductions
16:09:13 <zlopez> #info This is a place where people who are interested in Fedora Infrastructure can introduce themselves
16:09:13 <zlopez> #info Getting Started Guide: https://fedoraproject.org/wiki/Infrastructure/GettingStarted
16:09:26 <eddiejenningsjr> I just hopped into IRC.  Matrix bridge seems OK for me.
16:09:32 <darknao> .hi
16:09:33 <zodbot> darknao: darknao 'Francois Andrieu' <darknao@drkn.ninja>
16:09:51 <nirik99> It seems to be sporadic... or perhaps it's better now?
16:10:00 <jnsamyak> .hello jnsamyak
16:10:03 <zodbot> jnsamyak: jnsamyak 'Samyak Jain' <samyak.jn11@gmail.com>
16:10:52 <zlopez> I didn't saw much messages from matrix arriving here, but it worked fine other way around
16:11:22 <zlopez> So anybody new here today?
16:12:03 <zlopez> It doesn't seem so
16:12:16 <zlopez> So let continue with the chair
16:12:24 <zlopez> #topic Next chair
16:12:24 <zlopez> #info magic eight ball says:
16:12:24 <zlopez> #info chair 2023-06-29 - phsmoura
16:12:24 <zlopez> #info chair 2023-07-06 - dtometzki
16:12:48 <zlopez> #info chair 2023-07-13 - ???
16:13:15 <zlopez> Does anybody want to take the chair for 2023-07-13?
16:13:33 <zlopez> You are obligated to return it after usage :-)
16:13:42 <lenkaseg> Me!
16:13:51 <eddiejenningsjr> For anyone new or on-the-fence, it's a fun, easy way to be involved with fedora-infra!
16:13:58 <lenkaseg> The chair?
16:14:25 <zlopez> lenkaseg: it's yours
16:14:34 <zlopez> #info chair 2023-07-13 - lenkaseg
16:14:35 <dtometzki> .hi
16:14:35 <zodbot> dtometzki: dtometzki 'Damian Tometzki' <damian@riscv.tometzki.de>
16:15:13 <zlopez> It's enough to have chairs for next three weeks, so let's look at the oncall
16:15:25 <zlopez> Sorry, the news will go first :-)
16:15:31 <zlopez> #topic announcements and information
16:15:31 <zlopez> #info CPE Infra&Releng EU-hours team has a Monday through Thursday 30 minute meeting going through tickets at 0730 UTC in #centos-meeting
16:15:31 <zlopez> #info CPE Infra&Releng NA-hours team has a Monday through Thursday 30 minute meeting going through tickets at 1800 UTC in #fedora-meeting-3
16:15:31 <zlopez> #info we had a lovely DDoS of our dns servers yesterday. Should be in better shape the next time something like that happens.
16:15:33 <zlopez> #info flock call for papers/talks is open... https://cfp.fedoraproject.org/
16:15:52 <zlopez> Anything else to announce?
16:16:03 <nirik99> we can remove the ddos line now. :) I still have on my list to write up what happened...
16:17:18 <zlopez> I wasn't here the last week, so I wasn't sure if this is old announcement or it really happened yesterday
16:18:03 <eddiejenningsjr> It was the Canonical people wasn't it? ;)
16:19:36 <zlopez> I don't think so :-D
16:20:11 <nirik99> ha. I don't think so either.
16:20:26 <nirik99> it was before the last meeting... last tues?
16:20:32 <zlopez> Let's continue with oncall
16:20:45 <zlopez> #topic Oncall
16:20:45 <zlopez> #info https://fedoraproject.org/wiki/Infrastructure/Oncall
16:20:45 <zlopez> #info https://docs.fedoraproject.org/en-US/cpe/day_to_day_fedora/
16:20:52 <zlopez> #info eddiejennings is on call from 2023-06-16 to 2023-06-22
16:20:52 <zlopez> #info nirik is on call from 2023-06-23 to 2023-06-29
16:20:52 <zlopez> #info ??? is on call from 2023-06-30 to 2023-07-06
16:21:17 <nirik> .oncalltakeus
16:21:17 <zodbot> nirik: Kneel before zod!
16:21:37 <zlopez> Anybody interested to take 2023-06-30 to 2023-07-06?
16:22:48 <darknao> I can take it
16:22:55 <zlopez> Sold!
16:23:13 <zlopez> #info darknao is on call from 2023-06-30 to 2023-07-06
16:23:32 <zlopez> #info Summary of last week: (from current oncall )
16:23:33 <eddiejenningsjr> You can put me down for the week after.  I'm on-call for my job, so I'll be on-call for fedora-infra as well :)
16:24:04 <zlopez> eddiejenningsjr: Thanks for volunteering :-)
16:24:12 <zlopez> Did you had any ping this week?
16:24:47 <zlopez> #info eddiejenningsjr is on call from 2023-07-07 to 2023-07-13
16:25:06 <eddiejenningsjr> If I did, they were when I was asleep.  This week was thankfully quiet :D
16:25:59 <zlopez> That's good to hear :-)
16:26:14 <zlopez> #topic Monitoring discussion [nirik]
16:26:14 <zlopez> #info https://nagios.fedoraproject.org/nagios
16:26:14 <zlopez> #info Go over existing out items and fix
16:26:21 <nirik99> lets see...
16:27:12 <nirik99> looking pretty good.
16:27:30 <nirik99> still need to look at the fedmsg thing on proxies
16:27:51 <nirik99> and there is one new openqa message queue thing to look into
16:28:02 <zlopez> You mean the symlink on proxies?
16:28:28 <nirik99> yes....
16:28:58 <nirik99> I put the sym link there to fix all of them alerting
16:29:16 <nirik99> we need to fix the real issue
16:29:53 <nirik99> which I think was related to the changes we made to fix notifs-backend alerts
16:29:56 <zlopez> From what I tried to look into it seems that the psutil is returning name of the process without `3` at the end, which causes the socket name to change
16:30:36 <nirik99> thats pretty weird
16:32:11 <zlopez> I tried this in python interpreter and confirmed that it is really an issue
16:32:13 <nirik99> python-psutil?
16:32:17 <zlopez> Yes
16:32:34 <zlopez> I can point you to the exact code where this happens
16:34:02 <nirik99> so, can we downgrade? or file a bug?
16:34:27 <zlopez> I tried to downgrade psutil and it seems there is no version to downgrade
16:35:37 <zlopez> This is the line, that's causing the issue `proc = [p for p in psutil.process_iter() if p.pid == pid][0]`, the proc.name() returns the name of the process without `3` at the end
16:35:44 <nirik99> I can try and look at what changed when it started happening
16:36:03 <aheath1992> nirik99, i have free cycles to help with the notifs-backend whe nyou have time
16:36:05 <nirik99> it seemed like it was after a noc playbook run, which made me suspect our changes
16:36:09 <zlopez> It's in /etc/fedmsg.d/fedmsg-gateway-slave.py
16:37:19 <nirik99> zlopez: but the socket is monitoring-fedmsg-gateway-.socket
16:37:33 <nirik99> and the link that makes the alerts stop is monitoring-fedmsg-gateway--3.socket
16:37:54 <nirik99> so nothing is right there. ;)
16:38:27 <eddiejenningsjr> heh
16:38:46 <zlopez> I'm not sure why there is `--` the socket should be just the name of the process from what I found
16:39:06 <zlopez> Maybe it didn't worked as it should even before :-D
16:39:47 <nirik99> I think we need to look at the entire chain... from what is the nagios check looking for
16:40:17 <nirik99> it did work before tho, it just started alerting at once after a noc run. (but perhaps that was cooincidence?)
16:40:18 <zlopez> `'ipc:///var/run/fedmsg/monitoring-%s.socket' % name` This is how the socket should be named
16:40:35 <zlopez> The name is from what I shared earlier
16:40:51 <nirik99> Jun 07 17:18:32 <zodbot>        PROBLEM - proxy14.fedoraproject.org/Check fedmsg-gateway consumers backlog is UNKNOWN: UNKNOWN - /var/run/fedmsg/monitoring-fedmsg-gateway--3.socket does not exist (noc01)
16:41:19 <zlopez> I know, it seems like the psutil change it's behavior in some cases
16:42:49 <nirik99> commit de5ab8f045f
16:42:59 <nirik99> -    fname = '/var/run/fedmsg/monitoring-%s.socket' % service
16:42:59 <nirik99> +    fname = '/var/run/fedmsg/monitoring-%s-3.socket' % service
16:43:16 <nirik99> but that doesn't explain the -- or whatever
16:43:27 <zlopez> It explains it
16:43:54 <zlopez> Currently the name is `monitoring-fedmsg-gateway-.socket` and if you add -3, you will get the `--`
16:44:03 <zlopez> Not sure why this change was made
16:44:15 <nirik99> it was made to fix some notifs-backend alerts.
16:44:27 <nirik99> but not sure why it fixes them and breaks this. ;)
16:45:12 <zlopez> This is done in ansible or fedmsg?
16:45:30 <nirik99> this is all in ansible... the nagios side
16:45:43 <zlopez> Ok, I didn't checked that
16:45:56 <zlopez> I just looked what caused it
16:46:19 <nirik99> well, we are taking up the meeting with this. ;) But I can dig more if we want.
16:47:01 <eddiejenningsjr> Up to you two.  I can easily do my little talk next week.  15 minutes is probably not going to be enough time, especially for questions.
16:47:34 <smooge> what was the file being looked at?
16:47:41 <nirik99> so notifs-backend01 (still f36...) has monitoring-fedmsg-hub-3.socket
16:47:59 <nirik99> so that expains why the change was made, but not why its different.
16:48:17 <zlopez> Ok, so there is inconsistency across machines
16:48:17 <nirik99> aheath1992: you remember any of this? :) it was a while ago...
16:48:44 <zlopez> eddiejenningsjr: let's move the talk to next week, this is interesting as well :-)
16:49:01 <eddiejenningsjr> +1
16:49:24 <aheath1992> if I remember that some of the alerts were only pointing to monitoring-fedmsg-hub.socket so I  updated to monitoring-fedmsg-hub-3.socket
16:49:36 <aheath1992> in the nagios check scripts
16:49:50 <zlopez> smooge: /etc/fedmsg.d/fedmsg-gateway-slave.py
16:51:05 <aheath1992> https://pagure.io/fedora-infra/ansible/pull-request/1475
16:51:13 <aheath1992> PR for that change
16:51:17 <nirik99> so, if we remove the -3 it would fix the proxies, but break notifs... so perhaps we could figure why notifs has a different socket name?
16:51:29 <nirik99> or why proxies do
16:52:02 <nirik99> I guess proxies do due to the psutils thing?
16:52:11 <zlopez> Yes, it seems so
16:52:31 <zlopez> The name should be same as process name, but it isn't
16:54:44 <zlopez> Couldn't we just change the nagios rules to match the name on both machines?
16:55:12 <nirik99> systemd seems to think the name is weird too:    Main PID: 690 (fedmsg-gateway-)
16:55:34 <nirik99> sure, thats an option. change the check to look for with -3 and without?
16:55:40 <zlopez> Ok, so it's not psutil think
16:56:12 <zlopez> But the `ps aux|grep fedmsg` returned correct name for me
16:56:24 <nirik99> yeah. odd.
16:56:45 <zlopez> `/usr/bin/fedmsg-gateway-3`
16:57:10 <zlopez> It's strange that it just cuts the `3` at the end
16:58:38 <zlopez> Maybe the systemd is using same way to retrieve process name as psutil does
16:59:48 <nirik99> hummm...
16:59:59 <nirik99> The script has:
17:00:02 <nirik99> if __name__ == '__main__':
17:00:02 <nirik99> sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0])
17:00:30 <nirik99> could that be messing it up somehow? but no idea why it would look different
17:00:44 * nirik99 sees we are now out of time. ;)
17:01:18 <zlopez> I will end it here, but the discussion was interesting
17:01:36 <zlopez> Thanks everybody for coming
17:01:37 <zlopez> #endmeeting