16:05:14 <zlopez> #startmeeting Infrastructure (2023-06-22) 16:05:14 <zodbot> Meeting started Thu Jun 22 16:05:14 2023 UTC. 16:05:14 <zodbot> This meeting is logged and archived in a public location. 16:05:14 <zodbot> The chair is zlopez. Information about MeetBot at https://fedoraproject.org/wiki/Zodbot#Meeting_Functions. 16:05:14 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic. 16:05:14 <zodbot> The meeting name has been set to 'infrastructure_(2023-06-22)' 16:05:14 <zlopez> #meetingname infrastructure 16:05:14 <zlopez> #chair nirik zlopez nb bodanel dtometzki jnsamyak lenkaseg 16:05:14 <zodbot> The meeting name has been set to 'infrastructure' 16:05:14 <zodbot> Current chairs: bodanel dtometzki jnsamyak lenkaseg nb nirik zlopez 16:05:20 <zlopez> #info Agenda is at: https://board.net/p/fedora-infra 16:05:20 <zlopez> #info About our team: https://docs.fedoraproject.org/en-US/cpe/ 16:05:20 <zlopez> #info Fedora Infra documentation: https://docs.fedoraproject.org/en-US/infra 16:05:20 <zlopez> #topic greetings! 16:05:36 <phsmoura> .hi 16:05:37 <zodbot> phsmoura: phsmoura 'Pedro Moura' <pmoura@redhat.com> 16:05:38 <zlopez> Hi everyone, it seems that the matrix bridge is not OK today 16:05:54 <zlopez> So I will run the meeting from libera.chat 16:06:02 <nirik99> morning 16:06:23 <zlopez> There is a slight change in plan today, as I will be the host instead of lenkaseg 16:06:32 <zlopez> .hello zlopez 16:06:33 <zodbot> zlopez: zlopez 'Michal Konecny' <michal.konecny@pacse.eu> 16:07:07 <eddiejenningsjr> .hello eddiejennings 16:07:08 <zodbot> eddiejenningsjr: eddiejennings 'Eddie Jennings' <eddie@eddiejennings.net> 16:09:07 <zlopez> Let's see if there is somebody new 16:09:13 <zlopez> #topic New folks introductions 16:09:13 <zlopez> #info This is a place where people who are interested in Fedora Infrastructure can introduce themselves 16:09:13 <zlopez> #info Getting Started Guide: https://fedoraproject.org/wiki/Infrastructure/GettingStarted 16:09:26 <eddiejenningsjr> I just hopped into IRC. Matrix bridge seems OK for me. 16:09:32 <darknao> .hi 16:09:33 <zodbot> darknao: darknao 'Francois Andrieu' <darknao@drkn.ninja> 16:09:51 <nirik99> It seems to be sporadic... or perhaps it's better now? 16:10:00 <jnsamyak> .hello jnsamyak 16:10:03 <zodbot> jnsamyak: jnsamyak 'Samyak Jain' <samyak.jn11@gmail.com> 16:10:52 <zlopez> I didn't saw much messages from matrix arriving here, but it worked fine other way around 16:11:22 <zlopez> So anybody new here today? 16:12:03 <zlopez> It doesn't seem so 16:12:16 <zlopez> So let continue with the chair 16:12:24 <zlopez> #topic Next chair 16:12:24 <zlopez> #info magic eight ball says: 16:12:24 <zlopez> #info chair 2023-06-29 - phsmoura 16:12:24 <zlopez> #info chair 2023-07-06 - dtometzki 16:12:48 <zlopez> #info chair 2023-07-13 - ??? 16:13:15 <zlopez> Does anybody want to take the chair for 2023-07-13? 16:13:33 <zlopez> You are obligated to return it after usage :-) 16:13:42 <lenkaseg> Me! 16:13:51 <eddiejenningsjr> For anyone new or on-the-fence, it's a fun, easy way to be involved with fedora-infra! 16:13:58 <lenkaseg> The chair? 16:14:25 <zlopez> lenkaseg: it's yours 16:14:34 <zlopez> #info chair 2023-07-13 - lenkaseg 16:14:35 <dtometzki> .hi 16:14:35 <zodbot> dtometzki: dtometzki 'Damian Tometzki' <damian@riscv.tometzki.de> 16:15:13 <zlopez> It's enough to have chairs for next three weeks, so let's look at the oncall 16:15:25 <zlopez> Sorry, the news will go first :-) 16:15:31 <zlopez> #topic announcements and information 16:15:31 <zlopez> #info CPE Infra&Releng EU-hours team has a Monday through Thursday 30 minute meeting going through tickets at 0730 UTC in #centos-meeting 16:15:31 <zlopez> #info CPE Infra&Releng NA-hours team has a Monday through Thursday 30 minute meeting going through tickets at 1800 UTC in #fedora-meeting-3 16:15:31 <zlopez> #info we had a lovely DDoS of our dns servers yesterday. Should be in better shape the next time something like that happens. 16:15:33 <zlopez> #info flock call for papers/talks is open... https://cfp.fedoraproject.org/ 16:15:52 <zlopez> Anything else to announce? 16:16:03 <nirik99> we can remove the ddos line now. :) I still have on my list to write up what happened... 16:17:18 <zlopez> I wasn't here the last week, so I wasn't sure if this is old announcement or it really happened yesterday 16:18:03 <eddiejenningsjr> It was the Canonical people wasn't it? ;) 16:19:36 <zlopez> I don't think so :-D 16:20:11 <nirik99> ha. I don't think so either. 16:20:26 <nirik99> it was before the last meeting... last tues? 16:20:32 <zlopez> Let's continue with oncall 16:20:45 <zlopez> #topic Oncall 16:20:45 <zlopez> #info https://fedoraproject.org/wiki/Infrastructure/Oncall 16:20:45 <zlopez> #info https://docs.fedoraproject.org/en-US/cpe/day_to_day_fedora/ 16:20:52 <zlopez> #info eddiejennings is on call from 2023-06-16 to 2023-06-22 16:20:52 <zlopez> #info nirik is on call from 2023-06-23 to 2023-06-29 16:20:52 <zlopez> #info ??? is on call from 2023-06-30 to 2023-07-06 16:21:17 <nirik> .oncalltakeus 16:21:17 <zodbot> nirik: Kneel before zod! 16:21:37 <zlopez> Anybody interested to take 2023-06-30 to 2023-07-06? 16:22:48 <darknao> I can take it 16:22:55 <zlopez> Sold! 16:23:13 <zlopez> #info darknao is on call from 2023-06-30 to 2023-07-06 16:23:32 <zlopez> #info Summary of last week: (from current oncall ) 16:23:33 <eddiejenningsjr> You can put me down for the week after. I'm on-call for my job, so I'll be on-call for fedora-infra as well :) 16:24:04 <zlopez> eddiejenningsjr: Thanks for volunteering :-) 16:24:12 <zlopez> Did you had any ping this week? 16:24:47 <zlopez> #info eddiejenningsjr is on call from 2023-07-07 to 2023-07-13 16:25:06 <eddiejenningsjr> If I did, they were when I was asleep. This week was thankfully quiet :D 16:25:59 <zlopez> That's good to hear :-) 16:26:14 <zlopez> #topic Monitoring discussion [nirik] 16:26:14 <zlopez> #info https://nagios.fedoraproject.org/nagios 16:26:14 <zlopez> #info Go over existing out items and fix 16:26:21 <nirik99> lets see... 16:27:12 <nirik99> looking pretty good. 16:27:30 <nirik99> still need to look at the fedmsg thing on proxies 16:27:51 <nirik99> and there is one new openqa message queue thing to look into 16:28:02 <zlopez> You mean the symlink on proxies? 16:28:28 <nirik99> yes.... 16:28:58 <nirik99> I put the sym link there to fix all of them alerting 16:29:16 <nirik99> we need to fix the real issue 16:29:53 <nirik99> which I think was related to the changes we made to fix notifs-backend alerts 16:29:56 <zlopez> From what I tried to look into it seems that the psutil is returning name of the process without `3` at the end, which causes the socket name to change 16:30:36 <nirik99> thats pretty weird 16:32:11 <zlopez> I tried this in python interpreter and confirmed that it is really an issue 16:32:13 <nirik99> python-psutil? 16:32:17 <zlopez> Yes 16:32:34 <zlopez> I can point you to the exact code where this happens 16:34:02 <nirik99> so, can we downgrade? or file a bug? 16:34:27 <zlopez> I tried to downgrade psutil and it seems there is no version to downgrade 16:35:37 <zlopez> This is the line, that's causing the issue `proc = [p for p in psutil.process_iter() if p.pid == pid][0]`, the proc.name() returns the name of the process without `3` at the end 16:35:44 <nirik99> I can try and look at what changed when it started happening 16:36:03 <aheath1992> nirik99, i have free cycles to help with the notifs-backend whe nyou have time 16:36:05 <nirik99> it seemed like it was after a noc playbook run, which made me suspect our changes 16:36:09 <zlopez> It's in /etc/fedmsg.d/fedmsg-gateway-slave.py 16:37:19 <nirik99> zlopez: but the socket is monitoring-fedmsg-gateway-.socket 16:37:33 <nirik99> and the link that makes the alerts stop is monitoring-fedmsg-gateway--3.socket 16:37:54 <nirik99> so nothing is right there. ;) 16:38:27 <eddiejenningsjr> heh 16:38:46 <zlopez> I'm not sure why there is `--` the socket should be just the name of the process from what I found 16:39:06 <zlopez> Maybe it didn't worked as it should even before :-D 16:39:47 <nirik99> I think we need to look at the entire chain... from what is the nagios check looking for 16:40:17 <nirik99> it did work before tho, it just started alerting at once after a noc run. (but perhaps that was cooincidence?) 16:40:18 <zlopez> `'ipc:///var/run/fedmsg/monitoring-%s.socket' % name` This is how the socket should be named 16:40:35 <zlopez> The name is from what I shared earlier 16:40:51 <nirik99> Jun 07 17:18:32 <zodbot> PROBLEM - proxy14.fedoraproject.org/Check fedmsg-gateway consumers backlog is UNKNOWN: UNKNOWN - /var/run/fedmsg/monitoring-fedmsg-gateway--3.socket does not exist (noc01) 16:41:19 <zlopez> I know, it seems like the psutil change it's behavior in some cases 16:42:49 <nirik99> commit de5ab8f045f 16:42:59 <nirik99> - fname = '/var/run/fedmsg/monitoring-%s.socket' % service 16:42:59 <nirik99> + fname = '/var/run/fedmsg/monitoring-%s-3.socket' % service 16:43:16 <nirik99> but that doesn't explain the -- or whatever 16:43:27 <zlopez> It explains it 16:43:54 <zlopez> Currently the name is `monitoring-fedmsg-gateway-.socket` and if you add -3, you will get the `--` 16:44:03 <zlopez> Not sure why this change was made 16:44:15 <nirik99> it was made to fix some notifs-backend alerts. 16:44:27 <nirik99> but not sure why it fixes them and breaks this. ;) 16:45:12 <zlopez> This is done in ansible or fedmsg? 16:45:30 <nirik99> this is all in ansible... the nagios side 16:45:43 <zlopez> Ok, I didn't checked that 16:45:56 <zlopez> I just looked what caused it 16:46:19 <nirik99> well, we are taking up the meeting with this. ;) But I can dig more if we want. 16:47:01 <eddiejenningsjr> Up to you two. I can easily do my little talk next week. 15 minutes is probably not going to be enough time, especially for questions. 16:47:34 <smooge> what was the file being looked at? 16:47:41 <nirik99> so notifs-backend01 (still f36...) has monitoring-fedmsg-hub-3.socket 16:47:59 <nirik99> so that expains why the change was made, but not why its different. 16:48:17 <zlopez> Ok, so there is inconsistency across machines 16:48:17 <nirik99> aheath1992: you remember any of this? :) it was a while ago... 16:48:44 <zlopez> eddiejenningsjr: let's move the talk to next week, this is interesting as well :-) 16:49:01 <eddiejenningsjr> +1 16:49:24 <aheath1992> if I remember that some of the alerts were only pointing to monitoring-fedmsg-hub.socket so I updated to monitoring-fedmsg-hub-3.socket 16:49:36 <aheath1992> in the nagios check scripts 16:49:50 <zlopez> smooge: /etc/fedmsg.d/fedmsg-gateway-slave.py 16:51:05 <aheath1992> https://pagure.io/fedora-infra/ansible/pull-request/1475 16:51:13 <aheath1992> PR for that change 16:51:17 <nirik99> so, if we remove the -3 it would fix the proxies, but break notifs... so perhaps we could figure why notifs has a different socket name? 16:51:29 <nirik99> or why proxies do 16:52:02 <nirik99> I guess proxies do due to the psutils thing? 16:52:11 <zlopez> Yes, it seems so 16:52:31 <zlopez> The name should be same as process name, but it isn't 16:54:44 <zlopez> Couldn't we just change the nagios rules to match the name on both machines? 16:55:12 <nirik99> systemd seems to think the name is weird too: Main PID: 690 (fedmsg-gateway-) 16:55:34 <nirik99> sure, thats an option. change the check to look for with -3 and without? 16:55:40 <zlopez> Ok, so it's not psutil think 16:56:12 <zlopez> But the `ps aux|grep fedmsg` returned correct name for me 16:56:24 <nirik99> yeah. odd. 16:56:45 <zlopez> `/usr/bin/fedmsg-gateway-3` 16:57:10 <zlopez> It's strange that it just cuts the `3` at the end 16:58:38 <zlopez> Maybe the systemd is using same way to retrieve process name as psutil does 16:59:48 <nirik99> hummm... 16:59:59 <nirik99> The script has: 17:00:02 <nirik99> if __name__ == '__main__': 17:00:02 <nirik99> sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0]) 17:00:30 <nirik99> could that be messing it up somehow? but no idea why it would look different 17:00:44 * nirik99 sees we are now out of time. ;) 17:01:18 <zlopez> I will end it here, but the discussion was interesting 17:01:36 <zlopez> Thanks everybody for coming 17:01:37 <zlopez> #endmeeting