16:05:14 #startmeeting Infrastructure (2023-06-22) 16:05:14 Meeting started Thu Jun 22 16:05:14 2023 UTC. 16:05:14 This meeting is logged and archived in a public location. 16:05:14 The chair is zlopez. Information about MeetBot at https://fedoraproject.org/wiki/Zodbot#Meeting_Functions. 16:05:14 Useful Commands: #action #agreed #halp #info #idea #link #topic. 16:05:14 The meeting name has been set to 'infrastructure_(2023-06-22)' 16:05:14 #meetingname infrastructure 16:05:14 #chair nirik zlopez nb bodanel dtometzki jnsamyak lenkaseg 16:05:14 The meeting name has been set to 'infrastructure' 16:05:14 Current chairs: bodanel dtometzki jnsamyak lenkaseg nb nirik zlopez 16:05:20 #info Agenda is at: https://board.net/p/fedora-infra 16:05:20 #info About our team: https://docs.fedoraproject.org/en-US/cpe/ 16:05:20 #info Fedora Infra documentation: https://docs.fedoraproject.org/en-US/infra 16:05:20 #topic greetings! 16:05:36 .hi 16:05:37 phsmoura: phsmoura 'Pedro Moura' 16:05:38 Hi everyone, it seems that the matrix bridge is not OK today 16:05:54 So I will run the meeting from libera.chat 16:06:02 morning 16:06:23 There is a slight change in plan today, as I will be the host instead of lenkaseg 16:06:32 .hello zlopez 16:06:33 zlopez: zlopez 'Michal Konecny' 16:07:07 .hello eddiejennings 16:07:08 eddiejenningsjr: eddiejennings 'Eddie Jennings' 16:09:07 Let's see if there is somebody new 16:09:13 #topic New folks introductions 16:09:13 #info This is a place where people who are interested in Fedora Infrastructure can introduce themselves 16:09:13 #info Getting Started Guide: https://fedoraproject.org/wiki/Infrastructure/GettingStarted 16:09:26 I just hopped into IRC. Matrix bridge seems OK for me. 16:09:32 .hi 16:09:33 darknao: darknao 'Francois Andrieu' 16:09:51 It seems to be sporadic... or perhaps it's better now? 16:10:00 .hello jnsamyak 16:10:03 jnsamyak: jnsamyak 'Samyak Jain' 16:10:52 I didn't saw much messages from matrix arriving here, but it worked fine other way around 16:11:22 So anybody new here today? 16:12:03 It doesn't seem so 16:12:16 So let continue with the chair 16:12:24 #topic Next chair 16:12:24 #info magic eight ball says: 16:12:24 #info chair 2023-06-29 - phsmoura 16:12:24 #info chair 2023-07-06 - dtometzki 16:12:48 #info chair 2023-07-13 - ??? 16:13:15 Does anybody want to take the chair for 2023-07-13? 16:13:33 You are obligated to return it after usage :-) 16:13:42 Me! 16:13:51 For anyone new or on-the-fence, it's a fun, easy way to be involved with fedora-infra! 16:13:58 The chair? 16:14:25 lenkaseg: it's yours 16:14:34 #info chair 2023-07-13 - lenkaseg 16:14:35 .hi 16:14:35 dtometzki: dtometzki 'Damian Tometzki' 16:15:13 It's enough to have chairs for next three weeks, so let's look at the oncall 16:15:25 Sorry, the news will go first :-) 16:15:31 #topic announcements and information 16:15:31 #info CPE Infra&Releng EU-hours team has a Monday through Thursday 30 minute meeting going through tickets at 0730 UTC in #centos-meeting 16:15:31 #info CPE Infra&Releng NA-hours team has a Monday through Thursday 30 minute meeting going through tickets at 1800 UTC in #fedora-meeting-3 16:15:31 #info we had a lovely DDoS of our dns servers yesterday. Should be in better shape the next time something like that happens. 16:15:33 #info flock call for papers/talks is open... https://cfp.fedoraproject.org/ 16:15:52 Anything else to announce? 16:16:03 we can remove the ddos line now. :) I still have on my list to write up what happened... 16:17:18 I wasn't here the last week, so I wasn't sure if this is old announcement or it really happened yesterday 16:18:03 It was the Canonical people wasn't it? ;) 16:19:36 I don't think so :-D 16:20:11 ha. I don't think so either. 16:20:26 it was before the last meeting... last tues? 16:20:32 Let's continue with oncall 16:20:45 #topic Oncall 16:20:45 #info https://fedoraproject.org/wiki/Infrastructure/Oncall 16:20:45 #info https://docs.fedoraproject.org/en-US/cpe/day_to_day_fedora/ 16:20:52 #info eddiejennings is on call from 2023-06-16 to 2023-06-22 16:20:52 #info nirik is on call from 2023-06-23 to 2023-06-29 16:20:52 #info ??? is on call from 2023-06-30 to 2023-07-06 16:21:17 .oncalltakeus 16:21:17 nirik: Kneel before zod! 16:21:37 Anybody interested to take 2023-06-30 to 2023-07-06? 16:22:48 I can take it 16:22:55 Sold! 16:23:13 #info darknao is on call from 2023-06-30 to 2023-07-06 16:23:32 #info Summary of last week: (from current oncall ) 16:23:33 You can put me down for the week after. I'm on-call for my job, so I'll be on-call for fedora-infra as well :) 16:24:04 eddiejenningsjr: Thanks for volunteering :-) 16:24:12 Did you had any ping this week? 16:24:47 #info eddiejenningsjr is on call from 2023-07-07 to 2023-07-13 16:25:06 If I did, they were when I was asleep. This week was thankfully quiet :D 16:25:59 That's good to hear :-) 16:26:14 #topic Monitoring discussion [nirik] 16:26:14 #info https://nagios.fedoraproject.org/nagios 16:26:14 #info Go over existing out items and fix 16:26:21 lets see... 16:27:12 looking pretty good. 16:27:30 still need to look at the fedmsg thing on proxies 16:27:51 and there is one new openqa message queue thing to look into 16:28:02 You mean the symlink on proxies? 16:28:28 yes.... 16:28:58 I put the sym link there to fix all of them alerting 16:29:16 we need to fix the real issue 16:29:53 which I think was related to the changes we made to fix notifs-backend alerts 16:29:56 From what I tried to look into it seems that the psutil is returning name of the process without `3` at the end, which causes the socket name to change 16:30:36 thats pretty weird 16:32:11 I tried this in python interpreter and confirmed that it is really an issue 16:32:13 python-psutil? 16:32:17 Yes 16:32:34 I can point you to the exact code where this happens 16:34:02 so, can we downgrade? or file a bug? 16:34:27 I tried to downgrade psutil and it seems there is no version to downgrade 16:35:37 This is the line, that's causing the issue `proc = [p for p in psutil.process_iter() if p.pid == pid][0]`, the proc.name() returns the name of the process without `3` at the end 16:35:44 I can try and look at what changed when it started happening 16:36:03 nirik99, i have free cycles to help with the notifs-backend whe nyou have time 16:36:05 it seemed like it was after a noc playbook run, which made me suspect our changes 16:36:09 It's in /etc/fedmsg.d/fedmsg-gateway-slave.py 16:37:19 zlopez: but the socket is monitoring-fedmsg-gateway-.socket 16:37:33 and the link that makes the alerts stop is monitoring-fedmsg-gateway--3.socket 16:37:54 so nothing is right there. ;) 16:38:27 heh 16:38:46 I'm not sure why there is `--` the socket should be just the name of the process from what I found 16:39:06 Maybe it didn't worked as it should even before :-D 16:39:47 I think we need to look at the entire chain... from what is the nagios check looking for 16:40:17 it did work before tho, it just started alerting at once after a noc run. (but perhaps that was cooincidence?) 16:40:18 `'ipc:///var/run/fedmsg/monitoring-%s.socket' % name` This is how the socket should be named 16:40:35 The name is from what I shared earlier 16:40:51 Jun 07 17:18:32 PROBLEM - proxy14.fedoraproject.org/Check fedmsg-gateway consumers backlog is UNKNOWN: UNKNOWN - /var/run/fedmsg/monitoring-fedmsg-gateway--3.socket does not exist (noc01) 16:41:19 I know, it seems like the psutil change it's behavior in some cases 16:42:49 commit de5ab8f045f 16:42:59 - fname = '/var/run/fedmsg/monitoring-%s.socket' % service 16:42:59 + fname = '/var/run/fedmsg/monitoring-%s-3.socket' % service 16:43:16 but that doesn't explain the -- or whatever 16:43:27 It explains it 16:43:54 Currently the name is `monitoring-fedmsg-gateway-.socket` and if you add -3, you will get the `--` 16:44:03 Not sure why this change was made 16:44:15 it was made to fix some notifs-backend alerts. 16:44:27 but not sure why it fixes them and breaks this. ;) 16:45:12 This is done in ansible or fedmsg? 16:45:30 this is all in ansible... the nagios side 16:45:43 Ok, I didn't checked that 16:45:56 I just looked what caused it 16:46:19 well, we are taking up the meeting with this. ;) But I can dig more if we want. 16:47:01 Up to you two. I can easily do my little talk next week. 15 minutes is probably not going to be enough time, especially for questions. 16:47:34 what was the file being looked at? 16:47:41 so notifs-backend01 (still f36...) has monitoring-fedmsg-hub-3.socket 16:47:59 so that expains why the change was made, but not why its different. 16:48:17 Ok, so there is inconsistency across machines 16:48:17 aheath1992: you remember any of this? :) it was a while ago... 16:48:44 eddiejenningsjr: let's move the talk to next week, this is interesting as well :-) 16:49:01 +1 16:49:24 if I remember that some of the alerts were only pointing to monitoring-fedmsg-hub.socket so I updated to monitoring-fedmsg-hub-3.socket 16:49:36 in the nagios check scripts 16:49:50 smooge: /etc/fedmsg.d/fedmsg-gateway-slave.py 16:51:05 https://pagure.io/fedora-infra/ansible/pull-request/1475 16:51:13 PR for that change 16:51:17 so, if we remove the -3 it would fix the proxies, but break notifs... so perhaps we could figure why notifs has a different socket name? 16:51:29 or why proxies do 16:52:02 I guess proxies do due to the psutils thing? 16:52:11 Yes, it seems so 16:52:31 The name should be same as process name, but it isn't 16:54:44 Couldn't we just change the nagios rules to match the name on both machines? 16:55:12 systemd seems to think the name is weird too: Main PID: 690 (fedmsg-gateway-) 16:55:34 sure, thats an option. change the check to look for with -3 and without? 16:55:40 Ok, so it's not psutil think 16:56:12 But the `ps aux|grep fedmsg` returned correct name for me 16:56:24 yeah. odd. 16:56:45 `/usr/bin/fedmsg-gateway-3` 16:57:10 It's strange that it just cuts the `3` at the end 16:58:38 Maybe the systemd is using same way to retrieve process name as psutil does 16:59:48 hummm... 16:59:59 The script has: 17:00:02 if __name__ == '__main__': 17:00:02 sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0]) 17:00:30 could that be messing it up somehow? but no idea why it would look different 17:00:44 * nirik99 sees we are now out of time. ;) 17:01:18 I will end it here, but the discussion was interesting 17:01:36 Thanks everybody for coming 17:01:37 #endmeeting