<@zlopez:fedora.im>
17:00:49
!startmeeting Infrastructure (2025-11-13)
<@meetbot:fedora.im>
17:00:51
Meeting started at 2025-11-13 17:00:49 UTC
<@meetbot:fedora.im>
17:00:51
The Meeting name is 'Infrastructure (2025-11-13)'
<@zlopez:fedora.im>
17:00:55
!meetingname infrastructure
<@zlopez:fedora.im>
17:00:55
!topic Hola y bienvenido
<@zlopez:fedora.im>
17:00:55
!info Fedora Infra documentation: https://docs.fedoraproject.org/en-US/infra
<@zlopez:fedora.im>
17:00:55
!info About our team: https://docs.fedoraproject.org/en-US/cle/
<@zlopez:fedora.im>
17:00:55
!info Agenda is at: https://board.net/p/fedora-infra
<@zlopez:fedora.im>
17:00:55
!chair @nirik:matrix.scrye.com @zlopez:fedora.im @jnsamyak:matrix.org @james:fedora.im @gwmngilfen:fedora.im
<@meetbot:fedora.im>
17:00:57
The Meeting Name is now infrastructure
<@nirik:matrix.scrye.com>
17:01:01
morning
<@seddik:fedora.im>
17:01:20
morning 👋
<@seddik:fedora.im>
17:02:41
yeah it's been a long time :(
<@zlopez:fedora.im>
17:02:56
!hi
<@zodbot:fedora.im>
17:02:58
Michal Konecny (zlopez)
<@seddik:fedora.im>
17:03:09
!hi
<@zlopez:fedora.im>
17:03:09
Welcome everyone on today Fedora Infra weekly meeting
<@zodbot:fedora.im>
17:03:10
seddik alaouiismaili (seddik)
<@nirik:matrix.scrye.com>
17:03:44
welcome back saibug. Hope life is going good for you...
<@zlopez:fedora.im>
17:06:14
Let's see if there is anybody new around
<@zlopez:fedora.im>
17:06:21
!topic New folks introductions
<@zlopez:fedora.im>
17:06:21
!info Getting Started Guide: https://docs.fedoraproject.org/en-US/infra/gettingstarted/
<@zlopez:fedora.im>
17:06:21
!info This is a place where people who are interested in Fedora Infrastructure can introduce themselves
<@james:fedora.im>
17:06:38
Kind of here, as I'm mostly in my other meeting.
<@nirik:matrix.scrye.com>
17:06:49
hey James
<@zlopez:fedora.im>
17:07:09
I'm trying to finish some tasks in the background, but I'm mostly here 🙂
<@zlopez:fedora.im>
17:07:43
Nobody new here, so let's go to next topic
<@zlopez:fedora.im>
17:07:52
!info chair 2025-11-20 - ???
<@zlopez:fedora.im>
17:07:52
!topic Next chair
<@zlopez:fedora.im>
17:07:52
!info magic eight ball says:
<@zlopez:fedora.im>
17:07:52
!info chair 2025-11-27 - ???
<@zlopez:fedora.im>
17:08:09
So we are looking for volunteers for next two weeks
<@nirik:matrix.scrye.com>
17:08:26
I can do next week if desired.
<@nirik:matrix.scrye.com>
17:08:38
also, the 27th is a holiday in the us... and I am out that entire week. ;)
<@zlopez:fedora.im>
17:08:56
Do we want to cancel it?
<@nirik:matrix.scrye.com>
17:08:59
so, perhaps we cancel the 27th?
<@zlopez:fedora.im>
17:09:12
I'm +1 for cancelling it
<@james:fedora.im>
17:09:16
+1
<@seddik:fedora.im>
17:09:47
I can take 20
<@nirik:matrix.scrye.com>
17:09:59
fine with me. ;)
<@zlopez:fedora.im>
17:10:20
!info chair 2025-12-04 - ???
<@zlopez:fedora.im>
17:10:20
!info chair 2025-11-20 - saibug
<@zlopez:fedora.im>
17:10:20
!info chair 2025-11-27 - Holiday in US
<@zlopez:fedora.im>
17:10:41
Any volunteer for first meeting in December?
<@zlopez:fedora.im>
17:11:34
No worries we can decide it next week
<@zlopez:fedora.im>
17:11:43
!info CLE Infra&Releng NA-hours team has a Monday through Thursday 30 minute meeting going through tickets at 1900 UTC in https://matrix.to/#/#meeting-3:fedoraproject.org
<@zlopez:fedora.im>
17:11:43
!topic announcements and information
<@zlopez:fedora.im>
17:11:43
!info CLE Infra&Releng EU-hours team has a Monday through Thursday 30 minute meeting going through tickets at 0815 UTC in https://matrix.to/#/#meeting-3:fedoraproject.org
<@zlopez:fedora.im>
17:12:26
!info New version of release-monitoring.org is now deployed on stg.release-monitoring.org with dark mode, feel free to test it out
<@nirik:matrix.scrye.com>
17:12:40
!info Super anoying tcp timeout bug seems to be solved (although there's a possibly related less anoying bug)
<@zlopez:fedora.im>
17:14:37
!topic Oncall
<@zlopez:fedora.im>
17:14:37
!info https://docs.fedoraproject.org/en-US/infra/day_to_day_fedora/#_the_oncall_role_in_our_team
<@zlopez:fedora.im>
17:14:37
!info on call from 2025-10-21 to 2025-11-27 - ???
<@zlopez:fedora.im>
17:14:37
!info on call from 2025-10-14 to 2025-11-20 - ???
<@zlopez:fedora.im>
17:14:37
!info on call from 2025-11-07 to 2025-11-13 - zlopez
<@nirik:matrix.scrye.com>
17:15:09
I can take the next week
<@zlopez:fedora.im>
17:15:09
Any volunteers for oncall?
<@zlopez:fedora.im>
17:15:23
!info on call from 2025-10-14 to 2025-11-20 - nirik
<@zlopez:fedora.im>
17:15:55
!oncall
<@zodbot:fedora.im>
17:15:55
● @nirik:matrix.scrye.com (kevin) Current Time for them: 09:15 (US/Pacific)
<@zodbot:fedora.im>
17:15:55
The following people are oncall:
<@zodbot:fedora.im>
17:15:55
<@zodbot:fedora.im>
17:15:55
If they do not respond, please file a ticket (https://pagure.io/fedora-infrastructure/issues)
<@zlopez:fedora.im>
17:15:59
Set 🙂
<@nirik:matrix.scrye.com>
17:16:10
\o/
<@zlopez:fedora.im>
17:16:12
Anybody for the week after
<@zlopez:fedora.im>
17:18:20
Let's keep that for next week than
<@zlopez:fedora.im>
17:18:29
!info Summary of last week: (from current oncall)
<@zlopez:fedora.im>
17:18:36
I got 4 pings this week
<@zlopez:fedora.im>
17:18:47
One was for the koji outage
<@nirik:matrix.scrye.com>
17:18:50
yeah, it's been... rough lately
<@zlopez:fedora.im>
17:19:01
Another one for mailing lists ownership
<@zlopez:fedora.im>
17:19:15
Than stuck bodhi update
<@nirik:matrix.scrye.com>
17:19:35
so, I guess oncall is still being useful?
<@zlopez:fedora.im>
17:19:55
And the last one was for stuck update on koji, although I couldn't fix that one
<@zlopez:fedora.im>
17:20:07
And the last one was for stuck build on koji, although I couldn't fix that one
<@zlopez:fedora.im>
17:20:18
But all was resolved in the end
<@nirik:matrix.scrye.com>
17:20:41
Thanks for looking into all those!
<@nirik:matrix.scrye.com>
17:21:01
I was really hoping we would be going back to a quiet/normal period... the kojipkgs thing this morning is anoying.
<@gwmngilfen:fedora.im>
17:21:51
!hi
<@zodbot:fedora.im>
17:21:52
Greg Sutcliffe (gwmngilfen) - he / him / his
<@zlopez:fedora.im>
17:21:59
Yeah, Gwmngilfen resolved it by restarting httpd on both kojipkgs machines, some hanging http connections it seems
<@gwmngilfen:fedora.im>
17:22:19
that one was odd, i don't think it was real load. ssh was quite responsive
<@zlopez:fedora.im>
17:22:30
And I restarted signing queue twice this week as well
<@gwmngilfen:fedora.im>
17:22:33
but yeah, an httpd restart brought it back to normal
<@zlopez:fedora.im>
17:23:16
nirik: We have monitoring topic next, we can probably talk about the your topics there
<@nirik:matrix.scrye.com>
17:23:28
sure, well, or after.
<@zlopez:fedora.im>
17:23:32
!info https://nagios.fedoraproject.org/nagios & https://zabbix.fedoraproject.org (top 100 triggers: https://zabbix.fedoraproject.org/zabbix.php?action=toptriggers.list)
<@zlopez:fedora.im>
17:23:32
!topic Monitoring discussion [nirik]
<@zlopez:fedora.im>
17:23:32
!info Go over existing items and fix them
<@nirik:matrix.scrye.com>
17:23:41
so... on the nagios side...
<@nirik:matrix.scrye.com>
17:24:15
there's some proxies which phsmoura is installing, so thats fine
<@zlopez:fedora.im>
17:24:16
I'm surprised how many things there are
<@gwmngilfen:fedora.im>
17:24:23
ditto for that in zabbix
<@nirik:matrix.scrye.com>
17:24:30
there's some certs needing renewal
<@gwmngilfen:fedora.im>
17:24:51
vpn certs?
<@nirik:matrix.scrye.com>
17:24:56
there's a disk space issue on vmhost-x86-02 (because it's running the power10 vHSM and thats taking up a bunch of space on /)
<@nirik:matrix.scrye.com>
17:25:15
no, real certs: SSL WARNING - Certificate '*.apps.ocp.fedoraproject.org' expires in 22 day(s) (2025-12-05 23:59 +0000/UTC).
<@gwmngilfen:fedora.im>
17:25:29
oh i thought the vpn might be why the copr hosts are out
<@nirik:matrix.scrye.com>
17:25:45
No, not sure on the copr hosts.
<@gwmngilfen:fedora.im>
17:25:48
oh, i'm an idiot i have the crit filter page open
<@nirik:matrix.scrye.com>
17:25:50
I think they were reinstalling them
<@nirik:matrix.scrye.com>
17:25:57
we should ask them about it.
<@nirik:matrix.scrye.com>
17:26:54
The apps cert is a digicert one. I think James was going to renew that one? or someone else could.
<@james:fedora.im>
17:27:10
Yeh, I had it on my TODO list for this week
<@nirik:matrix.scrye.com>
17:27:29
as long as it's in the next 22 days it's all good
<@nirik:matrix.scrye.com>
17:27:37
I think thats mostly it for nagios...
<@james:fedora.im>
17:27:41
Also phsmoura had a problem with DNS for one of the proxies.
<@gwmngilfen:fedora.im>
17:28:06
for zabbix then
<@gwmngilfen:fedora.im>
17:28:18
similar things ofc, i also see the disk space issues
<@gwmngilfen:fedora.im>
17:28:26
and the proxies that are being fixed
<@gwmngilfen:fedora.im>
17:29:08
from the top 100 report, it's mostly load on pkgs01 (probably not an issue), the haproxy outages from the network storm, and then it goes into a longer tail of stuff
<@gwmngilfen:fedora.im>
17:29:39
i've also written up some thoughts on noise levels at https://discussion.fedoraproject.org/t/zabbix-noise-levels-in-chat-where-to-we-want-to-get-to/172426 which some have replied to - James I'd love your thoughts too 😉
<@gwmngilfen:fedora.im>
17:30:09
overall seems ok. i need to move on with more templates for rabbit, certs, and ssh but fires keep getting in the way
<@nirik:matrix.scrye.com>
17:30:36
pkgs01 load is ok if its under say... 6... if it's over that it might be an indication scrapers are hitting it
<@nirik:matrix.scrye.com>
17:31:15
spot the scrapers:
<@gwmngilfen:fedora.im>
17:31:22
1m avg load for the last 7 days has averages 12
<@gwmngilfen:fedora.im>
17:31:37
yeah, mine looks the same
<@nirik:matrix.scrye.com>
17:31:43
those 25-30 times were scraper activity
<@gwmngilfen:fedora.im>
17:31:44
1m avg load for the last 7 days has averaged 12
<@gwmngilfen:fedora.im>
17:32:09
ok, so we want to know about that one, fair enough
<@nirik:matrix.scrye.com>
17:32:11
(hitting js and css files that get thru anubis over and over and over again)
<@gwmngilfen:fedora.im>
17:32:45
i'll check the variables for pkg01 and check it works out to 6 for an alert
<@gwmngilfen:fedora.im>
17:32:59
it might be lower than that right now
<@gwmngilfen:fedora.im>
17:33:21
thats all I've got today, comment on that topic is welcome (or any other check we should adjust / remove)
<@gwmngilfen:fedora.im>
17:33:29
oh, i plan to try something out with postfix
<@gwmngilfen:fedora.im>
17:33:39
right now it just alerts if the queue is > X
<@gwmngilfen:fedora.im>
17:33:57
but really, stuck means tey've been there a while, or is increasing, so I'll adjust that check
<@nirik:matrix.scrye.com>
17:34:09
in general things in queue are notable... but some exceptions (bastion, pagure, mailman)
<@gwmngilfen:fedora.im>
17:34:30
right but by reporting the current value you could just get unlucky
<@gwmngilfen:fedora.im>
17:34:35
0,0,0,4,0,0
<@nirik:matrix.scrye.com>
17:34:42
indeed
<@gwmngilfen:fedora.im>
17:35:01
something like "has the value been > 0 for the last 3 checks" is probably better
<@gwmngilfen:fedora.im>
17:35:11
(0 can be adjusted ofc)
<@gwmngilfen:fedora.im>
17:35:20
ok, now i'm fin
<@nirik:matrix.scrye.com>
17:35:52
ok, shall I go?
<@zlopez:fedora.im>
17:36:04
nirik: The floor is yours
<@nirik:matrix.scrye.com>
17:36:07
!topic tcp timeouts issue / outages
<@nirik:matrix.scrye.com>
17:36:23
So, I am pretty convinced the tcp timeout thing is solved by the firewall cluster upgrade.
<@nirik:matrix.scrye.com>
17:36:39
There's one new issue however around koji and watching tasks and logs.
<@nirik:matrix.scrye.com>
17:36:50
But it's not nearly as big a deal. IMHO
<@nirik:matrix.scrye.com>
17:37:16
wanted to share this fun graph:
<@gwmngilfen:fedora.im>
17:37:43
wow
<@nirik:matrix.scrye.com>
17:37:50
That is when the cluster upgrade happened. We have > 1Million active connections, then it dropped to about 150k
<@gwmngilfen:fedora.im>
17:37:55
(also I suspect that was the cause of the zabbix timeouts in ansible plays)
<@nirik:matrix.scrye.com>
17:38:12
Anyhow, today I plan to:
<@nirik:matrix.scrye.com>
17:38:46
Move kojipkgs back to port 80 (so it uses varnish again), close the timeout issue/clear status, and open a new ticket to consolidate the koji 502 thing in one place.
<@nirik:matrix.scrye.com>
17:39:11
unless someone objects. ;)
<@nirik:matrix.scrye.com>
17:39:34
And on a more general note: lots of outages or big issues lately. ;(
<@nirik:matrix.scrye.com>
17:40:00
They have not all been any one thing... the tcp timeouts were kind of a low level pain, but the outages have been other things
<@nirik:matrix.scrye.com>
17:40:28
I am wondering if we shouldn't make a outage log type thing, or record them as they happen so we can run retros later and figure out how to make them not happen
<@nirik:matrix.scrye.com>
17:41:02
or at least publish some kind of RCA so people know why they happened and what we did to fix?
<@nirik:matrix.scrye.com>
17:41:21
thoughts?
<@zlopez:fedora.im>
17:42:34
That would be great
<@nirik:matrix.scrye.com>
17:43:15
I'll ponder on a concrete proposal... perhaps a discussion thread to hash out how we would like to implement it?
<@nirik:matrix.scrye.com>
17:43:44
Oh another thing that came up recently... we should try and make sure we update status for outages...
<@nirik:matrix.scrye.com>
17:44:20
It's a bit of a balancing act... you need at least some info to update status and perhaps it can just be fixed, but if it's more than a few minutes of investigate we should update.
<@zlopez:fedora.im>
17:45:07
Somebody was proposing to make it more automatic
<@nirik:matrix.scrye.com>
17:45:36
I think automatic is impossible/bad/non desireable. ;)
<@zlopez:fedora.im>
17:46:00
I just heard that, but I understand why we don't want that
<@james:fedora.im>
17:46:35
I think it might be nice to have something semi automatic that says "people are currently reporting some problems with X" ... where we don't need to write a new text file, but could click a button or something.
<@nirik:matrix.scrye.com>
17:46:37
The idea of status is that it's a place users and contributors can go to see whats know out. If it's there, it should be known. If it was automatic it may well not be. Also, there's no easy way to map automatic things to a human understandable status.
<@nirik:matrix.scrye.com>
17:47:19
"proxy34 haproxy backend for foo is down" is that an outage? to what?
<@nirik:matrix.scrye.com>
17:47:46
James: might be nice to have some kind of automation to allow us to set something like that yeah...
<@nirik:matrix.scrye.com>
17:48:26
although it's stil a balancing act because someone might report a problem but it's not actually a problem and we don't know that until we investigate.
<@nirik:matrix.scrye.com>
17:49:18
so, time is running low... anymore on this? or shall I go to the next thing/
<@nirik:matrix.scrye.com>
17:49:20
so, time is running low... anymore on this? or shall I go to the next thing?
<@zlopez:fedora.im>
17:49:44
Let's go to next thing
<@nirik:matrix.scrye.com>
17:49:50
!topic forge mirgation for infra
<@nirik:matrix.scrye.com>
17:49:57
So, releng is moving stuff next week.
<@nirik:matrix.scrye.com>
17:50:09
We have a ticket on this, but I think we should see if we can hash out a plan.
<@nirik:matrix.scrye.com>
17:50:23
I'm like to ask the org get created soon so we can start testing things out.
<@nirik:matrix.scrye.com>
17:50:49
I'm fine moving ansible repo most anytime, but we need to make sure we can get the sync to batcave working correctly.
<@nirik:matrix.scrye.com>
17:51:28
The main tickets we should announce in advance. And have a note about what to do for private stuff (since we can't have private tickets)
<@nirik:matrix.scrye.com>
17:51:53
At the same time we should perhaps ask for the apps and websites orgs so we can move apps to there as we need/want
<@james:fedora.im>
17:52:14
Yeh, thought this might be the last thing we did because of this. Although from what I remember looking at the code it was pretty generic and should work for any git repo.
<@zlopez:fedora.im>
17:52:15
I thought that was already solved 😕
<@nirik:matrix.scrye.com>
17:53:00
well, if we move now, we don't move any of our old pirvate tickets. Moving forward we ask people to just email us I guess...
<@nirik:matrix.scrye.com>
17:53:12
and hope someday once its implemented we can import all the old ones.
<@james:fedora.im>
17:53:37
Can we not have an infra-private or something?
<@zlopez:fedora.im>
17:53:39
I think e-mail to admin@fedoraproject.org is OK
<@nirik:matrix.scrye.com>
17:53:57
James: we can, but it's useless. People can't see / file anything there.
<@nirik:matrix.scrye.com>
17:54:35
There was talk about making an app to allow that, but... just actually implementing it is way better IMHO
<@nirik:matrix.scrye.com>
17:55:42
So, next week is pretty busy with updates/reboots and other things.
<@james:fedora.im>
17:55:43
Yeh, some kind of "not everyone should see this" feature is pretty common.
<@nirik:matrix.scrye.com>
17:55:49
The week after is us holiday.
<@nirik:matrix.scrye.com>
17:55:57
So, perhaps we look at the first week of dec?
<@nirik:matrix.scrye.com>
17:57:04
anyhow, I can update the ticket/discussion and we can figure it out.
<@nirik:matrix.scrye.com>
17:57:08
Just wanted to bring it up
<@james:fedora.im>
17:57:10
Seems fair ... what is the current timeline for rdu-cc move?
<@nirik:matrix.scrye.com>
17:57:40
first week of december. ;) But I am pondering punting to 2026. There is another later 'wave' then...
<@nirik:matrix.scrye.com>
17:58:07
if we do move then there will be a ~1day outage of pagure.io then
<@james:fedora.im>
17:58:16
Okay, that's what I thought I remembered ... would def. rather not do both in the same week ;)
<@nirik:matrix.scrye.com>
17:58:20
(unless I can figure out how to move it sooner)
<@nirik:matrix.scrye.com>
17:58:34
yeah, so perhaps sec week of dec?
<@nirik:matrix.scrye.com>
17:58:41
but we can figure it out
<@nirik:matrix.scrye.com>
17:58:50
Almost out of time...
<@zlopez:fedora.im>
17:59:21
Yes, but we covered important topics today
<@nirik:matrix.scrye.com>
17:59:41
yeah, definitely!
<@zlopez:fedora.im>
18:00:17
We are out of time, thanks everybody for coming today
<@zlopez:fedora.im>
18:00:33
!endmeeting