16:00:04 <lenkaseg> #startmeeting Infrastructure (2022-02-03)
16:00:04 <zodbot> Meeting started Thu Feb  3 16:00:04 2022 UTC.
16:00:04 <zodbot> This meeting is logged and archived in a public location.
16:00:04 <zodbot> The chair is lenkaseg. Information about MeetBot at https://fedoraproject.org/wiki/Zodbot#Meeting_Functions.
16:00:04 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic.
16:00:04 <zodbot> The meeting name has been set to 'infrastructure_(2022-02-03)'
16:00:34 <Leo[m]> .hello Leo
16:00:35 <lenkaseg> #meetingname infrastructure
16:00:35 <zodbot> The meeting name has been set to 'infrastructure'
16:00:35 <zodbot> Leo[m]: leo 'Leo Puvilland' <leo@craftcat.dev>
16:00:36 <lenkaseg> #chair nirik siddharthvipul mobrien zlopez pingou bodanel dtometzki jnsamyak computerkid
16:00:36 <zodbot> Current chairs: bodanel computerkid dtometzki jnsamyak lenkaseg mobrien nirik pingou siddharthvipul zlopez
16:00:46 <lenkaseg> #info Agenda is at: https://board.net/p/fedora-infra
16:00:47 <lenkaseg> #info About our team: https://docs.fedoraproject.org/en-US/cpe/
16:00:53 <lenkaseg> #topic greetings!
16:00:55 <dtometzki> .hello dtometzki
16:00:56 <zodbot> dtometzki: dtometzki 'Damian Tometzki' <linux@tometzki.de>
16:00:57 <nirik> morning.
16:00:59 <lenkaseg> Heelo everybody!
16:01:03 <mkonecny> .hello zlopez
16:01:04 <dtometzki> hello
16:01:04 <zodbot> mkonecny: zlopez 'Michal Konecny' <michal.konecny@psmail.xyz>
16:01:05 <copperi[m]> .hello copperi
16:01:07 <zodbot> copperi[m]: copperi 'Jan Kuparinen' <copper_fin@hotmail.com>
16:01:11 <petebuffon[m]> .hello petebuffon
16:01:14 <nirik> lenkaseg: lets see if the matrix/irc bridge is happy today. ;)
16:01:15 <zodbot> petebuffon[m]: petebuffon 'Peter Buffon' <pabuffon@gmail.com>
16:02:05 <lenkaseg> nirik: I have the irc client open as well :)
16:02:16 <austinpowered> .hi
16:02:17 <zodbot> austinpowered: austinpowered 'T.C. Williams' <fedoraproject@wootenwilliams.com>
16:02:18 <darknao> .hi
16:02:18 <mkonecny> The bot is a little behind, otherwise it seems good
16:02:21 <zodbot> darknao: darknao 'Francois Andrieu' <darknao@drkn.ninja>
16:02:24 <lenkaseg> #topic New folks introductions
16:02:31 <lenkaseg> #info This is a place where people who are interested in Fedora Infrastructure can introduce themselves
16:02:39 <lenkaseg> #info Getting Started Guide: https://fedoraproject.org/wiki/Infrastructure/GettingStarted
16:02:47 <lenkaseg> Any newcomers here?
16:03:28 <mobrien> .hi
16:03:29 <zodbot> mobrien: mobrien 'Mark O'Brien' <markobri@redhat.com>
16:03:33 <eddiejennings> .hi
16:03:33 <zodbot> eddiejennings: eddiejennings 'Eddie Jennings' <eddie@eddiejennings.net>
16:04:19 <lenkaseg> In case there is anybody new, please say hi!
16:05:00 <lenkaseg> Alright, next chair
16:05:02 <lenkaseg> #topic Next chair
16:05:11 <lenkaseg> #info magic eight ball says:
16:05:16 <lenkaseg> #info chair 2022-02-03 - lenkaseg
16:05:22 <lenkaseg> #info chair 2022-02-10 - dtometzki
16:05:27 <lenkaseg> #info chair 2022-02-17 - mkonecny
16:05:50 <lenkaseg> do we have volunteer for Feb 24th?
16:06:13 <petebuffon> I can do Feb 24
16:06:40 <phsmoura> hello
16:06:51 <lenkaseg> #info chair 2022-02-24 -petebuffon
16:06:55 <lenkaseg> thank you petebuffon
16:07:13 <lenkaseg> #topic announcements and information
16:07:24 <lenkaseg> #info CPE Infra&Releng EU-hours team has a Monday through Thursday 30 minute meeting going through tickets at 1030 Europe/paris in #centos-meeting
16:07:27 <lenkaseg> #info CPE Infra&Releng NA-hours team has a Monday through Thursday 30 minute meeting going through tickets at 1800 UTC in #fedora-meeting-3
16:07:35 <lenkaseg> #info If your team wants support from the Fedora Program Management Team, file an isssue: https://pagure.io/fedora-pgm/pgm_team/issues?template=support_request
16:07:42 <lenkaseg> #info matrix sig is forming, see https://discussion.fedoraproject.org/t/bi-weekly-meeting-for-matrix-sig-or-matrix-team/35947/3 if interested
16:08:04 <lenkaseg> #info Mass rebuild for F36 finished
16:09:17 <lenkaseg> Any other new info for today?
16:10:46 <lenkaseg> Ok, let's move on
16:11:11 <lenkaseg> #topic Oncall
16:11:16 <lenkaseg> #info https://fedoraproject.org/wiki/Infrastructure/Oncall
16:11:30 <lenkaseg> #info https://docs.fedoraproject.org/en-US/cpe/day_to_day_fedora/
16:11:34 <lenkaseg> ## .oncalltakeeu .oncalltakeus
16:11:41 <lenkaseg> #info nirik on call from 2022-01-27 to 2022-02-03
16:11:46 <lenkaseg> #info mobrien on call from 2022-02-04 to 2022-02-10
16:12:39 <lenkaseg> Somebody wants to take the 3rd week of February?
16:12:44 <dtometzki> i can do it
16:12:56 <lenkaseg> Ok, thanks dtometzki !
16:12:58 <mobrien> .oncalltakeeu
16:12:58 <zodbot> mobrien: Error: You don't have the alias.add capability. If you think that you should have this capability, be sure that you are identified before trying again. The 'whoami' command can tell you if you're identified.
16:13:27 <lenkaseg> #info dtometzki on call from 2022-02-11 to 2022-02-17
16:13:36 <mobrien> .oncalltakeeu
16:13:37 <zodbot> mobrien: Kneel before zod!
16:13:46 <lenkaseg> #info Summary of last week: (from current oncall )
16:14:04 <nirik> so, there were 4 oncall pings...
16:14:05 <lenkaseg> nirik, did something happened during you oncall?
16:14:11 <nirik> oddly all of them on tuesday. ;)
16:14:30 <mkonecny> Ping's Tuesday?
16:15:14 <nirik> apparently. anyhow, everything got handled... several filed as tickets and one issue (pagure down) handled by someone active.
16:16:22 <lenkaseg> Ok, let's move to ...
16:16:27 <lenkaseg> #topic Monitoring discussion [nirik]
16:16:39 <nirik> lets see.
16:16:45 <lenkaseg> floor is yours nirik!
16:17:51 <nirik> there's alerts from proxy34 being down...it's an aws instance and amazon sent us email on it recently. It was on a node that they were retiring, so we had to cycle it...
16:17:56 <nirik> it should be back up soon.
16:19:03 <nirik> I added some ssl cert checks and changed the way our cert check checks some of our certs. Before it was checking every proxy, now it's just checking proxy01... which isn't ideal, but trying to ack like 60 alerts is anoying.
16:19:17 <nirik> otherwise it's just the normal stuff being down.
16:19:36 <nirik> we can move on (and of course get more info on this in the learning topic in a few )
16:20:19 <lenkaseg> #topic Learning topic
16:20:30 <lenkaseg> (hope I'm not skipping anything)
16:20:42 <lenkaseg> #topic Upcoming learning topics
16:20:48 <lenkaseg> #info 2022-02-03 - Fedora infra server monitoring [nirik]
16:20:53 <lenkaseg> #info 2022-02-17 - Docs pipeline [darknao]
16:21:01 <mkonecny> So we have a learning topic for today
16:21:05 * nirik nods
16:21:26 <lenkaseg> Do we add some more ideas now or let's learn first?
16:21:59 <nirik> either way. What topics would folks like to hear about?
16:22:42 <lenkaseg> I was getting quite confused by koji lately, so a koji talk would be appreaciated :)
16:23:00 <lenkaseg> is it within scope of infra?
16:23:25 <nirik> sure. I can give such a talk... well, I can't speak to the development side of things, but I can talk about deployment and high level...
16:23:49 <nirik> or perhaps mkonecny could... he worked on it with mmmmmmbox I think?
16:23:58 <petebuffon> I like that idea. Anything on rolling out packages would be great.
16:24:43 <mkonecny> nirik: I can, but I'm still a little confused by it :-D
16:25:14 <eddiejennings> high level is a good starting point for me :)
16:26:18 <nirik> I'm happy to help share what I know of it...
16:26:25 <lenkaseg> yep, for me too
16:27:34 <lenkaseg> 2022-03-03 - Koji deployment [nirik] - like this?
16:27:54 <nirik> sure, sounds good.
16:28:20 <eddiejennings> Looking forward to it.  I'll be back home that week :D
16:28:26 <lenkaseg> Ok, thanks nirik!
16:28:26 <Leo[m]> huh
16:28:27 <lenkaseg> #info 2022-03-03 - Koji deployment [nirik]
16:28:38 <lenkaseg> Floor is your nirik!
16:28:44 <nirik> ok, thanks.
16:28:45 <lenkaseg> s
16:28:58 <nirik> #topic Fedora infra server monitoring
16:29:28 <nirik> So, before we dive into details, lets talk a bit about background... why would you want monitoring?
16:30:10 <nirik> To know when something is broken before people trying to use it know it's broken so you can fix it. :)
16:30:37 <nirik> But also, to help debugging problems... sometimes when you know X Y and Z are broken it helps you know what the real problem is...
16:31:18 <nirik> Ideally monitoring only notifies people when people are needed... (but sadly thats not always the case).
16:31:58 <nirik> So, for historical reasons, we are currently using a set of monitoring tools called nagios.
16:32:23 <nirik> nagios was first released 19 years ago (!)
16:32:47 <nirik> it's a set of configuration files and scripts.
16:33:28 <nirik> The scripts can be in most any language, they just need to return a specific set of items back to the main process.
16:33:52 <nirik> there's a bunch of premade scripts to monitor common items, but you can make your own to monitory whatever you want.
16:34:24 <nirik> there's also (since probibly 10 years ago) a templating system that lets you define common commands, etc.
16:35:19 <nirik> Initially when we started using nagios we just configured it like many of our other applications... we had it in ansible, but to add a new server or new service or new script, you had to commit to a bunch of complex templates
16:35:57 <nirik> this caused us issues because people would add new applications or servers and forget to add them to nagios. Or they couldn't get it working right because the templates were all complex.
16:36:46 <nirik> So, a number of years back we moved to a new setup: nagios config is still in ansible and you can modify it there like any other application, but now it leverages ansibles facts to automatically generate most of the config.
16:37:13 <nirik> so, now when you add a new machine to ansible inventory, it automatically will be added to nagios the next time the playbook runs.
16:37:53 <nirik> we have actually 2 nagios instances. one internal one runs on noc01.iad2.fedoraproject.org and is available at https://nagios.fedoraproject.org/
16:38:23 <nirik> another one runs on noc02.fedoraproject.org at the ibiblio datacenter and is available at https://nagios-external.fedoraproject.org/nagios/
16:38:48 <nirik> this is to allow one to work if our main datacenter is down, and to give a 'external' view of services.
16:39:25 <nirik> The ansible config is under roles/nagios_server/ you can see under there are a ton of templates, some of them quite complex looking. ;(
16:40:02 <nirik> Lets talk a bit about notifications and escalations.
16:40:32 <nirik> You can control nagios via the web interface... so for example you can ackknoledge an outage or down service. Or disable checking it at all, etc.
16:41:04 <nirik> When we first started using nagios we had it alert for a problem every hour until solved or acked.
16:41:25 <nirik> this was a lot of alerts...
16:41:49 <nirik> and due to the distributed nature of our group meant someone was always getting alerts at the wrong time.
16:42:26 <nirik> so, now we moved to a scheme where nagios alerts only to IRC the first alert and waits 10min. This allows someone active on irc to see it and fix it before it goes any further.
16:42:38 <nirik> after 10min it sends an alert via email.
16:43:28 <nirik> even so these days sometimes it's too noisy.
16:43:55 <nirik> We have considered moving to something like zabbix, but havent' had the cycles to do so yet...
16:44:44 <nirik> Some of our ansible playbooks interact with nagios... it has a control socket ansible can send to... so for example, a playbook might set a host in downtime before an upgrade.
16:45:26 <nirik> I think thats most of it... any questions? happy to go into more detail
16:46:09 <nirik> Oh, one lower level thing I should cover:
16:46:32 <eddiejennings> Zabbix move is prompted by the fact that other areas like the CentOS infrastructure use it, right?
16:46:58 <nirik> nagios can just run a script for a check, or... it has a client thing called 'nrpe'. nrpe runs on a client host and nagios connects to that and tells _it_ to run something. This allows it to run checks that need local information and still get them back to the main server.
16:48:08 <nirik> eddiejennings: partly, but nagios has it's issues. It's config is complex and breaks somewhat often. It doesn't have nice features like zabbix (like being able to setup a zabbix gateway at a site and do checks from there).
16:48:35 * eddiejennings nods.
16:49:08 <nirik> nagios also does have some simple maps and uptime stats... but zabbix has a ton more of that kind of information.
16:49:38 <nirik> nagios is also kinda slow...
16:49:55 <nirik> for checking services right now:
16:49:58 <nirik> <= 5 minutes:	1507 (57.1%)
16:49:58 <nirik> <= 15 minutes:	2213 (83.8%)
16:51:17 <nirik> also with nagios it's hard to setup things so some specific people get alerts about some specific service or host... right now it's basically just a flood of all alerts.
16:52:31 <nirik> ok, anything else? happy to answer any questions when/if folks have em...
16:52:46 <eddiejennings> A wealth of information.  Thank you :D
16:53:36 <lenkaseg> Thank you nirik! I'm formulating some questions, but dunno how to word it to make sense :)
16:54:08 <lenkaseg> I think I'll first check the ansible repo :)
16:54:08 <petebuffon> Thanks nirik, I think I'll have to digest that for a while. Maybe mess around with Zabbix at home.
16:54:12 <nirik> :) happy to also answer later if you need more time to ponder.
16:55:26 <lenkaseg> We ahve last 5 mins so let's move to
16:55:27 <lenkaseg> #topic Open Floor
16:56:30 <lenkaseg> Does somebody want to share something with the rest of us?
16:57:20 <austinpowered> So if a newbie like me wants to deploy server monitoring in the home lab as a learning experience, would zabbix be the choice over nagios?
16:57:21 <eddiejennings> Nothing from me.
16:57:38 <eddiejennings> ^-- curious about that too
16:57:59 <mobrien> We have an app successfully running on our ocp4 staging cluster. I think its the first
16:58:01 <nirik> I would think so... but of course it's up to you
16:58:38 <austinpowered> If zabbix is a 'future project', who is looking at workiing on the transition?
16:59:09 <nirik> it's an iniative on our backlog... so no one working on it now, but if someone wanted to start prelim work, great!
16:59:16 <lenkaseg> mobrien: which one is it?
16:59:48 <eddiejennings> I would like to help with it, but I'm tapped of most outside-of-9-5-work time until March.
16:59:56 <mobrien> lenkaseg: blockerbugs, its a qa app
17:00:16 * nirik has to head to another meeting. Thanks everyone
17:00:32 <lenkaseg> time's up, let's end the meeting
17:00:38 <lenkaseg> #endmeeting