16:00:04 #startmeeting Infrastructure (2022-02-03) 16:00:04 Meeting started Thu Feb 3 16:00:04 2022 UTC. 16:00:04 This meeting is logged and archived in a public location. 16:00:04 The chair is lenkaseg. Information about MeetBot at https://fedoraproject.org/wiki/Zodbot#Meeting_Functions. 16:00:04 Useful Commands: #action #agreed #halp #info #idea #link #topic. 16:00:04 The meeting name has been set to 'infrastructure_(2022-02-03)' 16:00:34 .hello Leo 16:00:35 #meetingname infrastructure 16:00:35 The meeting name has been set to 'infrastructure' 16:00:35 Leo[m]: leo 'Leo Puvilland' 16:00:36 #chair nirik siddharthvipul mobrien zlopez pingou bodanel dtometzki jnsamyak computerkid 16:00:36 Current chairs: bodanel computerkid dtometzki jnsamyak lenkaseg mobrien nirik pingou siddharthvipul zlopez 16:00:46 #info Agenda is at: https://board.net/p/fedora-infra 16:00:47 #info About our team: https://docs.fedoraproject.org/en-US/cpe/ 16:00:53 #topic greetings! 16:00:55 .hello dtometzki 16:00:56 dtometzki: dtometzki 'Damian Tometzki' 16:00:57 morning. 16:00:59 Heelo everybody! 16:01:03 .hello zlopez 16:01:04 hello 16:01:04 mkonecny: zlopez 'Michal Konecny' 16:01:05 .hello copperi 16:01:07 copperi[m]: copperi 'Jan Kuparinen' 16:01:11 .hello petebuffon 16:01:14 lenkaseg: lets see if the matrix/irc bridge is happy today. ;) 16:01:15 petebuffon[m]: petebuffon 'Peter Buffon' 16:02:05 nirik: I have the irc client open as well :) 16:02:16 .hi 16:02:17 austinpowered: austinpowered 'T.C. Williams' 16:02:18 .hi 16:02:18 The bot is a little behind, otherwise it seems good 16:02:21 darknao: darknao 'Francois Andrieu' 16:02:24 #topic New folks introductions 16:02:31 #info This is a place where people who are interested in Fedora Infrastructure can introduce themselves 16:02:39 #info Getting Started Guide: https://fedoraproject.org/wiki/Infrastructure/GettingStarted 16:02:47 Any newcomers here? 16:03:28 .hi 16:03:29 mobrien: mobrien 'Mark O'Brien' 16:03:33 .hi 16:03:33 eddiejennings: eddiejennings 'Eddie Jennings' 16:04:19 In case there is anybody new, please say hi! 16:05:00 Alright, next chair 16:05:02 #topic Next chair 16:05:11 #info magic eight ball says: 16:05:16 #info chair 2022-02-03 - lenkaseg 16:05:22 #info chair 2022-02-10 - dtometzki 16:05:27 #info chair 2022-02-17 - mkonecny 16:05:50 do we have volunteer for Feb 24th? 16:06:13 I can do Feb 24 16:06:40 hello 16:06:51 #info chair 2022-02-24 -petebuffon 16:06:55 thank you petebuffon 16:07:13 #topic announcements and information 16:07:24 #info CPE Infra&Releng EU-hours team has a Monday through Thursday 30 minute meeting going through tickets at 1030 Europe/paris in #centos-meeting 16:07:27 #info CPE Infra&Releng NA-hours team has a Monday through Thursday 30 minute meeting going through tickets at 1800 UTC in #fedora-meeting-3 16:07:35 #info If your team wants support from the Fedora Program Management Team, file an isssue: https://pagure.io/fedora-pgm/pgm_team/issues?template=support_request 16:07:42 #info matrix sig is forming, see https://discussion.fedoraproject.org/t/bi-weekly-meeting-for-matrix-sig-or-matrix-team/35947/3 if interested 16:08:04 #info Mass rebuild for F36 finished 16:09:17 Any other new info for today? 16:10:46 Ok, let's move on 16:11:11 #topic Oncall 16:11:16 #info https://fedoraproject.org/wiki/Infrastructure/Oncall 16:11:30 #info https://docs.fedoraproject.org/en-US/cpe/day_to_day_fedora/ 16:11:34 ## .oncalltakeeu .oncalltakeus 16:11:41 #info nirik on call from 2022-01-27 to 2022-02-03 16:11:46 #info mobrien on call from 2022-02-04 to 2022-02-10 16:12:39 Somebody wants to take the 3rd week of February? 16:12:44 i can do it 16:12:56 Ok, thanks dtometzki ! 16:12:58 .oncalltakeeu 16:12:58 mobrien: Error: You don't have the alias.add capability. If you think that you should have this capability, be sure that you are identified before trying again. The 'whoami' command can tell you if you're identified. 16:13:27 #info dtometzki on call from 2022-02-11 to 2022-02-17 16:13:36 .oncalltakeeu 16:13:37 mobrien: Kneel before zod! 16:13:46 #info Summary of last week: (from current oncall ) 16:14:04 so, there were 4 oncall pings... 16:14:05 nirik, did something happened during you oncall? 16:14:11 oddly all of them on tuesday. ;) 16:14:30 Ping's Tuesday? 16:15:14 apparently. anyhow, everything got handled... several filed as tickets and one issue (pagure down) handled by someone active. 16:16:22 Ok, let's move to ... 16:16:27 #topic Monitoring discussion [nirik] 16:16:39 lets see. 16:16:45 floor is yours nirik! 16:17:51 there's alerts from proxy34 being down...it's an aws instance and amazon sent us email on it recently. It was on a node that they were retiring, so we had to cycle it... 16:17:56 it should be back up soon. 16:19:03 I added some ssl cert checks and changed the way our cert check checks some of our certs. Before it was checking every proxy, now it's just checking proxy01... which isn't ideal, but trying to ack like 60 alerts is anoying. 16:19:17 otherwise it's just the normal stuff being down. 16:19:36 we can move on (and of course get more info on this in the learning topic in a few ) 16:20:19 #topic Learning topic 16:20:30 (hope I'm not skipping anything) 16:20:42 #topic Upcoming learning topics 16:20:48 #info 2022-02-03 - Fedora infra server monitoring [nirik] 16:20:53 #info 2022-02-17 - Docs pipeline [darknao] 16:21:01 So we have a learning topic for today 16:21:05 * nirik nods 16:21:26 Do we add some more ideas now or let's learn first? 16:21:59 either way. What topics would folks like to hear about? 16:22:42 I was getting quite confused by koji lately, so a koji talk would be appreaciated :) 16:23:00 is it within scope of infra? 16:23:25 sure. I can give such a talk... well, I can't speak to the development side of things, but I can talk about deployment and high level... 16:23:49 or perhaps mkonecny could... he worked on it with mmmmmmbox I think? 16:23:58 I like that idea. Anything on rolling out packages would be great. 16:24:43 nirik: I can, but I'm still a little confused by it :-D 16:25:14 high level is a good starting point for me :) 16:26:18 I'm happy to help share what I know of it... 16:26:25 yep, for me too 16:27:34 2022-03-03 - Koji deployment [nirik] - like this? 16:27:54 sure, sounds good. 16:28:20 Looking forward to it. I'll be back home that week :D 16:28:26 Ok, thanks nirik! 16:28:26 huh 16:28:27 #info 2022-03-03 - Koji deployment [nirik] 16:28:38 Floor is your nirik! 16:28:44 ok, thanks. 16:28:45 s 16:28:58 #topic Fedora infra server monitoring 16:29:28 So, before we dive into details, lets talk a bit about background... why would you want monitoring? 16:30:10 To know when something is broken before people trying to use it know it's broken so you can fix it. :) 16:30:37 But also, to help debugging problems... sometimes when you know X Y and Z are broken it helps you know what the real problem is... 16:31:18 Ideally monitoring only notifies people when people are needed... (but sadly thats not always the case). 16:31:58 So, for historical reasons, we are currently using a set of monitoring tools called nagios. 16:32:23 nagios was first released 19 years ago (!) 16:32:47 it's a set of configuration files and scripts. 16:33:28 The scripts can be in most any language, they just need to return a specific set of items back to the main process. 16:33:52 there's a bunch of premade scripts to monitor common items, but you can make your own to monitory whatever you want. 16:34:24 there's also (since probibly 10 years ago) a templating system that lets you define common commands, etc. 16:35:19 Initially when we started using nagios we just configured it like many of our other applications... we had it in ansible, but to add a new server or new service or new script, you had to commit to a bunch of complex templates 16:35:57 this caused us issues because people would add new applications or servers and forget to add them to nagios. Or they couldn't get it working right because the templates were all complex. 16:36:46 So, a number of years back we moved to a new setup: nagios config is still in ansible and you can modify it there like any other application, but now it leverages ansibles facts to automatically generate most of the config. 16:37:13 so, now when you add a new machine to ansible inventory, it automatically will be added to nagios the next time the playbook runs. 16:37:53 we have actually 2 nagios instances. one internal one runs on noc01.iad2.fedoraproject.org and is available at https://nagios.fedoraproject.org/ 16:38:23 another one runs on noc02.fedoraproject.org at the ibiblio datacenter and is available at https://nagios-external.fedoraproject.org/nagios/ 16:38:48 this is to allow one to work if our main datacenter is down, and to give a 'external' view of services. 16:39:25 The ansible config is under roles/nagios_server/ you can see under there are a ton of templates, some of them quite complex looking. ;( 16:40:02 Lets talk a bit about notifications and escalations. 16:40:32 You can control nagios via the web interface... so for example you can ackknoledge an outage or down service. Or disable checking it at all, etc. 16:41:04 When we first started using nagios we had it alert for a problem every hour until solved or acked. 16:41:25 this was a lot of alerts... 16:41:49 and due to the distributed nature of our group meant someone was always getting alerts at the wrong time. 16:42:26 so, now we moved to a scheme where nagios alerts only to IRC the first alert and waits 10min. This allows someone active on irc to see it and fix it before it goes any further. 16:42:38 after 10min it sends an alert via email. 16:43:28 even so these days sometimes it's too noisy. 16:43:55 We have considered moving to something like zabbix, but havent' had the cycles to do so yet... 16:44:44 Some of our ansible playbooks interact with nagios... it has a control socket ansible can send to... so for example, a playbook might set a host in downtime before an upgrade. 16:45:26 I think thats most of it... any questions? happy to go into more detail 16:46:09 Oh, one lower level thing I should cover: 16:46:32 Zabbix move is prompted by the fact that other areas like the CentOS infrastructure use it, right? 16:46:58 nagios can just run a script for a check, or... it has a client thing called 'nrpe'. nrpe runs on a client host and nagios connects to that and tells _it_ to run something. This allows it to run checks that need local information and still get them back to the main server. 16:48:08 eddiejennings: partly, but nagios has it's issues. It's config is complex and breaks somewhat often. It doesn't have nice features like zabbix (like being able to setup a zabbix gateway at a site and do checks from there). 16:48:35 * eddiejennings nods. 16:49:08 nagios also does have some simple maps and uptime stats... but zabbix has a ton more of that kind of information. 16:49:38 nagios is also kinda slow... 16:49:55 for checking services right now: 16:49:58 <= 5 minutes: 1507 (57.1%) 16:49:58 <= 15 minutes: 2213 (83.8%) 16:51:17 also with nagios it's hard to setup things so some specific people get alerts about some specific service or host... right now it's basically just a flood of all alerts. 16:52:31 ok, anything else? happy to answer any questions when/if folks have em... 16:52:46 A wealth of information. Thank you :D 16:53:36 Thank you nirik! I'm formulating some questions, but dunno how to word it to make sense :) 16:54:08 I think I'll first check the ansible repo :) 16:54:08 Thanks nirik, I think I'll have to digest that for a while. Maybe mess around with Zabbix at home. 16:54:12 :) happy to also answer later if you need more time to ponder. 16:55:26 We ahve last 5 mins so let's move to 16:55:27 #topic Open Floor 16:56:30 Does somebody want to share something with the rest of us? 16:57:20 So if a newbie like me wants to deploy server monitoring in the home lab as a learning experience, would zabbix be the choice over nagios? 16:57:21 Nothing from me. 16:57:38 ^-- curious about that too 16:57:59 We have an app successfully running on our ocp4 staging cluster. I think its the first 16:58:01 I would think so... but of course it's up to you 16:58:38 If zabbix is a 'future project', who is looking at workiing on the transition? 16:59:09 it's an iniative on our backlog... so no one working on it now, but if someone wanted to start prelim work, great! 16:59:16 mobrien: which one is it? 16:59:48 I would like to help with it, but I'm tapped of most outside-of-9-5-work time until March. 16:59:56 lenkaseg: blockerbugs, its a qa app 17:00:16 * nirik has to head to another meeting. Thanks everyone 17:00:32 time's up, let's end the meeting 17:00:38 #endmeeting