14:59:42 #startmeeting Infrastructure (2019-02-07) 14:59:42 Meeting started Thu Feb 7 14:59:42 2019 UTC. 14:59:42 This meeting is logged and archived in a public location. 14:59:42 The chair is smooge. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:59:42 Useful Commands: #action #agreed #halp #info #idea #link #topic. 14:59:42 The meeting name has been set to 'infrastructure_(2019-02-07)' 14:59:42 #meetingname infrastructure 14:59:42 The meeting name has been set to 'infrastructure' 14:59:42 #topic aloha 14:59:42 #chair nirik pingou puiterwijk relrod smooge tflink threebean cverna mkonecny 14:59:42 Current chairs: cverna mkonecny nirik pingou puiterwijk relrod smooge tflink threebean 14:59:53 ó/ 14:59:55 * cverna waves 15:00:01 hello 15:00:02 morning 15:00:56 morning 15:00:59 .hello2 15:01:01 creaked: creaked 'Will Chellman' 15:01:06 hello 15:02:03 .hello2 15:02:03 bowlofeggs: bowlofeggs 'Randy Barlow' 15:02:06 #topic New folks introductions 15:02:06 #info This is a place where people who are interested in Fedora Infrastructure can introduce themselves 15:02:06 #info Getting Started Guide: https://fedoraproject.org/wiki/Infrastructure/GettingStarted 15:02:55 .hello sayanchowdhury 15:02:56 sayan: sayanchowdhury 'Sayan Chowdhury' 15:03:18 .hello zlopez 15:03:19 mkonecny: zlopez 'Michal Konečný' 15:04:22 hello everyone 15:04:28 hello o/ 15:04:37 Hi all 15:05:02 #topic announcements and information 15:05:02 #info nirik will have sparse hours due to house move 15:05:02 #info mass rebuild was messy 15:05:02 #info new Taskotron deployment almost complete - once complete and ready for production, will let us replace the remaining F27 machines 15:05:32 #info bodhi-3.13.0 beta in staging 15:05:39 any other announcements? 15:05:55 #info pagure 5.3 slightly delayed 15:06:12 #info signing is slowly slowly going along. Will investigate slowness... 15:06:16 we will need to work out a deployment schedule for that 15:06:43 koschei in staging was moved to openshift 15:07:12 reminds me... we moved some things to openshift stg a while back... we may want to look at moving those into prod... 15:07:22 (fedocal, nuancier, and sme others) 15:07:30 not yet 15:07:44 both of these needs to be ported to oidc for them to work in openshift 15:08:00 ah, ok. Just wanted to make sure we didn't forget them there. 15:08:21 yes we need the fedora-messaging certs also 15:08:25 (might be that putting a high limit of the flask version fixes it) 15:08:27 for some of them 15:08:29 that to 15:08:42 ok, fair enough 15:08:46 but yes we should not forgot about it :) 15:08:53 forget* 15:09:22 #info koschei in staging has moved to openshift 15:10:10 #info delay in moving services from staging openshift to production needs OIDC work 15:10:22 ok next 15:10:27 #info and fedora-messaging certs 15:10:40 #topic Oncall 15:10:40 #info smooge is on call from 2019-01-31 -> 2019-02-07 15:10:40 #info ?????? is on call from 2019-02-07 -> 2019-02-14 15:10:40 #info ?????? is on call from 2019-02-14 -> 2019-02-21 15:10:40 #info ?????? is on call from 2019-02-21 -> 2019-02-28 15:10:41 #info Summary of last week: (from smooge ) 15:11:23 I can take it most anytime, but might have to hand it off if I can ever move. 15:11:51 yeah I was thinking you were stuck for a bit. I am going to take next week as relrod had 2 weeks 15:11:57 * nirik is facing the zeno's paradox of electrical work. They come, they do 50%, repeat. 15:11:58 i can do the 7th on 15:12:24 nirik: hahaha 15:13:03 nirik, I would definitely get another contractor to come in and check what they are doing .. the houses I have seen where that happens do not go well 15:13:20 bowlofeggs, so you are on call after this meeting? 15:13:42 the initial ones have been sacked. The new ones do seem a lot more compitent and responsive, so perhaps it will finish someday. ;) 15:13:50 oh good 15:14:07 smooge: sure 15:15:48 zodbot: alias add oncall "echo Bowlofeggs (Bowl Of Eggs) is oncall. Please file a ticket if you don't hear from me ( https://pagure.io/fedora-infrastructure/issues ) My regular hours are 1400 UTC to 2300 UTC" 15:15:50 smooge: Kneel before zod! 15:16:24 haha i like the breakdown in parenthesis 15:16:24 So review for the week. The majority of the problems have been related to the mass rebuild 15:17:03 The other itesm have been the standard fedmsg-hub eating a systems ram 15:17:17 There is an outage going on with qa11 and with autosign01 15:17:24 which leads to 15:17:26 the actual mass rebuild went fine... the signing... did not. 15:17:30 #topic Monitoring discussion 15:17:30 #info https://nagios.fedoraproject.org/nagios 15:17:30 #info Go over existing out items and fix 15:18:01 yeah.. most of the alerts were with warnings about build events filling up various message queues 15:18:07 yeah, we have a bunch of random stuff we need to fix... 15:19:27 The batcave nagios warning looks to be ansible mass run 15:19:46 I propose to remove the "Check OSBS build listing", it does not give much anyway basically just check if the Openshift cluster is running 15:19:48 yeah, I thought I fixed that last night, but I can clean it up. 15:19:58 cverna: +1 15:20:02 atomic compose alert is legitimate - some kernel issue is blocking the compose 15:20:33 cverna, rename it to OpenShift Cluster is running? 15:20:38 or remove it completely? 15:21:01 it currently does not work because it does not have permission to query the api endpoints 15:21:22 ok then remove away 15:21:25 we could create a service account in openshift just for that, but I wonder if that's worth it 15:22:05 I lean more to remove the check completely 15:22:10 i found some packages for nagios-plugins for openshift but I am not sure if that is worth doing either 15:22:14 how are we monitoring openshift apps? i need to add some koschei monitoring before moving prod to openshift 15:22:24 just external route with nagios is not enough 15:23:18 mizdebsk: There should be prometheus 15:23:36 is it connected with nagios somehow? 15:24:19 no 15:24:22 i've never seen any alert from prometheus 15:24:23 PROPOSAL remove the "Check OSBS build listing" check, if we have issues with the OSBS Openshift cluster later on we can find a way to monitor it 15:24:38 it currently monitors, but does not send or notify anything. you have to look. 15:24:44 cverna: +1 15:24:48 agreed 15:25:00 so, the plan was to look into sensu... which can talk to prometheus 15:25:02 nirik, this is... not ideal at very least 15:25:04 #action cverna to nuke the Check OSBS build listing check 15:25:12 [o/ 15:25:22 \o/ hehe 15:25:57 are there any plans to have these alerts sent via email/irc? 15:26:09 mizdebsk: the plan is to look at sensu, which will do that 15:26:22 and also replace nagios 15:26:51 ok; then personally i would consider this as a blocker for moving apps to prod 15:27:04 prod in openshift 15:27:21 well, we can still monitor the end app via nagios... 15:27:38 https://github.com/google/alertmanager-irc-relay 15:27:47 which is not enough for many apps 15:27:51 we could look at configuring alertmanager... 15:28:04 but I am not sure how far out the sensu work is... smooge ? 15:28:07 https://alertmanager-main-openshift-monitoring.app.os.stg.fedoraproject.org/#/alerts 15:28:20 sensu is not being looked at til this summer 15:28:38 ok, then we should likely try and do something shorter term 15:29:25 https://prometheus-k8s-openshift-monitoring.app.os.stg.fedoraproject.org/alerts is the prometheus side 15:29:33 but thats just the cluster. 15:30:37 nirik: getting 403 on both link 15:30:40 s 15:30:42 i can't login to alert manager - "403 Permission Denied - Invalid Account"; is it available to appowners, or just to cluster admins? 15:30:58 huh, I can find out what perm that is and add everyone. 15:30:58 how do we add prometheus check on the applications ? 15:31:09 I was able to login 15:31:47 ah, you need cluster-monitoring-view 15:31:50 like pingou, i can't login to either 15:32:28 try now? 15:32:43 anyhow, I think we should approach this from the other side. 15:33:09 what do we want to monitor on the apps? then we can take that list and try and figure out how best to do it... nagios or alertmanager or whatever 15:33:53 for koschei all i need is to check whether some pods are running or not, and how many of them; this can be easily done with nagios 15:34:25 I can see both links now, thanks nirik 15:34:29 It would be nice to have a 'default' set of things and more adding 15:34:54 pods themselves will be checked by openshift using liveness probess - it will restart/fail them if they crash 15:35:14 once pod is failed, it's not running and nagios could see this by running "oc get pod" on os-master 15:36:07 (with the right namespace, but sure) 15:36:11 I would like to get notification when the pod fails and why 15:36:40 * nirik is fine using nagios for now... 15:36:55 if we can autogen monitoring for all apps that would be even better 15:37:53 I will see what the nagios plugins for openshift can do and report back next week 15:37:55 we need to give nagios the permission to execute oc command 15:38:07 smooge: +1 15:38:32 ok anything else on this? 15:39:06 #topic Tickets discussion 15:39:06 #info https://pagure.io/fedora-infrastructure/report/Meetings%20ticket 15:39:06 #info fedora-packages outage https://pagure.io/fedora-infrastructure/issue/7507 - cverna 15:39:16 cverna you have something here? 15:39:41 yes I would like to rebuild fedora-package boxes 15:39:51 there are still running on f27 15:39:57 mizdebsk: can you work on that? or I can when I get cycles (and if so we should put in a ticket to track) 15:40:16 so I need to coordinate the outage with someone from ops to be available to nuke to boxes 15:40:19 rebuilding packages? sure, i can do the ops side 15:40:48 what are the fedora-packages boxes? 15:41:04 bah, wireless hung 15:41:08 well, that was about the nagios thing, but whatever. ;) 15:41:10 apps.fedoraproject.org/packages 15:41:18 packages03.phx2.fedoraproject.org packages04.phx2.fedoraproject.org packages03.stg.phx2.fedoraproject.org 15:42:01 yes I can schedule the outage for next week and coordinate with mizdebsk 15:42:12 nirik, re nagios i think it's better to wait for smooge to investigate the openshift plugin first 15:42:31 if it turns out not feasible to install then i can write a custom plugin that would check running pods 15:43:19 ok thanks. I will put that as my Friday work 15:45:14 #topic Priorities for next week? 15:45:14 #info please put tickets needing to be focused on here 15:45:14 #info https://pagure.io/fedora-infrastructure/issue/7547 15:45:14 #info autosign outage/slowdown 15:45:14 #info nagios monitoring openshift 15:46:21 Any other priorities for the team that need focusing and teamwork on? 15:47:35 all the things. ;) 15:47:50 :) 15:48:03 Hopefully we will have taiga soon and can move our high level planning stuff there. 15:48:14 That might help us prioritize and schedule 15:49:58 I think this part of the meeting will be us looking at said taiga in the future and probably will be a larger percentage 15:50:03 Right now, everybody has their own work planned 15:50:20 yeah, hard to coordinate 15:50:42 are we going to have per-subproject taiga boards? or one global one 15:51:03 i don't know.. I have heard both so I am going to assume both 15:51:08 per subproject I think... 15:51:08 right now i don't want to open too many tickets to track things as it annoys some of us :) 15:51:30 well, like cpe will have one, marketing will have one, etc... 15:51:47 but I guess some of that is TBD 15:52:46 IMHO, taiga for 'large picture items' and pagure for 'tasks' 15:53:56 What about goals for specific applications 15:54:21 application tracker I think 15:54:27 if it's a large item, taiga, but if it's just related to the app, in the application tracker... 15:54:37 cverna, which is where exactly? 15:54:53 do you have some new tracker in mind? 15:55:00 cverna: So this will still let us divided in silos 15:55:12 it depends on the app? 15:55:20 mizdebsk: for example fpdc goals in fpdc tracker, then i can tag the ticket with something like taiga and it would be replicated to the infra board 15:55:26 well, I don't think it's practical to use one tracker for everything. 15:55:39 but perhaps I am not thinking out of the box enough. ;) 15:55:44 nirik: agreed we will have to keep the project tracker for bugs etc 15:56:02 we can't have everything in taiga it would be unmanageable 15:56:37 honestly.. I think we are going to need some input from the people wanting us to use taiga to know what they want us to use it as 15:57:54 that too 15:58:00 smooge: From what I understand they want to have high goals overview 15:58:50 My understanding is that they wanted one place where all (active) subprojects/sigs/teams have a presense... 15:59:02 so if someone wanted say marketing, they could find it. 16:00:09 * cverna is not sure how this will work but I guess we have to wait and see :) 16:00:13 but yeah, I guess we will see. 16:01:04 #topic open floor 16:01:07 ok we are over time 16:01:15 so do we have anything on this meeting? 16:03:13 nothing on my side 16:03:25 thanks smooge for running the show 16:03:30 yeah, thanks smooge 16:03:39 thanks smooge 16:03:40 ok thank you all 16:03:43 #endmeeting