14:59:42 <smooge> #startmeeting Infrastructure (2019-02-07)
14:59:42 <zodbot> Meeting started Thu Feb  7 14:59:42 2019 UTC.
14:59:42 <zodbot> This meeting is logged and archived in a public location.
14:59:42 <zodbot> The chair is smooge. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:59:42 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic.
14:59:42 <zodbot> The meeting name has been set to 'infrastructure_(2019-02-07)'
14:59:42 <smooge> #meetingname infrastructure
14:59:42 <zodbot> The meeting name has been set to 'infrastructure'
14:59:42 <smooge> #topic aloha
14:59:42 <smooge> #chair nirik pingou puiterwijk relrod smooge tflink threebean cverna mkonecny
14:59:42 <zodbot> Current chairs: cverna mkonecny nirik pingou puiterwijk relrod smooge tflink threebean
14:59:53 <pingou> ó/
14:59:55 * cverna waves
15:00:01 <tflink> hello
15:00:02 <nirik> morning
15:00:56 <creaked> morning
15:00:59 <creaked> .hello2
15:01:01 <zodbot> creaked: creaked 'Will Chellman' <creaked@gmail.com>
15:01:06 <smooge> hello
15:02:03 <bowlofeggs> .hello2
15:02:03 <zodbot> bowlofeggs: bowlofeggs 'Randy Barlow' <rbarlow@redhat.com>
15:02:06 <smooge> #topic New folks introductions
15:02:06 <smooge> #info This is a place where people who are interested in Fedora Infrastructure can introduce themselves
15:02:06 <smooge> #info Getting Started Guide: https://fedoraproject.org/wiki/Infrastructure/GettingStarted
15:02:55 <sayan> .hello sayanchowdhury
15:02:56 <zodbot> sayan: sayanchowdhury 'Sayan Chowdhury' <sayan.chowdhury2012@gmail.com>
15:03:18 <mkonecny> .hello zlopez
15:03:19 <zodbot> mkonecny: zlopez 'Michal Konečný' <michal.konecny@packetseekers.eu>
15:04:22 <smooge> hello everyone
15:04:28 <chris787> hello o/
15:04:37 <armnhmr> Hi all
15:05:02 <smooge> #topic announcements and information
15:05:02 <smooge> #info nirik will have sparse hours due to house move
15:05:02 <smooge> #info mass rebuild was messy
15:05:02 <smooge> #info new Taskotron deployment almost complete - once complete and ready for production, will let us replace the remaining F27 machines
15:05:32 <bowlofeggs> #info bodhi-3.13.0 beta in staging
15:05:39 <smooge> any other announcements?
15:05:55 <pingou> #info pagure 5.3 slightly delayed
15:06:12 <nirik> #info signing is slowly slowly going along. Will investigate slowness...
15:06:16 <smooge> we will need to work out a deployment schedule for that
15:06:43 <mizdebsk> koschei in staging was moved to openshift
15:07:12 <nirik> reminds me... we moved some things to openshift stg a while back... we may want to look at moving those into prod...
15:07:22 <nirik> (fedocal, nuancier, and sme others)
15:07:30 <pingou> not yet
15:07:44 <pingou> both of these needs to be ported to oidc for them to work in openshift
15:08:00 <nirik> ah, ok. Just wanted to make sure we didn't forget them there.
15:08:21 <cverna> yes we need the fedora-messaging certs also
15:08:25 <pingou> (might be that putting a high limit of the flask version fixes it)
15:08:27 <cverna> for some of them
15:08:29 <pingou> that to
15:08:42 <nirik> ok, fair enough
15:08:46 <cverna> but yes we should not forgot about it :)
15:08:53 <cverna> forget*
15:09:22 <smooge> #info koschei in staging has moved to openshift
15:10:10 <smooge> #info delay in moving services from staging openshift to production needs OIDC work
15:10:22 <smooge> ok next
15:10:27 <nirik> #info and fedora-messaging certs
15:10:40 <smooge> #topic Oncall
15:10:40 <smooge> #info smooge is on call from 2019-01-31 -> 2019-02-07
15:10:40 <smooge> #info ?????? is on call from 2019-02-07 -> 2019-02-14
15:10:40 <smooge> #info ?????? is on call from 2019-02-14 -> 2019-02-21
15:10:40 <smooge> #info ?????? is on call from 2019-02-21 -> 2019-02-28
15:10:41 <smooge> #info Summary of last week: (from smooge )
15:11:23 <nirik> I can take it most anytime, but might have to hand it off if I can ever move.
15:11:51 <smooge> yeah I was thinking you were stuck for a bit. I am going to take next week as relrod had 2 weeks
15:11:57 * nirik is facing the zeno's paradox of electrical work. They come, they do 50%, repeat.
15:11:58 <bowlofeggs> i can do the 7th on
15:12:24 <bowlofeggs> nirik: hahaha
15:13:03 <smooge> nirik, I would definitely get another contractor to come in and check what they are doing .. the houses I have seen where that happens do not go well
15:13:20 <smooge> bowlofeggs, so you are on call after this meeting?
15:13:42 <nirik> the initial ones have been sacked. The new ones do seem a lot more compitent and responsive, so perhaps it will finish someday. ;)
15:13:50 <smooge> oh good
15:14:07 <bowlofeggs> smooge: sure
15:15:48 <smooge> zodbot: alias add oncall "echo Bowlofeggs (Bowl Of Eggs) is oncall. Please file a ticket if you don't hear from me ( https://pagure.io/fedora-infrastructure/issues ) My regular hours are 1400 UTC to 2300 UTC"
15:15:50 <zodbot> smooge: Kneel before zod!
15:16:24 <bowlofeggs> haha i like the breakdown in parenthesis
15:16:24 <smooge> So review for the week. The majority of the problems have been related to the mass rebuild
15:17:03 <smooge> The other itesm have been the standard fedmsg-hub eating a systems ram
15:17:17 <smooge> There is an outage going on with qa11 and with autosign01
15:17:24 <smooge> which leads to
15:17:26 <nirik> the actual mass rebuild went fine... the signing... did not.
15:17:30 <smooge> #topic Monitoring discussion
15:17:30 <smooge> #info https://nagios.fedoraproject.org/nagios
15:17:30 <smooge> #info Go over existing out items and fix
15:18:01 <smooge> yeah.. most of the alerts were with warnings about build events filling up various message queues
15:18:07 <nirik> yeah, we have a bunch of random stuff we need to fix...
15:19:27 <smooge> The batcave nagios warning looks to be ansible mass run
15:19:46 <cverna> I propose to remove the "Check OSBS build listing", it does not give much anyway basically just check if the Openshift cluster is running
15:19:48 <nirik> yeah, I thought I fixed that last night, but I can clean it up.
15:19:58 <nirik> cverna: +1
15:20:02 <mizdebsk> atomic compose alert is legitimate - some kernel issue is blocking the compose
15:20:33 <smooge> cverna, rename it to OpenShift Cluster is running?
15:20:38 <smooge> or remove it completely?
15:21:01 <cverna> it currently does not work because it does not have permission to query the api endpoints
15:21:22 <smooge> ok then remove away
15:21:25 <cverna> we could create a service account in openshift just for that, but I wonder if that's worth it
15:22:05 <cverna> I lean more to remove the check completely
15:22:10 <smooge> i found some packages for nagios-plugins for openshift but I am not sure if that is worth doing either
15:22:14 <mizdebsk> how are we monitoring openshift apps? i need to add some koschei monitoring before moving prod to openshift
15:22:24 <mizdebsk> just external route with nagios is not enough
15:23:18 <mkonecny> mizdebsk: There should be prometheus
15:23:36 <mizdebsk> is it connected with nagios somehow?
15:24:19 <nirik> no
15:24:22 <mizdebsk> i've never seen any alert from prometheus
15:24:23 <cverna> PROPOSAL remove the "Check OSBS build listing" check, if we have issues with the OSBS Openshift cluster later on we can find a way to monitor it
15:24:38 <nirik> it currently monitors, but does not send or notify anything. you have to look.
15:24:44 <nirik> cverna: +1
15:24:48 <smooge> agreed
15:25:00 <nirik> so, the plan was to look into sensu... which can talk to prometheus
15:25:02 <mizdebsk> nirik, this is... not ideal at very least
15:25:04 <cverna> #action cverna to nuke the Check OSBS build listing check
15:25:12 <cverna> [o/
15:25:22 <cverna> \o/ hehe
15:25:57 <mizdebsk> are there any plans to have these alerts sent via email/irc?
15:26:09 <nirik> mizdebsk: the plan is to look at sensu, which will do that
15:26:22 <nirik> and also replace nagios
15:26:51 <mizdebsk> ok; then personally i would consider this as a blocker for moving apps to prod
15:27:04 <mizdebsk> prod in openshift
15:27:21 <nirik> well, we can still monitor the end app via nagios...
15:27:38 <cverna> https://github.com/google/alertmanager-irc-relay
15:27:47 <mizdebsk> which is not enough for many apps
15:27:51 <nirik> we could look at configuring alertmanager...
15:28:04 <nirik> but I am not sure how far out the sensu work is... smooge ?
15:28:07 <nirik> https://alertmanager-main-openshift-monitoring.app.os.stg.fedoraproject.org/#/alerts
15:28:20 <smooge> sensu is not being looked at til this summer
15:28:38 <nirik> ok, then we should likely try and do something shorter term
15:29:25 <nirik> https://prometheus-k8s-openshift-monitoring.app.os.stg.fedoraproject.org/alerts is the prometheus side
15:29:33 <nirik> but thats just the cluster.
15:30:37 <pingou> nirik: getting 403 on both link
15:30:40 <pingou> s
15:30:42 <mizdebsk> i can't login to alert manager - "403 Permission Denied - Invalid Account"; is it available to appowners, or just to cluster admins?
15:30:58 <nirik> huh, I can find out what perm that is and add everyone.
15:30:58 <cverna> how do we add prometheus check on the applications ?
15:31:09 <cverna> I was able to login
15:31:47 <nirik> ah, you need cluster-monitoring-view
15:31:50 <mizdebsk> like pingou, i can't login to either
15:32:28 <nirik> try now?
15:32:43 <nirik> anyhow, I think we should approach this from the other side.
15:33:09 <nirik> what do we want to monitor on the apps? then we can take that list and try and figure out how best to do it... nagios or alertmanager or whatever
15:33:53 <mizdebsk> for koschei all i need is to check whether some pods are running or not, and how many of them; this can be easily done with nagios
15:34:25 <pingou> I can see both links now, thanks nirik
15:34:29 <nirik> It would be nice to have a 'default' set of things and more adding
15:34:54 <mizdebsk> pods themselves will be checked by openshift using liveness probess - it will restart/fail them if they crash
15:35:14 <mizdebsk> once pod is failed, it's not running and nagios could see this by running "oc get pod" on os-master
15:36:07 <nirik> (with the right namespace, but sure)
15:36:11 <mkonecny> I would like to get notification when the pod fails and why
15:36:40 * nirik is fine using nagios for now...
15:36:55 <nirik> if we can autogen monitoring for all apps that would be even better
15:37:53 <smooge> I will see what the nagios plugins for openshift can do and report back next week
15:37:55 <cverna> we need to give nagios the permission to execute oc command
15:38:07 <cverna> smooge: +1
15:38:32 <smooge> ok anything else on this?
15:39:06 <smooge> #topic Tickets discussion
15:39:06 <smooge> #info https://pagure.io/fedora-infrastructure/report/Meetings%20ticket
15:39:06 <smooge> #info fedora-packages outage https://pagure.io/fedora-infrastructure/issue/7507 - cverna
15:39:16 <smooge> cverna you have something here?
15:39:41 <cverna> yes I would like to rebuild fedora-package boxes
15:39:51 <cverna> there are still running on f27
15:39:57 <nirik> mizdebsk: can you work on that? or I can when I get cycles (and if so we should put in a ticket to track)
15:40:16 <cverna> so I need to coordinate the outage with someone from ops to be available to nuke to boxes
15:40:19 <mizdebsk> rebuilding packages? sure, i can do the ops side
15:40:48 <smooge> what are the fedora-packages boxes?
15:41:04 <nirik> bah, wireless hung
15:41:08 <nirik> well, that was about the nagios thing, but whatever. ;)
15:41:10 <nirik> apps.fedoraproject.org/packages
15:41:18 <mizdebsk> packages03.phx2.fedoraproject.org packages04.phx2.fedoraproject.org packages03.stg.phx2.fedoraproject.org
15:42:01 <cverna> yes I can schedule the outage for next week and coordinate with mizdebsk
15:42:12 <mizdebsk> nirik, re nagios i think it's better to wait for smooge to investigate the openshift plugin first
15:42:31 <mizdebsk> if it turns out not feasible to install then i can write a custom plugin that would check running pods
15:43:19 <smooge> ok thanks. I will put that as my Friday work
15:45:14 <smooge> #topic Priorities for next week?
15:45:14 <smooge> #info please put tickets needing to be focused on here
15:45:14 <smooge> #info https://pagure.io/fedora-infrastructure/issue/7547
15:45:14 <smooge> #info autosign outage/slowdown
15:45:14 <smooge> #info nagios monitoring openshift
15:46:21 <smooge> Any other priorities for the team that need focusing and teamwork on?
15:47:35 <nirik> all the things. ;)
15:47:50 <cverna> :)
15:48:03 <nirik> Hopefully we will have taiga soon and can move our high level planning stuff there.
15:48:14 <nirik> That might help us prioritize and schedule
15:49:58 <smooge> I think this part of the meeting will be us looking at said taiga in the future and probably will be a larger percentage
15:50:03 <mkonecny> Right now, everybody has their own work planned
15:50:20 <nirik> yeah, hard to coordinate
15:50:42 <mizdebsk> are we going to have per-subproject taiga boards? or one global one
15:51:03 <smooge> i don't know.. I have heard both so I am going to assume both
15:51:08 <nirik> per subproject I think...
15:51:08 <mizdebsk> right now i don't want to open too many tickets to track things as it annoys some of us :)
15:51:30 <nirik> well, like cpe will have one, marketing will have one, etc...
15:51:47 <nirik> but I guess some of that is TBD
15:52:46 <nirik> IMHO, taiga for 'large picture items' and pagure for 'tasks'
15:53:56 <mkonecny> What about goals for specific applications
15:54:21 <cverna> application tracker I think
15:54:27 <nirik> if it's a large item, taiga, but if it's just related to the app, in the application tracker...
15:54:37 <mizdebsk> cverna, which is where exactly?
15:54:53 <mizdebsk> do you have some new tracker in mind?
15:55:00 <mkonecny> cverna: So this will still let us divided in silos
15:55:12 <nirik> it depends on the app?
15:55:20 <cverna> mizdebsk: for example fpdc goals in fpdc tracker, then i can tag the ticket with something like taiga and it would be replicated to the infra board
15:55:26 <nirik> well, I don't think it's practical to use one tracker for everything.
15:55:39 <nirik> but perhaps I am not thinking out of the box enough. ;)
15:55:44 <cverna> nirik: agreed we will have to keep the project tracker for bugs etc
15:56:02 <cverna> we can't have everything in taiga it would be unmanageable
15:56:37 <smooge> honestly.. I think we are going to need some input from the people wanting us to use taiga to know what they want us to use it as
15:57:54 <nirik> that too
15:58:00 <mkonecny> smooge: From what I understand they want to have high goals overview
15:58:50 <nirik> My understanding is that they wanted one place where all (active) subprojects/sigs/teams have a presense...
15:59:02 <nirik> so if someone wanted say marketing, they could find it.
16:00:09 * cverna is not sure how this will work but I guess we have to wait and see :)
16:00:13 <nirik> but yeah, I guess we will see.
16:01:04 <smooge> #topic open floor
16:01:07 <smooge> ok we are over time
16:01:15 <smooge> so do we have anything on this meeting?
16:03:13 <cverna> nothing on my side
16:03:25 <cverna> thanks smooge for running the show
16:03:30 <nirik> yeah, thanks smooge
16:03:39 <mkonecny> thanks smooge
16:03:40 <smooge> ok thank you all
16:03:43 <smooge> #endmeeting