18:00:27 #startmeeting Infrastructure (2017-08-10) 18:00:27 Meeting started Thu Aug 10 18:00:27 2017 UTC. The chair is smooge. Information about MeetBot at http://wiki.debian.org/MeetBot. 18:00:27 Useful Commands: #action #agreed #halp #info #idea #link #topic. 18:00:27 The meeting name has been set to 'infrastructure_(2017-08-10)' 18:00:27 #meetingname infrastructure 18:00:27 The meeting name has been set to 'infrastructure' 18:00:27 #topic aloha 18:00:27 #chair smooge relrod nirik abadger1999 dgilmore threebean pingou puiterwijk pbrobinson 18:00:28 Current chairs: abadger1999 dgilmore nirik pbrobinson pingou puiterwijk relrod smooge threebean 18:00:51 * relrod waves 18:00:53 morning everyone. 18:01:30 hi 18:01:49 Hello everybody. 18:01:55 hello 18:01:58 hello 18:02:02 hi all. 18:02:09 * threebean waves 18:02:12 Hola 18:02:20 #topic New folks introductions 18:02:35 Hi are there any new people to Infrastructure? 18:02:39 New guy here. Hi everyone. 18:02:45 *waving* 18:02:50 I am a newbie. 18:03:23 hi netcronin and profnixer 18:03:27 welcome 18:04:05 welcome 18:04:22 hi netcronin and profnixer what are you interested in? 18:04:29 Hello guys! I'm new in the team 18:04:35 sysadmin work. 18:05:37 also the apprentice team, if there are open spots. 18:05:43 smooge: Sysadmin work, too. At first I'd like to take a closer look at https://pagure.io/fedora-infrastructure/issue/5290 18:06:30 smooge: Getting to know everything should be a good first step. 18:06:36 yeah.. same here 18:06:46 hello, sry I am late :( 18:07:19 well welcome to everyone. 18:07:23 #topic announcements and information 18:07:23 #info PHX2 Colo Trip coming up, Dec 4th - 9th 18:07:23 #info FLOCK at Cape Cod Aug29->Sep01 18:07:23 #info Fedora F27 Rebuild (going on now) 18:07:23 #info Updated of all servers has been messy kevin/smooge/patrick/relrod 18:07:24 #info Bodhi 2.9.0 released. Deployment planned for Monday - bowlofeggs 18:07:26 #info fedmsg 0.19.1 released 18:07:30 #info autocloud 0.7.3 released. 18:07:47 So the PHX2 Colo trip which was supposed to be next week has had to be rescheduled 18:07:48 the f27 rebuild is actually all done 18:08:17 that means that there will be no rolling outage for next week 18:08:31 sorry I missed updating that line nirik 18:08:45 me too 18:08:46 I am guessing the bodhi is also an old one? 18:09:00 any other announcements? 18:09:00 no, I think that one did land... bowlofeggs ^ 18:09:20 Yes, 2.9.0 is now in prod 18:11:28 oh yeah, bodhi was deployed, but that announcement was from last week 18:11:44 so i guess we could just say #info - bodhi 2.9.0 deployed this week. 18:12:11 there is a common issue that people have been hitting: https://github.com/fedora-infra/bodhi/issues/1731 18:12:26 there's a workaround documented tehre 18:12:27 should be online about pagure deployed for https://src.fedoraproject.org/ 18:12:47 s/online/one line/ 18:13:00 indeed... 18:13:02 #info pagure has been deployed for http://src.fedoraproject.org/ 18:13:06 * dustymabe late 18:13:07 sorry guys 18:13:13 guys/gals 18:13:20 * smooge gives dustymabe some tickets to work his guilt out on 18:13:38 smooge: i've got some of those already :) 18:14:15 ok any announcements for infrastructure on the atomic side dustymabe ? 18:14:46 smooge: I do have something to bring up 18:14:52 I brought this up a the releng meeting on monday 18:15:02 but it is worth talking about here too for those who missed that 18:15:35 #info support for bodhi-backend01to run pungi composes 18:15:46 well I don't know if that needed an info 18:15:48 #undo 18:15:53 dustymabe: this is announcements. New topics should be in the open floor probably 18:16:10 puiterwijk: ok - 18:16:16 yeah.. sorry I wasn't clear 18:16:20 only annoucement is 18:16:32 i was able to get bodhi+pungi running together in stage 18:16:36 cool 18:16:42 so using pungi instead of masher to create repos 18:16:55 EOM 18:17:01 #topic Update cycles. Lessons learned? 18:17:01 #info the last couple of updates cycles have been hard 18:17:01 #info different reasons each time but how can we do better? 18:17:26 So I just wanted to bring this up in case there were any immediate lessons learned we should put out here. 18:17:30 smooge: I have a suggestion 18:17:36 If not we can move to the next topic 18:18:12 i consulted for the financial company for a little while. they ran everything in the cloud 18:18:26 well, if/when we move more things to openshift we can deploy much faster and roll back much easier... but outages are still gonna be needed for some things 18:18:46 one philosophy that they used on everything is that every server they had was completey rebuilt from the ground up automatically every two weeks 18:19:17 i know that's not a place we can easily get to, but maybe we could think about it for the future 18:20:06 yes, there has been thought about getting there...we likely never will completely, but we definitely want to explore cloud more in the coming years 18:20:32 ye, to mee, this seems like unnecesarilly hard measure 18:20:48 one little problem i noticed in the update cycle is that some signature messages were missed by bodhi when bodhi-backend02 was offline 18:21:15 nirik: my point was less about "cloud" and more about rebuilding the instances automatically every two weeks 18:21:21 one idea to help there is to use some kind of durable message queue so backend02 can pick up what it missed when it comes back? 18:21:25 i realize cloud can make that easier 18:21:34 or another idea: make sure signatures dont' happen if 02 is offline? 18:21:36 well, we wont ever be 100% in the cloud I don't think... since it would need a koji re-write and... what provider gives us ppc64/armv7/aarch64/s390x nodes? 18:21:39 dustymabe, that works in a cloud. in our limited hardware environment it is a lot harder 18:21:58 because no one wants anything down 18:22:20 bowlofeggs: we have that durable message queue. But datagrepper crashed 18:22:21 bowlofeggs: I think other apps handle that by keeping a place and querying datagrepper when they start back upp 18:22:29 but that yeah 18:22:34 however it is something we can work towards as nirik says in some places 18:22:34 Bodhi does that too 18:22:59 * nirik notes he has a topic on future plans later in this very meeting 18:23:01 But when bodhi tried to, datagrepper was offline and it couldn't 18:23:18 puiterwijk: oh yeah, that's true - should i do something to make fedmsg-hub look at datagrepper when it starts, or does it already work that way? i.e., any action on my part, or just a problem unique to this week and nothing toworry about? 18:23:28 One thing I wanted to bring up is that I would have much rather had fedmsg-0.19.0 blacklisted for update rather than rushing a 0.19.1. Is there anything we can do about that in the future? 18:23:30 bowlofeggs: it already works that way 18:23:43 jcline: yes, not get broken packages into epel-stable 18:23:44 ok cool, so no problem to solve really. nice! 18:23:55 jcline: aka, test in staging, before stuff go to stable 18:23:56 puiterwijk, that's not... a real solution. 18:24:11 jcline: well, stuff that's in stable is supposed to be .... stable 18:24:19 There _will_ be software that makes it to stable repos that has bugs. 18:24:50 Sure. But also stuff that is so widespread in all our infra, and that just doesn't do any parts of its job? 18:24:58 Then we do a fixed build. Look at iptables 18:25:21 https://koji.fedoraproject.org/koji/buildinfo?buildID=953583 18:26:29 So the solution is for me to just never make any mistakes ever? 18:26:52 You will make mistakes, I know that. but then there's this thing called stagingwhere you can test updates-testing packages 18:27:29 we could do a build in infra tags with an epoch? 18:27:31 puiterwijk: we all make mistakes - i think his ask is valid 18:27:33 (and downgrades) 18:27:45 yeah why not just downgrade to the older working version? 18:27:46 nirik: yep, that's what I suggested as well 18:27:56 That's not going to catch all the problems, and that's not even the issue I'm talking about. I'm talking about dealing with the situation we were in last night in a better way. 18:27:59 dustymabe: because that means we first do a yum udpate and then a yum downgrade on 500 boxes 18:28:10 puiterwijk: ansible? 18:28:22 ^ 18:28:24 sure. 18:28:27 dustymabe: yes, and then all the things that Require fedmsg-0.19.0 18:28:30 but it's not just one package either 18:28:49 It's fedmsg-0.19.0, fedmsg-base0.19.0, python2-fedmsg-0.19.0, and more 18:29:07 ok yeah that starts to get more complicated if there are a lot of other packages, but are they all built from the same srpm? 18:29:15 or were there other packages that also depended on it? 18:29:25 it's also very stressfull on the ops side to try and work on that at the same time as all the planned stuff 18:29:39 nirik: indeed 18:29:57 * nirik notes we were around about 6 hours on the outage yesterday... it was a long day. 18:30:02 Do note that we're fine with downgrading some specific stuff, like datagrepper, but with something as core as fedmsg, that's going to cause a lot of pain everywhere 18:30:30 datagrepper is two packages on two servers (+2 packages on a single stg server). 18:30:37 The bigger issue is that we need to make sure that certain tools we are putting into epel are getting more testing somewhere 18:30:39 But fedmsg is literally on all of our systems 18:30:40 we could possibly look at using distro-sync better 18:31:03 nirik: would be cool if you had an 'infra' ostree 18:31:07 that might help us backout stuff quicker. 18:31:28 if we disable epel* and enable infra-tags* and have only the older one tagged in there. 18:31:32 dustymabe: hahaha.Last time we tried to use ostree in infra was quickly disregarded 18:31:53 ok anyway. lets focus here 18:32:08 so, more testing, better backout process... 18:32:24 would updating stg more often help out? would it have found any of the things we hit? 18:32:31 nirik: yes 18:32:34 this was mainly meant to go over any lessons learned. We learned that we didn't clearly test some apps as well as we should. we also learned we didn't have a good method for backing out/blacklisting packages 18:32:48 it would've hit the fedmsg issues, it would've hit the IPA issues, ... 18:32:51 nirik++ for updating stg more often 18:33:11 * puiterwijk notes that application owners can also run a "yum update" on some boxes if they want to test their software 18:33:16 do we enable epel-testing there ? I can't recall. 18:33:28 No, we do not by default, but some playbooks enable it 18:34:04 I would like to move that we try making epel-testing default on staging 18:34:33 might help... 18:34:45 Yeah. I think having regular automated updates to staging would also be good to catch issues in the *-testing repos before they hit stable 18:34:46 along with perhaps a weekly update cycle 18:35:32 How about just a cron "yum update -y" on all systems and call it a day. Never update cycles, everything broken every other day 18:35:42 I would also like to move to having Monday be yum update on all staging hosts 18:36:05 puiterwijk: if they were all dnf, dnf automatic is actually nicee (and already enabled somewhat on all fedora machines) 18:36:18 yum cron is kinda poor sadly 18:37:54 Anyway. Weekly updates in stg sure 18:37:57 I'm fine with trying a monday update cycle on stg... we could /should even ansible playbook the thing 18:39:16 anything else on this? ;) 18:39:21 nirik, I would like to do it on a regular day that we can just say we are doing and skip when we know it won't work 18:39:29 I figure monday is better than friday 18:39:38 beyond that I have nothing else 18:39:53 #topic flock infrastructure of the future workshop - kevin 18:40:04 Do you use any kind of central repo management like Spacewalk for example? 18:40:24 so, I just wanted to advertise my flock workshop and get people thinking about it. 18:41:03 I'd like to discuss and plan out say a year and 5 years where and what we see our infrastructure doing/as. 18:41:54 I have some starter thoughts on it... get stuff in openshift, start a openshift in some clouds and move some things there, move things that make sense 18:42:28 try and get to where we do CI/constant deployment on apps that folks are ready to do that on 18:42:57 perhaps move to a setup where we can move things around from our cloud to a 3rd party cloud to another 3rd party cloud... 18:43:39 anyhow, it's all just handwavy... but I'd love to discuss and see how we can make things more distributed, eat more dogfood and make our lives easier. 18:43:56 for those of you not at flock, we can discuss on list or in admin sometime 18:44:14 * cyberpear showed up late, but has now read the scrolback 18:44:16 profnixer: we don't have spacewalk, but we have central repos we control. 18:45:25 nirik, did they give a tentative time yet for the talk? 18:45:42 nirik: i'll try really hard to be there 18:45:43 yeah, the flock schedue is p now 18:45:49 i have some opinions on this matter 18:46:00 https://flock2017.sched.com/ 18:46:28 thursday at 9:30am it seems 18:46:41 pingou was going to try and get them to move the pagure one to another time 18:47:49 ah ok 18:47:58 that would be hopefully possible 18:48:48 I will be there (outside circumstances like supernova not withstanding) 18:48:53 if we move most things to openshift and get them to redeploy on commits or at releases we could at least not have app probems at outages 18:49:06 ready for the next subject? 18:49:09 we would have them anytime. ;) 18:49:11 sure 18:49:15 oh sorry 18:49:18 #topic Upstream and downstream tickets - kevin 18:50:42 so this came up in some tickets this week... and I thought it would be a good topic to discuss. 18:50:48 (allthough we don't have much time left) 18:51:06 sorry about that.. 18:51:26 we often get someone filing a infra ticket about app X. We ask them to refile on app Xs bug tracker and they do. We then close the ticket as UPSTREAM 18:51:40 but the fix isn't deployed or anything yet 18:52:05 so they have to follow the upstream bug. But then it might be commited but they have no idea when we deploy the new version. 18:52:27 it was suggested we keep all such bugs open and close them only when we have deployed the fix. 18:52:45 In the ideal world I agree with that. 18:52:50 That would mean we need to track all of the upstream bugs we file 18:52:58 But we are pretty lacking in cycles. 18:53:13 yep... and what release they got into and when we deploy it. 18:53:41 in some cases where we arent related to the app the upstream may not even mention our bug... 18:54:12 Could we put a notice about going to the correct upstream bug tracker on the main fedora-infrastructure pagure homepage? 18:54:35 I see we have a welcome message - we could add something about "if you have an app issue to report..." 18:54:42 netcronin: sure, but it doesn't solve this problem fully... it just means people will still not know when their bug is actually fixed. 18:54:42 ok i have to run soon basically all i wanted to say was look at the following two issues I opened and weigh in 18:54:46 https://pagure.io/fedora-infrastructure/issue/6193 18:54:52 https://pagure.io/fedora-infrastructure/issue/6192 18:55:17 for the last one I could use some help from infra, but if we're all too busy I'll try to open pull requests for the infra repo myself 18:55:23 and by pull request I mean emails 18:56:07 dustymabe: ok, have fun 18:56:53 so, we are out of time, but I guess if someone has thoughts on this we can continue discussing in admin. 18:57:17 I think we are at the all need a breather and a walk around the block 18:57:18 I just don't think the sysadmin side of things has cycles to keep track of every bug and tie it to every deployment. 18:57:24 yeah. 18:58:38 I agree on this. I don't think we have the infrastructure of a complex ticketing system which would track all this or the cycles to look at it etc 18:59:15 ok I am going to need to close the meeting out today 18:59:38 thanks for running it smooge 18:59:40 discussions to continue on mailing list or #fedora-admin after a walk 18:59:52 #endmeeting