18:00:27 <smooge> #startmeeting Infrastructure (2017-08-10)
18:00:27 <zodbot> Meeting started Thu Aug 10 18:00:27 2017 UTC.  The chair is smooge. Information about MeetBot at http://wiki.debian.org/MeetBot.
18:00:27 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic.
18:00:27 <zodbot> The meeting name has been set to 'infrastructure_(2017-08-10)'
18:00:27 <smooge> #meetingname infrastructure
18:00:27 <zodbot> The meeting name has been set to 'infrastructure'
18:00:27 <smooge> #topic aloha
18:00:27 <smooge> #chair smooge relrod nirik abadger1999 dgilmore threebean pingou puiterwijk pbrobinson
18:00:28 <zodbot> Current chairs: abadger1999 dgilmore nirik pbrobinson pingou puiterwijk relrod smooge threebean
18:00:51 * relrod waves
18:00:53 <nirik> morning everyone.
18:01:30 <puiterwijk> hi
18:01:49 <profnixer> Hello everybody.
18:01:55 <smooge> hello
18:01:58 <bgray> hello
18:02:02 <netcronin> hi all.
18:02:09 * threebean waves
18:02:12 <cverna[m]> Hola
18:02:20 <smooge> #topic New folks introductions
18:02:35 <smooge> Hi are there any new people to Infrastructure?
18:02:39 <netcronin> New guy here. Hi everyone.
18:02:45 <profnixer> *waving*
18:02:50 <profnixer> I am a newbie.
18:03:23 <cverna[m]> hi netcronin and profnixer
18:03:27 <cverna[m]> welcome
18:04:05 <nirik> welcome
18:04:22 <smooge> hi netcronin and profnixer what are you interested in?
18:04:29 <brunofurtado> Hello guys! I'm new in the team
18:04:35 <netcronin> sysadmin work.
18:05:37 <netcronin> also the apprentice team, if there are open spots.
18:05:43 <profnixer> smooge: Sysadmin work, too. At first I'd like to take a closer look at https://pagure.io/fedora-infrastructure/issue/5290
18:06:30 <profnixer> smooge: Getting to know everything should be a good first step.
18:06:36 <smooge> yeah.. same here
18:06:46 <clime> hello, sry I am late :(
18:07:19 <smooge> well welcome to everyone.
18:07:23 <smooge> #topic announcements and information
18:07:23 <smooge> #info PHX2 Colo Trip coming up, Dec 4th - 9th
18:07:23 <smooge> #info FLOCK at Cape Cod  Aug29->Sep01
18:07:23 <smooge> #info Fedora F27 Rebuild (going on now)
18:07:23 <smooge> #info Updated of all servers has been messy kevin/smooge/patrick/relrod
18:07:24 <smooge> #info Bodhi 2.9.0 released. Deployment planned for Monday - bowlofeggs
18:07:26 <smooge> #info fedmsg 0.19.1 released
18:07:30 <smooge> #info autocloud 0.7.3 released.
18:07:47 <smooge> So the PHX2 Colo trip which was supposed to be next week has had to be rescheduled
18:07:48 <nirik> the f27 rebuild is actually all done
18:08:17 <smooge> that means that there will be no rolling outage for next week
18:08:31 <smooge> sorry I missed updating that line nirik
18:08:45 <nirik> me too
18:08:46 <smooge> I am guessing the bodhi is also an old one?
18:09:00 <smooge> any other announcements?
18:09:00 <nirik> no, I think that one did land... bowlofeggs ^
18:09:20 <puiterwijk> Yes, 2.9.0 is now in prod
18:11:28 <bowlofeggs> oh yeah, bodhi was deployed, but that announcement was from last week
18:11:44 <bowlofeggs> so i guess we could just say #info - bodhi 2.9.0 deployed this week.
18:12:11 <bowlofeggs> there is a common issue that people have been hitting: https://github.com/fedora-infra/bodhi/issues/1731
18:12:26 <bowlofeggs> there's a workaround documented tehre
18:12:27 <cverna[m]> should be online about pagure deployed for https://src.fedoraproject.org/
18:12:47 <cverna[m]> s/online/one line/
18:13:00 <nirik> indeed...
18:13:02 <smooge> #info pagure has been deployed for http://src.fedoraproject.org/
18:13:06 * dustymabe late
18:13:07 <dustymabe> sorry guys
18:13:13 <dustymabe> guys/gals
18:13:20 * smooge gives dustymabe some tickets to work his guilt out on
18:13:38 <dustymabe> smooge: i've got some of those already :)
18:14:15 <smooge> ok any announcements for infrastructure on the atomic side dustymabe ?
18:14:46 <dustymabe> smooge: I do have something to bring up
18:14:52 <dustymabe> I brought this up a the releng meeting on monday
18:15:02 <dustymabe> but it is worth talking about here too for those who missed that
18:15:35 <dustymabe> #info support for bodhi-backend01to run pungi composes
18:15:46 <dustymabe> well I don't know if that needed an info
18:15:48 <dustymabe> #undo
18:15:53 <puiterwijk> dustymabe: this is announcements. New topics should be in the open floor probably
18:16:10 <dustymabe> puiterwijk: ok -
18:16:16 <smooge> yeah.. sorry I wasn't clear
18:16:20 <dustymabe> only annoucement is
18:16:32 <dustymabe> i was able to get bodhi+pungi running together in stage
18:16:36 <smooge> cool
18:16:42 <dustymabe> so using pungi instead of masher to create repos
18:16:55 <dustymabe> EOM
18:17:01 <smooge> #topic Update cycles. Lessons learned?
18:17:01 <smooge> #info the last couple of updates cycles have been hard
18:17:01 <smooge> #info different reasons each time but how can we do better?
18:17:26 <smooge> So I just wanted to bring this up in case there were any immediate lessons learned we should put out here.
18:17:30 <dustymabe> smooge: I have a suggestion
18:17:36 <smooge> If not we can move to the next topic
18:18:12 <dustymabe> i consulted for the financial company for a little while. they ran everything in the cloud
18:18:26 <nirik> well, if/when we move more things to openshift we can deploy much faster and roll back much easier... but outages are still gonna be needed for some things
18:18:46 <dustymabe> one philosophy that they used on everything is that every server they had was completey rebuilt from the ground up automatically every two weeks
18:19:17 <dustymabe> i know that's not a place we can easily get to, but maybe we could think about it for the future
18:20:06 <nirik> yes, there has been thought about getting there...we likely never will completely, but we definitely want to explore cloud more in the coming years
18:20:32 <clime> ye, to mee, this seems like unnecesarilly hard measure
18:20:48 <bowlofeggs> one little problem i noticed in the update cycle is that some signature messages were missed by bodhi when bodhi-backend02 was offline
18:21:15 <dustymabe> nirik: my point was less about "cloud" and more about rebuilding the instances automatically every two weeks
18:21:21 <bowlofeggs> one idea to help there is to use some kind of durable message queue so backend02 can pick up what it missed when it comes back?
18:21:25 <dustymabe> i realize cloud can make that easier
18:21:34 <bowlofeggs> or another idea: make sure signatures dont' happen if 02 is offline?
18:21:36 <nirik> well, we wont ever be 100% in the cloud I don't think... since it would need a koji re-write and... what provider gives us ppc64/armv7/aarch64/s390x nodes?
18:21:39 <smooge> dustymabe, that works in a cloud. in our limited hardware environment it is a lot harder
18:21:58 <smooge> because no one wants anything down
18:22:20 <puiterwijk> bowlofeggs: we have that durable message queue. But datagrepper crashed
18:22:21 <nirik> bowlofeggs: I think other apps handle that by keeping a place and querying datagrepper when they start back upp
18:22:29 <nirik> but that yeah
18:22:34 <smooge> however it is something we can work towards as nirik says in some places
18:22:34 <puiterwijk> Bodhi does that too
18:22:59 * nirik notes he has a topic on future plans later in this very meeting
18:23:01 <puiterwijk> But when bodhi tried to, datagrepper was offline and it couldn't
18:23:18 <bowlofeggs> puiterwijk: oh yeah, that's true - should i do something to make fedmsg-hub look at datagrepper when it starts, or does it already work that way? i.e., any action on my part, or just a problem unique to this week and nothing toworry about?
18:23:28 <jcline> One thing I wanted to bring up is that I would have much rather had fedmsg-0.19.0 blacklisted for update rather than rushing a 0.19.1. Is there anything we can do about that in the future?
18:23:30 <puiterwijk> bowlofeggs: it already works that way
18:23:43 <puiterwijk> jcline: yes, not get broken packages into epel-stable
18:23:44 <bowlofeggs> ok cool, so no problem to solve really. nice!
18:23:55 <puiterwijk> jcline: aka, test in staging, before stuff go to stable
18:23:56 <jcline> puiterwijk, that's not... a real solution.
18:24:11 <puiterwijk> jcline: well, stuff that's in stable is supposed to be .... stable
18:24:19 <jcline> There _will_ be software that makes it to stable repos that has bugs.
18:24:50 <puiterwijk> Sure. But also stuff that is so widespread in all our infra, and that just doesn't do any parts of its job?
18:24:58 <puiterwijk> Then we do a fixed build. Look at iptables
18:25:21 <puiterwijk> https://koji.fedoraproject.org/koji/buildinfo?buildID=953583
18:26:29 <jcline> So the solution is for me to just never make any mistakes ever?
18:26:52 <puiterwijk> You will make mistakes, I know that. but then there's this thing called stagingwhere you can test updates-testing packages
18:27:29 <nirik> we could do a build in infra tags with an epoch?
18:27:31 <dustymabe> puiterwijk: we all make mistakes - i think his ask is valid
18:27:33 <nirik> (and downgrades)
18:27:45 <dustymabe> yeah why not just downgrade to the older working version?
18:27:46 <puiterwijk> nirik: yep, that's what I suggested as well
18:27:56 <jcline> That's not going to catch all the problems, and that's not even the issue I'm talking about. I'm talking about dealing with the situation we were in last night in a better way.
18:27:59 <puiterwijk> dustymabe: because that means we first do a yum udpate and then a yum downgrade on 500 boxes
18:28:10 <dustymabe> puiterwijk: ansible?
18:28:22 <jcline> ^
18:28:24 <nirik> sure.
18:28:27 <puiterwijk> dustymabe: yes, and then all the things that Require fedmsg-0.19.0
18:28:30 <nirik> but it's not just one package either
18:28:49 <puiterwijk> It's fedmsg-0.19.0, fedmsg-base0.19.0, python2-fedmsg-0.19.0, and more
18:29:07 <dustymabe> ok yeah that starts to get more complicated if there are a lot of other packages, but are they all built from the same srpm?
18:29:15 <dustymabe> or were there other packages that also depended on it?
18:29:25 <nirik> it's also very stressfull on the ops side to try and work on that at the same time as all the planned stuff
18:29:39 <dustymabe> nirik: indeed
18:29:57 * nirik notes we were around about 6 hours on the outage yesterday... it was a long day.
18:30:02 <puiterwijk> Do note that we're fine with downgrading some specific stuff, like datagrepper, but with something as core as fedmsg, that's going to cause a lot of pain everywhere
18:30:30 <puiterwijk> datagrepper is two packages on two servers (+2 packages on a single stg server).
18:30:37 <smooge> The bigger issue is that we need to make sure that certain tools we are putting into epel are getting more testing somewhere
18:30:39 <puiterwijk> But fedmsg is literally on all of our systems
18:30:40 <nirik> we could possibly look at using distro-sync better
18:31:03 <dustymabe> nirik: would be cool if you had an 'infra' ostree
18:31:07 <nirik> that might help us backout stuff quicker.
18:31:28 <nirik> if we disable epel* and enable infra-tags* and have only the older one tagged in there.
18:31:32 <puiterwijk> dustymabe: hahaha.Last time we tried to use ostree in infra was quickly disregarded
18:31:53 <smooge> ok anyway. lets focus here
18:32:08 <nirik> so, more testing, better backout process...
18:32:24 <nirik> would updating stg more often help out? would it have found any of the things we hit?
18:32:31 <puiterwijk> nirik: yes
18:32:34 <smooge> this was mainly meant to go over any lessons learned. We learned that we didn't clearly test some apps as well as we should. we also learned we didn't have a good method for backing out/blacklisting packages
18:32:48 <puiterwijk> it would've hit the fedmsg issues, it would've hit the IPA issues, ...
18:32:51 <dustymabe> nirik++ for updating stg more often
18:33:11 * puiterwijk notes that application owners can also run a "yum update" on some boxes if they want to test their software
18:33:16 <nirik> do we enable epel-testing there ? I can't recall.
18:33:28 <puiterwijk> No, we do not by default, but some playbooks enable it
18:34:04 <smooge> I would like to move that we try making epel-testing default on staging
18:34:33 <nirik> might help...
18:34:45 <jcline> Yeah. I think having regular automated updates to staging would also be good to catch issues in the *-testing repos before they hit stable
18:34:46 <nirik> along with perhaps a weekly update cycle
18:35:32 <puiterwijk> How about just a cron "yum update -y" on all systems and call it a day. Never update cycles, everything broken every other day
18:35:42 <smooge> I would also like to move to having Monday be yum update on all staging hosts
18:36:05 <nirik> puiterwijk: if they were all dnf, dnf automatic is actually nicee (and already enabled somewhat on all fedora machines)
18:36:18 <nirik> yum cron is kinda poor sadly
18:37:54 <puiterwijk> Anyway. Weekly updates in stg sure
18:37:57 <nirik> I'm fine with trying a monday update cycle on stg... we could /should even ansible playbook the thing
18:39:16 <nirik> anything else on this? ;)
18:39:21 <smooge> nirik, I would like to do it on a regular day that we can just say we are doing and skip when we know it won't work
18:39:29 <smooge> I figure monday is better than friday
18:39:38 <smooge> beyond that I have nothing else
18:39:53 <smooge> #topic flock infrastructure of the future workshop - kevin
18:40:04 <profnixer> Do you use any kind of central repo management like Spacewalk for example?
18:40:24 <nirik> so, I just wanted to advertise my flock workshop and get people thinking about it.
18:41:03 <nirik> I'd like to discuss and plan out say a year and 5 years where and what we see our infrastructure doing/as.
18:41:54 <nirik> I have some starter thoughts on it... get stuff in openshift, start a openshift in some clouds and move some things there, move things that make sense
18:42:28 <nirik> try and get to where we do CI/constant deployment on apps that folks are ready to do that on
18:42:57 <nirik> perhaps move to a setup where we can move things around from our cloud to a 3rd party cloud to another 3rd party cloud...
18:43:39 <nirik> anyhow, it's all just handwavy... but I'd love to discuss and see how we can make things more distributed, eat more dogfood and make our lives easier.
18:43:56 <nirik> for those of you not at flock, we can discuss on list or in admin sometime
18:44:14 * cyberpear showed up late, but has now read the scrolback
18:44:16 <nirik> profnixer: we don't have spacewalk, but we have central repos we control.
18:45:25 <smooge> nirik, did they give a tentative time yet for the talk?
18:45:42 <dustymabe> nirik: i'll try really hard to be there
18:45:43 <nirik> yeah, the flock schedue is p now
18:45:49 <dustymabe> i have some opinions on this matter
18:46:00 <nirik> https://flock2017.sched.com/
18:46:28 <nirik> thursday at 9:30am it seems
18:46:41 <nirik> pingou was going to try and get them to move the pagure one to another time
18:47:49 <smooge> ah ok
18:47:58 <smooge> that would be hopefully possible
18:48:48 <smooge> I will be there (outside circumstances like supernova not withstanding)
18:48:53 <nirik> if we move most things to openshift and get them to redeploy on commits or at releases we could at least not have app probems at outages
18:49:06 <smooge> ready for the next subject?
18:49:09 <nirik> we would have them anytime. ;)
18:49:11 <nirik> sure
18:49:15 <smooge> oh sorry
18:49:18 <smooge> #topic Upstream and downstream tickets - kevin
18:50:42 <nirik> so this came up in some tickets this week... and I thought it would be a good topic to discuss.
18:50:48 <nirik> (allthough we don't have much time left)
18:51:06 <smooge> sorry about that..
18:51:26 <nirik> we often get someone filing a infra ticket about app X. We ask them to refile on app Xs bug tracker and they do. We then close the ticket as UPSTREAM
18:51:40 <nirik> but the fix isn't deployed or anything yet
18:52:05 <nirik> so they have to follow the upstream bug. But then it might be commited but they have no idea when we deploy the new version.
18:52:27 <nirik> it was suggested we keep all such bugs open and close them only when we have deployed the fix.
18:52:45 <nirik> In the ideal world I agree with that.
18:52:50 <puiterwijk> That would mean we need to track all of the upstream bugs we file
18:52:58 <nirik> But we are pretty lacking in cycles.
18:53:13 <nirik> yep... and what release they got into and when we deploy it.
18:53:41 <nirik> in some cases where we arent related to the app the upstream may not even mention our bug...
18:54:12 <netcronin> Could we put a notice about going to the correct upstream bug tracker on the main fedora-infrastructure pagure homepage?
18:54:35 <netcronin> I see we have a welcome message - we could add something about "if you have an app issue to report..."
18:54:42 <nirik> netcronin: sure, but it doesn't solve this problem fully... it just means people will still not know when their bug is actually fixed.
18:54:42 <dustymabe> ok i have to run soon basically all i wanted to say was look at the following two issues I opened and weigh in
18:54:46 <dustymabe> https://pagure.io/fedora-infrastructure/issue/6193
18:54:52 <dustymabe> https://pagure.io/fedora-infrastructure/issue/6192
18:55:17 <dustymabe> for the last one I could use some help from infra, but if we're all too busy I'll try to open pull requests for the infra repo myself
18:55:23 <dustymabe> and by pull request I mean emails
18:56:07 <nirik> dustymabe: ok, have fun
18:56:53 <nirik> so, we are out of time, but I guess if someone has thoughts on this we can continue discussing in admin.
18:57:17 <smooge> I think we are at the all need a breather and a walk around the block
18:57:18 <nirik> I just don't think the sysadmin side of things has cycles to keep track of every bug and tie it to every deployment.
18:57:24 <nirik> yeah.
18:58:38 <smooge> I agree on this. I don't think we have the infrastructure of a complex ticketing system which would track all this or the cycles to look at it etc
18:59:15 <smooge> ok I am going to need to close the meeting out today
18:59:38 <relrod> thanks for running it smooge
18:59:40 <smooge> discussions to continue on mailing list or #fedora-admin after a walk
18:59:52 <smooge> #endmeeting