#rdo log

15:00:24 <jpena> #startmeeting RDO meeting - 2019-03-27
15:00:24 <zodbot> Meeting started Wed Mar 27 15:00:24 2019 UTC.
15:00:24 <zodbot> This meeting is logged and archived in a public location.
15:00:24 <zodbot> The chair is jpena. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:24 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic.
15:00:24 <zodbot> The meeting name has been set to 'rdo_meeting_-_2019-03-27'
15:00:25 <openstack> Meeting started Wed Mar 27 15:00:24 2019 UTC and is due to finish in 60 minutes.  The chair is jpena. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:27 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:30 <openstack> The meeting name has been set to 'rdo_meeting___2019_03_27'
15:00:31 <jpena> #topic roll call
15:00:44 <fultonj> o/
15:00:48 <fmount> o/
15:00:50 <jpena> #chair fultonj
15:00:50 <zodbot> Current chairs: fultonj jpena
15:00:51 <openstack> Current chairs: fultonj jpena
15:00:57 <jpena> #chair fmount
15:00:57 <zodbot> Current chairs: fmount fultonj jpena
15:00:58 <openstack> Current chairs: fmount fultonj jpena
15:01:06 <rdogerrit> Merged openstack/placement-distgit stein-rdo: openstack-placement-1.0.0-0.2.0rc2  https://review.rdoproject.org/r/19781
15:01:08 <jpena> remember you can add last-minute topics to the agenda at https://etherpad.openstack.org/p/RDO-Meeting
15:01:30 <Vorrtex> o/
15:01:38 <jpena> #chair Vorrtex
15:01:38 <zodbot> Current chairs: Vorrtex fmount fultonj jpena
15:01:38 <openstack> Current chairs: Vorrtex fmount fultonj jpena
15:01:39 <amoralej> o/
15:01:43 <PagliaccisCloud> finally somewhat on time!  ٩( ᐛ )و
15:01:47 <jpena> #chair amoralej PagliaccisCloud
15:01:47 <zodbot> Current chairs: PagliaccisCloud Vorrtex amoralej fmount fultonj jpena
15:01:48 <openstack> Current chairs: PagliaccisCloud Vorrtex amoralej fmount fultonj jpena
15:01:49 <ykarel> o/
15:01:56 <jpena> #chair
15:01:56 <zodbot> Current chairs: PagliaccisCloud Vorrtex amoralej fmount fultonj jpena
15:01:56 <openstack> Current chairs: PagliaccisCloud Vorrtex amoralej fmount fultonj jpena
15:02:15 <jpena> #chair ykarel
15:02:15 <zodbot> Current chairs: PagliaccisCloud Vorrtex amoralej fmount fultonj jpena ykarel
15:02:16 <openstack> Current chairs: PagliaccisCloud Vorrtex amoralej fmount fultonj jpena ykarel
15:02:31 <baha> o/
15:02:39 <mjturek> o/
15:03:08 <jpena> #chair mjturek baha
15:03:08 <zodbot> Current chairs: PagliaccisCloud Vorrtex amoralej baha fmount fultonj jpena mjturek ykarel
15:03:10 <openstack> Current chairs: PagliaccisCloud Vorrtex amoralej baha fmount fultonj jpena mjturek ykarel
15:04:05 <jpena> let's start with the topics
15:04:11 <jpena> #topic ppc64le containers build update
15:04:21 <jpena> mjturek: we can merge your topic and mine if you're ok
15:04:35 <mjturek> yeah sure
15:04:41 <mjturek> though baha is afk for a minute
15:04:51 <mjturek> could we maybe do the next topic first?
15:05:02 <jpena> ok
15:05:07 <mjturek> thank you!
15:05:10 <jpena> #topic  Ceph Nautilus update
15:05:43 <fmount> we're working to have Nautilus+Ansible2.7 working
15:05:46 <fmount> as per  https://review.rdoproject.org/r/#/c/18721/
15:06:19 <fmount> fultonj: we have still some networking issues that are not strictly related to the container we're using
15:06:22 <fultonj> dsavineau ^
15:06:42 <fmount> tag: master-5d15bed-nautilus-centos-7-x86_64
15:06:48 <fultonj> do you have what you need to figure out what the network issues are?
15:06:58 <amoralej> so, what's the problem now? sorry, i was two days on pto
15:07:11 <amoralej> network related?
15:07:28 <fultonj> do you think the reproducer env provided by ykarel sufficient for you to figure out the root cause of the network issue?
15:07:29 <fmount> fultonj: I'm reproducing the CI on my own env to figure out how to fix this
15:07:54 <fmount> fultonj: nope, maybe we need a fresh recheck
15:08:26 <fmount> cause we tweaked a lot with that env, for this reason I'm reproducing the issue locally
15:08:34 <fmount> (using CI conf)
15:08:38 <fultonj> fmount or dsavineau can you explain network issue to amoralej ?
15:08:53 <fmount> fultonj: sure
15:09:13 <fmount> amoralej: question is that mons don't bootstrap the cluster
15:09:37 <fmount> amoralej: and continue to send probes to an empty list
15:09:55 <fmount> amoralej: then we restarted the mon container on the eth0 dev ip address
15:10:35 <fmount> amoralej: and everything start working properly; in addition, we found that v1 on 6789 with this trick starts correctly
15:10:54 <fultonj> fmount: and that restart happens after OVN is configured?
15:11:04 <amoralej> what means "we restarted the mon container on the eth0 dep ip address" ?
15:11:17 <fultonj> is it possible the steps in tripleo need adjusting to ensure networks are up before?
15:11:41 * Duck o/
15:11:56 <fmount> amoralej: it means that we stopped the mon container (the systemd unit) and run it manually on the eth0 ip addr instead of the ovs obridge
15:12:12 <amoralej> ok
15:12:18 <amoralej> got it now
15:12:40 <fultonj> amoralej: if you restart it on the ovs bridge it breaks?
15:12:53 <fmount> fultonj: good question, it could be an idea to retrigger the job
15:13:09 <amoralej> i'm not sure
15:13:12 <amoralej> but
15:13:12 <fultonj> on the same box if you flip it back it breaks
15:13:18 <fultonj> (in theory)
15:13:24 <fultonj> i mean IF on the same box if you flip it back it breaks
15:13:27 <ykarel> fultonj, you saying networks are not up till step 2? those bridges are up before in NetworkDeployment step
15:13:29 <fultonj> then diff the network settings
15:13:33 <amoralej> i'd say with a previous container from 4.0, it worked with ovn
15:13:33 <fultonj> to figure out why
15:13:39 <amoralej> at least it passed this point
15:13:42 <fultonj> amoralej: true
15:13:57 <fultonj> it used ceph 14.0 but now we're trying to use 14.2
15:14:17 <amoralej> and 14.2 changed something related to this networking stuff?
15:14:19 <fultonj> 14.0 wasn't running v1 protocol on 6789 which cinder was using
15:14:28 <fultonj> 14.0 was running 3300
15:14:31 <fultonj> v2 protocol
15:14:33 <fultonj> we want both
15:14:44 <fultonj> v1:3300 and v2:6789
15:14:55 <fultonj> 14.2 should run both
15:15:01 <amoralej> what i don't understand well is how the fact of enabling 6789 port affects networking
15:15:10 <amoralej> but i don't know about ceph, so.. :)
15:15:18 <fultonj> fmount: but i think you said you found v1 and v2 not running on 14.2 anyway?
15:15:22 <fmount> fultonj: amoralej I also found that starting manually the mon container 6789 is up
15:15:33 <amoralej> if you can build a reproducer i can involve ovn team if needed
15:16:01 <fultonj> fmount: is it fair to say we have reproducer already in env ykarel provided
15:16:02 <fultonj> ?
15:16:18 <fmount> yes, with that trick I've seen also 6789 and yatin was able to create a volume with cinder
15:16:26 <fmount> fultonj: yes
15:16:28 <fultonj> or do you want to rekick it? and then invite ovn team to show them how you can change network and then it's "fixed"?
15:16:34 <amoralej> fmount, and we can reproduce the issue?
15:16:43 <amoralej> i man getting it to fail again?
15:16:47 <ykarel> fmount, i was able to create cinder without 6789
15:16:49 <amoralej> s/man/mean
15:17:18 <fmount> amoralej: imho we need to rekick it and start looking with a clean env
15:17:33 <amoralej> ack, that was my understanding
15:17:36 <fultonj> fmount: sounds good to me
15:17:44 <fmount> ykarel: oh sorry yes
15:18:27 <fultonj> fmount: so what should plan be?
15:18:28 <fmount> fultonj: I propose a clean env cause now things are going to be more confused
15:19:30 <fultonj> if the same env is going to be used to overcloud delete and overcloud deploy, then rm -rf fetch dir in between
15:19:33 <fmount> fultonj: we can rekick the job, reproduce the env and investigate starting from the new understings
15:19:47 <fultonj> ykarel: do you mind helping fmount with ^ ?
15:19:56 <fultonj> not fetch dir but rekick?
15:20:08 <fmount> fultonj: fetch_dir is enough
15:20:47 <fmount> for me is all, fultonj seems we have a plan
15:20:51 <ykarel> amoralej, should we recheck and hold the node? or reproduce locally?
15:20:55 <ykarel> wdyt?
15:21:06 <amoralej> ykarel, i can't hold nodes
15:21:19 <ykarel> jpena can help i think
15:21:19 <amoralej> if we can reproduce it locally in some server in rdo-cloud it'd be great
15:21:23 <amoralej> otherwise yeah
15:21:26 <amoralej> let's ask jpena
15:21:52 <jpena> I can help if we need holding nodes in Zuul
15:21:57 <amoralej> jpena, could you hold a node in zuul to use it to troubleshoot issue in ceph update?
15:21:58 <amoralej> ok
15:22:04 <ykarel> okk good
15:22:05 <amoralej> we'll let you known then
15:22:10 <amoralej> i'm rechecking
15:22:21 <ykarel> rebase
15:22:29 <rdogerrit> Alfredo Moralejo proposed rdoinfo master: DNM - only testing - bump Ansible to 2.7 for Stein  https://review.rdoproject.org/r/18721
15:22:44 <fultonj> thanks ^ please let fmount know IP when you have it
15:23:14 <ykarel> fultonj, fmount should i try same on new environment what i did in earlier environment?
15:23:20 <ykarel> to see if ceph runs there too
15:24:01 <fultonj> ykarel: do you mean when you couldn't reproduce yesterday?
15:24:08 <ykarel> fultonj, yes
15:24:23 <ykarel> ceph + openstack both worked there
15:24:31 <ykarel> with new containers and client + ovs
15:24:44 <fultonj> ovn?
15:25:02 <ykarel> yes
15:25:09 <amoralej> ykarel, what combination worked fine?
15:25:14 <jpena> which is the job name that should be held?
15:25:35 <amoralej> jpena, rdoinfo-tripleo-master-centos-7-scenario001-standalone
15:25:46 <amoralej> for 18721,26
15:25:51 <ykarel> amoralej, so on the environment we reproduced the issue, i tried reusing the same env to reproduce
15:26:05 <ykarel> by updating repos, removing containers, and deploying agin
15:26:21 <ykarel> and then ceph started, openstack services able to contact new ceph
15:26:25 <amoralej> ok
15:26:30 <amoralej> but was not new clean env
15:26:34 <fultonj> fmount: would that help you? ^ ykarel other env?
15:26:40 <fultonj> ykarel: i think it's an interesting datapoint. ykarel if you do please rm {{local_ceph_ansible_fetch_directory_backup}}
15:26:43 <fultonj> b4
15:26:53 <fultonj> or at leat make sure it doesn't exist from a previous deployment
15:26:58 <fmount> fultonj: yes, it's interesting
15:27:00 <fultonj> https://review.openstack.org/#/c/618320/1/docker/services/ceph-ansible/ceph-base.yaml
15:27:08 <fultonj> fresh deployment should create new one
15:27:30 <jpena> ok, done. I'll keep it monitored
15:28:16 <ykarel> fultonj, okk will take care
15:28:31 <ykarel> fultonj, fmount i did http://paste.openstack.org/show/748485/ before running standalone with new containers:- http://paste.openstack.org/show/748485/
15:28:35 <fmount> ykarel: for that reason I was looking for a clean environment
15:29:02 <ykarel> fmount, yes clean environment will help
15:29:16 <ykarel> but why it worked that way is weird
15:29:43 <fmount> ykarel: it contributes to confuse me
15:30:35 <fultonj> so fmount gets new envs and continues debug. we pull in ovn team
15:30:41 <fultonj> ^ that's the plan right?
15:31:04 <fmount> fultonj: yes, it's the best solution to solve this issue
15:31:23 <amoralej> ok, let's see if we can get it working
15:32:45 <fultonj> #action fmount gets new envs and continues debug of ceph issue. we pull in ovn team if necessary
15:32:52 <fultonj> i guess we move to next topic?
15:33:08 <jpena> yep
15:33:17 <fmount> fultonj: amoralej ykarel thanks for the effort, we can move to the next topic
15:33:33 <jpena> #topic  Decisions on ppc64le arch enablement for containers
15:33:57 <jpena> During the week we've been discussing the topic in the mailing list: https://lists.rdoproject.org/pipermail/dev/2019-March/009042.html
15:34:24 <jpena> So far, we've increased the disk in the RDO Registry VM to be able to host any additional demand, but we need to make decisions on the other topics
15:34:41 <mjturek> awesome
15:35:18 <jpena> so... for me, the first thing would be to have a clear picture of the workflow
15:35:43 <mjturek> jpena: workflow of the job itself?
15:37:26 <amoralej> mjturek, workflow for images build (i guess that's easy, pure kolla thing), push, tag and consume
15:37:59 <amoralej> so, iiuc containers for ppc64le are created in a periodic job from  consistent tag
15:38:02 <jpena> mjturek: the overall workflow. I saw some discussions on the tag format, and I think that's pretty much ok. However, after finding out the issues with manifest lists and the OpenShift Registry, I wonder if we're going to push the manifests only to dockerhub, or we still need to do it on the rdo registry
15:38:38 <mjturek> that's fair, Vorrtex I think you only needed the manifests lists on Dockerhub - is that correct?
15:38:52 <amoralej> jpena, so far the approach is to use a common namespace for x86_64 and pp64le and use different tags?
15:39:19 <jpena> amoralej: yes, that's what I understood
15:39:51 <Vorrtex> I don't know if I would say we *only* need manifests on dockerhub, I think its right to say we *definitely* need manifest lists on dockerhub.  Honestly, the more identical the workflow and end result per registry the better... Any "one-off" design ideas can be problematic in future updating.
15:39:55 <amoralej> will that not break x86_64?
15:40:17 <mjturek> fair enough Vorrtex
15:40:45 <amoralej> wouldn't be easier to avoid conflicts to use different namespaces?
15:41:20 <egonzalez> hi, when is expected to be stable stein repos created/promoted?
15:41:34 <Vorrtex> amoralej I don't think so, because the workflow will be identical no matter what architecture if you can use manifest list images, and I don't know if you can when the images are in separate namespaces, because separate namespaces are separate repositories.
15:43:31 <Vorrtex> So the workflow desired is as follows:  1) build and upload an x86_64 image, including that architecture in the tag.  2) build and upload a ppc64le image, including that architecture in the tag.  3) create a manifest list image pointing to those two images, but mark the tag as the current default without an architecture tag.  4) As a user, "pull" the manifest list image, which will give you the correct image based on the user's
15:43:31 <Vorrtex> architecture.
15:44:15 <amoralej> Vorrtex, then, when pulling a container, manifests are automatically used?
15:44:22 <amoralej> to find out the image to pull?
15:44:41 <jpena> Vorrtex: if I understood it correctly, 1 and 2 can be done first on registry.rdo, then dockerhub when the tests are successful. And then, 3 and 4 can be done in dockerhub, where users will fetch the images
15:44:44 <jpena> is this correct?
15:45:36 <mjturek> jpena can you elaorate on "when the tests are successful"?
15:45:40 <Vorrtex> amoralej if you do a pull on a manifest list image, you are returned the image that's appropriate for the requester's architecture, unless forcibly requesting a specific image.
15:45:53 <amoralej> does podman/buildah support working with manifests?
15:46:12 <Vorrtex> podman definitely does... I don't know anything about buildah as of yet.
15:46:35 <jpena> mjturek: sure. Container images are uploaded to registry.rdo to be used during the promotion pipeline, to avoid spamming dockerhub. Once the pipeline is successful, the same images are pushed to dockerhub and tagged as "current-tripleo"
15:46:50 <jpena> someone correct me if I'm wrong with ^^
15:47:07 <Vorrtex> jpena I believe that to be a possibility, but honestly as I mentioned above, the inconsistency between registries can be problematic when updating in the future.
15:47:33 <mjturek> jpena: ahhh okay,  so as long as the images all build successfully  they can go to dockerhub
15:47:55 <amoralej> from tripleo PoV, it will use the long tags (using -<arch>) or somehow will use the manifests?
15:48:16 <amoralej> jpena, yes, that's correct
15:48:22 <jpena> here's where I'd love to have someone from the tripleo ci team
15:48:34 <baha> Vorrtex: From what I understand, the issue is that the RDO registry just does not support manifest lists currently?
15:48:58 <amoralej> not only that,  who will create the manifest in docherhub?
15:49:02 <Vorrtex> baha yeah, I know, I was just mentioning the "problem" with that workflow.  If the community agrees on it being the design, then no issues from me, just making sure its noted.
15:49:09 <baha> Thus the need to use separate namespaces on the RDO registry and combine to single namespace w/ manifest list for dockerhub
15:49:17 <amoralej> with current workflow, it should be done in the promoter script probably
15:49:50 <Vorrtex> amoralej there can be a new job created to create the manifest list image.  It can likely be called or started or whatever from the existing build+upload job with another argument or something.
15:50:06 <amoralej> baha, but, are the manifests needed at all if we work with different namespaces?
15:50:29 <jpena> amoralej: the idea is to use a single namespace (at least in dockerhub)
15:50:31 <baha> amoralej: the issue is, on dockerhub, we don't want separate namespaces, from what I understand
15:50:37 <baha> jpena: +1
15:50:45 <amoralej> ok
15:51:09 <amoralej> from containers pulling, what i asked before
15:51:22 <amoralej> will tripleo need to somehow deal with manifests?
15:51:44 <amoralej> or it's transparent
15:52:23 <Vorrtex> amoralej its transparent.  The 'consumer' of the image has the exact same workflow that they have today if we use manifest list iamges.
15:52:25 <Vorrtex> images***
15:53:21 <Vorrtex> The only "in question" parts here are how we tag and upload those images.  I advocate for no different namespaces, because the difference between RDO registry (if that's the right name for it) and dockerhub will be much greater.
15:55:03 <Vorrtex> To be more clear, if we simply append an architecture to the image tag, and account for that when consuming from the RDO registry, then anyone looking for those same images in dockerhub will likely find them, since they'll be named/tagged the same.  If we have the RDO registry using a separate namespace, then the "identical" image in dockerhub doesn't have that namespace, and doesn't have the same tag.  Does that make sense?
15:55:19 <jpena> yes, I see that
15:55:49 <jpena> So we're missing input from the tripleo ci team, because that'd require changes to the jobs to include the arch part in the tag
15:56:12 <Vorrtex> jpena we reached out to them yesterday, and are currently discussing any required changes in a thread with those guys.
15:56:25 <jpena> aha, that's what I didn't know :)
15:56:30 <mjturek> Vorrtex: should we maybe add amoralej and jpena?
15:56:41 <Vorrtex> mjturek will do.  Though, I don't know your emails.
15:56:54 <jpena> Vorrtex: just our nicks at redhat.com
15:56:55 <mjturek> jpena and amoralej can you pm Vorrtex with that?
15:57:00 <mjturek> that works too :)
15:57:33 <mjturek> amoralej and jpena: could you both join the tripleo meeting next week?
15:57:35 <jpena> thanks, I just want to be sure that we're reaching a consensus, and then that we can support it from the RDO infra side :)
15:57:35 <mjturek> we should continue discussions there as well
15:57:58 <jpena> I'd be happy to attend, just let me know the date/time
15:58:18 <amoralej> i'm ok to attend, but tbh i think the main stakeholder is tripleo-ci
15:58:37 <Vorrtex> jpena the OOO-CI meeting happens on bluejeans right after the OOO community meeting yesterday
15:58:42 <mjturek> amoralej they have a tripleo-ci  community sync after the meeting, but it's variable time as it happens right after the meeting
15:58:51 <amoralej> ok
15:58:59 <mjturek> so going to the tripleo meeting on Monday is our best bet to get us all there, let me grab the wiki
15:59:26 <baha> Tuesday!
15:59:27 <Vorrtex> Yeah, they had some concerns but overall weren't against the changes we proposed.  A majority of those concerns are being addressed in the thread I added you guys on.
15:59:33 <mjturek> #info TripleO meeting is 14:00 UTC Tuesday in #tripleo
15:59:40 <jpena> #action mjturek, Vorrtex and the tripleo ci team to agree on implementation, jpena/amoralej to attend the tripleo-ci community sync next week
16:00:53 <jpena> anything else? We're running out of time
16:01:08 <Vorrtex> I think we can sync back up next week after more discussion.
16:01:11 <Duck> quack
16:01:13 <mjturek> jpena: plenty :) but we'll resume later
16:01:29 <jpena> great, thanks for the discussion :)
16:01:33 <mjturek> sorry for the long topic, thanks for the time
16:01:34 <Duck> if there's an open floor
16:01:39 <jpena> #topic chair for the next meeting
16:01:45 <jpena> Duck: yes, after this one
16:01:49 <jpena> any volunteer?
16:02:12 <ykarel> i can take
16:02:16 <jpena> thanks ykarel
16:02:21 <jpena> #action ykarel to chair next meeting
16:02:24 <jpena> #topic open floor
16:02:28 <Duck> yeah!
16:02:40 <Duck> as for the mail outage
16:03:09 <Duck> Misc discovered the dhcp client was probably the service which triggered a network restart
16:03:22 <Duck> and then bad timing for the postfix to come back
16:03:40 <Duck> so normally this never happens as needrestart does not restart network related stuff
16:04:03 <Duck> but in this VM there is no networkmanager so it was not handled right
16:04:24 <Duck> I bugged upstream and I hope to have this fixed or backported soon
16:05:03 <Duck> anyway I've been using needrestart for years now and it's been a while since any problem
16:05:20 <Duck> just to let you know
16:05:48 <jpena> it's not a huge deal. I hope we can improve that with the additional monitoring
16:06:00 <Duck> yep, it was missing indeed
16:06:20 <Duck> it's always possible to add our own blacklist for any problematic service too
16:06:40 <jpena> it's only happened once, I don't think it's worth for now
16:07:01 <Duck> so this does not affect my proposal to have it on all machines :-)
16:07:39 <jpena> not for me, although it highlights that we need to be aware of things like this and maybe monitor some more services
16:07:44 <Duck> yes, from the start (at least when I came in RH) needrestart was installed when it was managed by OSAS only
16:08:09 <Duck> I would say ALL services
16:08:45 <Duck> that's all for me :-)
16:08:53 <jpena> thanks!
16:08:57 <jpena> anything else before we close?
16:09:05 <jpena> 3,2,1...
16:09:17 <jpena> #endmeeting