15:00:24 #startmeeting RDO meeting - 2019-03-27 15:00:24 Meeting started Wed Mar 27 15:00:24 2019 UTC. 15:00:24 This meeting is logged and archived in a public location. 15:00:24 The chair is jpena. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:24 Useful Commands: #action #agreed #halp #info #idea #link #topic. 15:00:24 The meeting name has been set to 'rdo_meeting_-_2019-03-27' 15:00:25 Meeting started Wed Mar 27 15:00:24 2019 UTC and is due to finish in 60 minutes. The chair is jpena. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:27 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:30 The meeting name has been set to 'rdo_meeting___2019_03_27' 15:00:31 #topic roll call 15:00:44 o/ 15:00:48 o/ 15:00:50 #chair fultonj 15:00:50 Current chairs: fultonj jpena 15:00:51 Current chairs: fultonj jpena 15:00:57 #chair fmount 15:00:57 Current chairs: fmount fultonj jpena 15:00:58 Current chairs: fmount fultonj jpena 15:01:06 Merged openstack/placement-distgit stein-rdo: openstack-placement-1.0.0-0.2.0rc2 https://review.rdoproject.org/r/19781 15:01:08 remember you can add last-minute topics to the agenda at https://etherpad.openstack.org/p/RDO-Meeting 15:01:30 o/ 15:01:38 #chair Vorrtex 15:01:38 Current chairs: Vorrtex fmount fultonj jpena 15:01:38 Current chairs: Vorrtex fmount fultonj jpena 15:01:39 o/ 15:01:43 finally somewhat on time! ٩( ᐛ )و 15:01:47 #chair amoralej PagliaccisCloud 15:01:47 Current chairs: PagliaccisCloud Vorrtex amoralej fmount fultonj jpena 15:01:48 Current chairs: PagliaccisCloud Vorrtex amoralej fmount fultonj jpena 15:01:49 o/ 15:01:56 #chair 15:01:56 Current chairs: PagliaccisCloud Vorrtex amoralej fmount fultonj jpena 15:01:56 Current chairs: PagliaccisCloud Vorrtex amoralej fmount fultonj jpena 15:02:15 #chair ykarel 15:02:15 Current chairs: PagliaccisCloud Vorrtex amoralej fmount fultonj jpena ykarel 15:02:16 Current chairs: PagliaccisCloud Vorrtex amoralej fmount fultonj jpena ykarel 15:02:31 o/ 15:02:39 o/ 15:03:08 #chair mjturek baha 15:03:08 Current chairs: PagliaccisCloud Vorrtex amoralej baha fmount fultonj jpena mjturek ykarel 15:03:10 Current chairs: PagliaccisCloud Vorrtex amoralej baha fmount fultonj jpena mjturek ykarel 15:04:05 let's start with the topics 15:04:11 #topic ppc64le containers build update 15:04:21 mjturek: we can merge your topic and mine if you're ok 15:04:35 yeah sure 15:04:41 though baha is afk for a minute 15:04:51 could we maybe do the next topic first? 15:05:02 ok 15:05:07 thank you! 15:05:10 #topic Ceph Nautilus update 15:05:43 we're working to have Nautilus+Ansible2.7 working 15:05:46 as per https://review.rdoproject.org/r/#/c/18721/ 15:06:19 fultonj: we have still some networking issues that are not strictly related to the container we're using 15:06:22 dsavineau ^ 15:06:42 tag: master-5d15bed-nautilus-centos-7-x86_64 15:06:48 do you have what you need to figure out what the network issues are? 15:06:58 so, what's the problem now? sorry, i was two days on pto 15:07:11 network related? 15:07:28 do you think the reproducer env provided by ykarel sufficient for you to figure out the root cause of the network issue? 15:07:29 fultonj: I'm reproducing the CI on my own env to figure out how to fix this 15:07:54 fultonj: nope, maybe we need a fresh recheck 15:08:26 cause we tweaked a lot with that env, for this reason I'm reproducing the issue locally 15:08:34 (using CI conf) 15:08:38 fmount or dsavineau can you explain network issue to amoralej ? 15:08:53 fultonj: sure 15:09:13 amoralej: question is that mons don't bootstrap the cluster 15:09:37 amoralej: and continue to send probes to an empty list 15:09:55 amoralej: then we restarted the mon container on the eth0 dev ip address 15:10:35 amoralej: and everything start working properly; in addition, we found that v1 on 6789 with this trick starts correctly 15:10:54 fmount: and that restart happens after OVN is configured? 15:11:04 what means "we restarted the mon container on the eth0 dep ip address" ? 15:11:17 is it possible the steps in tripleo need adjusting to ensure networks are up before? 15:11:41 * Duck o/ 15:11:56 amoralej: it means that we stopped the mon container (the systemd unit) and run it manually on the eth0 ip addr instead of the ovs obridge 15:12:12 ok 15:12:18 got it now 15:12:40 amoralej: if you restart it on the ovs bridge it breaks? 15:12:53 fultonj: good question, it could be an idea to retrigger the job 15:13:09 i'm not sure 15:13:12 but 15:13:12 on the same box if you flip it back it breaks 15:13:18 (in theory) 15:13:24 i mean IF on the same box if you flip it back it breaks 15:13:27 fultonj, you saying networks are not up till step 2? those bridges are up before in NetworkDeployment step 15:13:29 then diff the network settings 15:13:33 i'd say with a previous container from 4.0, it worked with ovn 15:13:33 to figure out why 15:13:39 at least it passed this point 15:13:42 amoralej: true 15:13:57 it used ceph 14.0 but now we're trying to use 14.2 15:14:17 and 14.2 changed something related to this networking stuff? 15:14:19 14.0 wasn't running v1 protocol on 6789 which cinder was using 15:14:28 14.0 was running 3300 15:14:31 v2 protocol 15:14:33 we want both 15:14:44 v1:3300 and v2:6789 15:14:55 14.2 should run both 15:15:01 what i don't understand well is how the fact of enabling 6789 port affects networking 15:15:10 but i don't know about ceph, so.. :) 15:15:18 fmount: but i think you said you found v1 and v2 not running on 14.2 anyway? 15:15:22 fultonj: amoralej I also found that starting manually the mon container 6789 is up 15:15:33 if you can build a reproducer i can involve ovn team if needed 15:16:01 fmount: is it fair to say we have reproducer already in env ykarel provided 15:16:02 ? 15:16:18 yes, with that trick I've seen also 6789 and yatin was able to create a volume with cinder 15:16:26 fultonj: yes 15:16:28 or do you want to rekick it? and then invite ovn team to show them how you can change network and then it's "fixed"? 15:16:34 fmount, and we can reproduce the issue? 15:16:43 i man getting it to fail again? 15:16:47 fmount, i was able to create cinder without 6789 15:16:49 s/man/mean 15:17:18 amoralej: imho we need to rekick it and start looking with a clean env 15:17:33 ack, that was my understanding 15:17:36 fmount: sounds good to me 15:17:44 ykarel: oh sorry yes 15:18:27 fmount: so what should plan be? 15:18:28 fultonj: I propose a clean env cause now things are going to be more confused 15:19:30 if the same env is going to be used to overcloud delete and overcloud deploy, then rm -rf fetch dir in between 15:19:33 fultonj: we can rekick the job, reproduce the env and investigate starting from the new understings 15:19:47 ykarel: do you mind helping fmount with ^ ? 15:19:56 not fetch dir but rekick? 15:20:08 fultonj: fetch_dir is enough 15:20:47 for me is all, fultonj seems we have a plan 15:20:51 amoralej, should we recheck and hold the node? or reproduce locally? 15:20:55 wdyt? 15:21:06 ykarel, i can't hold nodes 15:21:19 jpena can help i think 15:21:19 if we can reproduce it locally in some server in rdo-cloud it'd be great 15:21:23 otherwise yeah 15:21:26 let's ask jpena 15:21:52 I can help if we need holding nodes in Zuul 15:21:57 jpena, could you hold a node in zuul to use it to troubleshoot issue in ceph update? 15:21:58 ok 15:22:04 okk good 15:22:05 we'll let you known then 15:22:10 i'm rechecking 15:22:21 rebase 15:22:29 Alfredo Moralejo proposed rdoinfo master: DNM - only testing - bump Ansible to 2.7 for Stein https://review.rdoproject.org/r/18721 15:22:44 thanks ^ please let fmount know IP when you have it 15:23:14 fultonj, fmount should i try same on new environment what i did in earlier environment? 15:23:20 to see if ceph runs there too 15:24:01 ykarel: do you mean when you couldn't reproduce yesterday? 15:24:08 fultonj, yes 15:24:23 ceph + openstack both worked there 15:24:31 with new containers and client + ovs 15:24:44 ovn? 15:25:02 yes 15:25:09 ykarel, what combination worked fine? 15:25:14 which is the job name that should be held? 15:25:35 jpena, rdoinfo-tripleo-master-centos-7-scenario001-standalone 15:25:46 for 18721,26 15:25:51 amoralej, so on the environment we reproduced the issue, i tried reusing the same env to reproduce 15:26:05 by updating repos, removing containers, and deploying agin 15:26:21 and then ceph started, openstack services able to contact new ceph 15:26:25 ok 15:26:30 but was not new clean env 15:26:34 fmount: would that help you? ^ ykarel other env? 15:26:40 ykarel: i think it's an interesting datapoint. ykarel if you do please rm {{local_ceph_ansible_fetch_directory_backup}} 15:26:43 b4 15:26:53 or at leat make sure it doesn't exist from a previous deployment 15:26:58 fultonj: yes, it's interesting 15:27:00 https://review.openstack.org/#/c/618320/1/docker/services/ceph-ansible/ceph-base.yaml 15:27:08 fresh deployment should create new one 15:27:30 ok, done. I'll keep it monitored 15:28:16 fultonj, okk will take care 15:28:31 fultonj, fmount i did http://paste.openstack.org/show/748485/ before running standalone with new containers:- http://paste.openstack.org/show/748485/ 15:28:35 ykarel: for that reason I was looking for a clean environment 15:29:02 fmount, yes clean environment will help 15:29:16 but why it worked that way is weird 15:29:43 ykarel: it contributes to confuse me 15:30:35 so fmount gets new envs and continues debug. we pull in ovn team 15:30:41 ^ that's the plan right? 15:31:04 fultonj: yes, it's the best solution to solve this issue 15:31:23 ok, let's see if we can get it working 15:32:45 #action fmount gets new envs and continues debug of ceph issue. we pull in ovn team if necessary 15:32:52 i guess we move to next topic? 15:33:08 yep 15:33:17 fultonj: amoralej ykarel thanks for the effort, we can move to the next topic 15:33:33 #topic Decisions on ppc64le arch enablement for containers 15:33:57 During the week we've been discussing the topic in the mailing list: https://lists.rdoproject.org/pipermail/dev/2019-March/009042.html 15:34:24 So far, we've increased the disk in the RDO Registry VM to be able to host any additional demand, but we need to make decisions on the other topics 15:34:41 awesome 15:35:18 so... for me, the first thing would be to have a clear picture of the workflow 15:35:43 jpena: workflow of the job itself? 15:37:26 mjturek, workflow for images build (i guess that's easy, pure kolla thing), push, tag and consume 15:37:59 so, iiuc containers for ppc64le are created in a periodic job from consistent tag 15:38:02 mjturek: the overall workflow. I saw some discussions on the tag format, and I think that's pretty much ok. However, after finding out the issues with manifest lists and the OpenShift Registry, I wonder if we're going to push the manifests only to dockerhub, or we still need to do it on the rdo registry 15:38:38 that's fair, Vorrtex I think you only needed the manifests lists on Dockerhub - is that correct? 15:38:52 jpena, so far the approach is to use a common namespace for x86_64 and pp64le and use different tags? 15:39:19 amoralej: yes, that's what I understood 15:39:51 I don't know if I would say we *only* need manifests on dockerhub, I think its right to say we *definitely* need manifest lists on dockerhub. Honestly, the more identical the workflow and end result per registry the better... Any "one-off" design ideas can be problematic in future updating. 15:39:55 will that not break x86_64? 15:40:17 fair enough Vorrtex 15:40:45 wouldn't be easier to avoid conflicts to use different namespaces? 15:41:20 hi, when is expected to be stable stein repos created/promoted? 15:41:34 amoralej I don't think so, because the workflow will be identical no matter what architecture if you can use manifest list images, and I don't know if you can when the images are in separate namespaces, because separate namespaces are separate repositories. 15:43:31 So the workflow desired is as follows: 1) build and upload an x86_64 image, including that architecture in the tag. 2) build and upload a ppc64le image, including that architecture in the tag. 3) create a manifest list image pointing to those two images, but mark the tag as the current default without an architecture tag. 4) As a user, "pull" the manifest list image, which will give you the correct image based on the user's 15:43:31 architecture. 15:44:15 Vorrtex, then, when pulling a container, manifests are automatically used? 15:44:22 to find out the image to pull? 15:44:41 Vorrtex: if I understood it correctly, 1 and 2 can be done first on registry.rdo, then dockerhub when the tests are successful. And then, 3 and 4 can be done in dockerhub, where users will fetch the images 15:44:44 is this correct? 15:45:36 jpena can you elaorate on "when the tests are successful"? 15:45:40 amoralej if you do a pull on a manifest list image, you are returned the image that's appropriate for the requester's architecture, unless forcibly requesting a specific image. 15:45:53 does podman/buildah support working with manifests? 15:46:12 podman definitely does... I don't know anything about buildah as of yet. 15:46:35 mjturek: sure. Container images are uploaded to registry.rdo to be used during the promotion pipeline, to avoid spamming dockerhub. Once the pipeline is successful, the same images are pushed to dockerhub and tagged as "current-tripleo" 15:46:50 someone correct me if I'm wrong with ^^ 15:47:07 jpena I believe that to be a possibility, but honestly as I mentioned above, the inconsistency between registries can be problematic when updating in the future. 15:47:33 jpena: ahhh okay, so as long as the images all build successfully they can go to dockerhub 15:47:55 from tripleo PoV, it will use the long tags (using -) or somehow will use the manifests? 15:48:16 jpena, yes, that's correct 15:48:22 here's where I'd love to have someone from the tripleo ci team 15:48:34 Vorrtex: From what I understand, the issue is that the RDO registry just does not support manifest lists currently? 15:48:58 not only that, who will create the manifest in docherhub? 15:49:02 baha yeah, I know, I was just mentioning the "problem" with that workflow. If the community agrees on it being the design, then no issues from me, just making sure its noted. 15:49:09 Thus the need to use separate namespaces on the RDO registry and combine to single namespace w/ manifest list for dockerhub 15:49:17 with current workflow, it should be done in the promoter script probably 15:49:50 amoralej there can be a new job created to create the manifest list image. It can likely be called or started or whatever from the existing build+upload job with another argument or something. 15:50:06 baha, but, are the manifests needed at all if we work with different namespaces? 15:50:29 amoralej: the idea is to use a single namespace (at least in dockerhub) 15:50:31 amoralej: the issue is, on dockerhub, we don't want separate namespaces, from what I understand 15:50:37 jpena: +1 15:50:45 ok 15:51:09 from containers pulling, what i asked before 15:51:22 will tripleo need to somehow deal with manifests? 15:51:44 or it's transparent 15:52:23 amoralej its transparent. The 'consumer' of the image has the exact same workflow that they have today if we use manifest list iamges. 15:52:25 images*** 15:53:21 The only "in question" parts here are how we tag and upload those images. I advocate for no different namespaces, because the difference between RDO registry (if that's the right name for it) and dockerhub will be much greater. 15:55:03 To be more clear, if we simply append an architecture to the image tag, and account for that when consuming from the RDO registry, then anyone looking for those same images in dockerhub will likely find them, since they'll be named/tagged the same. If we have the RDO registry using a separate namespace, then the "identical" image in dockerhub doesn't have that namespace, and doesn't have the same tag. Does that make sense? 15:55:19 yes, I see that 15:55:49 So we're missing input from the tripleo ci team, because that'd require changes to the jobs to include the arch part in the tag 15:56:12 jpena we reached out to them yesterday, and are currently discussing any required changes in a thread with those guys. 15:56:25 aha, that's what I didn't know :) 15:56:30 Vorrtex: should we maybe add amoralej and jpena? 15:56:41 mjturek will do. Though, I don't know your emails. 15:56:54 Vorrtex: just our nicks at redhat.com 15:56:55 jpena and amoralej can you pm Vorrtex with that? 15:57:00 that works too :) 15:57:33 amoralej and jpena: could you both join the tripleo meeting next week? 15:57:35 thanks, I just want to be sure that we're reaching a consensus, and then that we can support it from the RDO infra side :) 15:57:35 we should continue discussions there as well 15:57:58 I'd be happy to attend, just let me know the date/time 15:58:18 i'm ok to attend, but tbh i think the main stakeholder is tripleo-ci 15:58:37 jpena the OOO-CI meeting happens on bluejeans right after the OOO community meeting yesterday 15:58:42 amoralej they have a tripleo-ci community sync after the meeting, but it's variable time as it happens right after the meeting 15:58:51 ok 15:58:59 so going to the tripleo meeting on Monday is our best bet to get us all there, let me grab the wiki 15:59:26 Tuesday! 15:59:27 Yeah, they had some concerns but overall weren't against the changes we proposed. A majority of those concerns are being addressed in the thread I added you guys on. 15:59:33 #info TripleO meeting is 14:00 UTC Tuesday in #tripleo 15:59:40 #action mjturek, Vorrtex and the tripleo ci team to agree on implementation, jpena/amoralej to attend the tripleo-ci community sync next week 16:00:53 anything else? We're running out of time 16:01:08 I think we can sync back up next week after more discussion. 16:01:11 quack 16:01:13 jpena: plenty :) but we'll resume later 16:01:29 great, thanks for the discussion :) 16:01:33 sorry for the long topic, thanks for the time 16:01:34 if there's an open floor 16:01:39 #topic chair for the next meeting 16:01:45 Duck: yes, after this one 16:01:49 any volunteer? 16:02:12 i can take 16:02:16 thanks ykarel 16:02:21 #action ykarel to chair next meeting 16:02:24 #topic open floor 16:02:28 yeah! 16:02:40 as for the mail outage 16:03:09 Misc discovered the dhcp client was probably the service which triggered a network restart 16:03:22 and then bad timing for the postfix to come back 16:03:40 so normally this never happens as needrestart does not restart network related stuff 16:04:03 but in this VM there is no networkmanager so it was not handled right 16:04:24 I bugged upstream and I hope to have this fixed or backported soon 16:05:03 anyway I've been using needrestart for years now and it's been a while since any problem 16:05:20 just to let you know 16:05:48 it's not a huge deal. I hope we can improve that with the additional monitoring 16:06:00 yep, it was missing indeed 16:06:20 it's always possible to add our own blacklist for any problematic service too 16:06:40 it's only happened once, I don't think it's worth for now 16:07:01 so this does not affect my proposal to have it on all machines :-) 16:07:39 not for me, although it highlights that we need to be aware of things like this and maybe monitor some more services 16:07:44 yes, from the start (at least when I came in RH) needrestart was installed when it was managed by OSAS only 16:08:09 I would say ALL services 16:08:45 that's all for me :-) 16:08:53 thanks! 16:08:57 anything else before we close? 16:09:05 3,2,1... 16:09:17 #endmeeting