18:00:12 #startmeeting Infrastructure (2017-04-27) 18:00:12 Meeting started Thu Apr 27 18:00:12 2017 UTC. The chair is smooge. Information about MeetBot at http://wiki.debian.org/MeetBot. 18:00:12 Useful Commands: #action #agreed #halp #info #idea #link #topic. 18:00:12 The meeting name has been set to 'infrastructure_(2017-04-27)' 18:00:12 #meetingname infrastructure 18:00:13 The meeting name has been set to 'infrastructure' 18:00:13 #topic aloha 18:00:13 #chair smooge relrod nirik abadger1999 dgilmore threebean pingou puiterwijk pbrobinson 18:00:13 Current chairs: abadger1999 dgilmore nirik pbrobinson pingou puiterwijk relrod smooge threebean 18:00:16 hello all 18:00:20 hi 18:00:38 hi 18:00:40 morning everyone. 18:01:55 #topic New folks introductions 18:02:06 Hello are there any new people this week? 18:02:16 * cverna is around 18:03:34 gnu people? ;) 18:04:02 ok looks like I should have sent that email out earlier :) 18:04:15 #topic announcements and information 18:04:16 #info beta freeze will start 2017-05-17 18:04:16 #info infra hackfest in RDU 2017-05-08 to 2017-05-12 - everyone 18:04:16 #info Fedora Infrastructure weathered large outage last Friday. Good job everyone 18:04:16 #info mass update/reboot cycle next week (2017-05-02/03) - everyone 18:04:17 #info bodhi 2.6.0 released. A few issues so look for 2.6.1 soon - bowlofeggs 18:04:18 #info Moved production resultsdb database to separate machine to help with performance issues - tflink 18:04:32 Looks like this week was mostly recovery and next week will be mostly reboots 18:04:42 The week after that will be mostly meetings 18:04:51 Finally we will have a freeze 18:05:38 Any other items to put in the old stuff done list? 18:05:48 If not it will be time to hand it over to nirik 18:05:57 #topic Apprentice work day scheduling/topic - kevin 18:06:15 yeah, so I was thinking we should schedule a apprentice workday again... 18:06:20 and come up with a topic 18:06:32 The last one we did was docs... we could do that again, or something else. 18:06:53 Possibly we could look at all our apps and triage issues? 18:07:08 or clean up ansible playbooks from ansible lint output? 18:07:14 or ...your idea here. 18:07:33 some ansible sounds nice 18:07:39 good to know...maybe that moving stage IP easyfix? 18:07:51 Any python related things that might qualify in the easyfix realm? 18:08:05 As for time, perhaps week of the 22nd? we would be in freeze then... 18:08:11 capitanocrunch: good thought yeah... 18:08:23 Skeer: probibly tons on various apps. 18:08:43 I can open a thread on it on the list to get more folks input... 18:08:53 If someone has time to laser focus a few I'd be willing to dive in 18:09:23 +1 for thread on the list 18:09:28 .hello bowlofeggs 18:09:29 +1 18:09:29 bowlofeggs: bowlofeggs 'Randy Barlow' 18:09:32 more like bowlofbrokebodhi 18:09:33 we could also pick some poor neglected app and try and fix it up... askbot leaps to mind. ;) 18:09:50 or packages could use love. 18:10:22 how does the week of the 22nd sound for everyone? too soon? ok? bad time for some other reason? 18:10:41 +1 for 22nd 18:10:46 +1 18:10:46 Skeer: bodhi has some easyfix bugs and it's python 18:10:55 22nd sounds good 18:11:10 bowlofeggs: I'll head that way and ping you in I get lost ;) 18:11:17 askbot good old askbot 18:11:38 perhaps actually the 24th... (wed) 18:11:38 22nd sounds good 18:11:47 or the 24th 18:11:54 that way we are over the mondays... 18:11:56 +1 for 22nd 18:12:29 or actually, we last time did things over a week didn't we? 18:12:37 not just a day? will have to look 18:13:12 anyhow, thats all on this. I will post to the list and we can figure out a topic and such details. :) 18:13:28 sounds good 18:14:39 #topic BDR (bidirectional Data Replication) in postgres - kevin 18:14:45 back to you Kevin 18:14:58 so I wanted to talk about this a bit... but not sure we have all our apps folks around... 18:15:13 we could just wait and talk about it at the hackfest I suppose. 18:15:25 basically: 18:16:01 I want 2 things we don't currently have: replication (in case of disaster) and high availability (in cases where we reboot servers to apply updates, etc 18:16:17 there are a number of ways to do this in the postgres world. 18:16:41 There's pgpool which is a proxy... your apps talk to it, and it talks to postgres servers on the backend. 18:17:09 The problem with that is that it doesn't understand the full set of sql and can get confused. Also, it's another single point. 18:17:27 There's postgres'es native master/replicate stuff. 18:17:55 But it requires you to do various things when you promote and when you demote/readd a old spare 18:18:24 BDR does limit you on what you can do somewhat... but it makes things super easy otherwise. 18:18:35 You can reboot any of the nodes and they resync when they come back up 18:18:50 You don't have to do anything weird or arcane for that. 18:19:02 But I do understand the sql limits are anoying on the app side. 18:19:13 So, thats it in a nutshell. :) 18:19:25 sql limits are god's way of saying slow down 18:19:41 Well, I think the sql limits are reasonable, honestly. One is "have a primary key in all tables", which with most things already happens by default 18:19:44 I'll try and make sure we have everyone ok with it before I deploy things. 18:19:51 (for DBR) 18:20:14 it would help a lot if sqlalchemy could do the right thing on updates. 18:20:21 The other is no things like "CREATE TABLE ... AS SELECT ...", which you can just split into two things 18:20:47 Well, that's an alembic thing. And the alembic "bug" is easy to fix, and only needs to be done once per application 18:20:55 both of which I think go with "slow down" 18:20:57 (but yes, we should gt that added upstream) 18:21:13 smooge: I tend to disagree. They're both things that don't happen very often 18:21:51 puiterwijk, I think we are violently agreeing. If an app HAS to do those things for some reason, it needs to slow down 18:22:05 note also that they are trying to get all merged into postgres and should be in there hopefully in 10. 18:22:13 Ah, right 18:22:22 door bell brb 18:22:29 fwiw, I'm not sure whether it is not too early for BDR, and ... if you need just "backup", you won't profit probably from BDR 18:23:04 as I hopefully said, we don't need just backup 18:23:18 we want HA without a bunch of manual steps 18:23:35 and we have been running a bunch of things in stg with it for a while. ;) 18:23:40 ok, that's still not BDR, BDR == master <-> master 18:24:24 yep. 18:24:48 it is master/master, but we aren't pointing traffic to both masters. 18:25:15 nirik: which we totally could do, and maybe should :) 18:25:44 we could... except then we would need to make sure our apps could reconnect cleanly... 18:26:09 if we have masterA and masterB and spread load, and reboot masterA everything connected to it would need to reconnect to masterB 18:26:27 and most of our apps seem poor at reconnects 18:27:14 anyhow, from the sysadmin side this is a big win, IMHO, but I am happy to discuss further with anyone with reservations. ;) 18:27:17 thats all I had on this. 18:28:52 smooge: ... back to you? 18:28:53 ok thanks 18:29:16 On the i95 freeway we have a major backup 18:29:18 #topic Apprentice Open office hours 18:29:27 Hello Apprentices 18:29:58 Hello 18:30:05 hi 18:30:36 any open issues needing help with? 18:31:02 .hello nb 18:31:04 nb: nb 'Nick Bebout' 18:31:23 Oh, I did some digging on an interesting old ticket... without solving it really. 18:31:38 if someone wants to continue digging on it, might be a nice bit of fun. 18:31:41 what was the ticket 18:31:43 * nirik looks 18:31:54 I need some pointers on teh jenkiniss cleanup tcket 18:32:06 https://pagure.io/fedora-infrastructure/issue/4211 18:32:18 https://pagure.io/fedora-infrastructure/issue/6003 18:32:40 Skeer: for that, it just needs a normal logrotate config file... which we could add in ansible, but shold really be added to the package 18:33:44 For teh cleanup? 18:33:54 Skeer, I would write the logrotate config file 18:34:08 oh, sorry, thinking of the wrong ticket here. 18:34:24 lol 18:34:24 Then put in a patch for our ansible for the time being. 18:35:00 https://pagure.io/fedora-infrastructure/issue/6010 18:35:43 Skeer, I would say if they have not contacted you within a reasonable time they can be killed 18:36:04 They can get back when they show they are actually interested 18:36:20 but I have a sinus headache 18:36:27 * nirik nods. agree 18:36:30 some background about https://pagure.io/fedora-infrastructure/issue/5989 18:36:40 i look fine for me 18:37:47 smooge: Gotcha.. oko well I'll give it until tomorrow. Then update the ticket. 18:38:30 Skeer, cool 18:38:47 bt0, what do you need on that ticket? 18:38:54 bt0: ah yeah, I can add more background there... 18:39:02 * nirik sees that it's kinda vuage. 18:39:06 bowlofeggs: Youo've likely guessed but Im totally lost on how to proceed on ticket https://pagure.io/fedora-infrastructure/issue/5932 18:39:56 nirik, yeah please 18:42:00 Skeer, for that one I would look at the two template files and see what really is different from them 18:42:32 if possible I would syntax it as {% if env == 'stg' ... or whatever is correct text 18:42:41 Im hung up on not knowing the correct formatting to google for help. 18:42:55 Like what lang are thos ein? 18:42:58 jinja 18:43:19 I believe 18:43:23 Thats what I thought.. I found almost nothing IIRC 18:43:39 jinja2 ansible is where I usually start my looking 18:43:51 then I go throuhg the existing templates and pull out the logic I need :) 18:44:11 Gotcha.. I need to look for existing, working templates 18:44:29 Jinja2 by chance? 18:44:55 "friendly templating language for Python" 18:46:15 Skeer: the haproxy one mgith be good to look at as an example 18:46:23 Im dragging the meeting out.. I'll go spelunking some more and see what I can find on that. 18:46:26 Thanks nirik 18:46:37 nirik: noted :) 18:46:38 ok thanks guys 18:46:42 #topic Open Floor 18:46:50 https://infrastructure.fedoraproject.org/cgit/ansible.git/tree/roles/haproxy/templates/haproxy.cfg 18:46:54 sorry, but i missed discussion about postgres BDR, could we still come back to this topic for a few minutes? or should just post to the mailing list? 18:47:00 bt0: added a comment, hope it made sense. ;) 18:47:29 mizdebsk: feel free to add or post to the list or whatever. I'm sure there's going to be more discussion. ;) 18:47:30 mizdebsk, it is open floor so if you want to do so for 4 minutes 18:47:46 first, i quite dislike the situation we currently have with postgresql servers - staging has much different setup from production, which defeats the purpose of staging environment... 18:47:52 something may work in stg but will fail after moving to prod (we already hit this with koschei) 18:48:10 second, i have a feeling that BDR is not mature and painful to use 18:48:19 well, yes, but I hope to fix that by rolling BDR to prod. ;) 18:48:25 i missed the BDR topic due to some side convos i got pulled in to 18:48:26 most importantly from my pov, it does not support some features that i would like to use (such as partial unique index, or materialized view) 18:48:33 i do have other concerns though 18:48:49 some apps are not as critical as others - the world won't end if they are not available for a few hours and they can even withstand data loss of a few days, eg. after restoring db from a bit older backup 18:48:56 for example, BDR can get you into a deadlock situation that only human intervention can resolve 18:49:02 so i have an idea: what about having different db server for less critical apps? with no BDR and lower HA expectations 18:49:09 sure. a dead postgres server can do 18:49:11 that also 18:49:13 it could also run on fedora instead of rhel, to allow use of newer postgres features 18:49:25 nirik: i mean a data deaklock 18:49:26 mizdebsk: thats a thought indeed... 18:49:46 like if a write is accepted by A, A goes down, B takes over and accepts a conflicting write 18:49:53 that can't happen without BDR and it can happen with BDR 18:49:59 many of our apps are very simple and don't really need vast features tho 18:50:06 bowlofeggs: it can happen with a master-slave HA postgres as well 18:50:15 the slave is read only 18:50:17 bowlofeggs: sure, but life is tradeoffs. 18:50:23 most of apps are tested with sqlite :) 18:50:24 puiterwijk: i'm advocating to let bodhi be non-HA actually 18:50:25 bowlofeggs: not after it's promoted to master because the master is down 18:50:30 bowlofeggs: I'm not. 18:50:38 i dont' think we should promote the slave 18:50:39 I'd say that bodhi is one of the mission critical apps actually 18:50:45 unless we never bring the dead master back 18:50:56 it's not user facing, it's devloper facing 18:51:09 users test and add karma and comments? 18:51:17 and it rarely has an actual outage (it has severe bugs, like right now, but rarely an actual outage) 18:51:17 bowlofeggs: fedora-easy-karma? 18:51:51 even fedora-easy-karma i wouldn't consider mission critical 18:52:25 I'd call updates mashing/ pushing mission critical. 18:52:31 I was dreaming of a world where we would no longer need scheduled outages for updates. 18:52:43 my opinion: i'd rather not trade data safety for bodhi to get HA, when HA isn't a frequent problem for bodhi in the firs tplace 18:52:56 bowlofeggs, I think the problem is you are wanting it not to be mission critical and we are being told by outside forces it is mission critical 18:53:26 smooge: who is saying that bodhi is mission critical? obviously, it's not my call - this is just my opinion 18:53:33 but a deadlock will bring it down too 18:53:54 * nirik notes we have seen... none of those (without human error) in stg 18:53:58 so it's still not perfectly HA 18:54:04 nirik: bodhi doesn't work in stg at all 18:54:26 also, i saw another ocmment earlier that wasn't true about fk's 18:54:33 sure, but I am just stating a data point 18:54:35 bowlofeggs: except it does as I said the other day, just not your account but that's because you have tables without primary key 18:54:47 you are making it sound like it hits deadlocks all the time 18:54:48 it's extrememly common for applications to have tables without fk's because that's how m2m relationships are most commonly done 18:54:57 all of bodhi's non-fk tables are m2m tables 18:55:45 nirik: i'm not saying it happens all the time, but i am saying tat bodhi doesn't really have true downtime today and i'd rather have data safety than HA, if it were up to me (not saying it *is* up to me) 18:56:06 koschei definitely doesn't need to be HA, so ideally i would like to move it (koschei) to fedora-based, non-bdr postgres; if db server is specific to koschei then we (sysadmin-koschei) can take care of its maintenance 18:56:08 so i feel like i'm giving up a lot and not getting something i need in the trade 18:56:10 bowlofeggs: then you can (and probably should!) still add a primary key on the combination of the two tables - tada, a primary key that also gets you data safety 18:56:13 well, I don't want to force things on people. :) I like reaching a consensus. ;) 18:56:38 puiterwijk: BDR means no data safety 18:56:44 I think we aren't going to reach it right here 18:56:53 so let us move this to the list 18:56:59 the pk doesn't solve the data safety problem, it's just a requirement for bdr 18:57:04 smooge: ok 18:57:06 bowlofeggs: I disagree on that 18:57:10 * nirik does too 18:57:16 puiterwijk: their docs say this, not me 18:57:23 the deadlocks are documented 18:57:31 but yes, let's talk more at the fad 18:57:35 bowlofeggs: also: the problem is that *you* aren't the one doing the database server maintenance, and needing to suddenly tell everyone everything is down for 15 minutes because you need to increase a disk 18:57:51 again, i'm not sayign it's my call, but my opinion is that bodhi is better off as is,t hat's all 18:57:51 I am closing this down.. you can argue more at the fad 18:58:01 puiterwijk: fair 18:58:14 #endmeeting