19:00:27 <bcoca> #startmeeting ansible core public irc meeting 19:00:27 <zodbot> Meeting started Tue Jul 16 19:00:27 2019 UTC. 19:00:27 <zodbot> This meeting is logged and archived in a public location. 19:00:27 <zodbot> The chair is bcoca. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:27 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic. 19:00:27 <zodbot> The meeting name has been set to 'ansible_core_public_irc_meeting' 19:00:33 <bcoca> #topic open floor 19:00:44 <nitzmahone> o/ 19:00:53 <jillr> o/ 19:02:46 * bcoca makes note to buy crickets 19:03:15 <bcoca> if nothing new, closing in 8 mins 19:04:23 <sdoran> \o 19:04:26 <cyberpear> #info ~6 weeks until beta freeze 19:04:38 <cyberpear> ^ still accurate? 19:07:18 <jhawkesworth> hey 19:07:21 <bcoca> afaik 19:08:41 <jhawkesworth> since its quiet. Anybody know of a PR in the works along the lines of a general purpose timeout for tasks, or whether its been tried and failed in the past. 19:08:59 <jhawkesworth> oops forgot the ? at the end of my question. 19:09:06 <bcoca> lots of timeout discussions, mainly they now work at connection level 19:09:18 <bcoca> we also have 'facts timeout' but still has issues with blocking operations on target 19:09:25 <bcoca> for 'task timeout' you really have async tasks 19:09:51 <jhawkesworth> that's good if you know its coming. 19:10:40 <jhawkesworth> I hit a case today where a 'pause' just hung. 2 minute timeout, I got bored waiting after 10 mins. 19:10:47 <bcoca> if we do it by default, tasks that take very long will timeout, if you dont know its comming you would not set timeout 19:11:04 <jhawkesworth> oh it would have to be opt in 19:11:15 <bcoca> i would say, async can deliver that as of now 19:11:34 <agaffney> not with action plugins like 'pause', though 19:12:06 <bcoca> no, but action plugins freezing is a problem on controller 19:12:17 <bcoca> and pause has it's own timeouts already as options 19:12:47 <agaffney> I can see the benefit of a global task timeout option that defaults to unset 19:12:50 <jhawkesworth> if I was getting fancy I'd want some kind of 'hey' your task hasn't progressed in x minutes' and then a global 'its been an _configurable period_ giving up 19:13:14 <jhawkesworth> oops missed the word 'warning' before the 'and' above. 19:13:38 <jhawkesworth> just curious if anyone had heard of anything similar in the works 19:14:01 <bcoca> i've heard 'intentions' of such a thing for a long time, but not even a hint of a PR 19:14:10 <jtanner> i've asked for it 19:14:21 <jtanner> https://github.com/ansible/ansible/pull/57818/files is the closest to it so far 19:15:17 <agaffney> WIP *and* janky :) 19:15:19 * jhawkesworth reads PR 19:15:22 <bcoca> ^ i had not seen that 19:16:08 <bcoca> but that still relies on teh connection 19:16:40 <bcoca> but better than existing that relies on the 'protocol' mostly 19:16:56 <bcoca> that still wont affect action plugins 19:17:43 <bcoca> i was thinking jimi-c's process plugins pr might be one way, having a 'timed forked' one would be good way to implement this for 'all tasks' 19:18:49 <jtanner> we'll forever be fighting weird things at the connection layer, the network layer, the module layer, etc 19:19:03 <jtanner> i've been trying to advocate for an optional worker level timeout 19:19:21 <bcoca> true, but at worker level makes more sense 'total task time', at llexec .. we do 4-12 of those depending on the action 19:19:47 <bcoca> jtanner: that is what im agreeing with, and very easy to swapin/test when we have 'process plugins' 19:20:07 <jtanner> only supplying my reasoning 19:20:14 <agaffney> what *isn't* there going to be a plugin type for when bcoca is finished? 19:20:19 <jhawkesworth> my first thought was something along closer to the task execution loop, but can't comment meaningfully on implementation. 19:20:26 <bcoca> agaffney: smart mouthes 19:20:48 <bcoca> one issue is that we already have a `timeout` keyword and it refers to the 'protocol timeout' 19:21:15 <bcoca> jhawkesworth: the 'process/worker' would be at that level 19:22:17 * jhawkesworth trying to avoid temptation to bikeshed name for thing that doesn't exist 19:22:27 <agaffney> heh 19:22:40 <bcoca> timelimit 19:22:46 <bcoca> endofworld 19:22:52 <bcoca> agmaggedonclock 19:22:56 <bcoca> ragnarok 19:23:23 <agaffney> and here all I had was "task_timeout" 19:23:45 <bcoca> thedoctor 19:24:07 <jhawkesworth> background is I'm getting asked to make certain playbooks more reliable. 19:24:08 <bcoca> i really dislike task_ for 'task keywords' 19:24:35 <agaffney> jhawkesworth: what problems are you running into where this feature would help reliability? 19:25:36 <jhawkesworth> in short, automated releases to QA. 19:25:56 <agaffney> a hang on 'pause' sounds like a bug in ansible that you can't do much about from your playbook, or maybe you did something silly like `minutes: 120` instead of `seconds: 120` :) 19:26:36 <jhawkesworth> reasons for playbooks to fail are many and varied - developer error; windows is busy doing something else. people leaving things logged on. 19:27:06 <jhawkesworth> yeah that pause one is totally weird. It must have run 100 times successfully, but something upset it this time. 19:27:28 <agaffney> we could solve the task-level attribute name issue by just making this a global config option. as bcoca said earlier, you can already achieve this on a per-task basis with 'async' 19:27:56 <jtanner> definitely needs to be optional and null by default 19:28:07 <jhawkesworth> +1 to that 19:28:17 <bcoca> agaffney: also a global named timeout ... 19:29:55 <bcoca> nag jimi|ansible about his process plugins, then we can easily implement that 19:30:28 <agaffney> are process plugins for things like threads vs. fork for workers? 19:30:44 <bcoca> yes, so 'timed' ones seems like a good plugin to use 19:32:24 <jhawkesworth> I guess as long as it doesn't slow down task exec loop that sounds like a nice way to get what I'm after. 19:33:21 <bcoca> or we can just hijack current timeout, change the meaning and add 'per connection plugin timeouts (they are already there)' 19:35:20 <jhawkesworth> hmm, timeout on its own is kinda vague. Could be connecting to host time out, response from host timeout. 19:35:55 <bcoca> he, each protocol can have N timeouts 19:36:11 <bcoca> auth/tcp/keepalive/total connection time/time to connect/time to socket/etc 19:36:21 <agaffney> jhawkesworth: all the more reason to hijack it with an overall task timeout 19:36:33 <bcoca> why i think 'current' timeout make smore sense as 'task timeout' .. its what most people assume anyways 19:37:06 <jhawkesworth> oh all right I'll bikeshed some names... 19:37:32 <jhawkesworth> `action_timeout` perhaps? 19:37:57 <jhawkesworth> or does that make it sound like it only applies to actions 19:38:52 <bcoca> to be fair, they are always an 'action' and we've mislead most to think of them as 'modules' but in this channel we know its actually a combination 19:39:13 <bcoca> action: /local_action: are the actual underlying keywords for a task action 19:39:28 <agaffney> and everything uses the 'normal' action by default 19:39:43 <bcoca> if no action plugin is matched 19:40:58 <jhawkesworth> hmm if the 'normal' action plugin were a configurable thing, perhaps the normal action could be a timed_normal action, if you see what i mean 19:41:44 <jhawkesworth> not sure it buys us anything over process plugins idea though 19:41:47 <bcoca> well, you can do that and override teh behaviour easily, then just create your own execute_module that has a timeout, but that wont cover other actions (pause, service, etc) 19:42:09 <bcoca> process plugins cover ALL actions that are not hardcoded (meta, add_host, group_by ..) 19:42:11 <jhawkesworth> ah yeah of course.. so .. higher level than actions really 19:43:46 <nitzmahone> Having action timeouts is great, but cancelling things isn't always possible, depending on connection and the operation that's blocking... Things get a lot more complex with threads, too, since they're not generally safe to abort. 19:44:12 <nitzmahone> So unless the timeout is fatal, recovering gracefully is often not possible 19:44:47 <jhawkesworth> I was thinking it would be fatal tbh 19:45:02 <nitzmahone> I mean controller fatal, not just task fatal 19:45:36 <jhawkesworth> I was thinking controller fatal. 19:46:48 <jhawkesworth> task fatal would be nice actually.. mail plugin makes for 'total perspective vortex' on failures. 19:48:00 <jhawkesworth> well good to chat it through. I can make more use of async and `wait_for..` 19:48:23 <nitzmahone> Yeah, without process isolation, it's nearly impossible to do reliably recoverable preemptive action timeouts. 19:48:51 <nitzmahone> Cooperative, sure, and maybe that's enough in many cases 19:49:41 <nitzmahone> That's a whole lot easier to do under py3 19:49:44 <bcoca> nitzmahone: why i suggested this as ONE process plugin, timed_forkes, which would allow user to choose which feature best suits them 19:50:13 <bcoca> you can do similar with threads but its not as easy to abandon them 19:50:29 <bcoca> you can still get stuck on 'cleanup' 19:51:43 <nitzmahone> Not just that either- a thread abort can leave hanging locks and other things that aren't obvious until you hang or deadlock at any arbitrary point later 19:52:04 <nitzmahone> So preemptive timeout in process isn't really possible 19:52:11 <bcoca> its possible, just not advisable 19:52:12 <nitzmahone> (only cooperative) 19:52:54 <agaffney> cooperative is reasonable when you own both sides 19:53:02 <bcoca> but .. plugins! 19:53:17 <bcoca> we can ensure the plugings we control.. but soo many we don't 19:54:06 <jhawkesworth> Hmm, well I'll think more about what I actually need. First thought was pretty basic 'do_within_timeout or die' but its clear there's more to it. 19:54:39 <nitzmahone> tis why Python doesn't have a managed external thread abort- most languages that *do* have them are like "uhh, yeah, if you use this, you basically need to tear down your process" 19:54:43 <bcoca> there are many interpretations of 'die' 19:55:11 <bcoca> nitzmahone: it does have an 'abandon' .. but as i said, not really 19:55:14 * jtanner starts to wonder if this could be hacked together with a bash script around the ansible cli 19:55:27 <bcoca> jtanner: timout ansible-playbook .... 19:55:34 <bcoca> timeout 19:55:36 <jtanner> that kills the master 19:55:49 <jtanner> something to look for child pids, and kill the child if too old 19:55:55 <bcoca> ^ actually . ansible_ssh_executable: timeout -n 10 ssh 19:56:20 <bcoca> ^ not that, but wrapper that does that 19:56:43 <nitzmahone> bcoca: yeah, I actually played with using timeout in the connection plugin on that janky LLEC timeout PR 19:57:07 <bcoca> nitzmahone: i saw, same issue its not 'task level' .. it should still work for some cases 19:57:27 <jhawkesworth> plenty of my playbooks don't use ssh but yeah shell script wrapper might get me out of this particular hole. 19:58:12 <bcoca> ^ playbook timeout, play timeout, role timeout, task timeout ... many timeouts .. and ive not gotten to connection nor 'module execution' 19:58:31 <jhawkesworth> since perl was my first language, my definition of die is pretty much https://perldoc.perl.org/functions/die.html 19:59:07 <bcoca> he, that is mine too ... but others differ on what death means an dhow to handle funeral expenses 19:59:21 <jhawkesworth> :-) 19:59:27 <bcoca> 5.30 .. wow 19:59:28 <jhawkesworth> thanks for chatting it through. 19:59:43 <jhawkesworth> time for windows working group though so .. cheers 19:59:48 <bcoca> np, anytime, you can start these in devel also, not really what meetings are normally used for 19:59:53 <bcoca> glwt 19:59:56 <bcoca> #endmeeting