19:00:27 <bcoca> #startmeeting ansible core public irc meeting
19:00:27 <zodbot> Meeting started Tue Jul 16 19:00:27 2019 UTC.
19:00:27 <zodbot> This meeting is logged and archived in a public location.
19:00:27 <zodbot> The chair is bcoca. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:27 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic.
19:00:27 <zodbot> The meeting name has been set to 'ansible_core_public_irc_meeting'
19:00:33 <bcoca> #topic open floor
19:00:44 <nitzmahone> o/
19:00:53 <jillr> o/
19:02:46 * bcoca makes note to buy crickets
19:03:15 <bcoca> if nothing new, closing in 8 mins
19:04:23 <sdoran> \o
19:04:26 <cyberpear> #info ~6 weeks until beta freeze
19:04:38 <cyberpear> ^ still accurate?
19:07:18 <jhawkesworth> hey
19:07:21 <bcoca> afaik
19:08:41 <jhawkesworth> since its quiet.  Anybody know of a PR in the works along the lines of a general purpose timeout for tasks, or whether its been tried and failed in the past.
19:08:59 <jhawkesworth> oops forgot the ? at the end of my question.
19:09:06 <bcoca> lots of timeout discussions, mainly they now work at connection level
19:09:18 <bcoca> we also have 'facts timeout' but still has issues with blocking operations on target
19:09:25 <bcoca> for 'task timeout' you really have async tasks
19:09:51 <jhawkesworth> that's good if  you know its coming.
19:10:40 <jhawkesworth> I hit a case today where a 'pause' just hung.  2 minute timeout, I got bored waiting after 10 mins.
19:10:47 <bcoca> if we do it by default, tasks that take very long will timeout, if you dont know its comming you would not set timeout
19:11:04 <jhawkesworth> oh it would have to be opt in
19:11:15 <bcoca> i would say, async can deliver that as of now
19:11:34 <agaffney> not with action plugins like 'pause', though
19:12:06 <bcoca> no, but action plugins freezing is a problem on controller
19:12:17 <bcoca> and pause has it's own timeouts already as options
19:12:47 <agaffney> I can see the benefit of a global task timeout option that defaults to unset
19:12:50 <jhawkesworth> if I was getting fancy I'd want some kind of 'hey' your task hasn't progressed in x minutes' and then a global 'its been an _configurable period_ giving up
19:13:14 <jhawkesworth> oops missed the word 'warning' before the 'and' above.
19:13:38 <jhawkesworth> just curious if anyone had heard of anything similar in the works
19:14:01 <bcoca> i've heard 'intentions' of such a thing for a long time, but not even  a hint of a PR
19:14:10 <jtanner> i've asked for it
19:14:21 <jtanner> https://github.com/ansible/ansible/pull/57818/files is the closest to it so far
19:15:17 <agaffney> WIP *and* janky :)
19:15:19 * jhawkesworth reads PR
19:15:22 <bcoca> ^ i had not seen that
19:16:08 <bcoca> but that still relies on teh connection
19:16:40 <bcoca> but better than existing that relies on the 'protocol' mostly
19:16:56 <bcoca> that still wont affect action plugins
19:17:43 <bcoca> i was thinking jimi-c's process plugins pr might be one way, having a 'timed forked' one would be good way to implement this for 'all tasks'
19:18:49 <jtanner> we'll forever be fighting weird things at the connection layer, the network layer, the module layer, etc
19:19:03 <jtanner> i've been trying to advocate for an optional worker level timeout
19:19:21 <bcoca> true, but at worker level makes more sense 'total task time', at llexec  .. we do 4-12 of those depending on the action
19:19:47 <bcoca> jtanner: that is what im agreeing with, and very easy to swapin/test when we have 'process plugins'
19:20:07 <jtanner> only supplying my reasoning
19:20:14 <agaffney> what *isn't* there going to be a plugin type for when bcoca is finished?
19:20:19 <jhawkesworth> my first thought was something along closer to the task execution loop, but can't comment meaningfully on implementation.
19:20:26 <bcoca> agaffney: smart mouthes
19:20:48 <bcoca> one issue is that we already have a `timeout` keyword and it refers to the 'protocol timeout'
19:21:15 <bcoca> jhawkesworth: the 'process/worker' would be at that level
19:22:17 * jhawkesworth trying to avoid temptation to bikeshed name for thing that doesn't exist
19:22:27 <agaffney> heh
19:22:40 <bcoca> timelimit
19:22:46 <bcoca> endofworld
19:22:52 <bcoca> agmaggedonclock
19:22:56 <bcoca> ragnarok
19:23:23 <agaffney> and here all I had was "task_timeout"
19:23:45 <bcoca> thedoctor
19:24:07 <jhawkesworth> background is I'm getting asked to make certain playbooks more reliable.
19:24:08 <bcoca> i really dislike task_ for 'task keywords'
19:24:35 <agaffney> jhawkesworth: what problems are you running into where this feature would help reliability?
19:25:36 <jhawkesworth> in short, automated releases to QA.
19:25:56 <agaffney> a hang on 'pause' sounds like a bug in ansible that you can't do much about from your playbook, or maybe you did something silly like `minutes: 120` instead of `seconds: 120` :)
19:26:36 <jhawkesworth> reasons for playbooks to fail are many and varied - developer error; windows is busy doing something else. people leaving things logged on.
19:27:06 <jhawkesworth> yeah that pause one is totally  weird.  It must have run 100 times successfully, but something upset it this time.
19:27:28 <agaffney> we could solve the task-level attribute name issue by just making this a global config option. as bcoca said earlier, you can already achieve this on a per-task basis with 'async'
19:27:56 <jtanner> definitely needs to be optional and null by default
19:28:07 <jhawkesworth> +1 to that
19:28:17 <bcoca> agaffney: also a global named timeout ...
19:29:55 <bcoca> nag jimi|ansible about his process plugins, then we can easily implement that
19:30:28 <agaffney> are process plugins for things like threads vs. fork for workers?
19:30:44 <bcoca> yes, so 'timed' ones seems like a good plugin to use
19:32:24 <jhawkesworth> I guess as long as it doesn't slow down task exec loop that sounds like a nice way to get what I'm after.
19:33:21 <bcoca> or we can just hijack current timeout, change the meaning and add 'per connection plugin timeouts (they are already there)'
19:35:20 <jhawkesworth> hmm, timeout on its own is kinda vague.  Could be connecting to host time out, response from host timeout.
19:35:55 <bcoca> he, each protocol can have N timeouts
19:36:11 <bcoca> auth/tcp/keepalive/total connection time/time to connect/time to socket/etc
19:36:21 <agaffney> jhawkesworth: all the more reason to hijack it with an overall task timeout
19:36:33 <bcoca> why i think 'current' timeout make smore sense as 'task timeout' .. its what most people assume anyways
19:37:06 <jhawkesworth> oh all right I'll bikeshed some names...
19:37:32 <jhawkesworth> `action_timeout` perhaps?
19:37:57 <jhawkesworth> or does that make it sound like it only applies to actions
19:38:52 <bcoca> to be fair, they are always an 'action' and we've mislead most to think of them as 'modules' but in this channel we know its actually a combination
19:39:13 <bcoca> action: /local_action: are the actual underlying keywords for a task action
19:39:28 <agaffney> and everything uses the 'normal' action by default
19:39:43 <bcoca> if no action plugin is matched
19:40:58 <jhawkesworth> hmm if the 'normal' action plugin were a configurable thing, perhaps the normal action could be a timed_normal action, if you see what i mean
19:41:44 <jhawkesworth> not sure it buys us anything over process plugins idea though
19:41:47 <bcoca> well, you can do that and override teh behaviour easily, then just create your own execute_module that has a timeout, but that wont cover other actions (pause, service, etc)
19:42:09 <bcoca> process plugins cover ALL actions that are not hardcoded (meta, add_host, group_by ..)
19:42:11 <jhawkesworth> ah yeah of course..  so .. higher level than actions really
19:43:46 <nitzmahone> Having action timeouts is great, but cancelling things isn't always possible, depending on connection and the operation that's blocking... Things get a lot more complex with threads, too, since they're not generally safe to abort.
19:44:12 <nitzmahone> So unless the timeout is fatal, recovering gracefully is often not possible
19:44:47 <jhawkesworth> I was thinking it would be fatal tbh
19:45:02 <nitzmahone> I mean controller fatal, not just task fatal
19:45:36 <jhawkesworth> I was thinking controller fatal.
19:46:48 <jhawkesworth> task fatal would be nice actually..  mail plugin makes for 'total perspective vortex' on failures.
19:48:00 <jhawkesworth> well good to chat it through.  I can make more use of async and `wait_for..`
19:48:23 <nitzmahone> Yeah, without process isolation, it's nearly impossible to do reliably recoverable preemptive action timeouts.
19:48:51 <nitzmahone> Cooperative, sure, and maybe that's enough in many cases
19:49:41 <nitzmahone> That's a whole lot easier to do under py3
19:49:44 <bcoca> nitzmahone: why i suggested this as ONE process plugin, timed_forkes, which would allow user to choose which feature best suits them
19:50:13 <bcoca> you can do similar with threads but its not as easy to abandon them
19:50:29 <bcoca> you can still get stuck on 'cleanup'
19:51:43 <nitzmahone> Not just that either- a thread abort can leave hanging locks and other things that aren't obvious until you hang or deadlock at any arbitrary point later
19:52:04 <nitzmahone> So preemptive timeout in process isn't really possible
19:52:11 <bcoca> its possible, just not advisable
19:52:12 <nitzmahone> (only cooperative)
19:52:54 <agaffney> cooperative is reasonable when you own both sides
19:53:02 <bcoca> but .. plugins!
19:53:17 <bcoca> we can ensure the plugings we control.. but soo many we don't
19:54:06 <jhawkesworth> Hmm, well I'll think more about what I actually need.  First thought was pretty basic 'do_within_timeout or die' but its clear there's more to it.
19:54:39 <nitzmahone> tis why Python doesn't have a managed external thread abort- most languages that *do* have them are like "uhh, yeah, if you use this, you basically need to tear down your process"
19:54:43 <bcoca> there are many interpretations of 'die'
19:55:11 <bcoca> nitzmahone: it does have an 'abandon' .. but as i said, not really
19:55:14 * jtanner starts to wonder if this could be hacked together with a bash script around the ansible cli
19:55:27 <bcoca> jtanner: timout ansible-playbook ....
19:55:34 <bcoca> timeout
19:55:36 <jtanner> that kills the master
19:55:49 <jtanner> something to look for child pids, and kill the child if too old
19:55:55 <bcoca> ^ actually . ansible_ssh_executable: timeout -n 10 ssh
19:56:20 <bcoca> ^ not that, but wrapper that does that
19:56:43 <nitzmahone> bcoca: yeah, I actually played with using timeout in the connection plugin on that janky LLEC timeout PR
19:57:07 <bcoca> nitzmahone: i saw, same issue its not 'task level' .. it should still work for some cases
19:57:27 <jhawkesworth> plenty of my playbooks don't use ssh but yeah shell script wrapper might get me out of this particular hole.
19:58:12 <bcoca> ^ playbook timeout, play timeout, role timeout, task timeout ... many timeouts  .. and ive not gotten to connection nor 'module execution'
19:58:31 <jhawkesworth> since perl was my first language, my definition of die is pretty much https://perldoc.perl.org/functions/die.html
19:59:07 <bcoca> he, that is mine too ... but others differ on what death means an dhow to handle funeral expenses
19:59:21 <jhawkesworth> :-)
19:59:27 <bcoca> 5.30 .. wow
19:59:28 <jhawkesworth> thanks for chatting it through.
19:59:43 <jhawkesworth> time for windows working group though so .. cheers
19:59:48 <bcoca> np, anytime, you can start these in devel also, not really what meetings are normally used for
19:59:53 <bcoca> glwt
19:59:56 <bcoca> #endmeeting