#gluster-meeting log

08:30:15 <hagarth> #startmeeting
08:30:15 <zodbot> Meeting started Fri Jun  6 08:30:15 2014 UTC.  The chair is hagarth. Information about MeetBot at http://wiki.debian.org/MeetBot.
08:30:15 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic.
08:30:30 <hagarth> xavih: let us wait for Dan to join in
08:30:38 <xavih> ok , no problem
08:30:40 <hagarth> who else do we have here today?
08:31:15 <krishnan_p> xavih, Hi. This is kp. I work primarily on glusterd
08:31:45 <xavih> krishnan_p: nice to meet you :)
08:32:49 <pranithk> xavih: Pranith here, work on afr :-)
08:33:01 <hagarth> xavih: Dan is having a bit of problems with his laptop. Should be in here soon.
08:33:04 <xavih> pranithk: Oh, really :D hehe
08:33:30 <pranithk> xavih: :-)
08:33:44 <raghu> xavih: Raghavendra here. Currently working on snapshots
08:34:31 <hagarth> there comes Dan
08:35:05 * ndevos is here, but currently working on a *cough* Xen *cough* kernel bug
08:35:19 <hagarth> ndevos: good luck :)
08:35:29 <ndevos> hagarth: hehe, thanks
08:35:35 <hagarth> shall we get started?
08:35:52 <pranithk> hagarth: +1
08:35:54 <xavih> yes
08:36:26 <dlambrig> Xavi, we were wondering if you could walk us through a write operation, and we could ask questions as we go
08:37:03 <xavih> ok, I can try...
08:37:24 <xavih> do we use the latest code review ? (pushed yesterday)
08:37:39 <dlambrig> sure
08:37:52 <dlambrig> we have it :)
08:38:02 <xavih> ok then
08:38:32 <xavih> the entry point is easy: ec_gf_writev() on ec.c
08:39:01 <xavih> here I only call the real write function with some additional parameters
08:39:19 <xavih> I'll only comment the interesting ones, the other should be obvious
08:39:40 <pranithk> xavih: sure
08:39:49 <xavih> the third parameter is a bitmask of subvolumes to which the request should be sent
08:39:59 <xavih> in this case -1 means all
08:40:25 <dlambrig> Ok
08:40:25 <xavih> each bit refers to a subvolume in the order defined in the volfile
08:40:37 <hagarth> ok
08:41:23 <xavih> fourth argument says how many answers are needed at minimum to consider the result valid
08:41:54 <xavih> answers are grouped looking at the ret code, errno, xdata and other things
08:42:23 <hagarth> xavih: what is the usual value for the fourth argument?
08:42:54 <xavih> a group of combined answers will only be considered a valid answer for this request if it's formed by, at least, the minimum number of individual answers specified in thin argument
08:43:06 <xavih> hagarth: it depends on the request
08:43:44 <xavih> for example
08:44:20 <xavih> normal requests like readv, writev, truncate, unlink, ... all use EC_MINIMUM_MIN
08:44:50 <krishnan_p> xavih, can we think of a grouping of response as a tuple defined by (op_ret, op_errno, xdata)
08:44:52 <xavih> this means that at least N (bricks) - R (redundancy) subvolumes must agree on the answer
08:45:04 <xavih> this can be seen as a quorum enforcement
08:45:13 <hagarth> xavih: right
08:45:51 <xavih> krishnan_p: yes, but it also checks other things like iatt or other cbk arguments, depending on the request
08:46:16 <pranithk> xavih: what will happen when at the time of winding quorum number of bricks are up but it succeeded in less number of bricks?
08:46:21 <hagarth> xavih: why is EC_MINIMUM_MIN -2 ? rather, what is the significance of -2?
08:46:46 <xavih> in some cases the minimum is a must, for example on read requests, because if less that N - R are available, it's impossible to generate an answer
08:47:37 <xavih> pranithk: the request will be sent, and when it's detected that there aren't enough combinable answers, an EIO will be reported to the caller
08:48:12 <pranithk> xavih: but the data is written on some of the bricks... self-heal handles it is it?
08:48:27 <xavih> hagarth: its only because it's determined later when the request is initiated. This could have been taken from ec->fragments
08:49:11 <xavih> hagarth: EC_MINIMUM_ALL can only be determined when the operation begins (it depends on alive bricks, successful locks and successful preop)
08:49:27 <xavih> hagarth: I used constants for the other cases only to be consistent
08:49:44 <hagarth> xavih: ok
08:49:45 <xavih> hagarth: and avoid having to access ec in ec_gf_xxx() functions
08:50:12 <xavih> pranithk: if the data is written to enough bricks (i.e N - R at least), self heal will recover it
08:50:40 <pranithk> xavih: in the case where it is not, what will happen to the partial write?
08:50:50 <krishnan_p> xavih, what is the type of 'ec'? Is it ec_fop_data_t?
08:50:51 <xavih> pranithk: however if, for example, there are N - R bricks alive and one of them fails the write, currently the data is irrecoverable
08:51:05 <xavih> krishnan_p: ec_t
08:51:13 <pranithk> xavih: hmm...
08:51:25 <xavih> krishnan_p: it's the private data from this->private
08:51:36 <xavih> pranithk: I don't know how to solve this situation...
08:51:56 <pranithk> xavih: ok we shall see about it later... please continue writev from where we left off...
08:52:05 <xavih> ok
08:52:09 <krishnan_p> xavih, OK
08:52:37 <dlambrig> lets continue the flow
08:53:06 <xavih> the minimum argument takes importance on self heal, where some requests are valid even with one valid answer
08:53:30 <xavih> fifth argument is the callback function to be called when the fop is finished. It can be NULL. For normal fops it's the default defaule_<fop>_cbk() function
08:53:42 <hagarth> ok
08:54:28 <xavih> sixth argument is any data to be attached to the fop (used on self-heal)
08:54:51 <xavih> the remaining args are the normal writev arguments
08:55:02 <pranithk> dlambrig: You should talk to the guy from ceph about how they handle partial failures...
08:55:52 <xavih> ec_writev() in ec-inode-write.c prepares the requets
08:56:26 <xavih> it first calls to ec_fop_data_allocate() that creates the fop_data_t structure that will be used though all the fop processing
08:57:10 <xavih> do you want I detail the arguments of this function ?
08:57:16 <dlambrig> pranithk: The Ceph engineer is Looc Dachary and he is a very good resource for us, he is not yet a RH employee but will be soon.
08:57:27 <dlambrig> xavi- yes, please do
08:57:31 <xavih> ok
08:57:31 <hagarth> pranithk: Loic is actually in #gluster-dev atm
08:57:31 <dlambrig> that is a key function
08:58:06 <xavih> third argument is the fop type. Used basically for logging
08:58:11 <xavih> next one are flags
08:58:48 <xavih> they say if the fop needs locking (inode or entry), preop handling
08:59:04 <xavih> it also says to how many subvolumes the request must be sent
08:59:16 <dlambrig> is that the 2?
08:59:52 <dlambrig> what is the 2? :)
09:00:08 <xavih> no, flags also say what "things" must be merged in combined answers. It can be a dict, a loc, etc
09:00:36 <xavih> since there can be multiple iatt answers, that 2 says how many iatt must be combined from answers
09:00:52 <xavih> in this case, the write callback receives 2 iatt structures that must be merged
09:01:06 <krishnan_p> xavih, how do we determine we need to combine only 2 of them?
09:01:21 <xavih> krishnan_p: looking at the callback argument list :)
09:02:02 <xavih> krishnan_p: all iatt on an answer must agree to be combined
09:02:25 <xavih> otherwise it means that the brick has had some problem and it's not in stnc with others
09:02:32 <xavih> s/stnc/sync/
09:03:12 <xavih> is this clear ?
09:03:17 <raghu> xavih: combining 2 iatt structures in writev_cbk, do you mean prebuf and postbuf?
09:03:37 <xavih> raghu: yes, in this case it corresponds to prebuf and postbuf
09:03:43 <krishnan_p> xavih, what would be the behaviour if the iatt's didn' agree? ie. not in sync
09:04:12 <xavih> krishnan_p: then the answers won't be combined. They will belong to two different groups
09:05:18 <krishnan_p> xavih, OK. So, does the combining operation take care of whether the responses (answers) are in sync?
09:06:02 <xavih> krishnan_p: yes. This is done to detect inconsistent bricks and initiate self-heal on them when necessary
09:06:49 <xavih> it's ok to continue with next arg ?
09:07:09 <dlambrig> yes
09:07:16 <krishnan_p> xavih, if after the combine if we don't receive N-R answers in any of the group, then we fail the writev?
09:07:59 <xavih> krishnan_p: yes. This is what pranithk said. I don't know how to solve this situation
09:08:06 <xavih> in the current implementation I return EIO
09:08:16 <krishnan_p> xavih, oops. OK
09:08:34 <xavih> target and minimum are already explained
09:09:36 <xavih> next one says how many answers are expected to be received. Now that I've seen it I see that it's something old and probably I could remove this one...
09:09:57 <xavih> I think I always use the same value... I'll review later...
09:10:12 <xavih> next one is the function to be called to wind the request to each subvolume
09:10:38 <xavih> except for write, it's a straightforward STACK_WIND
09:10:56 <xavih> next one is the function that will control the live cycle of the fop
09:11:03 <xavih> it's basically a state machine
09:11:24 <xavih> callback and data come from ec_gf_writev()
09:11:31 <xavih> any question on these arguments ?
09:11:50 <dlambrig> not for me, we will get to the state machine internals shortly
09:11:56 <xavih> yes
09:11:57 <xavih> ok
09:12:32 <xavih> if ec_fop_data_allocate() fails, the callback function is called with an EIO
09:12:50 <xavih> otherwise, fop structure is populated with writev arguments
09:13:06 <xavih> this is what ec_fop_data_set_xxx() does
09:13:20 <krishnan_p> OK
09:13:30 <xavih> finally, ec_manager() is called to begin the processing of the requets
09:13:37 <dlambrig> now the fun begins..
09:13:53 <xavih> it's important to note that the second argument of ec_manager() is an error code
09:14:16 <xavih> if some of the ec_fop_data_set_xxx() failed, the operation will be initiated with an EIO error
09:14:45 <xavih> let's go to ec_manager() on ec-common.c
09:14:52 <hagarth> xavih: ok
09:15:17 <xavih> EC_STATE_INVALID is 0 (the value that will have an uninitialized fop
09:16:03 <xavih> every state in the state machine can have two "flavors": when there is an error and when there is not
09:16:27 <xavih> any positive state means everything is ok. a negative state means some error happened
09:17:11 <xavih> then, if the fop needs to lock an inode or entry, the owner of the stack frame is set to a different value for each request
09:17:33 <xavih> __ec_manager() is the core of the state machine
09:18:05 <xavih> first it handles the error code. If there is an error code, the state is negated (to indicate an error)
09:18:45 <xavih> then the handler specified in the call of ec_fop_data_allocate() is called to manage the states of the fop
09:19:24 <xavih> when it returns (will see that function later), if the state is EC_STATE_INVALID, it means that the state machine has finished and it is released
09:19:56 <xavih> if not, ec_wait() waits until any possible subrequests initiated by the fop->handler() are completed
09:20:10 <xavih> it returns the error code from that subrequests
09:20:19 <xavih> and the next state is executed
09:20:27 <xavih> any question here ?
09:20:39 <krishnan_p> xavih, does ec_wait block until the subrequest issued by the handler returns?
09:21:17 <krishnan_p> xavih, or did you mean that the fop 'wait' in the same state?
09:21:31 <xavih> krishnan_p: no, it does not block
09:22:13 <xavih> if there are pending requests, -1 is returned and it exists the loop (it will call again __ec_manager when the request finishes)
09:22:23 <xavih> it's not a synchronous wait
09:22:36 <xavih> ec_wait() always returns immediately
09:22:47 <xavih> right ?
09:22:58 <xavih> do you want to look at ec_wait() now ?
09:23:01 <pranithk> xavih: Will it lead to a busy loop?
09:23:32 <xavih> pranithk: no, if there is pending work, it will return -1, and __ec_manager() will quit the loop
09:23:47 <krishnan_p> xavih, No. I got it. I wanted to understand what you meant when you said it will 'wait'. So I assume you meant that the fop waits in the same state until the subrequests return
09:24:02 <xavih> if there isn't pending work, it will take the error code from subrequests executed and return it
09:24:28 <xavih> krishnan_p: yes, I haven't used the right expression, sorry :P
09:24:31 <pranithk> xavih: I don't understand why this piece of code needs to be in a do - while loop
09:25:14 <xavih> because if the fop->handler() has not inited any subrequest, ec_wait() will not have anything to wait for
09:25:30 <xavih> so __ec_manager() should go to the next state immediately
09:25:46 <xavih> noone will call __ec_manager() again for this fop because there is no pending work on it
09:26:15 <xavih> pranithk: do you understand the reason ?
09:26:22 <pranithk> xavih: nope :-(
09:26:26 <pranithk> xavih: thinking...
09:26:36 <xavih> for example
09:26:38 <pranithk> xavih: what is a subrequest?
09:26:44 <xavih> the first state is EC_STATE_INIT
09:26:45 <pranithk> xavih: go ahead with the example....
09:27:07 <xavih> many fops do nothing here, only modify or store some data inside fop_data_t
09:27:42 <pranithk> xavih: true
09:27:59 <xavih> the it return the next state to which the machine should go
09:28:08 <xavih> this is returned by fop->handler()
09:28:18 <pranithk> xavih: yes
09:28:38 <xavih> in this case, ec_wait() will return immediately because the EC_STATE_INIT didn't started any fop
09:28:53 <pranithk> xavih: yes
09:28:58 <xavih> so the loop will call fop->handler() again using the next state (the one just returned)
09:29:29 <pranithk> xavih: ah! got the loop. It is running the state machine :-)
09:29:39 <xavih> yes
09:29:54 <pranithk> xavih: but I wonder how it handles winds/unwinds...?
09:29:55 <xavih> the state now can be EC_STATE_DISPATCH. In this state STACK_WIND() will be called to send the request to subvolumes
09:30:30 <xavih> in this case, when ec_wait() is called it will detect that there are pending requests on the fop, and will return -1
09:30:55 <xavih> ec_wait() handles winds and subrequests (other fops)
09:31:27 <dlambrig> lets walk through the first fop - lock
09:31:38 <xavih> in this case, at some point the last called wind will unwind. When this happend, __ec_manager() will be called again to resume the state machine
09:31:51 <pranithk> xavih: understood :-)
09:31:59 <pranithk> xavih: thanks for the detailed explanation
09:32:35 <xavih> ok, now we go to ec_manager_writev() in ec-inode-writev(), that is the handler for writev
09:32:46 <xavih> sorry, ec-inode-wrietv.v
09:33:00 <xavih> it receives the fop and the current state
09:33:43 <xavih> for EC_STATE_INIT, it basically prepares the write buffers and transforms offsets and sizes
09:33:56 <xavih> do you need any clarification here ?
09:34:18 <pranithk> xavih: none for me
09:34:44 <krishnan_p> xavih, No.
09:35:09 <xavih> all state machine handlers calls ec_default_manager() to handle common state transitions
09:35:31 <xavih> we may go there later if needed, ok ?
09:35:48 <xavih> or maybe it would be better to look at it now ?
09:35:59 <dlambrig> what is the next state after INIT
09:36:09 <dlambrig> we will go there if it lives in default_manager
09:36:30 <xavih> it depends of the flags specified for the fop. It's defined in ec_default_manager()... :P
09:36:37 <xavih> ok, let's go there...
09:37:04 <xavih> ec_default_manager() on ec-common.c
09:37:27 <xavih> on INIT, if flag EC_FLAG_LOCK is set, it jumps to EC_STATE_LOCK, otherwise it jumps directly to EC_STATE_DISPATCH
09:37:46 <xavih> in the case of wrietv, EC_STATE_LOCK is set, so we go to EC_STATE_LOCK
09:38:16 <xavih> ec_manager_writev() does nothing in this state, so it simply executes the code form ec_default_manager()
09:38:34 <xavih> here it calls to ec_lock() (will see that in a moment)
09:38:58 <xavih> then, if the flag EC_STATE_PREOP is set, we jump to EC_STATE_PREOP, otherwise to EC_STATE_DISPATCH
09:39:06 <xavih> I think you see the logic, right ?
09:39:14 <xavih> now, ec_lock()
09:39:27 <krishnan_p> xavih, yes.
09:39:43 <xavih> one minute, please...
09:40:55 <xavih> sorry, I'm here
09:41:00 <xavih> ok
09:41:16 <xavih> in ec_lock() it looks at the flags what it should lock
09:42:00 <xavih> this only happens if the current fop is not initiated as a subrequest of another fop (in which case the first fop should have locked whatever necessary)
09:42:26 <xavih> this is tested looking if fop->parent is NULL or not
09:42:47 <xavih> EC_FLAG_LOC_xxx are somewhat complex
09:43:36 <krishnan_p> xavih, that is fine. This means its interesting too!
09:44:05 <xavih> they are used for two purposes. It indicates if the inode or entry must be blocked and also if the inode or entry should be marked when some subvolumes return mismatching answers
09:44:58 <xavih> when not all subvolumes agree on the same answers, the subvolumes that belong to answers groups with less members are marked as bad to avoid using them on future requests
09:45:15 <krishnan_p> xavih, is the behaviour same even for fd based operations?
09:45:21 <xavih> self-heal clears these marks when it's healed
09:45:23 <xavih> krishnan_p: yes
09:46:44 <xavih> depending on these flags, ec_lock_entry() or ec_lock_inode() are called
09:47:38 <dlambrig> when you do the "mark", is it persistent? (i.e. stored in an extended attribute)
09:47:55 <xavih> both look to see if that lock is already acquired by this fop, and if not, it initiates a subrequests and adds an entry into fop->lock_list
09:48:11 <xavih> this list is later used to do unlocks
09:48:21 <xavih> dlambrig: no, it's stored only in memory
09:48:49 <xavih> if client crashes, the first access to the same entry/inode will detect again the discrepancy and mark it again
09:48:57 <xavih> unless self-heal has already solved it
09:49:33 <xavih> any doubt on locking functions ?
09:50:08 <xavih> ok
09:50:12 <hagarth> xavih: if a different client heals, how does the mark get cleaned up?
09:51:07 <xavih> hagarth: self-heal is currently done on client side, so when self-heal detects that it's all ok, it clears the mark from that client
09:51:40 <hagarth> xavih: what happens if more than one client attempt to self-heal?
09:51:49 <hagarth> and all clients have marked?
09:52:45 <xavih> hagarth: the metadata is healed using locks, so only one client can heal at a time. The second client will see a healed inode and clear the mark
09:53:00 <hagarth> xavih: right, that is along expected lines.
09:53:04 <xavih> however I've just seen a possible problem with data healing... I'll look at it...
09:53:29 <hagarth> xavih: ok
09:53:32 <krishnan_p> xavih, so what happens to the second self-heal when the first self-heal is still in progress
09:53:38 <xavih> onle one client will heal data, but another one can assume it's healed before it really is, I think...
09:54:21 <xavih> krishnan_p: it waits. But the locks are very short, only to heal the metadata. Then they are unlocked
09:54:32 <xavih> data self-heal is made locking the file in fragments
09:54:41 <krishnan_p> xavih, is this a synchronous wait?
09:54:42 <hagarth> xavih: ok
09:54:57 <dlambrig> after we finish the write flow, perhaps in another meeting, it would be good to discuss heal.
09:55:08 <dlambrig> we can continue with write for now tho
09:55:08 <xavih> krishnan_p: yes, but only for self-heal. normal fop execution and self-heal are asynchronous
09:55:19 <krishnan_p> xavih, ok
09:55:21 <xavih> ok
09:55:50 <xavih> so, when locks finish, ec_locked() is called
09:56:05 <hagarth> ok..
09:56:36 <xavih> here I only update parent fop (the current write) with valid subvolumes in case the lock failed on some of them
09:56:49 <xavih> this way the write won't be sent to bricks that failed to lock the inode
09:56:59 <xavih> the next state is EC_STATE_PREOP
09:57:03 <dlambrig> hang on
09:57:05 <dlambrig> :)
09:57:26 <dlambrig> lets go into how the second fop is create (the lock state machine), and the relationship of parent to child
09:57:49 <xavih> ah, ok
09:57:56 <dlambrig> this is a cool part
09:58:22 <xavih> the fop_data_t of each fop is attached to the frame created for that request (stored into fop->frame)
09:58:55 <xavih> when a new fop is created using ec_fop_data_allocate(), the first parameter is a frame
09:59:33 <xavih> ec_fop_data_allocate() looks at frame->local to see if this is a subrequest (a top level call will have frame->local == NULL)
09:59:53 <xavih> if it's not free, frame->local is assumed to be the parent of the new fop
10:00:45 <xavih> when a fop is a child of another one, it increases refs and jobs count of parent (this is used on ec_wait() later to know that there is pending work)
10:00:50 <dlambrig> so the target and minimum for a child fop are the same as the parent?
10:01:26 <xavih> not necessarily. It depends on the arguments specified in each ec_fop_data_allocate()
10:01:32 <dlambrig> Ok
10:02:00 <xavih> anything else on this parent/child binding ?
10:02:10 <dlambrig> so another state machine is invoked
10:02:16 <dlambrig> the lock state machine
10:02:20 <xavih> yes
10:02:31 <krishnan_p> xavih, how does the child state machine give control back to the parent state machine?
10:02:37 <xavih> it initiates a full new fop
10:04:05 <xavih> krishnan_p: when the child state machine finishes, it will call ec_fop_data_release() for the last time. In this case, ec_parent_resume() is called
10:04:34 <xavih> that basically restarts the state machine of the parent
10:04:41 <krishnan_p> xavih, OK
10:04:59 <dlambrig> can we walk through the callback flow of the lock request
10:05:51 <xavih> ok, we can go there. There are some interesting points there
10:06:34 <xavih> at a high level, when a blocking lock (of any type) is requested, it is transformed to a non-blocking request and sent to all subvolumes in parallel
10:07:42 <xavih> if any of the subvolumes returne EAGAIN (meaning that the lock cannot be immediately acquired), all locked volumes are unlocked and the lock request is restarted in blocking mode but sending the request one by one to the subvolumes
10:08:24 <xavih> do you want to see this in detail through the code ?
10:08:29 <dlambrig> are you in ec_lock_check ?
10:09:26 <xavih> yes, here is where the logic is processed
10:09:37 <dlambrig> ok,
10:09:47 <xavih> do you need more detail in some point ?
10:10:46 <krishnan_p> xavih, could you give an overview of the states a blocking lock fop which failed to acquire the lock on all servers when tried non-blocking?
10:10:54 <xavih> locking functions have the special thing that they use EC_MINIMUM_ALL to enforce that a good answer is only accepted if all alive subvolumes agree
10:11:39 <xavih> later, if only N - R are got, it's also accepted, but this is a specific managing of locks
10:12:20 <krishnan_p> is it all_alive_subvols && (resp_count >= N_R)
10:12:31 <krishnan_p> s/N_R/N-R
10:12:36 <xavih> krishnan_p: if any or all subvolumes failed with EAGAIN, it will have notlocked != 0
10:13:31 <xavih> krishnan_p: sorry, I don't understand...
10:14:13 <krishnan_p> xavih, you said that N-R replies are enough for a 'good' answer, right after saying all alive volumes need to agree on the response for the lock request
10:14:29 <krishnan_p> so, I was wondering if both these conditions were required to be met
10:14:47 <krishnan_p> s/volumes/subvolumes
10:15:05 <xavih> when a filed non-blocking lock is processed on ec_lock_check(), it returns the mask of subvolumes that must be unlocked (i.e. the lock succeeded) and return -1 to indicate that the lock should be restarted in incremental mode
10:16:03 <xavih> krishnan_p: EC_MINIMUM_ALL means that the handling of answers combination will only accept as a good a group of answers formed by all living subvolumes
10:16:38 <krishnan_p> xavih, ok
10:16:38 <xavih> if there are more than one group, none of them will satisfy the condition. In this case, the callback function of ec_inodelk() will return an EIO error
10:17:11 <xavih> however the callback function handles this case in a special way. It looks at all groups and looks why they failed and decides what to do
10:18:04 <xavih> it's here when it can decide that even if there isn't a group that contains all alive bricks, one of the groups can be taken as the valid one
10:18:34 <xavih> this happens when notlocked == 0
10:18:49 <xavih> fop->answer contains the answer that it has been accepted
10:19:33 <xavih> if will be NULL if not all bricks agree, but if some of the answers are enough (i.e. form a group of more than N - R) it will be accepted as the good answer...
10:19:52 <xavih> I think I'm complicating this more than necessary, sorry...
10:20:05 <xavih> I'll have to think about a simpler way to explain it...
10:20:30 <xavih> anything else on locking ?
10:20:40 <dlambrig> lets discuss the callback function lock ues
10:20:53 <xavih> ec_lock_check() ?
10:21:12 <dlambrig> is that the callback?
10:21:22 <krishnan_p> from ec_inodelk_cbk?
10:21:37 <xavih> well, I think I didn't use the right word...
10:22:09 <xavih> the callback from ec_inodelk() will receive the final result of the lock (all this I've explained is internal fop management)
10:22:59 <dlambrig> For example , ec_entrylk_cbk()
10:22:59 <xavih> the callback I was referring to is the ec_lock_check(), that is called from ec_manager_inodelk() at state EC_STATE_REBUILD.
10:23:29 <krishnan_p> xavih, I think i understood what you explained with regards all alive bricks vs N-R to decide a good answer. But let me clarify offline.
10:23:39 <xavih> the REBUILD state is processed just before calling the callback function to regenerate or "correct" the answer
10:23:49 <xavih> krishnan_p: ok
10:24:36 <xavih> lock functions use the REBUILD state to decide if the answer that have been combined should be sent to the callback or not
10:25:02 <xavih> for example. The first execution of a blocking lock will be translated to a non-blocking lock
10:25:20 <xavih> the answer of this request will arrive to REBUILD state
10:25:26 <dlambrig> when is ec_entrylk_cbk() called,
10:26:11 <xavih> ah, ok, this is the callback of the WIND, not the callback of the fop
10:26:31 <xavih> sorry to mess all this, I didn't understood your question :-/
10:26:31 <dlambrig> yes
10:26:54 <xavih> ok, ec_entrylk_cbk() will be called for every WIND call you made
10:26:56 <dlambrig> each individual subvolume gets a wind, and a callback.. this is a part I meant
10:27:40 <xavih> there, as any other fop, it will construct a cbk_data_t structure will all arguments (similar to ec_fop_data_t)
10:28:18 <xavih> the only interesting thing here is that ec_lock_handler() will determine if the current answer is valid or not.
10:28:32 <dlambrig> what does ec_complete() do? :)
10:28:59 <dlambrig> your variable fop->winds is the number of subvolumes you have sent to
10:29:13 <dlambrig> if it reaches 0, you can "report" ?
10:29:40 <xavih> basically it looks if the current operation is being made incrementally or not. If not normal ec_combine() is done, otherwise, any failure other than ENOTCONN is handled as a failure that does not allow the lock to complete successfully
10:29:56 <xavih> ec_complete() is basically used to inform that a wind operation finished
10:30:28 <dlambrig> I did not see the difference between ec_report() and ec_resume()
10:30:44 <xavih> when all wind operations have finished it resumes the state machine execution. The result will be reported when the state reaches EC_STATE_REPORT
10:30:58 <xavih> ec_resume is used to continue execution of the state machine of the fop
10:31:11 <xavih> ec_report is used to call the fop callback
10:31:17 <dlambrig> Ok
10:31:53 <xavih> I think I'm a bit lost... where do you want to continue ?
10:32:44 <xavih> is there any doubt on locking functions ?
10:33:05 <krishnan_p> xavih, I have a suggestion
10:33:10 <xavih> all lock functions use the same logic (()finodelk, (f)entrylk, lk)
10:34:08 <krishnan_p> Why don't we send you a mail which outlines the lifecycle of a FOP, in terms of individual winds/unwinds to/from the subvols and how the responses are aggregated and sent back to the upper layers?
10:34:39 <krishnan_p> So that you can fill that skeleton/template with functions that cluster/ec uses/employs at those checkpoints in the execution
10:34:46 <krishnan_p> Does that sound OK to you?
10:35:33 <krishnan_p> The skeleton would be translator agnostic. Something like, what function is called once a response from a client subvol reaches cluster/ec for a given FOP etc.
10:35:35 <xavih> I can do that, however this management is somewhat different for locking functions because they handle the normal answer in a special way to be able to restart the same request using blocking and incremental modes
10:35:47 <xavih> but I think it could be easier to follow
10:36:01 <xavih> first I can explain the "normal" flow and then the special case of locks
10:36:03 <krishnan_p> xavih, afr_nonblocking_inodelk and entrylk employs a similar strategy
10:36:24 <xavih> yes, I know, I've used the same idea
10:36:28 <krishnan_p> xavih, so the locking algorithm seems fine.
10:36:59 <krishnan_p> The part that is different (and new to me) is the way state machines are transferring control across winds to different subvols
10:37:22 <xavih> ok, I'll try to explain it better
10:37:33 <xavih> an email to gluster-devel would be ok ?
10:37:44 <krishnan_p> To understand this better, it would help if we started with something more familiar and isn't different in cluster/ec as well. The lifecycle of a FOP within a xlator.
10:38:11 <krishnan_p> xavih, I will send out that mail after this meeting which you could answer to. Yes, I will CC gluster-devel
10:38:16 <xavih> yes, in writev it's simpler but we have been jumping from one place to another :P
10:38:58 <krishnan_p> xavih, its hard to ignore a few questions from cropping up when something new is being explained I guess :)
10:39:24 <dlambrig> also, lock is part of write. I see them as the same transaction, personally.
10:39:37 <xavih> krishnan_p: yes, yes, I know it's difficult to understand so much thing
10:39:38 <dlambrig> we can stop here
10:39:59 <dlambrig> this has been extremely helpful - I have learned a lot
10:40:03 <xavih> dlambrig: yes, yes, it's not your fault, but there are a lot of details and it's difficult for me...
10:40:15 <krishnan_p> xavih, But you are explanations have made our understanding better than before. With a few more meetings we should be on the same page I hope.
10:40:25 <krishnan_p> s/you are /your
10:40:26 <xavih> :)
10:40:47 <dlambrig> your explanations have been very good
10:41:00 <dlambrig> we will take a few days to digest this latest meal you have served us :)
10:41:24 <pranithk> dlambrig: they generally are. Even his explanations on gluster-devel share the same clarity :-)
10:41:29 <xavih> I have difficulties sorting things in the best way to be understood...
10:41:52 <dlambrig> can you meet Tuesday? That will be my last day in Bangalore where we are all together
10:41:55 <xavih> pranithk: thanks :)
10:42:08 <pranithk> xavih: :-)
10:42:43 <xavih> I can try to do some high level explanations of general working and start from there to see details of some specific fops
10:43:47 <dlambrig> I think the fop piece is better understood
10:44:04 <dlambrig> could we discuss healing on Tuesday?
10:44:45 <xavih> I think a generic overview of state machine states and generic meaning and purpose would be interesting to follow the details of specific and more complex fops
10:44:59 <krishnan_p> xavih, Yes. That will be of great help
10:44:59 <xavih> dlambrig: as you prefer
10:45:19 <krishnan_p> xavih, we could cover the details of state machine and fops in a mail over gluster-devel
10:45:19 <hagarth> ok, let us catch up on Tuesday around the same time
10:45:39 <xavih> krishnan_p: ok
10:45:45 <dlambrig> cool
10:45:49 <xavih> hagarth: ok, perfect
10:45:54 <krishnan_p> xavih, thanks a lot for patiently explaining
10:46:04 <xavih> krishnan_p: yw
10:46:18 <xavih> krishnan_p: I hope haven't bored you too much :P
10:46:26 <hagarth> xavih: thanks for this!
10:46:31 <hagarth> #endmeeting