08:30:15 <hagarth> #startmeeting 08:30:15 <zodbot> Meeting started Fri Jun 6 08:30:15 2014 UTC. The chair is hagarth. Information about MeetBot at http://wiki.debian.org/MeetBot. 08:30:15 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic. 08:30:30 <hagarth> xavih: let us wait for Dan to join in 08:30:38 <xavih> ok , no problem 08:30:40 <hagarth> who else do we have here today? 08:31:15 <krishnan_p> xavih, Hi. This is kp. I work primarily on glusterd 08:31:45 <xavih> krishnan_p: nice to meet you :) 08:32:49 <pranithk> xavih: Pranith here, work on afr :-) 08:33:01 <hagarth> xavih: Dan is having a bit of problems with his laptop. Should be in here soon. 08:33:04 <xavih> pranithk: Oh, really :D hehe 08:33:30 <pranithk> xavih: :-) 08:33:44 <raghu> xavih: Raghavendra here. Currently working on snapshots 08:34:31 <hagarth> there comes Dan 08:35:05 * ndevos is here, but currently working on a *cough* Xen *cough* kernel bug 08:35:19 <hagarth> ndevos: good luck :) 08:35:29 <ndevos> hagarth: hehe, thanks 08:35:35 <hagarth> shall we get started? 08:35:52 <pranithk> hagarth: +1 08:35:54 <xavih> yes 08:36:26 <dlambrig> Xavi, we were wondering if you could walk us through a write operation, and we could ask questions as we go 08:37:03 <xavih> ok, I can try... 08:37:24 <xavih> do we use the latest code review ? (pushed yesterday) 08:37:39 <dlambrig> sure 08:37:52 <dlambrig> we have it :) 08:38:02 <xavih> ok then 08:38:32 <xavih> the entry point is easy: ec_gf_writev() on ec.c 08:39:01 <xavih> here I only call the real write function with some additional parameters 08:39:19 <xavih> I'll only comment the interesting ones, the other should be obvious 08:39:40 <pranithk> xavih: sure 08:39:49 <xavih> the third parameter is a bitmask of subvolumes to which the request should be sent 08:39:59 <xavih> in this case -1 means all 08:40:25 <dlambrig> Ok 08:40:25 <xavih> each bit refers to a subvolume in the order defined in the volfile 08:40:37 <hagarth> ok 08:41:23 <xavih> fourth argument says how many answers are needed at minimum to consider the result valid 08:41:54 <xavih> answers are grouped looking at the ret code, errno, xdata and other things 08:42:23 <hagarth> xavih: what is the usual value for the fourth argument? 08:42:54 <xavih> a group of combined answers will only be considered a valid answer for this request if it's formed by, at least, the minimum number of individual answers specified in thin argument 08:43:06 <xavih> hagarth: it depends on the request 08:43:44 <xavih> for example 08:44:20 <xavih> normal requests like readv, writev, truncate, unlink, ... all use EC_MINIMUM_MIN 08:44:50 <krishnan_p> xavih, can we think of a grouping of response as a tuple defined by (op_ret, op_errno, xdata) 08:44:52 <xavih> this means that at least N (bricks) - R (redundancy) subvolumes must agree on the answer 08:45:04 <xavih> this can be seen as a quorum enforcement 08:45:13 <hagarth> xavih: right 08:45:51 <xavih> krishnan_p: yes, but it also checks other things like iatt or other cbk arguments, depending on the request 08:46:16 <pranithk> xavih: what will happen when at the time of winding quorum number of bricks are up but it succeeded in less number of bricks? 08:46:21 <hagarth> xavih: why is EC_MINIMUM_MIN -2 ? rather, what is the significance of -2? 08:46:46 <xavih> in some cases the minimum is a must, for example on read requests, because if less that N - R are available, it's impossible to generate an answer 08:47:37 <xavih> pranithk: the request will be sent, and when it's detected that there aren't enough combinable answers, an EIO will be reported to the caller 08:48:12 <pranithk> xavih: but the data is written on some of the bricks... self-heal handles it is it? 08:48:27 <xavih> hagarth: its only because it's determined later when the request is initiated. This could have been taken from ec->fragments 08:49:11 <xavih> hagarth: EC_MINIMUM_ALL can only be determined when the operation begins (it depends on alive bricks, successful locks and successful preop) 08:49:27 <xavih> hagarth: I used constants for the other cases only to be consistent 08:49:44 <hagarth> xavih: ok 08:49:45 <xavih> hagarth: and avoid having to access ec in ec_gf_xxx() functions 08:50:12 <xavih> pranithk: if the data is written to enough bricks (i.e N - R at least), self heal will recover it 08:50:40 <pranithk> xavih: in the case where it is not, what will happen to the partial write? 08:50:50 <krishnan_p> xavih, what is the type of 'ec'? Is it ec_fop_data_t? 08:50:51 <xavih> pranithk: however if, for example, there are N - R bricks alive and one of them fails the write, currently the data is irrecoverable 08:51:05 <xavih> krishnan_p: ec_t 08:51:13 <pranithk> xavih: hmm... 08:51:25 <xavih> krishnan_p: it's the private data from this->private 08:51:36 <xavih> pranithk: I don't know how to solve this situation... 08:51:56 <pranithk> xavih: ok we shall see about it later... please continue writev from where we left off... 08:52:05 <xavih> ok 08:52:09 <krishnan_p> xavih, OK 08:52:37 <dlambrig> lets continue the flow 08:53:06 <xavih> the minimum argument takes importance on self heal, where some requests are valid even with one valid answer 08:53:30 <xavih> fifth argument is the callback function to be called when the fop is finished. It can be NULL. For normal fops it's the default defaule_<fop>_cbk() function 08:53:42 <hagarth> ok 08:54:28 <xavih> sixth argument is any data to be attached to the fop (used on self-heal) 08:54:51 <xavih> the remaining args are the normal writev arguments 08:55:02 <pranithk> dlambrig: You should talk to the guy from ceph about how they handle partial failures... 08:55:52 <xavih> ec_writev() in ec-inode-write.c prepares the requets 08:56:26 <xavih> it first calls to ec_fop_data_allocate() that creates the fop_data_t structure that will be used though all the fop processing 08:57:10 <xavih> do you want I detail the arguments of this function ? 08:57:16 <dlambrig> pranithk: The Ceph engineer is Looc Dachary and he is a very good resource for us, he is not yet a RH employee but will be soon. 08:57:27 <dlambrig> xavi- yes, please do 08:57:31 <xavih> ok 08:57:31 <hagarth> pranithk: Loic is actually in #gluster-dev atm 08:57:31 <dlambrig> that is a key function 08:58:06 <xavih> third argument is the fop type. Used basically for logging 08:58:11 <xavih> next one are flags 08:58:48 <xavih> they say if the fop needs locking (inode or entry), preop handling 08:59:04 <xavih> it also says to how many subvolumes the request must be sent 08:59:16 <dlambrig> is that the 2? 08:59:52 <dlambrig> what is the 2? :) 09:00:08 <xavih> no, flags also say what "things" must be merged in combined answers. It can be a dict, a loc, etc 09:00:36 <xavih> since there can be multiple iatt answers, that 2 says how many iatt must be combined from answers 09:00:52 <xavih> in this case, the write callback receives 2 iatt structures that must be merged 09:01:06 <krishnan_p> xavih, how do we determine we need to combine only 2 of them? 09:01:21 <xavih> krishnan_p: looking at the callback argument list :) 09:02:02 <xavih> krishnan_p: all iatt on an answer must agree to be combined 09:02:25 <xavih> otherwise it means that the brick has had some problem and it's not in stnc with others 09:02:32 <xavih> s/stnc/sync/ 09:03:12 <xavih> is this clear ? 09:03:17 <raghu> xavih: combining 2 iatt structures in writev_cbk, do you mean prebuf and postbuf? 09:03:37 <xavih> raghu: yes, in this case it corresponds to prebuf and postbuf 09:03:43 <krishnan_p> xavih, what would be the behaviour if the iatt's didn' agree? ie. not in sync 09:04:12 <xavih> krishnan_p: then the answers won't be combined. They will belong to two different groups 09:05:18 <krishnan_p> xavih, OK. So, does the combining operation take care of whether the responses (answers) are in sync? 09:06:02 <xavih> krishnan_p: yes. This is done to detect inconsistent bricks and initiate self-heal on them when necessary 09:06:49 <xavih> it's ok to continue with next arg ? 09:07:09 <dlambrig> yes 09:07:16 <krishnan_p> xavih, if after the combine if we don't receive N-R answers in any of the group, then we fail the writev? 09:07:59 <xavih> krishnan_p: yes. This is what pranithk said. I don't know how to solve this situation 09:08:06 <xavih> in the current implementation I return EIO 09:08:16 <krishnan_p> xavih, oops. OK 09:08:34 <xavih> target and minimum are already explained 09:09:36 <xavih> next one says how many answers are expected to be received. Now that I've seen it I see that it's something old and probably I could remove this one... 09:09:57 <xavih> I think I always use the same value... I'll review later... 09:10:12 <xavih> next one is the function to be called to wind the request to each subvolume 09:10:38 <xavih> except for write, it's a straightforward STACK_WIND 09:10:56 <xavih> next one is the function that will control the live cycle of the fop 09:11:03 <xavih> it's basically a state machine 09:11:24 <xavih> callback and data come from ec_gf_writev() 09:11:31 <xavih> any question on these arguments ? 09:11:50 <dlambrig> not for me, we will get to the state machine internals shortly 09:11:56 <xavih> yes 09:11:57 <xavih> ok 09:12:32 <xavih> if ec_fop_data_allocate() fails, the callback function is called with an EIO 09:12:50 <xavih> otherwise, fop structure is populated with writev arguments 09:13:06 <xavih> this is what ec_fop_data_set_xxx() does 09:13:20 <krishnan_p> OK 09:13:30 <xavih> finally, ec_manager() is called to begin the processing of the requets 09:13:37 <dlambrig> now the fun begins.. 09:13:53 <xavih> it's important to note that the second argument of ec_manager() is an error code 09:14:16 <xavih> if some of the ec_fop_data_set_xxx() failed, the operation will be initiated with an EIO error 09:14:45 <xavih> let's go to ec_manager() on ec-common.c 09:14:52 <hagarth> xavih: ok 09:15:17 <xavih> EC_STATE_INVALID is 0 (the value that will have an uninitialized fop 09:16:03 <xavih> every state in the state machine can have two "flavors": when there is an error and when there is not 09:16:27 <xavih> any positive state means everything is ok. a negative state means some error happened 09:17:11 <xavih> then, if the fop needs to lock an inode or entry, the owner of the stack frame is set to a different value for each request 09:17:33 <xavih> __ec_manager() is the core of the state machine 09:18:05 <xavih> first it handles the error code. If there is an error code, the state is negated (to indicate an error) 09:18:45 <xavih> then the handler specified in the call of ec_fop_data_allocate() is called to manage the states of the fop 09:19:24 <xavih> when it returns (will see that function later), if the state is EC_STATE_INVALID, it means that the state machine has finished and it is released 09:19:56 <xavih> if not, ec_wait() waits until any possible subrequests initiated by the fop->handler() are completed 09:20:10 <xavih> it returns the error code from that subrequests 09:20:19 <xavih> and the next state is executed 09:20:27 <xavih> any question here ? 09:20:39 <krishnan_p> xavih, does ec_wait block until the subrequest issued by the handler returns? 09:21:17 <krishnan_p> xavih, or did you mean that the fop 'wait' in the same state? 09:21:31 <xavih> krishnan_p: no, it does not block 09:22:13 <xavih> if there are pending requests, -1 is returned and it exists the loop (it will call again __ec_manager when the request finishes) 09:22:23 <xavih> it's not a synchronous wait 09:22:36 <xavih> ec_wait() always returns immediately 09:22:47 <xavih> right ? 09:22:58 <xavih> do you want to look at ec_wait() now ? 09:23:01 <pranithk> xavih: Will it lead to a busy loop? 09:23:32 <xavih> pranithk: no, if there is pending work, it will return -1, and __ec_manager() will quit the loop 09:23:47 <krishnan_p> xavih, No. I got it. I wanted to understand what you meant when you said it will 'wait'. So I assume you meant that the fop waits in the same state until the subrequests return 09:24:02 <xavih> if there isn't pending work, it will take the error code from subrequests executed and return it 09:24:28 <xavih> krishnan_p: yes, I haven't used the right expression, sorry :P 09:24:31 <pranithk> xavih: I don't understand why this piece of code needs to be in a do - while loop 09:25:14 <xavih> because if the fop->handler() has not inited any subrequest, ec_wait() will not have anything to wait for 09:25:30 <xavih> so __ec_manager() should go to the next state immediately 09:25:46 <xavih> noone will call __ec_manager() again for this fop because there is no pending work on it 09:26:15 <xavih> pranithk: do you understand the reason ? 09:26:22 <pranithk> xavih: nope :-( 09:26:26 <pranithk> xavih: thinking... 09:26:36 <xavih> for example 09:26:38 <pranithk> xavih: what is a subrequest? 09:26:44 <xavih> the first state is EC_STATE_INIT 09:26:45 <pranithk> xavih: go ahead with the example.... 09:27:07 <xavih> many fops do nothing here, only modify or store some data inside fop_data_t 09:27:42 <pranithk> xavih: true 09:27:59 <xavih> the it return the next state to which the machine should go 09:28:08 <xavih> this is returned by fop->handler() 09:28:18 <pranithk> xavih: yes 09:28:38 <xavih> in this case, ec_wait() will return immediately because the EC_STATE_INIT didn't started any fop 09:28:53 <pranithk> xavih: yes 09:28:58 <xavih> so the loop will call fop->handler() again using the next state (the one just returned) 09:29:29 <pranithk> xavih: ah! got the loop. It is running the state machine :-) 09:29:39 <xavih> yes 09:29:54 <pranithk> xavih: but I wonder how it handles winds/unwinds...? 09:29:55 <xavih> the state now can be EC_STATE_DISPATCH. In this state STACK_WIND() will be called to send the request to subvolumes 09:30:30 <xavih> in this case, when ec_wait() is called it will detect that there are pending requests on the fop, and will return -1 09:30:55 <xavih> ec_wait() handles winds and subrequests (other fops) 09:31:27 <dlambrig> lets walk through the first fop - lock 09:31:38 <xavih> in this case, at some point the last called wind will unwind. When this happend, __ec_manager() will be called again to resume the state machine 09:31:51 <pranithk> xavih: understood :-) 09:31:59 <pranithk> xavih: thanks for the detailed explanation 09:32:35 <xavih> ok, now we go to ec_manager_writev() in ec-inode-writev(), that is the handler for writev 09:32:46 <xavih> sorry, ec-inode-wrietv.v 09:33:00 <xavih> it receives the fop and the current state 09:33:43 <xavih> for EC_STATE_INIT, it basically prepares the write buffers and transforms offsets and sizes 09:33:56 <xavih> do you need any clarification here ? 09:34:18 <pranithk> xavih: none for me 09:34:44 <krishnan_p> xavih, No. 09:35:09 <xavih> all state machine handlers calls ec_default_manager() to handle common state transitions 09:35:31 <xavih> we may go there later if needed, ok ? 09:35:48 <xavih> or maybe it would be better to look at it now ? 09:35:59 <dlambrig> what is the next state after INIT 09:36:09 <dlambrig> we will go there if it lives in default_manager 09:36:30 <xavih> it depends of the flags specified for the fop. It's defined in ec_default_manager()... :P 09:36:37 <xavih> ok, let's go there... 09:37:04 <xavih> ec_default_manager() on ec-common.c 09:37:27 <xavih> on INIT, if flag EC_FLAG_LOCK is set, it jumps to EC_STATE_LOCK, otherwise it jumps directly to EC_STATE_DISPATCH 09:37:46 <xavih> in the case of wrietv, EC_STATE_LOCK is set, so we go to EC_STATE_LOCK 09:38:16 <xavih> ec_manager_writev() does nothing in this state, so it simply executes the code form ec_default_manager() 09:38:34 <xavih> here it calls to ec_lock() (will see that in a moment) 09:38:58 <xavih> then, if the flag EC_STATE_PREOP is set, we jump to EC_STATE_PREOP, otherwise to EC_STATE_DISPATCH 09:39:06 <xavih> I think you see the logic, right ? 09:39:14 <xavih> now, ec_lock() 09:39:27 <krishnan_p> xavih, yes. 09:39:43 <xavih> one minute, please... 09:40:55 <xavih> sorry, I'm here 09:41:00 <xavih> ok 09:41:16 <xavih> in ec_lock() it looks at the flags what it should lock 09:42:00 <xavih> this only happens if the current fop is not initiated as a subrequest of another fop (in which case the first fop should have locked whatever necessary) 09:42:26 <xavih> this is tested looking if fop->parent is NULL or not 09:42:47 <xavih> EC_FLAG_LOC_xxx are somewhat complex 09:43:36 <krishnan_p> xavih, that is fine. This means its interesting too! 09:44:05 <xavih> they are used for two purposes. It indicates if the inode or entry must be blocked and also if the inode or entry should be marked when some subvolumes return mismatching answers 09:44:58 <xavih> when not all subvolumes agree on the same answers, the subvolumes that belong to answers groups with less members are marked as bad to avoid using them on future requests 09:45:15 <krishnan_p> xavih, is the behaviour same even for fd based operations? 09:45:21 <xavih> self-heal clears these marks when it's healed 09:45:23 <xavih> krishnan_p: yes 09:46:44 <xavih> depending on these flags, ec_lock_entry() or ec_lock_inode() are called 09:47:38 <dlambrig> when you do the "mark", is it persistent? (i.e. stored in an extended attribute) 09:47:55 <xavih> both look to see if that lock is already acquired by this fop, and if not, it initiates a subrequests and adds an entry into fop->lock_list 09:48:11 <xavih> this list is later used to do unlocks 09:48:21 <xavih> dlambrig: no, it's stored only in memory 09:48:49 <xavih> if client crashes, the first access to the same entry/inode will detect again the discrepancy and mark it again 09:48:57 <xavih> unless self-heal has already solved it 09:49:33 <xavih> any doubt on locking functions ? 09:50:08 <xavih> ok 09:50:12 <hagarth> xavih: if a different client heals, how does the mark get cleaned up? 09:51:07 <xavih> hagarth: self-heal is currently done on client side, so when self-heal detects that it's all ok, it clears the mark from that client 09:51:40 <hagarth> xavih: what happens if more than one client attempt to self-heal? 09:51:49 <hagarth> and all clients have marked? 09:52:45 <xavih> hagarth: the metadata is healed using locks, so only one client can heal at a time. The second client will see a healed inode and clear the mark 09:53:00 <hagarth> xavih: right, that is along expected lines. 09:53:04 <xavih> however I've just seen a possible problem with data healing... I'll look at it... 09:53:29 <hagarth> xavih: ok 09:53:32 <krishnan_p> xavih, so what happens to the second self-heal when the first self-heal is still in progress 09:53:38 <xavih> onle one client will heal data, but another one can assume it's healed before it really is, I think... 09:54:21 <xavih> krishnan_p: it waits. But the locks are very short, only to heal the metadata. Then they are unlocked 09:54:32 <xavih> data self-heal is made locking the file in fragments 09:54:41 <krishnan_p> xavih, is this a synchronous wait? 09:54:42 <hagarth> xavih: ok 09:54:57 <dlambrig> after we finish the write flow, perhaps in another meeting, it would be good to discuss heal. 09:55:08 <dlambrig> we can continue with write for now tho 09:55:08 <xavih> krishnan_p: yes, but only for self-heal. normal fop execution and self-heal are asynchronous 09:55:19 <krishnan_p> xavih, ok 09:55:21 <xavih> ok 09:55:50 <xavih> so, when locks finish, ec_locked() is called 09:56:05 <hagarth> ok.. 09:56:36 <xavih> here I only update parent fop (the current write) with valid subvolumes in case the lock failed on some of them 09:56:49 <xavih> this way the write won't be sent to bricks that failed to lock the inode 09:56:59 <xavih> the next state is EC_STATE_PREOP 09:57:03 <dlambrig> hang on 09:57:05 <dlambrig> :) 09:57:26 <dlambrig> lets go into how the second fop is create (the lock state machine), and the relationship of parent to child 09:57:49 <xavih> ah, ok 09:57:56 <dlambrig> this is a cool part 09:58:22 <xavih> the fop_data_t of each fop is attached to the frame created for that request (stored into fop->frame) 09:58:55 <xavih> when a new fop is created using ec_fop_data_allocate(), the first parameter is a frame 09:59:33 <xavih> ec_fop_data_allocate() looks at frame->local to see if this is a subrequest (a top level call will have frame->local == NULL) 09:59:53 <xavih> if it's not free, frame->local is assumed to be the parent of the new fop 10:00:45 <xavih> when a fop is a child of another one, it increases refs and jobs count of parent (this is used on ec_wait() later to know that there is pending work) 10:00:50 <dlambrig> so the target and minimum for a child fop are the same as the parent? 10:01:26 <xavih> not necessarily. It depends on the arguments specified in each ec_fop_data_allocate() 10:01:32 <dlambrig> Ok 10:02:00 <xavih> anything else on this parent/child binding ? 10:02:10 <dlambrig> so another state machine is invoked 10:02:16 <dlambrig> the lock state machine 10:02:20 <xavih> yes 10:02:31 <krishnan_p> xavih, how does the child state machine give control back to the parent state machine? 10:02:37 <xavih> it initiates a full new fop 10:04:05 <xavih> krishnan_p: when the child state machine finishes, it will call ec_fop_data_release() for the last time. In this case, ec_parent_resume() is called 10:04:34 <xavih> that basically restarts the state machine of the parent 10:04:41 <krishnan_p> xavih, OK 10:04:59 <dlambrig> can we walk through the callback flow of the lock request 10:05:51 <xavih> ok, we can go there. There are some interesting points there 10:06:34 <xavih> at a high level, when a blocking lock (of any type) is requested, it is transformed to a non-blocking request and sent to all subvolumes in parallel 10:07:42 <xavih> if any of the subvolumes returne EAGAIN (meaning that the lock cannot be immediately acquired), all locked volumes are unlocked and the lock request is restarted in blocking mode but sending the request one by one to the subvolumes 10:08:24 <xavih> do you want to see this in detail through the code ? 10:08:29 <dlambrig> are you in ec_lock_check ? 10:09:26 <xavih> yes, here is where the logic is processed 10:09:37 <dlambrig> ok, 10:09:47 <xavih> do you need more detail in some point ? 10:10:46 <krishnan_p> xavih, could you give an overview of the states a blocking lock fop which failed to acquire the lock on all servers when tried non-blocking? 10:10:54 <xavih> locking functions have the special thing that they use EC_MINIMUM_ALL to enforce that a good answer is only accepted if all alive subvolumes agree 10:11:39 <xavih> later, if only N - R are got, it's also accepted, but this is a specific managing of locks 10:12:20 <krishnan_p> is it all_alive_subvols && (resp_count >= N_R) 10:12:31 <krishnan_p> s/N_R/N-R 10:12:36 <xavih> krishnan_p: if any or all subvolumes failed with EAGAIN, it will have notlocked != 0 10:13:31 <xavih> krishnan_p: sorry, I don't understand... 10:14:13 <krishnan_p> xavih, you said that N-R replies are enough for a 'good' answer, right after saying all alive volumes need to agree on the response for the lock request 10:14:29 <krishnan_p> so, I was wondering if both these conditions were required to be met 10:14:47 <krishnan_p> s/volumes/subvolumes 10:15:05 <xavih> when a filed non-blocking lock is processed on ec_lock_check(), it returns the mask of subvolumes that must be unlocked (i.e. the lock succeeded) and return -1 to indicate that the lock should be restarted in incremental mode 10:16:03 <xavih> krishnan_p: EC_MINIMUM_ALL means that the handling of answers combination will only accept as a good a group of answers formed by all living subvolumes 10:16:38 <krishnan_p> xavih, ok 10:16:38 <xavih> if there are more than one group, none of them will satisfy the condition. In this case, the callback function of ec_inodelk() will return an EIO error 10:17:11 <xavih> however the callback function handles this case in a special way. It looks at all groups and looks why they failed and decides what to do 10:18:04 <xavih> it's here when it can decide that even if there isn't a group that contains all alive bricks, one of the groups can be taken as the valid one 10:18:34 <xavih> this happens when notlocked == 0 10:18:49 <xavih> fop->answer contains the answer that it has been accepted 10:19:33 <xavih> if will be NULL if not all bricks agree, but if some of the answers are enough (i.e. form a group of more than N - R) it will be accepted as the good answer... 10:19:52 <xavih> I think I'm complicating this more than necessary, sorry... 10:20:05 <xavih> I'll have to think about a simpler way to explain it... 10:20:30 <xavih> anything else on locking ? 10:20:40 <dlambrig> lets discuss the callback function lock ues 10:20:53 <xavih> ec_lock_check() ? 10:21:12 <dlambrig> is that the callback? 10:21:22 <krishnan_p> from ec_inodelk_cbk? 10:21:37 <xavih> well, I think I didn't use the right word... 10:22:09 <xavih> the callback from ec_inodelk() will receive the final result of the lock (all this I've explained is internal fop management) 10:22:59 <dlambrig> For example , ec_entrylk_cbk() 10:22:59 <xavih> the callback I was referring to is the ec_lock_check(), that is called from ec_manager_inodelk() at state EC_STATE_REBUILD. 10:23:29 <krishnan_p> xavih, I think i understood what you explained with regards all alive bricks vs N-R to decide a good answer. But let me clarify offline. 10:23:39 <xavih> the REBUILD state is processed just before calling the callback function to regenerate or "correct" the answer 10:23:49 <xavih> krishnan_p: ok 10:24:36 <xavih> lock functions use the REBUILD state to decide if the answer that have been combined should be sent to the callback or not 10:25:02 <xavih> for example. The first execution of a blocking lock will be translated to a non-blocking lock 10:25:20 <xavih> the answer of this request will arrive to REBUILD state 10:25:26 <dlambrig> when is ec_entrylk_cbk() called, 10:26:11 <xavih> ah, ok, this is the callback of the WIND, not the callback of the fop 10:26:31 <xavih> sorry to mess all this, I didn't understood your question :-/ 10:26:31 <dlambrig> yes 10:26:54 <xavih> ok, ec_entrylk_cbk() will be called for every WIND call you made 10:26:56 <dlambrig> each individual subvolume gets a wind, and a callback.. this is a part I meant 10:27:40 <xavih> there, as any other fop, it will construct a cbk_data_t structure will all arguments (similar to ec_fop_data_t) 10:28:18 <xavih> the only interesting thing here is that ec_lock_handler() will determine if the current answer is valid or not. 10:28:32 <dlambrig> what does ec_complete() do? :) 10:28:59 <dlambrig> your variable fop->winds is the number of subvolumes you have sent to 10:29:13 <dlambrig> if it reaches 0, you can "report" ? 10:29:40 <xavih> basically it looks if the current operation is being made incrementally or not. If not normal ec_combine() is done, otherwise, any failure other than ENOTCONN is handled as a failure that does not allow the lock to complete successfully 10:29:56 <xavih> ec_complete() is basically used to inform that a wind operation finished 10:30:28 <dlambrig> I did not see the difference between ec_report() and ec_resume() 10:30:44 <xavih> when all wind operations have finished it resumes the state machine execution. The result will be reported when the state reaches EC_STATE_REPORT 10:30:58 <xavih> ec_resume is used to continue execution of the state machine of the fop 10:31:11 <xavih> ec_report is used to call the fop callback 10:31:17 <dlambrig> Ok 10:31:53 <xavih> I think I'm a bit lost... where do you want to continue ? 10:32:44 <xavih> is there any doubt on locking functions ? 10:33:05 <krishnan_p> xavih, I have a suggestion 10:33:10 <xavih> all lock functions use the same logic (()finodelk, (f)entrylk, lk) 10:34:08 <krishnan_p> Why don't we send you a mail which outlines the lifecycle of a FOP, in terms of individual winds/unwinds to/from the subvols and how the responses are aggregated and sent back to the upper layers? 10:34:39 <krishnan_p> So that you can fill that skeleton/template with functions that cluster/ec uses/employs at those checkpoints in the execution 10:34:46 <krishnan_p> Does that sound OK to you? 10:35:33 <krishnan_p> The skeleton would be translator agnostic. Something like, what function is called once a response from a client subvol reaches cluster/ec for a given FOP etc. 10:35:35 <xavih> I can do that, however this management is somewhat different for locking functions because they handle the normal answer in a special way to be able to restart the same request using blocking and incremental modes 10:35:47 <xavih> but I think it could be easier to follow 10:36:01 <xavih> first I can explain the "normal" flow and then the special case of locks 10:36:03 <krishnan_p> xavih, afr_nonblocking_inodelk and entrylk employs a similar strategy 10:36:24 <xavih> yes, I know, I've used the same idea 10:36:28 <krishnan_p> xavih, so the locking algorithm seems fine. 10:36:59 <krishnan_p> The part that is different (and new to me) is the way state machines are transferring control across winds to different subvols 10:37:22 <xavih> ok, I'll try to explain it better 10:37:33 <xavih> an email to gluster-devel would be ok ? 10:37:44 <krishnan_p> To understand this better, it would help if we started with something more familiar and isn't different in cluster/ec as well. The lifecycle of a FOP within a xlator. 10:38:11 <krishnan_p> xavih, I will send out that mail after this meeting which you could answer to. Yes, I will CC gluster-devel 10:38:16 <xavih> yes, in writev it's simpler but we have been jumping from one place to another :P 10:38:58 <krishnan_p> xavih, its hard to ignore a few questions from cropping up when something new is being explained I guess :) 10:39:24 <dlambrig> also, lock is part of write. I see them as the same transaction, personally. 10:39:37 <xavih> krishnan_p: yes, yes, I know it's difficult to understand so much thing 10:39:38 <dlambrig> we can stop here 10:39:59 <dlambrig> this has been extremely helpful - I have learned a lot 10:40:03 <xavih> dlambrig: yes, yes, it's not your fault, but there are a lot of details and it's difficult for me... 10:40:15 <krishnan_p> xavih, But you are explanations have made our understanding better than before. With a few more meetings we should be on the same page I hope. 10:40:25 <krishnan_p> s/you are /your 10:40:26 <xavih> :) 10:40:47 <dlambrig> your explanations have been very good 10:41:00 <dlambrig> we will take a few days to digest this latest meal you have served us :) 10:41:24 <pranithk> dlambrig: they generally are. Even his explanations on gluster-devel share the same clarity :-) 10:41:29 <xavih> I have difficulties sorting things in the best way to be understood... 10:41:52 <dlambrig> can you meet Tuesday? That will be my last day in Bangalore where we are all together 10:41:55 <xavih> pranithk: thanks :) 10:42:08 <pranithk> xavih: :-) 10:42:43 <xavih> I can try to do some high level explanations of general working and start from there to see details of some specific fops 10:43:47 <dlambrig> I think the fop piece is better understood 10:44:04 <dlambrig> could we discuss healing on Tuesday? 10:44:45 <xavih> I think a generic overview of state machine states and generic meaning and purpose would be interesting to follow the details of specific and more complex fops 10:44:59 <krishnan_p> xavih, Yes. That will be of great help 10:44:59 <xavih> dlambrig: as you prefer 10:45:19 <krishnan_p> xavih, we could cover the details of state machine and fops in a mail over gluster-devel 10:45:19 <hagarth> ok, let us catch up on Tuesday around the same time 10:45:39 <xavih> krishnan_p: ok 10:45:45 <dlambrig> cool 10:45:49 <xavih> hagarth: ok, perfect 10:45:54 <krishnan_p> xavih, thanks a lot for patiently explaining 10:46:04 <xavih> krishnan_p: yw 10:46:18 <xavih> krishnan_p: I hope haven't bored you too much :P 10:46:26 <hagarth> xavih: thanks for this! 10:46:31 <hagarth> #endmeeting