10:39:35 #startmeeting erasure coding review part2 10:39:35 Meeting started Tue Jun 10 10:39:35 2014 UTC. The chair is hagarth. Information about MeetBot at http://wiki.debian.org/MeetBot. 10:39:35 Useful Commands: #action #agreed #halp #info #idea #link #topic. 10:39:47 xavih: hi Xavi, can we talk about the healing part of your code? 10:39:59 and here is krishnan_p 10:40:11 xavih, Good afternoon. 10:40:15 hi all 10:40:24 dlambrig: what ever you prefer :) 10:40:46 xavih: lets go through what triggers a healing operation 10:40:50 where we start ? 10:40:54 ok 10:42:24 when any fop finishes, it knows which answers are "good" and which ones are "bad" (good answers are those that belong to the group with more combinable answers) 10:42:56 this means that any answer no belonging to that group has something bad 10:43:04 this is handled in ec_update_bad() 10:43:58 it receives a bitmask with the subvolumes that are ok, meaning that the other ones may have something inconsistent 10:44:38 at ec_update_bad() I do two things: mark the inode or fd with the bitmask of bad subvolumes to be taken into account in future requests 10:45:12 and call ec_check_status() 10:45:47 here I look if there are some alive bricks that have answered different than the main "good" group 10:46:09 if there is some brick, I start a self-heal by calling ec_heal() or ec_fheal() 10:46:29 xavih, this is probably a silly question, but if you don't have a ctx for ec xlator on an inode, shouldn't you create one and then add the information to the ctx? 10:46:29 this is the only point where self-heal is initiated currently 10:47:24 krishnan_p: I always create a ctx for the inode if it's missing in ec_inode_get() 10:47:45 krishnan_p: however it shouldn't be missing at this point of the fop execution 10:48:10 xavih: ok 10:48:11 it's created during preparation of the fop 10:48:23 xavih, OK. 10:48:49 now at ec-heal.c there is all logic for self-heal 10:49:21 it's treated as a regular fop, so it uses the same framework that any other fop (i.e. the state machine logic) 10:50:24 if we go to ec_manager_heal(), we see the actions made in each state 10:50:46 it creates a lot of additional states because it needs more work to do its job 10:51:21 first of all I set the owner for all future subrequests initiated by a self-heal (I'm at state EC_STATE_INIT) 10:52:03 the ec_heal_init() does some basic checks and creates the needed structures to control the self-heal process 10:52:16 I think there's nothing special here 10:52:54 then, at EC_STATE_DISPATCH the real work begins 10:53:25 I'll explain the steps, without specifying each state. I think it's quite easy to see the states. I'll only explain the logic 10:53:33 first it locks the entry 10:53:53 xavih: just a few quick questions before you move on.. 10:53:56 once it's locked, it makes a lookup to get information from all bricks 10:54:02 dlambrig: tell me 10:54:17 xavih, I had couple of questions 10:54:38 You could answer them later if you were anyway planning to cover them. 10:54:43 here goes, 10:54:51 Different clients could have different bitmasks for the same inode, depending on what partition of the bricks/nodes they 10:54:51 happen to be able to reach. This could lead to a split-brain, since we may not have a consensus on 'who' is right about 10:54:51 the inconsistencies (namely who is bad and who is good) 10:55:07 Am I right about this, or does ec handle this differently 10:55:48 2) Does heal and normal I/Os lock on the domain in locks xlator? 10:56:03 ec only considers an answer good if at least N - R bricks agree 10:56:05 ie. do heals compete with normal I/Os, in terms of locking servers? 10:57:00 xavih, so if an I/O didn't have N-R in any of its response groupings, you would return EIO and NOT mark any of the subvolumes as good or bad? 10:57:49 ec does not allow N - R to be smaller or equal than R 10:58:17 krishnan_p: yes, if there isn't enough healthy data, EIO is returned 10:58:47 in this there isn't any brick marked as bad 10:59:00 xavih, thanks. That makes sense. 10:59:07 because there is not enough information to know if they are really bad 10:59:44 nayway, any future request will reach the same conclusion unless a dead brick revives and give, at least, N - R answers 11:00:34 for the question 2, all fops use the name of the xlator as the lock domain, was this the question ? 11:01:17 the question was whether a heal operation would be blocked by an ongoing FOP on the same inode/entry? 11:01:17 and, yes, they compete with regular I/O. This is a must because the heal data that other operation may be trying to modify 11:01:35 yes, it can be blocked 11:01:55 xavih, OK. 11:02:01 however locks are kept to a minimum 11:02:10 to avoid big delays 11:02:22 is this clear ? 11:02:49 so we continue with heal operation... 11:02:57 xavih, Yes. It is clear so far 11:03:18 after locking the entry and doing the lookup, it can determine if entry information is consistent 11:03:45 this means that it looks if the file type is the same on all bricks or if it exists or not 11:04:02 xavih: ok 11:04:19 since this is done as a regular ec_lookup(), it will return the "good" answer and the list of bad answers 11:04:24 xavih, same on all or same on N-R would do? 11:05:15 krishnan_p: it will look in all bricks, but the "good" group will contain N - R answers at least 11:05:28 xavih, OK 11:05:33 all fops return all answer, but only one is the "good" one 11:05:40 we will see this now 11:06:22 at ec_heal_prepare() 11:06:45 heal->lookup points to the fop representing the last lookup 11:06:58 (this has been saved at ec_heal_entry_lookup_cbk()) 11:07:20 heal->lookup->answer points to the "good" group of answers 11:08:25 then I look at the return code. If it's < 0 means that most of the bricks agree in that the file is missing (most probably) or have some error 11:08:52 in this case, the way to heal the file is by deleting it on other bricks by calling ec_heal_remove_others() 11:09:27 this function basically looks at with type of file has each brick and calls ec_rmdir() or ec_unlink() to remove it 11:09:49 in this function you can also see how other groups are accessed 11:10:16 cbk_list contains a list of groups ordered by the number of answers in each group 11:10:27 the first one will always be the "good" one 11:10:47 so in ec_heal_remove_others() I handle all list items but the first 11:11:13 now we go back to ec_heal_prepare() 11:11:38 if the file did exist, we do different actions depending on what type of file is it 11:12:16 if the file is a regular file, we prepare and fd for read/write in case data self-heal is needed 11:12:34 then if the file is a symlink, we need to read the link location before continue 11:13:07 ec_readlink() will return the path to which the symlink points. This is handled in ec_heal_readlink_cbk() 11:13:41 When readlink finishes it will also call ec_heal_prepare_others() 11:14:19 on ec_heal_prepare_others() we have all needed entry information to recreate or recover the entry on bad bricks 11:14:50 we loop through all the bad groups of answers and decide what to do for each group 11:16:10 if the file doesn't exist, ec_heal_create() is called, otherwise, if the file type is incorrect or the gfid does not match, the file is deleted calling ec_heal_remove() 11:17:11 ec_heal_remove() removes the entry using ec_rmdir() or ec_unlink(). When this finishes, it also calls ec_heal_create() 11:17:54 on ec_heal_create() some tests are done 11:18:18 first of all, it can try to recreate the file by doing a hard link to an existing gfid 11:18:59 if this succeeds, the creation phase is done, otherwise, it will create the file calling ec_create() 11:19:24 otherwise, ec_mkdir(), ec_symlink() or ec_create() is used to create the entry. 11:20:17 this will finish the entry healing phase. At this point, the entry should be ok in all bricks (bricks that have failed at some step will be removed from the self-heal process) 11:20:24 are you following ? 11:20:37 xavih: yeah 11:20:37 xavih: we are following 11:20:48 ok, I thought you were sleeping :P 11:20:50 xavih, when would it happen that we have the gfid link is present and the entry is missing 11:21:36 krishnan_p: it can happen if the file you are trying to heal have disappeared from a brick, but it had another hard link on the same brick 11:22:01 or it can also happen if for some reason the file has been removed but the entry inside .glusterfs is not still deleted 11:22:07 xavih, like an incomplete rename? 11:22:16 could be, yes 11:22:37 xavih, i was thinking how could I test this branch of healing operation if I wanted to 11:23:19 you only need to remove the file from the brick, without removing the gfid 11:23:37 another thought is that, if we limited the set of possible on disk states that we could be in, the complexity of the corresponding healing algorithm would be lesser 11:23:47 once removed, it you access the file through the mount point, it will recover it creating a hard link 11:24:10 xavih, yes. That would be an 'illegal' operation to simulate the situation you have designed the algorithm for 11:24:37 xavih, if we could define a canonical (legal) way of reaching that state, it would be useful. 11:24:52 krishnan_p: the other way would be to kill a brick just in the middle of a unlink/rename operation, but that would be much harder to do... 11:24:52 xavih, it doesn't have to be necessarily deterministic. 11:25:19 xavih, Yeah. Someday we have to be able to inject such faults into our fs and test how it responds 11:25:52 you can also do that in a more "legal" way and deterministic... 11:26:05 xavih, Now we at least have a test case. A rename operation terminated before it was 'committed' to the volume 11:26:06 you create a file and a hard link to it 11:26:25 xavih, yes. Thats the case I was looking for :-) 11:26:40 xavih, thanks 11:26:58 then you kill a brick and remove one of the files 11:27:58 So crudely, touch f1; ln f1 l1; (kill brickN); unlink l1 11:28:19 oops, I think it won't work... quorum enforcement will remove the file from the other bricks instead of recreating it because majority of bricks have removed the file 11:29:06 xavih, aah. yes. I think so too. 11:29:20 I think it will be difficult to do without trying to kill the brick in the middle of the operation... 11:29:42 lets continue the walkthrough 11:29:47 ok 11:29:47 OK. Let me think about it this and get back to you offline 11:30:21 ok, then we had the entry ready in all bricks 11:30:29 now we do an ec_inodelk() 11:30:37 and another ec_lookup() 11:31:20 xavih: ok 11:31:24 this time all metadata information returned by the lookup should be consistent 11:31:38 here we look if xdata is equal, file size, ... 11:32:47 once we have all "good" data, I remove unneeded extended attributes from bad bricks in ec_heal_removexattr_others() 11:33:24 xavih: ok 11:33:26 xavih: can you talk about heal->version vs cbk->version 11:33:27 then I synchronize the remaining xattr in ec_heal_setxattr_others() and finally I synchronize owner, rights and other things in ec_heal_attr() 11:34:05 heal->version is the EC_XATTR_VERSION attribute returned by the "good" group during lookup 11:34:30 cbk->version is the same attribute but only for the considered group (cbk in this case) 11:34:38 xavih: do you increment the version on every write 11:34:54 yes, every modification operation increments the version number 11:35:45 after ec_heal_attr(), all metadata on all bricks should be consistent 11:36:16 xavih: I assume the version # is a huge 128 bit number 11:36:17 ec_heal_open() is called to open the file in all bricks if data self-heal will be needed 11:37:02 dlambrig: it's only a 64-bits number. I think it's enough to keep modifying a file for some centuries before it overflows... 11:37:22 xavih: :) 11:38:26 after that, ec_heal_reopen_fd() is called to open the files associated to the inode that may have not been opened before for example because the brick was down when the open happened 11:38:54 once all this is done, both locks (entry and inode) are released 11:39:54 here, if data self-heal is needed, a loop consisting on partial inodelk, read from good bricks, write to bad bricks, unlock is done until all data is healed 11:41:09 Finally, a new inodelk/lookup is made to get normal attributes from the file and update them using ec_setattr() (basically to reset the modification time) 11:42:03 this should finish the heal process. The only additional thing is that since I'm using a regular fop interface, it must report the result of the operation as if it were a normal fop 11:42:39 so I created a callback function that receives a bitmask of the bricks healead and which ones were good and which were healed 11:42:58 I think this is all at a high level 11:43:08 xavih: quick question- 11:43:18 xavih, by partial inodelk are you implying that you are taking a lock on the region that is being healed 11:43:27 krishnan_p: yes 11:44:12 xavih: ok 11:45:28 xavih: I think this description made a lot of sense :) 11:45:38 dlambrig: +1 11:46:03 there are more details, but it would take a lot of time to explain all of them... 11:46:07 xavih, I think you are seasoned in locks and afr now. Yay! we have more reviewers for glusterfs 11:46:40 krishnan_p: locking is a pain... :P 11:46:59 xavih: +1 ;) 11:47:46 of course I made use of nested fop execution to handle all this 11:48:02 xavih, I know. But I would love to see how we can abstract locks in general 11:48:10 I think I will end this meeting for now. 11:48:22 #endmeeting