#gluster-meeting log

10:39:35 <hagarth> #startmeeting erasure coding review part2
10:39:35 <zodbot> Meeting started Tue Jun 10 10:39:35 2014 UTC.  The chair is hagarth. Information about MeetBot at http://wiki.debian.org/MeetBot.
10:39:35 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic.
10:39:47 <dlambrig> xavih: hi Xavi, can we talk about the healing part of your code?
10:39:59 <hagarth> and here is krishnan_p
10:40:11 <krishnan_p> xavih, Good afternoon.
10:40:15 <xavih> hi all
10:40:24 <xavih> dlambrig: what ever you prefer :)
10:40:46 <dlambrig> xavih: lets go through what triggers a healing operation
10:40:50 <xavih> where we start ?
10:40:54 <xavih> ok
10:42:24 <xavih> when any fop finishes, it knows which answers are "good" and which ones are "bad" (good answers are those that belong to the group with more combinable answers)
10:42:56 <xavih> this means that any answer no belonging to that group has something bad
10:43:04 <xavih> this is handled in ec_update_bad()
10:43:58 <xavih> it receives a bitmask with the subvolumes that are ok, meaning that the other ones may have something inconsistent
10:44:38 <xavih> at ec_update_bad() I do two things: mark the inode or fd with the bitmask of bad subvolumes to be taken into account in future requests
10:45:12 <xavih> and call ec_check_status()
10:45:47 <xavih> here I look if there are some alive bricks that have answered different than the main "good" group
10:46:09 <xavih> if there is some brick, I start a self-heal by calling ec_heal() or ec_fheal()
10:46:29 <krishnan_p> xavih, this is probably a silly question, but if you don't have a ctx for ec xlator on an inode, shouldn't you create one and then add the information to the ctx?
10:46:29 <xavih> this is the only point where self-heal is initiated currently
10:47:24 <xavih> krishnan_p: I always create a ctx for the inode if it's missing in ec_inode_get()
10:47:45 <xavih> krishnan_p: however it shouldn't be missing at this point of the fop execution
10:48:10 <hagarth> xavih: ok
10:48:11 <xavih> it's created during preparation of the fop
10:48:23 <krishnan_p> xavih, OK.
10:48:49 <xavih> now at ec-heal.c there is all logic for self-heal
10:49:21 <xavih> it's treated as a regular fop, so it uses the same framework that any other fop (i.e. the state machine logic)
10:50:24 <xavih> if we go to ec_manager_heal(), we see the actions made in each state
10:50:46 <xavih> it creates a lot of additional states because it needs more work to do its job
10:51:21 <xavih> first of all I set the owner for all future subrequests initiated by a self-heal (I'm at state EC_STATE_INIT)
10:52:03 <xavih> the ec_heal_init() does some basic checks and creates the needed structures to control the self-heal process
10:52:16 <xavih> I think there's nothing special here
10:52:54 <xavih> then, at EC_STATE_DISPATCH the real work begins
10:53:25 <xavih> I'll explain the steps, without specifying each state. I think it's quite easy to see the states. I'll only explain the logic
10:53:33 <xavih> first it locks the entry
10:53:53 <dlambrig> xavih: just a few quick questions before you move on..
10:53:56 <xavih> once it's locked, it makes a lookup to get information from all bricks
10:54:02 <xavih> dlambrig: tell me
10:54:17 <krishnan_p> xavih, I had couple of questions
10:54:38 <krishnan_p> You could answer them later if you were anyway planning to cover them.
10:54:43 <krishnan_p> here goes,
10:54:51 <krishnan_p> Different clients could have different bitmasks for the same inode, depending on what partition of the bricks/nodes they
10:54:51 <krishnan_p> happen to be able to reach. This could lead to a split-brain, since we may not have a consensus on 'who' is right about
10:54:51 <krishnan_p> the inconsistencies (namely who is bad and who is good)
10:55:07 <krishnan_p> Am I right about this, or does ec handle this differently
10:55:48 <krishnan_p> 2) Does heal and normal I/Os lock on the domain in locks xlator?
10:56:03 <xavih> ec only considers an answer good if at least N - R bricks agree
10:56:05 <krishnan_p> ie. do heals compete with normal I/Os, in terms of locking servers?
10:57:00 <krishnan_p> xavih, so  if an I/O didn't have N-R in any of its response groupings, you would return EIO and NOT mark any of the subvolumes as good or bad?
10:57:49 <xavih> ec does not allow N - R to be smaller or equal than R
10:58:17 <xavih> krishnan_p: yes, if there isn't enough healthy data, EIO is returned
10:58:47 <xavih> in this there isn't any brick marked as bad
10:59:00 <krishnan_p> xavih, thanks. That makes sense.
10:59:07 <xavih> because there is not enough information to know if they are really bad
10:59:44 <xavih> nayway, any future request will reach the same conclusion unless a dead brick revives and give, at least, N - R answers
11:00:34 <xavih> for the question 2, all fops use the name of the xlator as the lock domain, was this the question ?
11:01:17 <krishnan_p> the question was whether a heal operation would be blocked by an ongoing FOP on the same inode/entry?
11:01:17 <xavih> and, yes, they compete with regular I/O. This is a must because the heal data that other operation may be trying to modify
11:01:35 <xavih> yes, it can be blocked
11:01:55 <krishnan_p> xavih, OK.
11:02:01 <xavih> however locks are kept to a minimum
11:02:10 <xavih> to avoid big delays
11:02:22 <xavih> is this clear ?
11:02:49 <xavih> so we continue with heal operation...
11:02:57 <krishnan_p> xavih, Yes. It is clear so far
11:03:18 <xavih> after locking the entry and doing the lookup, it can determine if entry information is consistent
11:03:45 <xavih> this means that it looks if the file type is the same on all bricks or if it exists or not
11:04:02 <hagarth> xavih: ok
11:04:19 <xavih> since this is done as a regular ec_lookup(), it will return the "good" answer and the list of bad answers
11:04:24 <krishnan_p> xavih, same on all or same on N-R would do?
11:05:15 <xavih> krishnan_p: it will look in all bricks, but the "good" group will contain N - R answers at least
11:05:28 <krishnan_p> xavih, OK
11:05:33 <xavih> all fops return all answer, but only one is the "good" one
11:05:40 <xavih> we will see this now
11:06:22 <xavih> at ec_heal_prepare()
11:06:45 <xavih> heal->lookup points to the fop representing the last lookup
11:06:58 <xavih> (this has been saved at ec_heal_entry_lookup_cbk())
11:07:20 <xavih> heal->lookup->answer points to the "good" group of answers
11:08:25 <xavih> then I look at the return code. If it's < 0 means that most of the bricks agree in that the file is missing (most probably) or have some error
11:08:52 <xavih> in this case, the way to heal the file is by deleting it on other bricks by calling ec_heal_remove_others()
11:09:27 <xavih> this function basically looks at with type of file has each brick and calls ec_rmdir() or ec_unlink() to remove it
11:09:49 <xavih> in this function you can also see how other groups are accessed
11:10:16 <xavih> cbk_list contains a list of groups ordered by the number of answers in each group
11:10:27 <xavih> the first one will always be the "good" one
11:10:47 <xavih> so in ec_heal_remove_others() I handle all list items but the first
11:11:13 <xavih> now we go back to ec_heal_prepare()
11:11:38 <xavih> if the file did exist, we do different actions depending on what type of file is it
11:12:16 <xavih> if the file is a regular file, we prepare and fd for read/write in case data self-heal is needed
11:12:34 <xavih> then if the file is a symlink, we need to read the link location before continue
11:13:07 <xavih> ec_readlink() will return the path to which the symlink points. This is handled in ec_heal_readlink_cbk()
11:13:41 <xavih> When readlink finishes it will also call ec_heal_prepare_others()
11:14:19 <xavih> on ec_heal_prepare_others() we have all needed entry information to recreate or recover the entry on bad bricks
11:14:50 <xavih> we loop through all the bad groups of answers and decide what to do for each group
11:16:10 <xavih> if the file doesn't exist, ec_heal_create() is called, otherwise, if the file type is incorrect or the gfid does not match, the file is deleted calling ec_heal_remove()
11:17:11 <xavih> ec_heal_remove() removes the entry using ec_rmdir() or ec_unlink(). When this finishes, it also calls ec_heal_create()
11:17:54 <xavih> on ec_heal_create() some tests are done
11:18:18 <xavih> first of all, it can try to recreate the file by doing a hard link to an existing gfid
11:18:59 <xavih> if this succeeds, the creation phase is done, otherwise, it will create the file calling ec_create()
11:19:24 <xavih> otherwise, ec_mkdir(), ec_symlink() or ec_create() is used to create the entry.
11:20:17 <xavih> this will finish the entry healing phase. At this point, the entry should be ok in all bricks (bricks that have failed at some step will be removed from the self-heal process)
11:20:24 <xavih> are you following ?
11:20:37 <hagarth> xavih: yeah
11:20:37 <dlambrig> xavih: we are following
11:20:48 <xavih> ok, I thought you were sleeping :P
11:20:50 <krishnan_p> xavih, when would it happen that we have the gfid link is present and the entry is missing
11:21:36 <xavih> krishnan_p: it can happen if the file you are trying to heal have disappeared from a brick, but it had another hard link on the same brick
11:22:01 <xavih> or it can also happen if for some reason the file has been removed but the entry inside .glusterfs is not still deleted
11:22:07 <krishnan_p> xavih, like an incomplete rename?
11:22:16 <xavih> could be, yes
11:22:37 <krishnan_p> xavih, i was thinking how could I test this branch of healing operation if I wanted to
11:23:19 <xavih> you only need to remove the file from the brick, without removing the gfid
11:23:37 <krishnan_p> another thought is that, if we limited the set of possible on disk states that we could be in, the complexity of the corresponding healing algorithm would be lesser
11:23:47 <xavih> once removed, it you access the file through the mount point, it will recover it creating a hard link
11:24:10 <krishnan_p> xavih, yes. That would be an 'illegal' operation to simulate the situation you have designed the algorithm for
11:24:37 <krishnan_p> xavih, if we could define a canonical (legal) way of reaching that state, it would be useful.
11:24:52 <xavih> krishnan_p: the other way would be to kill a brick just in the middle of a unlink/rename operation, but that would be much harder to do...
11:24:52 <krishnan_p> xavih, it doesn't have to be necessarily deterministic.
11:25:19 <krishnan_p> xavih, Yeah. Someday we have to be able to inject such faults into our fs and test how it responds
11:25:52 <xavih> you can also do that in a more "legal" way and deterministic...
11:26:05 <krishnan_p> xavih, Now we at least have a test case. A rename operation terminated before it was 'committed' to the volume
11:26:06 <xavih> you create a file and a hard link to it
11:26:25 <krishnan_p> xavih, yes. Thats the case I was looking for :-)
11:26:40 <krishnan_p> xavih, thanks
11:26:58 <xavih> then you kill a brick and remove one of the files
11:27:58 <krishnan_p> So crudely, touch f1; ln f1 l1; (kill brickN); unlink l1
11:28:19 <xavih> oops, I think it won't work... quorum enforcement will remove the file from the other bricks instead of recreating it because majority of bricks have removed the file
11:29:06 <krishnan_p> xavih, aah. yes. I think so too.
11:29:20 <xavih> I think it will be difficult to do without trying to kill the brick in the middle of the operation...
11:29:42 <dlambrig> lets continue the walkthrough
11:29:47 <xavih> ok
11:29:47 <krishnan_p> OK. Let me think about it this and get back to you offline
11:30:21 <xavih> ok, then we had the entry ready in all bricks
11:30:29 <xavih> now we do an ec_inodelk()
11:30:37 <xavih> and another ec_lookup()
11:31:20 <hagarth> xavih: ok
11:31:24 <xavih> this time all metadata information returned by the lookup should be consistent
11:31:38 <xavih> here we look if xdata is equal, file size, ...
11:32:47 <xavih> once we have all "good" data, I remove unneeded extended attributes from bad bricks in ec_heal_removexattr_others()
11:33:24 <hagarth> xavih: ok
11:33:26 <dlambrig> xavih: can you talk about heal->version vs cbk->version
11:33:27 <xavih> then I synchronize the remaining xattr in ec_heal_setxattr_others() and finally I synchronize owner, rights and other things in ec_heal_attr()
11:34:05 <xavih> heal->version is the EC_XATTR_VERSION attribute returned by the "good" group during lookup
11:34:30 <xavih> cbk->version is the same attribute but only for the considered group (cbk in this case)
11:34:38 <dlambrig> xavih: do you increment the version on every write
11:34:54 <xavih> yes, every modification operation increments the version number
11:35:45 <xavih> after ec_heal_attr(), all metadata on all bricks should be consistent
11:36:16 <dlambrig> xavih: I assume the version # is a huge 128 bit number
11:36:17 <xavih> ec_heal_open() is called to open the file in all bricks if data self-heal will be needed
11:37:02 <xavih> dlambrig: it's only a 64-bits number. I think it's enough to keep modifying a file for some centuries before it overflows...
11:37:22 <dlambrig> xavih: :)
11:38:26 <xavih> after that, ec_heal_reopen_fd() is called to open the files associated to the inode that may have not been opened before for example because the brick was down when the open happened
11:38:54 <xavih> once all this is done, both locks (entry and inode) are released
11:39:54 <xavih> here, if data self-heal is needed, a loop consisting on partial inodelk, read from good bricks, write to bad bricks, unlock is done until all data is healed
11:41:09 <xavih> Finally, a new inodelk/lookup is made to get normal attributes from the file and update them using ec_setattr() (basically to reset the modification time)
11:42:03 <xavih> this should finish the heal process. The only additional thing is that since I'm using a regular fop interface, it must report the result of the operation as if it were a normal fop
11:42:39 <xavih> so I created a callback function that receives a bitmask of the bricks healead and which ones were good and which were healed
11:42:58 <xavih> I think this is all at a high level
11:43:08 <dlambrig> xavih: quick question-
11:43:18 <krishnan_p> xavih, by partial inodelk are you implying that you are taking a lock on the region that is being healed
11:43:27 <xavih> krishnan_p: yes
11:44:12 <hagarth> xavih: ok
11:45:28 <dlambrig> xavih: I think this description made a lot of sense :)
11:45:38 <hagarth> dlambrig: +1
11:46:03 <xavih> there are more details, but it would take a lot of time to explain all of them...
11:46:07 <krishnan_p> xavih, I think you are seasoned in locks and afr now. Yay! we have more reviewers for glusterfs
11:46:40 <xavih> krishnan_p: locking is a pain... :P
11:46:59 <hagarth> xavih: +1 ;)
11:47:46 <xavih> of course I made use of nested fop execution to handle all this
11:48:02 <krishnan_p> xavih, I know. But I would love to see how we can abstract locks in general
11:48:10 <hagarth> I think I will end this meeting for now.
11:48:22 <hagarth> #endmeeting