15:11:35 <hagarth> #startmeeting
15:11:35 <zodbot> Meeting started Thu Feb 20 15:11:35 2014 UTC.  The chair is hagarth. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:11:35 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic.
15:11:55 <hagarth> xavih: ira can help with questions on samba
15:12:02 * ira nods.
15:12:20 <xavih> windows (and I suppose also samba) can open files with FILE_SHARE_READ, FILE_SHARE_WRITE and FILE_SHARE_DELETE
15:12:46 <ira> Yes, and I believe there's ways to make them exclusive also.
15:12:48 <xavih> I think that FILE_SHARE_READ and FILE_SHARE_WRITE translate to deny read/write/all at lower level, right ?
15:13:18 <xavih> how FILE_SHARE_DELETE is translated in low level ? I haven't seen anything
15:13:32 <ira> File share delete is handled at close
15:13:47 <xavih> ira: exclusive ? is this the range locking exclusive/shared mode ?
15:14:08 <ira> Share modes, are different than ranges.
15:14:24 <ira> They are attritbutes of the file in a sense.
15:14:32 <xavih> ira: sorry, I meant exclusive/shared access
15:14:48 <ira> It would block any other opener.
15:14:57 <ira> For the types of access involved.
15:15:16 <xavih> yes, but this is when 0 is specified as a share mode, right ?
15:15:45 <xavih> there isn't any other sharing mode or behavior at open time, is it ?
15:16:09 <xavih> then we can lock for exclusive or shared access at range level once the file is opened
15:16:28 <ira> There is also byte range locks.
15:17:01 <xavih> ok
15:17:07 <ira> The two are totally separate concepts, as are oplocks/leases.
15:17:22 <ira> (Well, oplocks/leases attach more to share modes, in a sense.)
15:17:31 <ira> They are "whole file".
15:17:47 <xavih> I think I understand this :)
15:18:06 <ira> Cool.. it isn't THAT complex.  It just isn't posix ;)
15:18:46 <xavih> I've developed for windows many years, but at a higher level. Just wanted to know how is this translated to lower level :P
15:19:11 <ira> Heh... I'm sorry, I'm a unix systems programmer who has been dragged into the SMB world ;)
15:19:41 <xavih> to handle delete share mode, there is any support needed from gluster or it's handled by samba alone ?
15:20:20 <ira> Samba can do it, itself.  It might be nice to have support for it, in case a node crashes.
15:20:55 <ira> Though what the semantics there are, I'd want to check using a windows server.
15:21:39 <xavih> then I don't see very clear how it should be handled. If it's checked at close time, how unlink should be handled ?
15:21:46 <xavih> does it return success ?
15:22:13 <ira> If it is at close time... you do the unlink, check for success, and return.
15:22:35 <ira> If you have it for delete, nobody else should be able to unlink it.
15:23:17 <xavih> then you have to check for delete share access at unlink, but only physically delete the file at close
15:23:28 <xavih> is it right ?
15:24:02 <ira> For posix, you'd have to check for the delete share access..
15:24:23 <ira> For windows, it would try to get that same permission and just fail.
15:24:49 <ira> It becomes a matter of how some of the calls get modeled in the locking system
15:25:19 <xavih> but the DeleteFile function in Windows only uses the file path. It's not required to open the file first
15:25:36 <ira> Well... so you think... in the actual protocol, you open it.
15:25:46 <ira> You create handle.
15:25:58 <ira> (of some form.)
15:26:10 <xavih> ah ok, then internally windows always open the file before deleting it
15:26:11 <ira> In smb2 you go through SMB2_CREATE.
15:26:26 <xavih> ok, I was missing this part
15:26:28 <ira> It will issue a create call.. yes.
15:26:35 <ira> Of some form or another...
15:27:00 <ira> Posix I don't think we get that guarantee... so it's a different game.
15:27:16 <xavih> I think I've clarified my doubts
15:27:19 <ira> This is where meshing the models is hard ;)
15:27:30 <ira> ok.
15:27:49 <xavih> well, posix does not work this way, but we may do that gluster accepts both of them...
15:28:05 <ira> Well, there's spots where they both conflict.
15:28:15 <ira> Like unlink.
15:28:42 <ira> So you need to know how you are going to pick "winning" sematics.
15:29:13 <xavih> I've been thinking on combinations of both, and I think they could work together, only with some minor issues
15:29:40 <xavih> I must see how to include this delete share mode
15:29:53 <ira> Yes, but those issues, are still real.  Most NAS vendors end up with a switch to enforce the windows side modes more strongly.
15:30:13 <ira> It isn't that bad... At least I don't think so.
15:30:42 <ira> The real issue is the "two layers".  There's the whole file layer, and then the byte range layer.
15:30:48 <xavih> maybe I've overseen something... I'll send you more detailed information by mail to see if it could be correct
15:30:55 <ira> ok.
15:31:08 <xavih> but posix also have byte range locks, right ?
15:31:30 <xavih> the only difference is that it uses advisory locking and windows uses mandatory locking
15:31:39 <ira> That's a big difference.
15:31:53 <ira> Also, we'll want the ability to cache ranges natively for HPC type workloads.
15:32:00 <xavih> yes, but they may be combined in some way
15:32:26 <ira> Maybe?  Also do you enforce a windows BRL on a posix client?
15:32:28 <xavih> yes, I think that sharing modes, range locking and caching can be implemented as different layers in gluster
15:32:50 <ira> yeah, then we have to ask "are we adding too many translators?" ;)
15:33:10 <xavih> well, it can be combined if you want
15:33:42 <xavih> I always prefer to do more in less translators, but this is somewhat against gluster philosophy
15:33:46 <ira> I'm not sure... these are all things I haven't thought through.  TBH, for share modes, I was thinking of embedding a translator into samba to talk to CTDB.
15:34:24 <ira> But that's very "I'm thinking." not "We're doing." :)
15:34:38 <xavih> about enforcing BRL (Byte Range Lock ?) to posix client, yes, in some way
15:34:59 <xavih> hehe
15:35:01 <ira> yeah. and when you get into all of that... Samba already does most of it internally... ;)
15:35:24 <ira> so, it should be api calls... and a few other things... and we play along nicely, with what samba wants to enforce.
15:35:57 <xavih> I'm not sure that binding gluster and samba so internally would be good...
15:36:20 <ira> Well, for things that are truly samba based: Share modes.  I have less issues with it.
15:36:28 <ira> Leases/Oplocks gets a bit fuzzy.
15:37:13 <ira> Especially at the directory level.
15:37:14 <xavih> wouldn't it be more useful if gluster offered interfaces for sharing modes and byte range locking that samba could use instead of make it depend on samba
15:37:38 <xavih> this way other clients using gfapi could take advantage of that without depending on samba
15:38:08 <xavih> even linux based applications could use mandatory locking and other features using gfapi ig they want
15:38:18 <xavih> s/ig/if/
15:38:31 <ira> Sure... if we want to spend the time to write all of if in translators, and get the DBs etc right...
15:39:28 <ira> I don't expect the samba side would be that bad.
15:39:35 <xavih> well, I'll send you some email to discuss this, but it seemed to me that the logic is not much complicated...
15:39:43 <xavih> no, I don't say that samba is bad
15:40:15 <ira> It isn't as much the core logic.  It is the databases, and the consistency...
15:40:29 <xavih> what I say is that mixing a translator with samba to handle sharing modes would add a big dependency for scenarios that may not need it
15:40:50 <xavih> well, the consistency is another problem
15:41:25 <xavih> I've also been thinking about how to manage consistency between brick (including maintaining synchronized shared modes, locking and caching)
15:41:57 <xavih> anyway this mechanism is needed by gluster for many other tasks, so it's a needed requisite anyway
15:43:08 <ira> For some of this, yes.  For other parts, no.  Depends on the goals.  If the goal is implement clean sematics for SMB.  I'm thinking not.  If it is to do MESI range coherency... yep, we'll need some distributed database.
15:43:40 <ira> no matter how we couch it.
15:43:54 <xavih> well, the distributed database could be the bricks themselves
15:44:14 <ira> Ok...
15:44:21 <xavih> all bricks storing a copy or a fragment of a file will need to maintain its state
15:44:28 <ira> Sure.
15:45:10 <xavih> currently we use inodelk and entrylk to maintain synchronized this "database"
15:45:42 <xavih> we can extend it to also track information about sharing, locking and caching
15:49:19 <ira> hagarth: I'm not as familar with the code as I want to be here... thoughts?
15:49:58 <hagarth> reading through - have been attending two IRC meetings simultaneosuly
15:50:16 <hagarth> and have been realizing that it is not a great idea ;)
15:52:04 <ira> Ok, so dumping the DB under the mutexes on the objects in memory.
15:53:31 <hagarth> xavih: the proposal is to store this state information like changelogs in afr?
15:55:03 <xavih> I don't have much knowledge about changelog yet but, conceptually, having changelogs is very important to recover in some circumstances
15:55:44 <xavih> I think that all information about sharing, locking and caching can be stored in memory (changelogs are used to recover information in case of failures)
15:56:14 <xavih> if all nodes die, it's irrelevant to have this information after restart (I think)
15:56:35 <ira> Depends on the recovery time... but "likely".
15:56:54 <ira> If they can recover in 60s... it matters.
15:57:15 <ira> So keeping it across a restart of a glusterd, is important.
15:57:34 <ira> (Even if it is sent from the other node... we'll need it.)
15:57:39 <xavih> well, what can be done is something similar to what is done for file descriptors
15:57:57 <xavih> when a brick restarts and a client reconnects, the client tries to reopen all previously opened files
15:58:13 <ira> Sure....
15:58:16 <xavih> we can do the same with current locks
15:58:28 <ira> You need a bit more logic...
15:58:34 <ira> But the concept isn't invalid.
15:58:56 <ira> You'll need some key so you know the client involved...
15:59:09 <ira> Because I don't want to give A's lock to B...
15:59:32 <ira> At least if I understand you.
15:59:49 <hagarth> we have partial code for lock healing
16:00:16 <hagarth> to address cases where a node restarts after a client holds a lock
16:00:35 <hagarth> we can potentially clean that up to make it work better
16:00:36 <ira> Ok...
16:01:07 <ira> Likely we'll need something to keep the bricks involved with the knowledge of file "X", in sync.
16:01:20 <ira> And deal with all the side issues there in.
16:01:37 <ira> jdarcy's work in nsr may be useful here.
16:01:58 <hagarth> ira: potentially yes
16:02:10 <ira> It'll at least give us "who is in control."
16:02:14 <xavih> yes, of course, independently of the state of the file, the locking system in gluster must have a way to say if two replicas or fragments of a file are from the same version
16:02:39 <xavih> well, I think this could be a problem
16:02:47 <ira> xavih: Split brain?
16:03:29 <xavih> if we allow afr (or some other client side translator) to control this, we can reach a situation in which two clients will consider different leaders
16:03:44 <xavih> this would need a way to synchronize clients, which I think it's quite complex
16:03:55 <ira> I always saw this server side ;)
16:04:03 <xavih> yes, me too
16:04:15 <ira> Which is why nsr comes to mind.
16:04:20 <xavih> clients need to cooperate in some way, but the real control should be in the server side
16:04:45 <ira> Right.  Ideally mapped to the bricks involved with the actual operations.
16:05:00 <xavih> I don't know very much how nsr works... will have to look at it
16:05:28 <xavih> yes
16:05:35 <ira> the things I want from there will be, "Quroum" and a leader election.
16:05:50 <ira> Those are not pieces of code to duplicate ;)
16:06:14 * lala_afk will read the log later :)
16:06:17 <ira> And I don't know how much jdarcy has done there in full.
16:06:18 <xavih> I'm not very sure about the goodnesses of a leader thing, but I don't know the details
16:06:47 <xavih> obviously all these things should be implemented in libraries to be shared by all translators
16:06:49 <ira> xavih: A leader can be a great help when trying to reconstruct state.
16:07:06 <xavih> but also have some complications
16:07:11 <ira> Sure...
16:07:28 <ira> Especiallty in an AP system... leaders are more a C thing ;)
16:08:47 <hagarth> there are complications both ways - leader election can get complex on this side, split brains on the other side
16:09:00 <ira> Yep.
16:09:01 <xavih> I'll have to see how the leader election is performed in nsr and which tasks assume that leader to say anything about it...
16:09:36 <hagarth> xavih: jdarcy can probably explain that in the next meeting
16:09:41 <ira> Another issue will be performance.  This will be hit every open.
16:10:13 <ira> CTDB ducks that a bit.  And I may be able to help us duck it more... we'll see.
16:10:23 <hagarth> ira: right
16:11:00 <ira> But if we take a network round trip every call on a file you have open to a remote DB... that's lethal.
16:11:32 <xavih> I don't see much we can do with opens if we need to track sharing, locking and caching...
16:11:35 <ira> I think we'll be ok, if we tie it to the bricks, holding the file, though with tiering that may be fun.
16:11:48 <xavih> another thing is reads/writes on the opened file
16:12:22 <xavih> tiering can be an extension of the caching/locking implementation
16:12:49 <ira> tiering is data movement...
16:12:59 <ira> Things move within the dht...
16:13:18 <ira> Or are we going to have special parts of the dht for different service classes?
16:13:37 <ira> I haven't looked at those notes in a bit. :/
16:13:49 <xavih> I thought tiering was having some levels of storage, ones more slower but with more capacity for seldom used data and ones faster for day to day use...
16:13:59 <hagarth> ira: things could move across dhts too
16:14:13 <ira> Right so this locking state would need to move...
16:14:17 <xavih> isn't this data-classification which is more than simply tiering ?
16:14:30 <ira> yes, I used the wrong word.
16:15:00 <xavih> of course, the state of the file must be moved with the file
16:15:32 <hagarth> xavih: right
16:15:39 * ira nods...
16:16:19 <xavih> well, I'll try to rebuild my ideas and I'll send an email to explain what is my point of view and see if it can be valid or have some problems
16:16:30 <hagarth> xavih: sounds like a good idea
16:16:35 * ira nods.
16:17:02 <xavih> there are some things I didn't think about
16:17:15 <xavih> I'll see if I can integrate all of them in a consistent way
16:17:38 <ira> Cool, I'm interested to see what you come up with.
16:17:54 <xavih> with a mess, sure ... :D
16:18:09 <hagarth> let us convene a meeting after we hash some discussions over email
16:18:17 <hagarth> #endmeeting