13:06:01 <hagarth_> #startmeeting
13:06:01 <zodbot> Meeting started Fri Dec 19 13:06:01 2014 UTC.  The chair is hagarth_. Information about MeetBot at http://wiki.debian.org/MeetBot.
13:06:01 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic.
13:06:22 <jdarcy> Let's start with http://www.gluster.org/community/documentation/index.php/Features/stat-xattr-cache
13:06:46 <jdarcy> Seems straightforward.  Is it something we could consider for earlier, e.g. 3.7?
13:07:00 <bene> there seems to be some disagreement on this topic, I've heard some people suggest md-cache enhancement, but ...
13:07:03 <hagarth_> Yes, I can take this up. We could even just load md-cache on the server.
13:07:35 <hagarth_> Anand S is also interested in this. Let us aim this enhancement for 3.7.
13:07:40 <bene> But would that really intercept all the xattr and stat calls?  Some of these are in the POSIX translator below md-cache, yes?
13:08:03 <jdarcy> Yes, I would say most of our xattr accesses are internal.
13:08:05 <ndevos> how about negative cache for xattrs, or missing files - that can be done safely server-side
13:08:17 <hagarth_> bene: yes, all xattr and stat calls can be handled by md-cache sitting above posix.
13:08:29 <jdarcy> Also, we can cache more aggressively on the server where we don't have a consistency/currency problem.
13:08:46 <hagarth_> yes, we could have infinite cache-timeout on the server.
13:08:59 <ndevos> and cache by handle/gfid too
13:09:00 <bene> Yes!   The key thing is to do llistxattr first, this will prevent a lot of xattr calls that aren't needed
13:09:21 <hagarth_> ndevos: -ve cache on the server side looks quite appealing.
13:10:00 <hagarth_> bene: as part of lookup, we typically do llistxattr.
13:10:13 <ndevos> hagarth_: yes, I would really like to see that - missing xattrs is pretty common
13:10:23 <jdarcy> Definitely appealing.  Should we add negative-lookup caching to the feature page?
13:10:31 <hagarth_> jdarcy: +1 to that
13:10:42 <ndevos> or, a stat() on a handle/file that just was removed *cough*
13:10:43 <bene> I didn't call it that but negative-lookup caching is what I meant
13:11:17 <bene> server-side caching also means that self-heal and other daemons can take advantage of it, yes?
13:11:34 <jdarcy> #action jdarcy to add negative-lookup caching to stat/xattr cache feature page (unless Ben beats him to it)
13:12:34 <jdarcy> bene: Definitely.  They could even bypass other cruft to get at the info, though of course that has to be done carefully.
13:12:46 <bene> so rather than changing posix translator, we just move md-cache down to just above posix translator?
13:13:04 <hagarth_> bene: yes, md-cache will be present in both the client & server stacks
13:13:23 <hagarth_> md-cache on the client will avoid network hops, on the server we will prevent disk seeks.
13:13:54 <bene> does md-cache have negative lookup feature?  I don't see how it could client-side because some other client can change metadata
13:13:57 <jdarcy> There's probably a *little* more than that, at least tweaking options/actions to suit that use case, but basically yeah.
13:14:21 <jdarcy> #action jdarcy to look into whether md-cache has negative-lookup functionality
13:14:47 <hagarth_> bene: no -ve lookup capability in md-cache as yet.
13:14:56 <jdarcy> Well, that was a quick AI.
13:15:08 <hagarth_> :)
13:15:34 <jdarcy> In the interests of time, let's move on to http://www.gluster.org/community/documentation/index.php/Features/composite-operations
13:16:46 <jdarcy> Three specific composite ops are mentioned there - readdirplus enhancements (covered by stat/xattr cache?), lockless create, and create-and-write
13:17:06 <bene> So Jeff, you mentioned dentry injection, but I proposed something a little different - allow READDIRPLUS to return xattr info, or at least existence of xattr info.  Make sense?
13:17:27 <jdarcy> So (1) can any of these reasonably be pulled forward, and (2) are there more?
13:17:42 <ndevos> yes, I think that makes sense - readdirplus replies including requested xattrs
13:17:56 <jdarcy> bene: I think both have their own separate value.
13:18:46 <jdarcy> IIRC, we already pre-fetch SELinux/ACL xattrs (in md-cache)?  Do we need to make that more general?
13:19:07 <bene> but how many round trips to do it?
13:19:15 <ndevos> I think Samba would benefit from other xattrs too
13:19:32 <hagarth_> I think we pick up all xattrs as part of readdirplus
13:20:02 * hagarth_ checks posix_readdirp_fill()
13:20:11 <bene> I re-read dentry injection and I get it now, you're talking about caching them in the kernel, whereas I was talking about how to get them across the network quickly
13:20:22 * ndevos thought it would only return the requested xattrs, but can be wrong
13:20:31 <jdarcy> We're already doing it as part of lookup/readdir/readdirp in md-cache (mdc_load_reqs)
13:20:44 <jdarcy> bene: Correct.
13:21:35 <jdarcy> Dentry injection is not going to be easy, because part of it has to be in kernel (FUSE at least), but I think it's still valuable.
13:22:12 <ndevos> sorry, whats the "dentry injection"?
13:22:28 <hagarth_> ndevos: you are right about requested xattrs
13:22:39 <jdarcy> ndevos: Pre-emptively pushing entries into the kernel from the glusterfs process, so we don't incur a context switch when they're requested.
13:22:48 <bene> http://www.gluster.org/community/documentation/index.php/Features/Feature_Smallfile_Perf#dentry_injection
13:23:19 <ndevos> ah, okay, yeah, that makes sense
13:23:45 <jdarcy> We already pre-fetch into client (user space) memory, but IIRC for some workloads the context switches are killing us.
13:24:39 <bene> IMHO the bigger problem of the two is avoiding network round trips.  Path between glusterfs and application is a lot shorter. I would like to have both features, but if I could only choose 1, I would probably side with avoiding round-trip-per-xattr-per-file
13:24:51 * jdarcy nods.
13:25:06 <hagarth_> bene: right
13:25:43 <jdarcy> So could/should readdirplus extensions be considered for 3.7 instead of 4.0?
13:26:24 <hagarth_> jdarcy: good to consider for 3.7 I feel.
13:26:57 <jdarcy> OK.
13:27:06 <ndevos> yes, 3.7 should be possible, readdirp() itsself does not need changes, right? only the callers for readdirp(), or filling of the structure
13:27:21 <bene> so where are the changes?  It appears the protocol sort of supports it, but libgfapi does not
13:27:47 <jdarcy> Right, we need to add options and/or gfapi support.
13:28:08 <ndevos> bene: what would you expect to see as function in gfapi?
13:28:13 <jdarcy> Options would be easier, but not as flexible.
13:28:40 <bene> If the translator API supported it, each layer could add in the xattrs that it needed on the way down, make sense?  I'm out of my league here ;-)
13:29:10 <hagarth_> bene: yes, that's how it is implemented today :).
13:29:56 <bene> oops, my bad.  So can we just extend existing libgfapi call or do we need readdirplusplus ;-) ?
13:30:04 <jdarcy> There are issues to be worked out there, but AFAICT they're not large enough to affect which release(s) we should consider.
13:30:14 <ndevos> ah, right! I was thinking about adding a virtual-xattr configuration option for md-cache - instruct md-cache to fetch certain attributes and cache them by calling setxattr("gluster.md-cache.bla", ...)
13:31:05 <ndevos> bene: a readdirplusplus() would be a option, but I wonder what structure it should return...
13:31:46 <bene> tag-length-value list in addition to traditional stat structure?
13:31:48 <jdarcy> Reminder: this is not a design meeting.  ;)
13:31:52 <hagarth_> ndevos: looks like a good interface that can be considered for later.
13:32:15 <jdarcy> How about create-and-write?
13:32:26 <bene> never mind... I'm happy that we are reaching consensus on what must be done.
13:32:38 <bene> yes let's discuss create-and-write please
13:33:06 <jdarcy> This would primarily be for stuff that comes in through gfapi, right?  Swift, SMB, etc.?
13:33:44 <hagarth_> yeah, looks like a gfapi only feature.
13:34:10 <bene> at first, yes.  But I wonder if it might be possible to eventually do it in glusterfs FUSE, more on that later
13:34:45 <jdarcy> We used to have a feature (forget the name) that would let you PUT an entire file as a virtual xattr.  Can't remember the name.
13:35:21 <hagarth_> jdarcy: yeah, I remember the PUT/GET interface. I think we would be confined by the xattr value length .. is that 4k?
13:35:30 <jdarcy> Something like that.
13:35:45 <jdarcy> Do we see any benefit to an equivalent GET?
13:36:02 <bene> I was hoping we could specify object up to 1 RPC's worth of data
13:36:15 <hagarth_> we could cut down on open + close over the network.
13:36:41 <jdarcy> bene: Through GFAPI we probably could.  Through FUSE etc. we'd be bound by xattr limits.
13:36:45 <bene> Much more than that.  If the API specifies xattrs, again each layer can add their own in on the way down the stack
13:36:56 <jdarcy> (though we could use multiple xattrs....)
13:37:41 <jdarcy> This one seems a bit trickier than the others we've discussed.
13:38:12 <hagarth_> value length max seems to be 64K
13:38:14 <jdarcy> It's not *super* hard, but we'd have to modify multiple layers.  Perhaps some coordination issues too?
13:39:16 <hagarth_> yeah, there can be some side effects which need to be thought through.
13:39:20 <jdarcy> Anything involving CREATE tends to get a bit complicated in DHT.  Now we'd have another call feeding into that complexity.
13:39:57 <jdarcy> But the impact might be huge.  Any thoughts on *how* huge, Ben?
13:40:33 <ndevos> would this not be similar to NFSv4 compound calls? why not add a COMPOUND fop that encapsulates other fops?
13:41:03 <bene> I had thought about that - this too would work.  SMB ANDX is another example
13:41:33 <jdarcy> ndevos: Doing that in a fully general way would require a whole new subsystem to manage concurrency and atomicity.
13:41:45 <bene> so the idea is you chain requests together in a single message, each request is conditional on preceding messages succeeding.  Error handling is more complicated.
13:42:19 <jdarcy> What happens if a request in the middle fails?  Are the NFSv4/SMB3 semantics even compatible?
13:42:20 <bene> That's why I proposed what I did - I thought a CREATE-AND-WRITE FOP would be simpler to implement than generalized compound operations
13:42:42 <bene> But what do you all think?
13:42:54 <hagarth_> might be a good idea to look into how NFS/SMB handle compound operations.
13:43:33 <hagarth_> bene: we can possibly give it a try, only through the API interface for now.
13:43:44 <bene> Are there other areas where we need to combine operations?  I'm not aware of any
13:43:54 <jdarcy> bene: I definitely think that addressing specific cases is better for now.  General compound-operation support might be biting off more than we can chew.
13:44:06 <jdarcy> People think 4.0 is overcrowded already, and they're probably right.
13:44:12 <ndevos> I think a COMPOUND(CREATE, WRITE) would be a nice start, followeed by COMPOUND(LOOKUP, OPEN, READ) ?
13:44:25 <jdarcy> ndevos: +1
13:44:39 <hagarth_> +1 to that
13:44:48 <bene> how does compound OPEN-READ differ from quick-read?
13:44:52 <jdarcy> Actually I'm not sure about the "followed" part.  They might go concurrently.
13:45:14 <ndevos> ah, I dont know how quick-read works :)
13:46:30 <bene> I thought data was returned in the OPEN call, anyone familiar with that?  Btw, small-file read performance with quick-read is pretty good IMHO
13:46:41 <jdarcy> Maybe it's just implicit vs. explicit.
13:47:19 <hagarth_> quick-read caches content during lookups. I think we are limited to 128K with quick-reads.
13:47:36 <jdarcy> Right now, qr has to be turned off a lot because it's implemented as a sort of cache and that's not always correct.
13:48:12 <bene> same principles apply to OPEN+READ, but CREATE+WRITE is a much bigger win
13:48:29 <hagarth_> right, we need better invalidation support. Maybe the upcalls infra in 3.7 can be utilized for making quick-read more consistent.
13:48:35 <jdarcy> Partly because CREATE is so much worse than OPEN.
13:49:04 <jdarcy> Actually, quick-read as it exists today might *go away* once we have real caching support with invalidation etc.
13:49:24 <jdarcy> Ditto write-behind and io-cache.  Need to think about that.
13:49:28 <hagarth_> jdarcy: yeah
13:50:02 <hagarth_> in terms of timelines, shall we consider better caching as a theme beyond 3.7?
13:50:21 <jdarcy> Definitely.  Since this relates closely to that, I'd also say keep it further out in 4.0
13:50:35 <hagarth_> jdarcy: agree
13:50:48 <jdarcy> BTW, talking to Ira yesterday, I came up with the idea of a 4.-1 (four point minus-one) for stuff that *might* be usable before 4.0
13:51:04 <hagarth_> jdarcy: not 3.x but 4.-1?
13:51:24 <jdarcy> hagarth_: Right.  4.x development stream, but before 4.0
13:51:27 <bene> sadly, I agree.  This is not a trivial change.  But I still think it's the most important one in terms of round trips.
13:51:51 <bene> can we discuss lookup-unhashed=auto?  That could happen sooner than V4, right?
13:52:02 <jdarcy> bene: Yes, good idea.
13:52:24 <jdarcy> I think we need to get this (or something like it) unstuck for 3.7.  What do you think, hagarth_?
13:52:43 <hagarth_> jdarcy: +1 to that
13:53:03 <jdarcy> bene: Can you remember what the measured performance gains were?
13:53:14 <hagarth_> small file performance is not going to improve without lookup-unhashed=auto & readdir performance improvement.
13:53:28 <hagarth_> so, I am all for improvements in these areas for 3.7.
13:53:34 <bene> Without this change, we have negative scaling for small-file creates, because LOOKUP is done to every brick prior to CREATE
13:54:11 <bene> with this change, we don't have perfectly linear scaling but it's not bad, see 3rd graph in description at: https://s3.amazonaws.com/ben.england/small-file-perf-feature-page.pdf
13:54:43 <jdarcy> Right.  That's pretty huge.  Thanks!
13:55:00 <jdarcy> #action jdarcy to refresh lookup-unhashed=auto patch, get it moving again for 3.7
13:55:29 <jdarcy> Any other smallfile-related changes we should discuss?
13:56:04 <hagarth_> I have some thoughts on readdir, may be we should discuss that in the next session.
13:56:23 <jdarcy> OK by me.
13:56:31 <hagarth_> readdir has a bunch of problems that need to be sorted out.
13:56:45 <bene> So I get the impression that migrating .glusterfs to SSD is not going to be considered, how are we going to speed up metadata writes?
13:57:13 <bene> does cache tiering handle this somehow?
13:57:31 <jdarcy> No, I don't think tiering/DC handle it.
13:57:53 <jdarcy> I don't think there's any resistance to the idea, just it hasn't been at the front of the list.
13:58:26 <hagarth_> bene: yeah, I am all for it if we can get some help in implementing that. need to check where we can get assistance for that from.
13:58:28 <bene> there are some legitimate concerns about it - what if SSD fails?
13:58:31 <jdarcy> Partly that might be because it's something you can do today, though it'd be hacky.
13:59:05 <hagarth_> bene: we would also need to snap SSD along with the data bricks as part of a snapshot operation.
13:59:44 <bene> replication should protect against SSD failure.  I forgot about snapshotting, ouch.
13:59:51 <jdarcy> Are hybrid drives or dm-cache relevant here?
14:00:26 <bene> my problem with dm-cache so far is that it doesn't accelerate initial writes and that's where we need the help, but...
14:00:54 <bene> Mike Snitzer and others have suggested that dm-cache can be tuned to favor writes over reads (promotion threshold)
14:01:42 <bene> Maybe we just haven't tried hard enough with it.
14:02:07 <jdarcy> Is it worth it for us to add separate support for putting .glusterfs on a different FS (e.g. solving the snapshot problem)?
14:02:33 <jdarcy> BTW, this same issue comes up with databases and logs that have been proposed for various features, whether they live in .glusterfs or elsewhere.
14:02:38 <hagarth_> jdarcy: something like btrfs?
14:02:42 <mpillai> brick configuration will be a major pain with dm-thin+dm-cache. dm-thin will have to on top, i think for snapshot
14:03:25 <jdarcy> hagarth_: Well, that's another possibility.  I was just thinking about a different mountpoint/volume, not necessarily of a different type.
14:03:33 <hagarth_> jdarcy: ah ok
14:03:42 * jdarcy writes a proposal to use ZFS for this.
14:04:10 <bene> is btrfs stable yet?
14:04:46 <jdarcy> AFAICT no, and not on a trajectory to become so.
14:04:49 <hagarth_> bene: not yet unfortunately
14:05:47 <jdarcy> Since we're over time, I think we'll have to defer brainstorming on this one to email.
14:06:11 <bene> our time is up, but thanks to everyone who participated, I think it's been productive, let's talk again about changes that are relevant to gluster 4.0 later
14:06:24 <jdarcy> Sounds good to me.  Thanks, everyone!
14:06:30 <hagarth_> thanks all!
14:06:51 * jdarcy puts a squeaky toy under hagarth_'s gavel.
14:07:02 <hagarth_> #endmeeting