13:06:01 <hagarth_> #startmeeting 13:06:01 <zodbot> Meeting started Fri Dec 19 13:06:01 2014 UTC. The chair is hagarth_. Information about MeetBot at http://wiki.debian.org/MeetBot. 13:06:01 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic. 13:06:22 <jdarcy> Let's start with http://www.gluster.org/community/documentation/index.php/Features/stat-xattr-cache 13:06:46 <jdarcy> Seems straightforward. Is it something we could consider for earlier, e.g. 3.7? 13:07:00 <bene> there seems to be some disagreement on this topic, I've heard some people suggest md-cache enhancement, but ... 13:07:03 <hagarth_> Yes, I can take this up. We could even just load md-cache on the server. 13:07:35 <hagarth_> Anand S is also interested in this. Let us aim this enhancement for 3.7. 13:07:40 <bene> But would that really intercept all the xattr and stat calls? Some of these are in the POSIX translator below md-cache, yes? 13:08:03 <jdarcy> Yes, I would say most of our xattr accesses are internal. 13:08:05 <ndevos> how about negative cache for xattrs, or missing files - that can be done safely server-side 13:08:17 <hagarth_> bene: yes, all xattr and stat calls can be handled by md-cache sitting above posix. 13:08:29 <jdarcy> Also, we can cache more aggressively on the server where we don't have a consistency/currency problem. 13:08:46 <hagarth_> yes, we could have infinite cache-timeout on the server. 13:08:59 <ndevos> and cache by handle/gfid too 13:09:00 <bene> Yes! The key thing is to do llistxattr first, this will prevent a lot of xattr calls that aren't needed 13:09:21 <hagarth_> ndevos: -ve cache on the server side looks quite appealing. 13:10:00 <hagarth_> bene: as part of lookup, we typically do llistxattr. 13:10:13 <ndevos> hagarth_: yes, I would really like to see that - missing xattrs is pretty common 13:10:23 <jdarcy> Definitely appealing. Should we add negative-lookup caching to the feature page? 13:10:31 <hagarth_> jdarcy: +1 to that 13:10:42 <ndevos> or, a stat() on a handle/file that just was removed *cough* 13:10:43 <bene> I didn't call it that but negative-lookup caching is what I meant 13:11:17 <bene> server-side caching also means that self-heal and other daemons can take advantage of it, yes? 13:11:34 <jdarcy> #action jdarcy to add negative-lookup caching to stat/xattr cache feature page (unless Ben beats him to it) 13:12:34 <jdarcy> bene: Definitely. They could even bypass other cruft to get at the info, though of course that has to be done carefully. 13:12:46 <bene> so rather than changing posix translator, we just move md-cache down to just above posix translator? 13:13:04 <hagarth_> bene: yes, md-cache will be present in both the client & server stacks 13:13:23 <hagarth_> md-cache on the client will avoid network hops, on the server we will prevent disk seeks. 13:13:54 <bene> does md-cache have negative lookup feature? I don't see how it could client-side because some other client can change metadata 13:13:57 <jdarcy> There's probably a *little* more than that, at least tweaking options/actions to suit that use case, but basically yeah. 13:14:21 <jdarcy> #action jdarcy to look into whether md-cache has negative-lookup functionality 13:14:47 <hagarth_> bene: no -ve lookup capability in md-cache as yet. 13:14:56 <jdarcy> Well, that was a quick AI. 13:15:08 <hagarth_> :) 13:15:34 <jdarcy> In the interests of time, let's move on to http://www.gluster.org/community/documentation/index.php/Features/composite-operations 13:16:46 <jdarcy> Three specific composite ops are mentioned there - readdirplus enhancements (covered by stat/xattr cache?), lockless create, and create-and-write 13:17:06 <bene> So Jeff, you mentioned dentry injection, but I proposed something a little different - allow READDIRPLUS to return xattr info, or at least existence of xattr info. Make sense? 13:17:27 <jdarcy> So (1) can any of these reasonably be pulled forward, and (2) are there more? 13:17:42 <ndevos> yes, I think that makes sense - readdirplus replies including requested xattrs 13:17:56 <jdarcy> bene: I think both have their own separate value. 13:18:46 <jdarcy> IIRC, we already pre-fetch SELinux/ACL xattrs (in md-cache)? Do we need to make that more general? 13:19:07 <bene> but how many round trips to do it? 13:19:15 <ndevos> I think Samba would benefit from other xattrs too 13:19:32 <hagarth_> I think we pick up all xattrs as part of readdirplus 13:20:02 * hagarth_ checks posix_readdirp_fill() 13:20:11 <bene> I re-read dentry injection and I get it now, you're talking about caching them in the kernel, whereas I was talking about how to get them across the network quickly 13:20:22 * ndevos thought it would only return the requested xattrs, but can be wrong 13:20:31 <jdarcy> We're already doing it as part of lookup/readdir/readdirp in md-cache (mdc_load_reqs) 13:20:44 <jdarcy> bene: Correct. 13:21:35 <jdarcy> Dentry injection is not going to be easy, because part of it has to be in kernel (FUSE at least), but I think it's still valuable. 13:22:12 <ndevos> sorry, whats the "dentry injection"? 13:22:28 <hagarth_> ndevos: you are right about requested xattrs 13:22:39 <jdarcy> ndevos: Pre-emptively pushing entries into the kernel from the glusterfs process, so we don't incur a context switch when they're requested. 13:22:48 <bene> http://www.gluster.org/community/documentation/index.php/Features/Feature_Smallfile_Perf#dentry_injection 13:23:19 <ndevos> ah, okay, yeah, that makes sense 13:23:45 <jdarcy> We already pre-fetch into client (user space) memory, but IIRC for some workloads the context switches are killing us. 13:24:39 <bene> IMHO the bigger problem of the two is avoiding network round trips. Path between glusterfs and application is a lot shorter. I would like to have both features, but if I could only choose 1, I would probably side with avoiding round-trip-per-xattr-per-file 13:24:51 * jdarcy nods. 13:25:06 <hagarth_> bene: right 13:25:43 <jdarcy> So could/should readdirplus extensions be considered for 3.7 instead of 4.0? 13:26:24 <hagarth_> jdarcy: good to consider for 3.7 I feel. 13:26:57 <jdarcy> OK. 13:27:06 <ndevos> yes, 3.7 should be possible, readdirp() itsself does not need changes, right? only the callers for readdirp(), or filling of the structure 13:27:21 <bene> so where are the changes? It appears the protocol sort of supports it, but libgfapi does not 13:27:47 <jdarcy> Right, we need to add options and/or gfapi support. 13:28:08 <ndevos> bene: what would you expect to see as function in gfapi? 13:28:13 <jdarcy> Options would be easier, but not as flexible. 13:28:40 <bene> If the translator API supported it, each layer could add in the xattrs that it needed on the way down, make sense? I'm out of my league here ;-) 13:29:10 <hagarth_> bene: yes, that's how it is implemented today :). 13:29:56 <bene> oops, my bad. So can we just extend existing libgfapi call or do we need readdirplusplus ;-) ? 13:30:04 <jdarcy> There are issues to be worked out there, but AFAICT they're not large enough to affect which release(s) we should consider. 13:30:14 <ndevos> ah, right! I was thinking about adding a virtual-xattr configuration option for md-cache - instruct md-cache to fetch certain attributes and cache them by calling setxattr("gluster.md-cache.bla", ...) 13:31:05 <ndevos> bene: a readdirplusplus() would be a option, but I wonder what structure it should return... 13:31:46 <bene> tag-length-value list in addition to traditional stat structure? 13:31:48 <jdarcy> Reminder: this is not a design meeting. ;) 13:31:52 <hagarth_> ndevos: looks like a good interface that can be considered for later. 13:32:15 <jdarcy> How about create-and-write? 13:32:26 <bene> never mind... I'm happy that we are reaching consensus on what must be done. 13:32:38 <bene> yes let's discuss create-and-write please 13:33:06 <jdarcy> This would primarily be for stuff that comes in through gfapi, right? Swift, SMB, etc.? 13:33:44 <hagarth_> yeah, looks like a gfapi only feature. 13:34:10 <bene> at first, yes. But I wonder if it might be possible to eventually do it in glusterfs FUSE, more on that later 13:34:45 <jdarcy> We used to have a feature (forget the name) that would let you PUT an entire file as a virtual xattr. Can't remember the name. 13:35:21 <hagarth_> jdarcy: yeah, I remember the PUT/GET interface. I think we would be confined by the xattr value length .. is that 4k? 13:35:30 <jdarcy> Something like that. 13:35:45 <jdarcy> Do we see any benefit to an equivalent GET? 13:36:02 <bene> I was hoping we could specify object up to 1 RPC's worth of data 13:36:15 <hagarth_> we could cut down on open + close over the network. 13:36:41 <jdarcy> bene: Through GFAPI we probably could. Through FUSE etc. we'd be bound by xattr limits. 13:36:45 <bene> Much more than that. If the API specifies xattrs, again each layer can add their own in on the way down the stack 13:36:56 <jdarcy> (though we could use multiple xattrs....) 13:37:41 <jdarcy> This one seems a bit trickier than the others we've discussed. 13:38:12 <hagarth_> value length max seems to be 64K 13:38:14 <jdarcy> It's not *super* hard, but we'd have to modify multiple layers. Perhaps some coordination issues too? 13:39:16 <hagarth_> yeah, there can be some side effects which need to be thought through. 13:39:20 <jdarcy> Anything involving CREATE tends to get a bit complicated in DHT. Now we'd have another call feeding into that complexity. 13:39:57 <jdarcy> But the impact might be huge. Any thoughts on *how* huge, Ben? 13:40:33 <ndevos> would this not be similar to NFSv4 compound calls? why not add a COMPOUND fop that encapsulates other fops? 13:41:03 <bene> I had thought about that - this too would work. SMB ANDX is another example 13:41:33 <jdarcy> ndevos: Doing that in a fully general way would require a whole new subsystem to manage concurrency and atomicity. 13:41:45 <bene> so the idea is you chain requests together in a single message, each request is conditional on preceding messages succeeding. Error handling is more complicated. 13:42:19 <jdarcy> What happens if a request in the middle fails? Are the NFSv4/SMB3 semantics even compatible? 13:42:20 <bene> That's why I proposed what I did - I thought a CREATE-AND-WRITE FOP would be simpler to implement than generalized compound operations 13:42:42 <bene> But what do you all think? 13:42:54 <hagarth_> might be a good idea to look into how NFS/SMB handle compound operations. 13:43:33 <hagarth_> bene: we can possibly give it a try, only through the API interface for now. 13:43:44 <bene> Are there other areas where we need to combine operations? I'm not aware of any 13:43:54 <jdarcy> bene: I definitely think that addressing specific cases is better for now. General compound-operation support might be biting off more than we can chew. 13:44:06 <jdarcy> People think 4.0 is overcrowded already, and they're probably right. 13:44:12 <ndevos> I think a COMPOUND(CREATE, WRITE) would be a nice start, followeed by COMPOUND(LOOKUP, OPEN, READ) ? 13:44:25 <jdarcy> ndevos: +1 13:44:39 <hagarth_> +1 to that 13:44:48 <bene> how does compound OPEN-READ differ from quick-read? 13:44:52 <jdarcy> Actually I'm not sure about the "followed" part. They might go concurrently. 13:45:14 <ndevos> ah, I dont know how quick-read works :) 13:46:30 <bene> I thought data was returned in the OPEN call, anyone familiar with that? Btw, small-file read performance with quick-read is pretty good IMHO 13:46:41 <jdarcy> Maybe it's just implicit vs. explicit. 13:47:19 <hagarth_> quick-read caches content during lookups. I think we are limited to 128K with quick-reads. 13:47:36 <jdarcy> Right now, qr has to be turned off a lot because it's implemented as a sort of cache and that's not always correct. 13:48:12 <bene> same principles apply to OPEN+READ, but CREATE+WRITE is a much bigger win 13:48:29 <hagarth_> right, we need better invalidation support. Maybe the upcalls infra in 3.7 can be utilized for making quick-read more consistent. 13:48:35 <jdarcy> Partly because CREATE is so much worse than OPEN. 13:49:04 <jdarcy> Actually, quick-read as it exists today might *go away* once we have real caching support with invalidation etc. 13:49:24 <jdarcy> Ditto write-behind and io-cache. Need to think about that. 13:49:28 <hagarth_> jdarcy: yeah 13:50:02 <hagarth_> in terms of timelines, shall we consider better caching as a theme beyond 3.7? 13:50:21 <jdarcy> Definitely. Since this relates closely to that, I'd also say keep it further out in 4.0 13:50:35 <hagarth_> jdarcy: agree 13:50:48 <jdarcy> BTW, talking to Ira yesterday, I came up with the idea of a 4.-1 (four point minus-one) for stuff that *might* be usable before 4.0 13:51:04 <hagarth_> jdarcy: not 3.x but 4.-1? 13:51:24 <jdarcy> hagarth_: Right. 4.x development stream, but before 4.0 13:51:27 <bene> sadly, I agree. This is not a trivial change. But I still think it's the most important one in terms of round trips. 13:51:51 <bene> can we discuss lookup-unhashed=auto? That could happen sooner than V4, right? 13:52:02 <jdarcy> bene: Yes, good idea. 13:52:24 <jdarcy> I think we need to get this (or something like it) unstuck for 3.7. What do you think, hagarth_? 13:52:43 <hagarth_> jdarcy: +1 to that 13:53:03 <jdarcy> bene: Can you remember what the measured performance gains were? 13:53:14 <hagarth_> small file performance is not going to improve without lookup-unhashed=auto & readdir performance improvement. 13:53:28 <hagarth_> so, I am all for improvements in these areas for 3.7. 13:53:34 <bene> Without this change, we have negative scaling for small-file creates, because LOOKUP is done to every brick prior to CREATE 13:54:11 <bene> with this change, we don't have perfectly linear scaling but it's not bad, see 3rd graph in description at: https://s3.amazonaws.com/ben.england/small-file-perf-feature-page.pdf 13:54:43 <jdarcy> Right. That's pretty huge. Thanks! 13:55:00 <jdarcy> #action jdarcy to refresh lookup-unhashed=auto patch, get it moving again for 3.7 13:55:29 <jdarcy> Any other smallfile-related changes we should discuss? 13:56:04 <hagarth_> I have some thoughts on readdir, may be we should discuss that in the next session. 13:56:23 <jdarcy> OK by me. 13:56:31 <hagarth_> readdir has a bunch of problems that need to be sorted out. 13:56:45 <bene> So I get the impression that migrating .glusterfs to SSD is not going to be considered, how are we going to speed up metadata writes? 13:57:13 <bene> does cache tiering handle this somehow? 13:57:31 <jdarcy> No, I don't think tiering/DC handle it. 13:57:53 <jdarcy> I don't think there's any resistance to the idea, just it hasn't been at the front of the list. 13:58:26 <hagarth_> bene: yeah, I am all for it if we can get some help in implementing that. need to check where we can get assistance for that from. 13:58:28 <bene> there are some legitimate concerns about it - what if SSD fails? 13:58:31 <jdarcy> Partly that might be because it's something you can do today, though it'd be hacky. 13:59:05 <hagarth_> bene: we would also need to snap SSD along with the data bricks as part of a snapshot operation. 13:59:44 <bene> replication should protect against SSD failure. I forgot about snapshotting, ouch. 13:59:51 <jdarcy> Are hybrid drives or dm-cache relevant here? 14:00:26 <bene> my problem with dm-cache so far is that it doesn't accelerate initial writes and that's where we need the help, but... 14:00:54 <bene> Mike Snitzer and others have suggested that dm-cache can be tuned to favor writes over reads (promotion threshold) 14:01:42 <bene> Maybe we just haven't tried hard enough with it. 14:02:07 <jdarcy> Is it worth it for us to add separate support for putting .glusterfs on a different FS (e.g. solving the snapshot problem)? 14:02:33 <jdarcy> BTW, this same issue comes up with databases and logs that have been proposed for various features, whether they live in .glusterfs or elsewhere. 14:02:38 <hagarth_> jdarcy: something like btrfs? 14:02:42 <mpillai> brick configuration will be a major pain with dm-thin+dm-cache. dm-thin will have to on top, i think for snapshot 14:03:25 <jdarcy> hagarth_: Well, that's another possibility. I was just thinking about a different mountpoint/volume, not necessarily of a different type. 14:03:33 <hagarth_> jdarcy: ah ok 14:03:42 * jdarcy writes a proposal to use ZFS for this. 14:04:10 <bene> is btrfs stable yet? 14:04:46 <jdarcy> AFAICT no, and not on a trajectory to become so. 14:04:49 <hagarth_> bene: not yet unfortunately 14:05:47 <jdarcy> Since we're over time, I think we'll have to defer brainstorming on this one to email. 14:06:11 <bene> our time is up, but thanks to everyone who participated, I think it's been productive, let's talk again about changes that are relevant to gluster 4.0 later 14:06:24 <jdarcy> Sounds good to me. Thanks, everyone! 14:06:30 <hagarth_> thanks all! 14:06:51 * jdarcy puts a squeaky toy under hagarth_'s gavel. 14:07:02 <hagarth_> #endmeeting