13:06:01 #startmeeting 13:06:01 Meeting started Fri Dec 19 13:06:01 2014 UTC. The chair is hagarth_. Information about MeetBot at http://wiki.debian.org/MeetBot. 13:06:01 Useful Commands: #action #agreed #halp #info #idea #link #topic. 13:06:22 Let's start with http://www.gluster.org/community/documentation/index.php/Features/stat-xattr-cache 13:06:46 Seems straightforward. Is it something we could consider for earlier, e.g. 3.7? 13:07:00 there seems to be some disagreement on this topic, I've heard some people suggest md-cache enhancement, but ... 13:07:03 Yes, I can take this up. We could even just load md-cache on the server. 13:07:35 Anand S is also interested in this. Let us aim this enhancement for 3.7. 13:07:40 But would that really intercept all the xattr and stat calls? Some of these are in the POSIX translator below md-cache, yes? 13:08:03 Yes, I would say most of our xattr accesses are internal. 13:08:05 how about negative cache for xattrs, or missing files - that can be done safely server-side 13:08:17 bene: yes, all xattr and stat calls can be handled by md-cache sitting above posix. 13:08:29 Also, we can cache more aggressively on the server where we don't have a consistency/currency problem. 13:08:46 yes, we could have infinite cache-timeout on the server. 13:08:59 and cache by handle/gfid too 13:09:00 Yes! The key thing is to do llistxattr first, this will prevent a lot of xattr calls that aren't needed 13:09:21 ndevos: -ve cache on the server side looks quite appealing. 13:10:00 bene: as part of lookup, we typically do llistxattr. 13:10:13 hagarth_: yes, I would really like to see that - missing xattrs is pretty common 13:10:23 Definitely appealing. Should we add negative-lookup caching to the feature page? 13:10:31 jdarcy: +1 to that 13:10:42 or, a stat() on a handle/file that just was removed *cough* 13:10:43 I didn't call it that but negative-lookup caching is what I meant 13:11:17 server-side caching also means that self-heal and other daemons can take advantage of it, yes? 13:11:34 #action jdarcy to add negative-lookup caching to stat/xattr cache feature page (unless Ben beats him to it) 13:12:34 bene: Definitely. They could even bypass other cruft to get at the info, though of course that has to be done carefully. 13:12:46 so rather than changing posix translator, we just move md-cache down to just above posix translator? 13:13:04 bene: yes, md-cache will be present in both the client & server stacks 13:13:23 md-cache on the client will avoid network hops, on the server we will prevent disk seeks. 13:13:54 does md-cache have negative lookup feature? I don't see how it could client-side because some other client can change metadata 13:13:57 There's probably a *little* more than that, at least tweaking options/actions to suit that use case, but basically yeah. 13:14:21 #action jdarcy to look into whether md-cache has negative-lookup functionality 13:14:47 bene: no -ve lookup capability in md-cache as yet. 13:14:56 Well, that was a quick AI. 13:15:08 :) 13:15:34 In the interests of time, let's move on to http://www.gluster.org/community/documentation/index.php/Features/composite-operations 13:16:46 Three specific composite ops are mentioned there - readdirplus enhancements (covered by stat/xattr cache?), lockless create, and create-and-write 13:17:06 So Jeff, you mentioned dentry injection, but I proposed something a little different - allow READDIRPLUS to return xattr info, or at least existence of xattr info. Make sense? 13:17:27 So (1) can any of these reasonably be pulled forward, and (2) are there more? 13:17:42 yes, I think that makes sense - readdirplus replies including requested xattrs 13:17:56 bene: I think both have their own separate value. 13:18:46 IIRC, we already pre-fetch SELinux/ACL xattrs (in md-cache)? Do we need to make that more general? 13:19:07 but how many round trips to do it? 13:19:15 I think Samba would benefit from other xattrs too 13:19:32 I think we pick up all xattrs as part of readdirplus 13:20:02 * hagarth_ checks posix_readdirp_fill() 13:20:11 I re-read dentry injection and I get it now, you're talking about caching them in the kernel, whereas I was talking about how to get them across the network quickly 13:20:22 * ndevos thought it would only return the requested xattrs, but can be wrong 13:20:31 We're already doing it as part of lookup/readdir/readdirp in md-cache (mdc_load_reqs) 13:20:44 bene: Correct. 13:21:35 Dentry injection is not going to be easy, because part of it has to be in kernel (FUSE at least), but I think it's still valuable. 13:22:12 sorry, whats the "dentry injection"? 13:22:28 ndevos: you are right about requested xattrs 13:22:39 ndevos: Pre-emptively pushing entries into the kernel from the glusterfs process, so we don't incur a context switch when they're requested. 13:22:48 http://www.gluster.org/community/documentation/index.php/Features/Feature_Smallfile_Perf#dentry_injection 13:23:19 ah, okay, yeah, that makes sense 13:23:45 We already pre-fetch into client (user space) memory, but IIRC for some workloads the context switches are killing us. 13:24:39 IMHO the bigger problem of the two is avoiding network round trips. Path between glusterfs and application is a lot shorter. I would like to have both features, but if I could only choose 1, I would probably side with avoiding round-trip-per-xattr-per-file 13:24:51 * jdarcy nods. 13:25:06 bene: right 13:25:43 So could/should readdirplus extensions be considered for 3.7 instead of 4.0? 13:26:24 jdarcy: good to consider for 3.7 I feel. 13:26:57 OK. 13:27:06 yes, 3.7 should be possible, readdirp() itsself does not need changes, right? only the callers for readdirp(), or filling of the structure 13:27:21 so where are the changes? It appears the protocol sort of supports it, but libgfapi does not 13:27:47 Right, we need to add options and/or gfapi support. 13:28:08 bene: what would you expect to see as function in gfapi? 13:28:13 Options would be easier, but not as flexible. 13:28:40 If the translator API supported it, each layer could add in the xattrs that it needed on the way down, make sense? I'm out of my league here ;-) 13:29:10 bene: yes, that's how it is implemented today :). 13:29:56 oops, my bad. So can we just extend existing libgfapi call or do we need readdirplusplus ;-) ? 13:30:04 There are issues to be worked out there, but AFAICT they're not large enough to affect which release(s) we should consider. 13:30:14 ah, right! I was thinking about adding a virtual-xattr configuration option for md-cache - instruct md-cache to fetch certain attributes and cache them by calling setxattr("gluster.md-cache.bla", ...) 13:31:05 bene: a readdirplusplus() would be a option, but I wonder what structure it should return... 13:31:46 tag-length-value list in addition to traditional stat structure? 13:31:48 Reminder: this is not a design meeting. ;) 13:31:52 ndevos: looks like a good interface that can be considered for later. 13:32:15 How about create-and-write? 13:32:26 never mind... I'm happy that we are reaching consensus on what must be done. 13:32:38 yes let's discuss create-and-write please 13:33:06 This would primarily be for stuff that comes in through gfapi, right? Swift, SMB, etc.? 13:33:44 yeah, looks like a gfapi only feature. 13:34:10 at first, yes. But I wonder if it might be possible to eventually do it in glusterfs FUSE, more on that later 13:34:45 We used to have a feature (forget the name) that would let you PUT an entire file as a virtual xattr. Can't remember the name. 13:35:21 jdarcy: yeah, I remember the PUT/GET interface. I think we would be confined by the xattr value length .. is that 4k? 13:35:30 Something like that. 13:35:45 Do we see any benefit to an equivalent GET? 13:36:02 I was hoping we could specify object up to 1 RPC's worth of data 13:36:15 we could cut down on open + close over the network. 13:36:41 bene: Through GFAPI we probably could. Through FUSE etc. we'd be bound by xattr limits. 13:36:45 Much more than that. If the API specifies xattrs, again each layer can add their own in on the way down the stack 13:36:56 (though we could use multiple xattrs....) 13:37:41 This one seems a bit trickier than the others we've discussed. 13:38:12 value length max seems to be 64K 13:38:14 It's not *super* hard, but we'd have to modify multiple layers. Perhaps some coordination issues too? 13:39:16 yeah, there can be some side effects which need to be thought through. 13:39:20 Anything involving CREATE tends to get a bit complicated in DHT. Now we'd have another call feeding into that complexity. 13:39:57 But the impact might be huge. Any thoughts on *how* huge, Ben? 13:40:33 would this not be similar to NFSv4 compound calls? why not add a COMPOUND fop that encapsulates other fops? 13:41:03 I had thought about that - this too would work. SMB ANDX is another example 13:41:33 ndevos: Doing that in a fully general way would require a whole new subsystem to manage concurrency and atomicity. 13:41:45 so the idea is you chain requests together in a single message, each request is conditional on preceding messages succeeding. Error handling is more complicated. 13:42:19 What happens if a request in the middle fails? Are the NFSv4/SMB3 semantics even compatible? 13:42:20 That's why I proposed what I did - I thought a CREATE-AND-WRITE FOP would be simpler to implement than generalized compound operations 13:42:42 But what do you all think? 13:42:54 might be a good idea to look into how NFS/SMB handle compound operations. 13:43:33 bene: we can possibly give it a try, only through the API interface for now. 13:43:44 Are there other areas where we need to combine operations? I'm not aware of any 13:43:54 bene: I definitely think that addressing specific cases is better for now. General compound-operation support might be biting off more than we can chew. 13:44:06 People think 4.0 is overcrowded already, and they're probably right. 13:44:12 I think a COMPOUND(CREATE, WRITE) would be a nice start, followeed by COMPOUND(LOOKUP, OPEN, READ) ? 13:44:25 ndevos: +1 13:44:39 +1 to that 13:44:48 how does compound OPEN-READ differ from quick-read? 13:44:52 Actually I'm not sure about the "followed" part. They might go concurrently. 13:45:14 ah, I dont know how quick-read works :) 13:46:30 I thought data was returned in the OPEN call, anyone familiar with that? Btw, small-file read performance with quick-read is pretty good IMHO 13:46:41 Maybe it's just implicit vs. explicit. 13:47:19 quick-read caches content during lookups. I think we are limited to 128K with quick-reads. 13:47:36 Right now, qr has to be turned off a lot because it's implemented as a sort of cache and that's not always correct. 13:48:12 same principles apply to OPEN+READ, but CREATE+WRITE is a much bigger win 13:48:29 right, we need better invalidation support. Maybe the upcalls infra in 3.7 can be utilized for making quick-read more consistent. 13:48:35 Partly because CREATE is so much worse than OPEN. 13:49:04 Actually, quick-read as it exists today might *go away* once we have real caching support with invalidation etc. 13:49:24 Ditto write-behind and io-cache. Need to think about that. 13:49:28 jdarcy: yeah 13:50:02 in terms of timelines, shall we consider better caching as a theme beyond 3.7? 13:50:21 Definitely. Since this relates closely to that, I'd also say keep it further out in 4.0 13:50:35 jdarcy: agree 13:50:48 BTW, talking to Ira yesterday, I came up with the idea of a 4.-1 (four point minus-one) for stuff that *might* be usable before 4.0 13:51:04 jdarcy: not 3.x but 4.-1? 13:51:24 hagarth_: Right. 4.x development stream, but before 4.0 13:51:27 sadly, I agree. This is not a trivial change. But I still think it's the most important one in terms of round trips. 13:51:51 can we discuss lookup-unhashed=auto? That could happen sooner than V4, right? 13:52:02 bene: Yes, good idea. 13:52:24 I think we need to get this (or something like it) unstuck for 3.7. What do you think, hagarth_? 13:52:43 jdarcy: +1 to that 13:53:03 bene: Can you remember what the measured performance gains were? 13:53:14 small file performance is not going to improve without lookup-unhashed=auto & readdir performance improvement. 13:53:28 so, I am all for improvements in these areas for 3.7. 13:53:34 Without this change, we have negative scaling for small-file creates, because LOOKUP is done to every brick prior to CREATE 13:54:11 with this change, we don't have perfectly linear scaling but it's not bad, see 3rd graph in description at: https://s3.amazonaws.com/ben.england/small-file-perf-feature-page.pdf 13:54:43 Right. That's pretty huge. Thanks! 13:55:00 #action jdarcy to refresh lookup-unhashed=auto patch, get it moving again for 3.7 13:55:29 Any other smallfile-related changes we should discuss? 13:56:04 I have some thoughts on readdir, may be we should discuss that in the next session. 13:56:23 OK by me. 13:56:31 readdir has a bunch of problems that need to be sorted out. 13:56:45 So I get the impression that migrating .glusterfs to SSD is not going to be considered, how are we going to speed up metadata writes? 13:57:13 does cache tiering handle this somehow? 13:57:31 No, I don't think tiering/DC handle it. 13:57:53 I don't think there's any resistance to the idea, just it hasn't been at the front of the list. 13:58:26 bene: yeah, I am all for it if we can get some help in implementing that. need to check where we can get assistance for that from. 13:58:28 there are some legitimate concerns about it - what if SSD fails? 13:58:31 Partly that might be because it's something you can do today, though it'd be hacky. 13:59:05 bene: we would also need to snap SSD along with the data bricks as part of a snapshot operation. 13:59:44 replication should protect against SSD failure. I forgot about snapshotting, ouch. 13:59:51 Are hybrid drives or dm-cache relevant here? 14:00:26 my problem with dm-cache so far is that it doesn't accelerate initial writes and that's where we need the help, but... 14:00:54 Mike Snitzer and others have suggested that dm-cache can be tuned to favor writes over reads (promotion threshold) 14:01:42 Maybe we just haven't tried hard enough with it. 14:02:07 Is it worth it for us to add separate support for putting .glusterfs on a different FS (e.g. solving the snapshot problem)? 14:02:33 BTW, this same issue comes up with databases and logs that have been proposed for various features, whether they live in .glusterfs or elsewhere. 14:02:38 jdarcy: something like btrfs? 14:02:42 brick configuration will be a major pain with dm-thin+dm-cache. dm-thin will have to on top, i think for snapshot 14:03:25 hagarth_: Well, that's another possibility. I was just thinking about a different mountpoint/volume, not necessarily of a different type. 14:03:33 jdarcy: ah ok 14:03:42 * jdarcy writes a proposal to use ZFS for this. 14:04:10 is btrfs stable yet? 14:04:46 AFAICT no, and not on a trajectory to become so. 14:04:49 bene: not yet unfortunately 14:05:47 Since we're over time, I think we'll have to defer brainstorming on this one to email. 14:06:11 our time is up, but thanks to everyone who participated, I think it's been productive, let's talk again about changes that are relevant to gluster 4.0 later 14:06:24 Sounds good to me. Thanks, everyone! 14:06:30 thanks all! 14:06:51 * jdarcy puts a squeaky toy under hagarth_'s gavel. 14:07:02 #endmeeting