16:59:44 #startmeeting Big Data SIG 16:59:44 Meeting started Thu Mar 7 16:59:44 2013 UTC. The chair is rbergeron. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:59:44 Useful Commands: #action #agreed #halp #info #idea #link #topic. 16:59:50 #meetingname Big Data SIG 16:59:50 The meeting name has been set to 'big_data_sig' 17:00:05 #topic Who's around for fun? 17:00:18 * tflink is preparing himself for the fun 17:01:00 awesome. 17:01:15 #info present: rbergero, tflink 17:01:24 threebean: are you here for the party as well :) 17:01:30 * rbergeron guesses witlessb is 17:01:30 howdy 17:01:34 heya. 17:01:45 #info present: witlessb 17:01:47 rbergeron: yup :) 17:01:47 * witlessb has a party-hat on 17:01:47 * zoglesby is here 17:01:59 * rbergeron will hold another moment ... while she pulls up the magical agenda 17:02:00 * samkottler is here 17:02:03 * jsmith lurks 17:02:18 #info present: threebean, zoglesby, samkottler, jsmith 17:02:43 Would any of you lovely folks like to have a chair as well :) 17:03:02 #chair tflink witlessb threebean zodbot samkottler jsmith 17:03:02 Current chairs: jsmith rbergeron samkottler tflink threebean witlessb zodbot 17:03:08 Yes. thta's a yes. 17:03:30 okay. sooooo: 17:03:49 #topic Agenda for today's first meeting :D 17:04:00 * ctyler joins in from Hong Kong 17:04:02 I posted a few things to the mailing list 17:04:07 #chair ctyler 17:04:07 Current chairs: ctyler jsmith rbergeron samkottler tflink threebean witlessb zodbot 17:04:18 hey, live from linaro-land, it's ctyler 17:04:32 #link http://lists.fedoraproject.org/pipermail/bigdata/2013-March/000003.html 17:04:32 yep 17:05:09 #info Agenda looks like: What this is all about, what do we have, what don't we have, what is anyone here interested in doing :) 17:05:18 (not necessarily having to be in that order, but...) 17:05:31 Feel free to poke along the way or yell or whatever if you want to add another topic and we'll figure it out. 17:05:34 First meetings are fun. 17:05:57 #topic What's the Big Data SIG all about? 17:06:35 So I don't have an amazing answer here. Other than: Heyyy, we should do something. Because not having anything is probably not the best answer. 17:07:13 It's sort of a broad field, so I figure we'll have to sort out what it is we want the group to be about, whehter it's practical implementation of stuff, packaging of things, or $somethingelse. 17:07:20 Don't y'all talk at once now :) 17:07:25 Thoughts? Additions? 17:07:29 I like the Big Data definition from that O'Reilly report a couple years back, which to paraphrase was: If the size of your data is part of the problem, it's Big Data. 17:07:37 Anyone here just curious about wtf big data is? 17:07:39 Ahhh. 17:07:57 #info loosely quoting from o'reilly: "If the size of your data is part of the problem, it's Big Data." 17:08:56 I think people are struggling with the need to save a variety of things, knowing how to store it, knowing how to do things with it, whether it's analyze, or find things quickly, or hook it up to some amazing infrastructure set-up. 17:09:50 So i'm going to presume that we probably have a mix of people who want to use it or play with it here, along with perhaps some people who want to help fix that, perhaps it's a good blend of both. 17:09:53 yeah, one part of it is getting a decent setup 17:10:04 the other part is understanding the tools and approaches needed 17:10:24 Which is helpful, since it's hard to get people to magically do things if they're not actually interested in using them :) 17:10:54 tflink: agreed - and there is a lot - and i suspect people get a lot of pointy-haired-boss action saying "WE NEED ALL THE BIG DATA THINGS" 17:11:14 The two pieces you hear about the most in Big Data seem to be massive storage, parallel computing (Hadoop, column databases, etc). 17:11:20 which is .. hopefully slightly different than me saying we should do the big data things :) 17:11:42 #idea one part of it is getting a decent setup; other part is understanding tools and approaches needed 17:11:55 yeah, I've heard plenty of people talk about "big data" or "hadoop" like some people talk about "the cloud" - something vague and cool so we should be using it 17:12:16 #idea two pieces you hear most about in Big Data seem to be massive storage, & parallel computing (hadoop, column databases, etc) 17:12:33 There's a pretty distinct set of problems suited to hadoop and friends, though. 17:12:54 another component is online processing or online analysis. i.e. predicting what is trending before its had time to hit disk. 17:12:55 yeah, and I think some of those folks aren't even to the point where they are thinking "maybe i should be saving this possibly useful in the future information" 17:13:29 #idea another component is online processing or online analysis - predicting what is trending before its had time to hit disk 17:13:34 threebean: want to expound on that a bit? 17:14:46 e.g., they say google can predict flu outbreaks faster than public health agencies by watching search terms 17:14:56 * threebean nods 17:15:07 financial tools, too. 17:15:11 ah - that's a good example 17:15:49 #idea for ex. - idea that google can predict flu outbreaks faster than public health agencies by watcihng search terms; financial tools as well apply to concept 17:15:57 thanks, that makes it much clearer. 17:16:30 another example of stream processing is twitter analytics - watching for emerging topics in twitter streams 17:16:36 okay, shall we move onwards? I think we have the discussion of "what are the buckets of things" that kind of bridges this and the "what do we actually ahve right now... if anything" discussion 17:16:52 #idea another ex. of stream processing is twitter analytics - looking for emerging topics in twitter streams 17:17:33 re: google predicting flu: http://www.nature.com/news/when-google-got-flu-wrong-1.12413 17:17:59 witlessb: cool, thanks for the link - background fun always helps 17:18:21 #topic What are the buckets or categories, and what do we have? 17:19:03 sooooo: I think we can probably stick nosql in a category without subdividing it - if we think it belongs here at all. :) 17:20:02 orchestration, batch processing and stream processing are the other ones that come to mind 17:20:35 examples: orchestration (zookeeper), batch (hadoop, disco), stream (storm) 17:20:36 A full Hadoop stack seems to be considered one of the useful foundation layers for some types of work, but the hadoop filesystem is getting a lot of attention as a weak spot, with nosql dbs and gluster being used as alternatives 17:20:37 tflink: that seems reasonable - 17:20:43 oh, you're totally ahead of me. :) 17:21:23 #idea orchestration, batch processing, stream processing are categories that come to mind - orch (zookeeper), batch (hadoop, disco), stream (storm) 17:21:47 welcome newcomers, feel free to pipe in if you'd like - we're just talking about different categories of tools 17:21:48 rbergeron: storage is another one, I just left that out because you had already mentioned it 17:22:02 I think there are some open source column-parallel SQL databases too, haven't checked into their status 17:22:10 I think that ctyler got the examples (HDFS, gluster, nosql) 17:22:25 #idea storage is another category 17:23:00 #idea full hadoop stack seems to be thought of as useful foundation layer for some types of work, but HDFS is getting attention as weak spot, with nosql dbs and gluster being used as alternatives 17:23:33 I thought that they all had their strengths and weaknesses, though 17:23:37 so does "storage" seem like a reasonable label to apply to hadoop, hdfs, nosql, gluster as a bucket? 17:23:46 but I've spent more time reading about possible filesystems for a disco cluster than hadoop 17:24:12 tflink: yeah, i think - at lesat with gluster - some of it is just things like what the data is - small bits or larger chunks when written 17:24:18 In addition to plain storage, one thing we've been struggling with here is how to adequately back up terabyte-to-petabyte data sets. 17:24:35 rbergeron: hadoop doesn't seem like it belongs in the storage bucket to me - hdfs would, though 17:24:46 and i suspect that everything is probably really good at one thing, and not so great at others - no superstar jack of all trades 17:25:02 tflink: hdfs is arguably part of the hadoop family, no? 17:25:02 tflink: so hadoop would bucketize where? :) 17:25:03 oh, batch 17:25:08 never mind 17:25:33 ctyler: I think that at some point LOTS of stuff falls into "the hadoop family" 17:25:44 ctyler: true, you can't have hdfs without hadoop but I was thinking that hadoop would belong more with the "batch processing" bucket 17:26:13 I guess the easy thing here is that we don't have a lot of these things. Most of them, actually. 17:26:26 identifying doesn't take long. :) 17:26:34 rbergeron: you have a weird definition of 'easy'! 17:26:54 ah, identifying :-) 17:27:02 hdfs is part of the hadoop project 17:27:12 and one could install hdfs without using the mapreduce part 17:27:48 ctyler: yeah, identifying 17:28:05 #info hdfs is part of the hadoop project; one can install hdfs without using the mapreduce part 17:28:10 bmahe: are there many projects/people who do that? I thought that hdfs was mostly used to support MR jobs in hadoop 17:28:12 bmahe: thanks for piping in :) 17:28:39 tflink, true, but you can do it nonetheless :) 17:29:42 So what we do have is: Riak - mongo - ......... 17:29:57 Gluster, ithink we said we could add into here as well. 17:30:09 rbergeron, are you listing datastore? 17:30:12 cassandra, hbase (if we're separating out the hadoop components) 17:30:19 I'm not sure what state gluster is in right now as far as being up to date (or any of the others for that matter) 17:30:52 bmahe: I am listing out nosql things off the top of my head, and gluster for good measure since we mentioned it earlier. Just thinking of *what do we have that's loosely related* in general. 17:31:15 though i'm happy to pass the wheel to smarter folks who might be more organized about the what we have discussion :) 17:31:16 oh, what do we have? I'm not sure if either cassandra or hbase is packaged 17:31:27 tflink: last i looked cassandra was not 17:31:59 rbergeron, I just joined the room, so I am just trying to catch up 17:32:01 not sure on hbase but preliminary zodbot asking says no 17:32:23 tflink, you may want to look at Apache Bigtop, which package already a bunch of the Apache Hadoop ecosystem 17:33:15 bmahe: I'll take a look, thanks 17:33:15 actually, I am a commiter on Apache Bigtop and am interesting in this sig so I can help out both at the same time 17:33:24 bmahe: so the quick catchup is that we basically talked about "what is this big data thing" - and then came up with a few buckets as ideas to just categorize things - storage, orchestration, batch processing, stream processing 17:33:40 not necessarily perfect or final but just ... brainstorming. 17:33:47 fair enough 17:33:49 thanks a lot! 17:33:50 bmahe: ah, so bigtop is like, the superpackage of hadoop-y things, is it not? 17:33:54 * ctyler dropped off due to dead battery, back now 17:33:55 sorry, apache bigtop :) 17:34:01 * rbergeron slaps herself a bit 17:34:14 rbergeron, yeah, plus integration test and deployment recipes (we even have a VM recipe for boxgrinder ) 17:35:00 these are actually the upstream of CDH (the hadoop distribution from Cloudera) 17:35:08 bmahe: ah - iirc boxgrinder is sadly going the way of the dodo - but we can discuss that ... in a bit? 17:35:19 sure 17:35:50 we also have kickstart recipes to create live usb (based on the fedora ones) 17:36:02 but - I guess perhaps it would be good to hear what your thoughts are, indeed - there are a lot of moving parts there. 17:36:03 but sorry to distract the discussion 17:36:18 bmahe: wow, awesome - no, by all means. we loooove hearing that people use fedora - 17:36:48 a lot of the bigtop work is done on fedora :) 17:37:08 \o/ 17:37:08 i think another useful thing the SIG can function as is sort of a - is Fedora continuing to be useful for you in however you use it with your big data aspirations 17:37:44 ie: are we doing good things to make it better once up and running - or are there things getting in your way, that kind of thing. 17:38:02 bmahe: but i am tickled pink to hear all of this. :D 17:38:03 wouldn't fedora be more useful for end users/developpers rather than deploying a full size cluster (people would rather use centos/rh/debian in thise case)? 17:38:28 bmahe: because of lts or .. ? 17:38:33 yes 17:38:40 bmahe: well, i think it's useful for people who want to play with things, just check it out, etc. 17:38:42 stability, lts and so forth 17:39:15 rbergeron, agreed, but then the focus is slightly different 17:39:17 even if fedora is used as a base for jeos (assuming cloud installation)? 17:39:20 Depends on the nature of your BigData. Some big data projects are short-lived (e.g., some research projects fit within a Fedora support lifespan) 17:39:28 bmahe: I think for some people it may be a "want better underlying technology faster" - i'm not sure how much things like KVM, systemd, etc. matter in tihs particular space, but I know that sometimes it's useful. 17:40:05 or perhaps - can fedora access it well even if it's running on something more -EL-ish 17:40:16 from what I have seen, production deployments would not consider fedora. But dev clusters and others would be fine. 17:40:27 though i will add - people are always happy to see things going into Fedora and EPEL 17:40:43 true, it does not hurt 17:41:13 but shouldn't we finish listing the datastores before going further? 17:41:16 i tihnk from a project-level perspective - maybe more narrow than bigtop, but - it's always useful to sort out "does it work? aaaaagh" issues in Fedora, before it's an actual problem on -EL later on. 17:41:19 Yeah, there is that persistent rumour that Fedora helps inform the way that at least one significant enterprise linux shapes up. Can't remember the name offhand... 17:41:31 bmahe: sure, i just got sidetracked by the shiny, this happens :) 17:42:10 So, I think we were mostly saying "what do we even remotely have" - and where do those things get bucketized - just so we have a list of ... something that we can point people at, and perhaps come out of it with .... 17:42:16 what do we need/want to do next :) 17:42:40 anyone have any additional knowledge of hidden gems in Fedora? besides the handful I listed (if they apply)? 17:43:08 we have a lot of java libraries and servers like tomcat (used by solr, oozie) 17:43:27 bmahe: good to know 17:43:29 most of the big-data-ish stuff that I've been using isn't packaged 17:43:43 all the Apache HAdoop related projects have a lot of dependencies 17:44:01 #info lots of java libraries, servers like tomcat (used by solr, oozie) are already in 17:44:09 bmahe: yeah, and mostly java, correct? 17:44:14 also 17:44:36 we also have pandas. This is not really big data, but still very useful for data analysis 17:45:15 yeah, pandas is really useful 17:45:42 another issue fedora could help also is, reporting bugs with the openjdk. Most of these projects go straight to the oracle jdk and do not really test against openjdk. They are not against openjdk and would welcome help though 17:46:53 * rbergeron assumes this is not ... this type of panda: http://fedoraproject.org/static/images/panda-wave.png 17:47:14 May I propose we start a wiki page, populate it with a list of relevant packages, and note (a) what shape each is in in Fedora, then (b) vote on what we care about, to use that as the basis for some planning (i.e., for next meeting)? 17:47:26 rbergeron, http://pandas.pydata.org/ 17:48:34 ctyler: of course :) - I started a big data sig wiki page - do we just want to stick it right there on that page? 17:48:57 I feel like i'm definitely in foreign waters a bit (much like when i sort of stumbled into the cloud sig) 17:49:00 :D 17:49:07 (but glad to see that everyone else knows what's up, yay) 17:49:17 Or maybe a subpage because we won't want this on the SIG front page in six months? 17:49:40 #idea we have pandas (not the animal, http://pandas.pydata.org) - useful for data analysis, not really big data 17:49:57 ctyler: you know the saying, it's a wiki, be bold? :) 17:50:18 #action rbergeron to add a sub-page of packges we have (unless someone beats me to it) 17:51:06 * ctyler thought that was: When on the road, let someone with a decent internet connection edit the wiki :-) 17:51:30 bmahe: i think your thought on the openjdk stuff might make for an interesting mail to the mailing list, fi you wanted to do that. 17:51:51 Fits nicely with today's RH announcement about OpenJDK6 17:51:53 * rbergeron jus tnotes we're coming up on the hour ... rapidly 17:52:01 rbergeron, sure 17:52:11 ctyler, which announcement? 17:52:14 ctyler: yeah, i saw daddy shadowman said something, but i haven't actually opened that up yet. 17:52:23 clicked on that twitter link. 17:52:56 bmahe: so wrt bigtop - have you guys had aspirations for actually getting it packaged proper in a distro? 17:52:59 rbergeron, I will send an email tonight when I come back from work 17:53:06 rbergeron, we do 17:53:20 it being "all the things" :D 17:53:21 rbergeron, actually, the ubuntu cloud is basing their packages on bigtop 17:53:35 it -> the email regarding openjdk 17:53:53 bmahe: gotcha 17:53:59 bmahe: http://www.redhat.com/about/news/press-archive/2013/3/red-hat-reinforces-java-commitment 17:54:35 rbergeron, so right now, in bigtop we are packaging a bunch of projects (jsvc tomcat bigtop-utils crunch datafu flume giraph hadoop hbase hive hue mahout oozie pig solr sqoop whirr zookeeper) for sles/fedora/ubuntu/debian/centos [,] 17:54:41 centos [5,6]* 17:55:03 bmahe: nod 17:55:09 so that's a lot to target and we had to take some shortcuts such as pulling the dependencies through maven and not packaging them independently 17:55:24 but it would be awesome if we could straighten up this part 17:56:08 and ideally, my goal (and some of the other people there) would be to become an upstream of distributions. Because there is no reason to duplicate the same efforts and we should all share a common base if possible so we can focus on higher level tasks 17:57:51 bmahe: nod - esp. with java - there's a lot of effort in packaging that much stuff - and maintaining 17:57:51 ctyler, thanks 17:57:54 So my angle on this, in addition to the fact that my college is looking at big data in the curriculum, is that ARM hyperscale looks like it will eventually be a good way to do some of this, yet the story is weak on a few fronts (e.g., OpenJDK on ARM). 17:58:22 ctyler, ubuntu was working on hadoop on arm. not sure where they are at 17:58:50 The OpenJDK piece is being worked on, it would be good to ensure that the rest of the pieces are in good shape on Fedora ARM. 17:59:28 ctyler: when you say "do some of this" - you mean "some of the big data things" or something more specific 18:00:05 rbergeron: I mean that ARM hyperscale is well suited to some big data tasks (but not others, yet). 18:00:15 bmahe: i (sadly) don't have major advice/thoughts on the whole maven/dependencies/becoming an upstream of distributions - my specialty is ... well, typing fast and cheerleading and not ... packaging 18:00:21 ctyler: ahhh, yes, totally 18:00:26 rbergeron, maybe we could try to dogfood the use cases by applying some tools to the fedora projects? Do we have access to any data from fedora? (webservers, package builds...) 18:00:45 bmahe: threebean is your guy. 18:00:48 (i.e., 10K cores in one rack is great, but 4GB per process max is a ceiling for some things) 18:01:04 rbergeron, no worry. I am pretty sure something interesting will come out either way 18:01:29 * threebean waves 18:01:39 threebean, hi! 18:01:43 bug data could be a decent candidate 18:01:47 bmahe: he's working on http://www.fedmsg.com/en/latest/ - mayve that seems like a good intersection 18:02:30 bmahe: hi.. we've been looking at throwing our infrastructure logs at logstash, but not much else as far as analysis goes yet. 18:02:32 but bug data isn't all that big on the scale of some projects - IIRC, F15 bugs are ~ 3-4G of text 18:02:39 * threebean nods 18:02:59 yeah, the fedmsg data isn't that big either. It's big, but not Big. 18:03:05 tflink, that would probably feed some ideas 18:03:14 bmahe: so I may be asking you to, well, write multiple emails - not an emergency or anything, we're not on fire here - but I think you've got a few cool things to discuss, esp. the maven/dependencies thing - 18:03:33 bmahe: I have code for grabbing data from bugzilla and all the F15 bugs if you're ever interested 18:03:38 unless fedora is willing to support this sig with a cluster and terabytes of HDD, I am not sure we want to go *that* big either 18:03:44 in both XML and extracted txt form 18:03:46 i know that the cloudstack folks have some of the same stuff going on, and even the folks who did jboss as had ... welll... fun. 18:04:15 rbergeron, let me write down the email subjects: openjdk support and packaging of dependencies? 18:05:25 yeah, i think that's it. 18:05:51 mind if i action you on that, so we don't say next week... who was that awesome person, what were we hoping to learn? :) 18:06:00 I'm not sure if anyone else is, but I'm interested in bigdata stuff outside the hadoop ecosystem 18:06:15 bmahe: I think just in general with packaging - anything regarding your thoughts/aspirations/potential problems 18:06:18 rbergeron, note also the tendency to use the latest and greatest only (servers in clojure, having strict dependencies on a beta version of yesterday build of a dependency...). So a lot of fun ahead 18:06:18 * ctyler notes that the pillow over there is pointing to the clock, which reads 2 am. I'm out, will read the minutes to see how the story ends. 18:06:23 tflink: I TOTALLY AM> 18:06:31 ctyler: have fun there :) 18:06:46 tflink, so do i :) 18:06:49 disco in particular since it seems to be more python-friendly than hadoop 18:07:19 tflink, not to come back to hadoop, but there are a few python wrappers for hadoop 18:07:19 yes, I know you can use python with hadoop - I've done it before 18:07:46 also most projects use thrift or avro or prtocol buffer, which are language agnostic 18:08:03 #idea interest in disco - seems to be more python-friendly than hadoop (though we are aware that there are python wrappers for hadoop) 18:08:31 I was using mrjob with EMR 18:10:21 I see spring is packaged. we could also package spring-hadoop 18:10:25 there are talks on disco at pycon US and pydata 2013 this year 18:11:24 #info spring is packaged - could package spring-hadoop 18:11:41 tflink: i know we have folks at pycon - not sure on pydata 18:12:04 rbergeron: pydata is during the sprints following pycon - I'm not sure if anyone is going either 18:12:17 there are also statsmodel and patsy which could be nice for pandas 18:12:21 I've been thinking about it but haven't decided if its worth the admission cost yet 18:12:51 #action bmahe to expound on openjdk/bug filing, as well as the wide world of bigtop packaging, as time permits :) 18:13:04 sure, will do 18:13:14 tflink: ticket cost or travel cost? 18:13:20 tflink: or some combo thereof :) 18:14:01 rbergeron: I'm already going to pycon and staying for the sprints. it's the ticket cost and whether it would be better to spend that time @ the sprints 18:14:25 ahhhh 18:15:03 tflink: you might talk to lh - I think she is helping to organize some of that, perhaps she can shed light on it 18:15:50 #topic Operation Agenda: Yeah... 18:15:53 Well... 18:15:53 * rbergeron throws a rock at the bot 18:15:54 This bodes well. 18:15:58 There we go. 18:16:19 I think we veered a bit but still came up with some interesting things. 18:16:52 I'm not sure if we'r emeeting'd out - tflink, did we cover your additional bases? 18:17:13 rbergeron: additional bases? 18:17:56 rbergeron, what about deployment, orchestration and the cloud? 18:18:18 you alluded to "other than hadoop" - and mentioned a few things - wasn't sure if you wanted to go on :) 18:18:48 no, I think we touched on what I had in mind 18:18:49 bmahe: if you're willing to go on, i'm willing to continue taking notes - I'm not sure if we've lost the others yet :) 18:19:10 rbergeron, I will add it to my email :) 18:19:53 bmahe: that would be delightful. 18:20:35 bmahe: we have a cloud sig - of course the orchestration stuf plays all over. 18:20:39 my, i can't type today. 18:20:52 Anyone else have anything they'd like todiscuss? 18:21:17 rbergeron, am already subscribed to the cloud sig :) 18:21:26 I think we might have to sit on "what would you like to do" in a more organized fashion until next week :) 18:22:01 bmahe: excellent, apologies if i've been blind to any mails you've sent there :) 18:22:15 yeah, it looks like we've lost most people 18:22:32 and some of the "what would we like to do" might work better on the list anyways 18:22:35 yeah. 18:22:43 rbergeron, I did not send any 18:23:02 it's just nice to have a discussion and stuff to get a general feel for things. :) 18:23:54 #action rbergeron to prod in meeting notes to get people to talk re: what would we like to do (we==they) 18:24:24 * rbergeron thinks she's got most things accounted for - and thus i shall start the timer to countdown 18:24:33 unless anyone objects :) 18:24:46 87, 29, 14,.... 18:24:48 10... 18:24:55 3, 2, 1. 18:25:01 Thanks for coming, everyone. 18:25:14 rbergeron: thanks for leading 18:25:17 This was highly informative and actually exciting :D which is awesome 18:25:31 thanks a lot! 18:25:31 tflink: surely :) 18:26:10 bmahe: thanks for joining today! looking forward to hearing from you. 18:26:13 #endmeeting