18:00:10 #startmeeting Infrastructure (2015-03-26) 18:00:10 Meeting started Thu Mar 26 18:00:10 2015 UTC. The chair is nirik. Information about MeetBot at http://wiki.debian.org/MeetBot. 18:00:10 Useful Commands: #action #agreed #halp #info #idea #link #topic. 18:00:10 #meetingname infrastructure 18:00:10 #topic aloha 18:00:10 #chair smooge relrod nirik abadger1999 lmacken dgilmore mdomsch threebean pingou puiterwijk 18:00:10 The meeting name has been set to 'infrastructure' 18:00:10 Current chairs: abadger1999 dgilmore lmacken mdomsch nirik pingou puiterwijk relrod smooge threebean 18:00:10 #topic New folks introductions / Apprentice feedback 18:00:23 * oddshocks waves 18:00:27 * puiterwijk is here 18:00:36 * lmacken 18:00:43 * threebean 18:01:22 * relrod here 18:01:27 * pingou here 18:01:36 here 18:01:50 * adrianr here 18:02:14 any new folks like to introduce themselves or apprentices have questions? 18:02:45 ok, on to info dump... 18:02:56 #topic announcements and information 18:02:56 #info Got all our iscsi storage moved from old to new filer with no downtime! - kevin 18:02:56 #info Got new rbac-playbook installed and working - patrick and tim 18:02:56 #info New FMN release deployed to production. New features and bugfixes. - ralph 18:02:57 #info Hooked some email lists up to fmn (scm-commits, meetingminutes) - ralph 18:02:58 #info Turned off old emails from dist-git and pkgdb - ralph 18:02:59 #info FAS3.0 status https://fedoraproject.org/wiki/User:Laxathom/Drafts:FAS3.0 18:03:20 #info progit became pagure: http://blog.pingoured.fr/index.php?post/2015/03/25/Progit-is-dead%2C-long-live-pagure 18:03:25 hola 18:03:26 #info pagure progresses nicely 18:03:34 hello 18:03:35 threebean: is it worth a reminder about the pkgdb/git/koji no longer sending emails? ie, devel-announce or something? 18:04:03 yeah, might as well. I can send that to devel announce today. 18:04:07 ok, cool. 18:04:10 welcome amrzaki 18:04:28 i'm new to infra team 18:04:44 and i want to be have role in team :) 18:05:02 amrzaki: great. Are you more interested in sysadmin? or application development related tasks? 18:05:12 .fas corey84 18:05:13 Corey84: corey84 'Corey84' 18:05:50 system admin 18:06:14 amrzaki: excellent. See me in #fedora-admin after the meeting and I will point you in the right direction to get started... 18:06:32 ok thanks 18:06:50 so, we don't have any discussion topics or 'learn about' sign up in the gobby this week. ;) Is there anything anyone would like to discuss and/or would someone like to teach about one of our applications or tools? 18:06:59 pingou: pagure.org and .io are available :) 18:07:11 I wanted to give an update about the MM2 status 18:07:21 nirik: I did say koji is not weeks ago when I turned it off 18:07:26 adrianr: sounds good. 18:07:33 #topic Mirrormanager2 status 18:07:42 lmacken: oh nice :) 18:07:44 dgilmore: yeah, people forget tho...and we did just turn off pkgdb and git. 18:08:02 I would say the crawler is now working 18:08:16 adrianr: it's still crashing from time to time tho right? 18:08:23 however, only with libcurl from F21 18:08:36 RHEL7.1 libcurl let's it crash after 1 minute 18:08:44 :/ 18:08:48 total runtime right now seems to be at least 7 hours 18:09:16 thats not too bad. 18:09:16 there is bug report for the libcurl problem 18:09:24 do you know what the current time for the current crawler is? 18:09:32 * pingou wanted to ask 18:09:43 no, but it takes some time, let me look 18:09:58 https://bugzilla.redhat.com/show_bug.cgi?id=1204825 18:10:34 I'll note that _all_ mirrorlist servers are mm2 now. 18:10:43 and have been more stable it seems like 18:10:56 that's great 18:10:56 unfortunately there are no real code changes which could explain why it works with F21 libcurl but not with RHEL7.1 libcurl 18:11:31 nirik: :awesome: :) 18:11:42 adrianr: yeah, puzzling. 18:11:49 Question: Is there some reference sheet for the meaning of all the keys found in the MM pickles? 18:11:49 post meeting I need to chat with someone about my login creds they seem to have disappeared 18:12:11 Corey84: for fas? or ? 18:12:14 There's all this stuff like this http://paste.fedoraproject.org/203355/73935231/ 18:12:18 and I have no idea what that stuff means 18:12:22 no I don't know about the pickles 18:12:47 oddshocks: I doubt it. but creating such a doc would be really worthwhile. 18:12:53 i hit the mm page and fas says no dice on login ( invalid user or creds don't allow access to this page) 18:13:04 pingou: do you still have your visualization of the mm2 table schema around somewhere? 18:13:11 threebean: I do 18:13:22 Corey84: odd. Nothing should have changed... 18:13:22 I feel like I'm kind of floundering on this here, and whatever docs people have would be great 18:13:35 threebean: https://github.com/fedora-infra/mirrormanager2/blob/master/doc/mirrormanager.png 18:13:41 (as well as the dia file) 18:13:47 nirik, hence the mention but post mtg not trying to hijack the mtg 18:13:47 current crawler kills all crawls which do not finish within two hours I thnik 18:14:15 adrianr: but you added the time-out right? 18:14:35 I notice that mm says my mirror was crawled recently. it shouldnt be crawled as its not public and I use report mirror 18:14:52 oddshocks: maybe take a look at the mirrorlist daemon code? it's the thing that reads in the pickles and uses them to assign requests to mirrors. if a pickle is bad, it's going to fail when loaded by that code. so if you could read through it and try to extract some of its assumptions... that could serve as the basis for some tests. 18:14:57 ignore me 18:15:08 there is a time-out in the code but it seems not work yet 18:15:12 /ignore dgilmore 18:15:45 but, with F21 libcurl, the MM2 crawler seems to work and is pretty stable 18:15:50 oddshocks: some of the cache will change as the data changes, for example host_netblock_cache 18:15:51 threebean: yesss thank you! that will help immensely 18:16:12 the newest code changes also re-enabled the xmlrpc interface 18:16:21 oddshocks: other will changes as the DB changes, I guess asn_host_cache or host_country_allowed_cache (although I would expect this last one to be pretty stable) 18:17:23 looking at the code I found out that report_mirror marks a mirror as up to date as long as all directories are there 18:17:35 adrianr: so what big parts are left? backend (ie, scanning dirs and making pkl) and frontend (but thats very minor) 18:17:40 pingou: So what'd be an example of a change that could break the pickle? 18:17:43 as far as I understand the code it does not care about the content of the directory 18:17:57 adrianr: not really 18:17:58 adrianr: that explains a lot of weird behavior I have seen over the years. ;( 18:18:22 oddshocks: maybe we could check that we always get some (stable) values, for example things should not changed for the FP mirrors (or the dell ones), maybe we could check these are always there 18:18:24 report_mirror is for private mirrors right? because we can't crawl them? but do public ones use it too? 18:18:32 adrianr: in the name of making things faster it validates only parts of the tree, and assumes teh rest is right 18:18:40 nirik: its for all mirrors 18:18:52 but required for private mirrors 18:18:57 oddshocks: after the idea (imho) is to set-up a system where we can add tests as we find bad pickle to prevent the same error/problem from happening again 18:19:00 IMHO we should reject it for all except private 18:19:08 nirik: it lets us know when a mirror has synced without crawling them 18:19:15 oddshocks: so as long as we have the architecture in place, we can always expand on after :) 18:19:22 dgilmore: yes ,but they can (and are) wrong 18:19:27 and then users get crappy mirrors. 18:19:27 report_mirror is helpful (kind of) 18:19:28 its required for private mirrors 18:19:48 nirik: mm also makes some other assumptions 18:19:51 people just stick it at the end of their rsync... but that does not mean it worked. 18:20:02 and doesnt drop mirrors out if it is unsure of content status 18:20:18 pingou: Awesome, that was going to be my next question. :) Do you mean some sort of alert system that a pickle has a problem we haven't seen before, and so someone needs to write a new test? Or some sort of automated system? 18:20:27 if we can crawl everything in 7 hours, why not move to a 'once the crawler says you have it you do' 18:20:28 MM1 and MM2 has prelimnary code for a canary scan, maybe this would be good to activate 18:20:57 nirik: its kinda false, and causes mirrors to drop in and out 18:20:59 adrianr: +1 on that 18:21:18 dgilmore: do you think you could find 1h at one point to write down all this? 18:21:19 why? not sure I follow 18:21:32 oddshocks, pingou: start small. just a python function that reads in a pickle and raises an exception if it is a 'bad' one. later we can integrate it into unit tests, or nagios, or the mm2 code itself if it proves useful. 18:21:40 pingou: sure, but I do not think any of it is new 18:21:54 dgilmore: it is not, but it's also not documented anywhere 18:21:54 threebean: got it 18:22:02 maybe I just have a more initimate knowledge 18:22:12 perhaps this is all a discussion for the list. 18:22:19 I think the crawler could be distributed on multiple machines and we could crawl each category seperately, this could decrease crawl time a lot 18:22:29 oddshocks: threebean my idea was that we could integrate that in MM2 itself so that we avoid pushing bad pickle to the mirrorlist servers 18:22:38 but it would need a lot of resources 18:22:42 adrianr: does it keep crawling over and over? or ? 18:22:48 adrianr: we need to be sure crawling does not overwhelm mirrors 18:22:57 pingou: sounds good. let's write the core of it first before worrying about using it to gate stuff. 18:23:02 +1 18:23:02 dgilmore: that is right, yes 18:23:07 could we be smart... crawl everything, stop. If new mirror or changes to mirror, recrawl it, if content changes, recrawl 18:23:37 dunno... I guess that fails. 18:23:48 the crawler could be much smarter, that is true, but that requires lot of code changes 18:23:50 perhaps we can do full crawls weekly, but only crawl for known changed content 18:23:53 but I do think we can be more strict and make sure users don't get bad mirrors as much 18:24:16 right now the crawler starts every 12 hours (but crashes after the first minute) 18:24:19 I think there's a lot of cases where report_mirror is being run, but it's wrong and they aren't up to date. 18:24:22 +1 on a smarter crawler but -1 on waiting on it to push MM2 :) 18:24:24 nirik: is it really a big issue. I rarely get a bad mirror 18:24:47 pingou: I was not thniking of delaying MM2 for that 18:24:52 pingou: we need working crawling 18:24:56 dgilmore: I've had a number of complaints. But yeah, in the end you using get what you want, but it just looks bad to have to hit several mirrors first 18:25:03 adrianr: I thought it was running fine with the libcurl from f21? 18:25:03 pingou: agreed. 18:25:05 if it crashes after a minute thats no good 18:25:20 but we shouldnt hold up moving to mm2 on it 18:25:23 pingou: yes, but only in my home directory ans started manually 18:25:42 adrianr: should we try to recompile the f21 on epel7 and install it on the machine? 18:26:03 well, we have 2.5weeks until the next window we could deploy it on 18:26:05 pingou: I don't know what the right solution, that or LD_PRELOAD 18:26:12 I don't think we are going to get it out before tuesday. 18:26:16 so it will need to be after beta 18:26:27 so, hopefully curl can be fixed. 18:26:57 so, crawler works, I have not seen errors in umdl (yet) but also not looked very closely, 18:27:07 frontend seems to work and xmlrpc 18:27:33 the next steps would be a new release and then trying to move the installation from staging to the production systems 18:27:34 will people need to have a newer report_mirror? 18:27:36 umdl and making sure it makes good pkls is the last big part IMHO 18:27:38 to be ready to switch 18:27:53 no, I am using a very old report_mirror for my testing on my mirror 18:28:00 never updated it in years 18:28:08 dgilmore: we tried to be backward compat on this :) 18:28:14 adrianr: cool. I think I use an ancient version 18:28:16 adrianr: thank you for that :D 18:28:18 adrianr: BTW, thank you a lot for all your work on this. It's really appreciated. 18:28:31 adrianr++ for all the testing on MM2, have a cookie :) 18:28:37 thanks 18:28:54 adrianr++ 18:28:55 :) 18:28:56 pingou: good, there will be mirrors not running rhel/fedora and likely not know/notice they need a new version 18:29:18 perhaps we could move report mirror to a fedmsg thing someday. 18:29:31 would require giving out certs or something tho 18:29:48 I am personally not sure if it is good to switch before F22, but we can try it and fix errors as they come 18:29:59 ...and whitelisting tons of ips in iptables.. 18:30:07 threebean: yeah, ok, nevermind. ;) 18:30:11 a parallel bus? 18:30:15 adrianr: well, we can see. 18:30:20 -1. this is a good fit for 'http' 18:30:22 nirik: yeah, I would like to get to a point where we can have a mirroring script that uses fedmsg to know when to sync and report back 18:30:31 dgilmore: we have such a script already. 18:30:31 if we want it on the bus we could do the github2fedmsg style http->fedmsg bridge. 18:30:38 adrianr wrote one. ;) 18:30:40 threebean: I heard about this, it's this new trendy thing, right? 18:30:50 and would let us automate implementing mirroring tiers 18:30:57 https://fedoraproject.org/wiki/Infrastructure/Mirroring#Mirror_Frequency 18:31:01 nirik: not a complete one 18:31:15 nirik: what we have is only part of it 18:31:19 I am still interested in getting SSH keys into MM2 to get fedmsg triggered push mirroring 18:31:46 dgilmore: whats missing? 18:32:34 adrianr: wonder if that would be finally a use case for ssh certs. 18:32:37 nirik: part of its us, in making messages for releases 18:32:48 nirik: but having mirrors report back 18:33:06 nirik: so when tier 1 mirrors are done, tier 2 know to kickoff 18:33:20 dgilmore: ah right... releases, yes 18:33:40 we could have report_mirror publish a fedmsg on our side. 18:33:59 adrianr: then mm generates a cert and signs it with a ca, and mirrors can trust certs with that ca, and we can also restrict what those certs could be used for. 18:34:02 i.e., tier1 mirror runs report_mirror, pings the mm2 server. mm2 server publishes a messages which makes its way to the tier2s. 18:34:11 but there is again the problem we know not for sure what the mirror has 18:34:24 true. 18:34:35 threebean: we could. I really want to get the tier 1 mirrors all using a common setup to sync contents 18:34:38 but that would help for the case that everything worked 18:34:55 right. if we're going to do the work to allow tier1 mirros to publish to our bus.. we could instead invest it in making report_mirror more detailed, or reliable, or however you want to call it.l 18:35:00 I do not think we can solve it all today 18:35:10 threebean: sounds reasonable. 18:35:16 dgilmore: nope, but good ideas. ;) 18:35:33 threebean: assuming that the tier 1 mirrors use report_mirror 18:35:39 * threebean nods 18:35:43 dgilmore: there's your common setup. ;) 18:35:51 threebean: but we could make that part of the automated mirroring setup 18:36:30 I think that today a lot of the tier 2 and 3 mirrors pull directly from dl.fp.o and not teh tier 1 mirrors 18:36:38 yep. I agree. 18:36:44 but I do not have cold hard facts to back up my feeling 18:36:55 I think it's the case, but yeah. 18:37:04 anyhow... 18:37:08 #topic Open Floor 18:37:12 any items for open floor? 18:37:39 just a status report thing.. i'm trying to get a bunch of new-hotness issues solved and ready to deploy before freeze. 18:37:49 mostly cleaning up error reporting and related things. 18:37:50 cool. 18:38:09 I'm going to see about applying non rebooting updates, so we are updated before freeze. 18:38:43 ok, thanks for coming everyone. See you in #fedora-admin, #fedora-apps, and #fedora-noc 18:38:46 #endmeeting