18:00:03 #startmeeting Infrastructure (2014-10-09) 18:00:03 Meeting started Thu Oct 9 18:00:03 2014 UTC. The chair is nirik. Information about MeetBot at http://wiki.debian.org/MeetBot. 18:00:03 Useful Commands: #action #agreed #halp #info #idea #link #topic. 18:00:04 #meetingname infrastructure 18:00:04 The meeting name has been set to 'infrastructure' 18:00:04 #topic aloha 18:00:04 #chair smooge relrod nirik abadger1999 lmacken dgilmore mdomsch threebean pingou puiterwijk 18:00:04 Current chairs: abadger1999 dgilmore lmacken mdomsch nirik pingou puiterwijk relrod smooge threebean 18:00:15 * lanica is here for the infra meeting. 18:00:21 * danielbruno here 18:00:22 * pingou 18:00:28 * tflink is here 18:00:30 * bwood09_ here 18:00:57 * lmacken 18:01:22 * mpduty is here 18:01:32 * threebean is here 18:01:35 * roshi lurks 18:01:49 welcome everyone. 18:02:03 hi 18:02:04 #topic New folks introductions and Apprentice tasks 18:02:14 any new folks like to introduce themselves? 18:02:23 Or apprentices with questions/comments/ideas? 18:02:35 * puiterwijk is here, but busy (as announced) 18:02:52 note: I'm going to be doing my monthly cleanup of the apprentice group later today... so if you haven't sent in your monthly status email, please do so asap. ;) 18:03:28 I am here, busy installing openstack 18:04:07 * oddshocks is here 18:04:13 I did have one question for the fedora cloud folks - what is your management/front end? is it all CLI virsh commands, or is there a web interface somewhere? 18:04:34 * danofsatx-dt checks his sent folder for nirik 18:04:52 danofsatx-dt: we have a dashboard (horizon?) for one off things, but we use ansible to manage the persistent instances. 18:05:05 ansible spins up the instance and configures it as needed. 18:05:06 ok, that's what I wanted to check 18:05:14 nirik, i'm not sure if it right time to ask, but still: please point me where to look/ask about backup. I've found out, that copr has no backup in fedora-infra, mirek done them by hand. I'm going to add new component to copr - package signer and it surely need to backup gpg keys (and in secure way btw) 18:05:27 http://infrastructure.fedoraproject.org/cgit/ansible.git/tree/ 18:05:29 a lot of the stuff I'm deploying is one-off instances 18:06:16 vgologuz: yeah, we aren't doing backups of it now... but it would be good to do so. Can you email me (or perhaps make a ticket) with what things should be backed up and I can set it up. 18:06:37 we have a backup server, it would connect and use rdiff-backup to backup whatever directory trees/volumes we want. 18:07:00 I guess we should backup all the rpms/repos too? 18:07:17 and second question, do fedora-infra have an intstance of cacti/zabbix to send custom monitoring stats (e.g. length of builds queue ) ? 18:07:42 nirik, and DB i think 18:08:07 vgologuz: yeah, db too. :) we should setup a cron to dump the db to a file and backup that and keys and rpms. 18:08:11 vgologuz: we have nagios... 18:08:18 and collectd 18:08:25 depending on if you want to monitor, or alert 18:08:32 i think nagios only about critical states? 18:08:41 yeah. 18:09:40 vgologuz: sure. so, if you could file tickets we can work on it, or you can do so. ;) 18:10:01 ok i will review copr and file ticket about backup 18:10:05 #info copr backups, monitoring and alerting work coming up. 18:10:07 haven't heard about collectd, where should i look in infra? 18:10:20 except collectd.org 18:10:27 http://admin.fedoraproject.org/collectd/ 18:10:36 it's also in ansible (configuring and setup). it's a role. 18:11:08 it collects normal stats... load, cpu, etc... and we can make plugins for extra stuff we want. 18:11:27 thanks, i'll study about plugins 18:11:37 .tiny https://admin.fedoraproject.org/collectd/bin/index.cgi?hostname=busgateway01.phx2.fedoraproject.org&plugin=fedmsg×pan=86400&action=show_selection&ok_button=OK 18:11:38 nirik: http://tinyurl.com/o8srljr 18:11:50 for example a fedmsg plugin for our busgateway that sows fedmsgs 18:12:18 anyhow, yeah, do let us know if you have questions or need info. 18:12:55 vgologuz: oh, you were fixing up the copr playbooks? last time I tried to run them they didn't finish... could you look into fixing that? I think it was a missing source file. 18:13:29 #topic Applications status / discussion 18:13:37 any applications news ? 18:14:07 been roping anitya and koschei into fedmsg this week.. lots of fun. ;) 18:14:21 threebean: and tnh? :) 18:14:32 * threebean nods 18:14:45 started work on the new backend for anitya 18:14:47 * tflink is working on taskotron monitoring setup, most of the other issues have been fixed but waiting for new builds 18:14:48 https://github.com/fedora-infra/the-new-hotness 18:14:59 I've been working on anitya with threebean 18:15:10 at this point, we need to decide how wise it is to switch off autoqa right before beta freeze 18:15:21 and spent some time on progit yesterday to make it a little less fedora-centric (ie: allow local account instead of relying on FAS) 18:15:38 tflink: the goal was to switch it off a few days ago, no? 18:15:58 oh, and I got the new fedocal out of the door (that benefited from quite a bit of help from trashy) 18:16:35 * mirek-hm is here 18:17:09 threebean: that was the original hope, yes 18:17:20 some bits took longer than I wanted them to 18:17:25 * threebean nods 18:17:26 #info new fedocal releases this week (see changelog on list) 18:17:34 * lmacken has been doing a lot of bodhi masher development lately. Almost ready to start testing pushes in stg. 18:17:50 #info taskotron monitoring has been added. 18:17:55 if the only pieces that are left are monitoring, I'd be +1 to moving forwards with taskotron for this portion of the release cycle -- and killing autoqa. 18:17:58 tflink: 2 questions: a/ how easy/hard is it to turn it back on if needed? b/ can both system work in parallel? 18:18:05 #info anitya and new backend work moving along 18:18:24 nirik: and pkgdb2 also got a released pushed in prod, on Monday 18:18:33 tflink: there's still a bit more monitoring to add? and we wanted to backup some more stuff? 18:18:39 pingou: if we don't delete the autoqa01 vm, trivial to turn back on. they can't work 100% in parallel due to how we provide feedback in bodhi comments 18:19:03 nirik: yeah, I'm working on the website monitoring right now, I don't think that the buildbot plugin is going to be ready this week 18:19:09 then I'm +1 on moving forward and just kee the autoqa01 vm around for now 18:19:18 there are some files on taskotron01.qa that need to be backed up as well 18:19:27 * tflink doesn't remember if he filed a ticket for that 18:19:39 tflink: I'm happy to assist adding the website monitoring and backups... 18:19:49 no ticket yet, but if you file one I can get it going. ;) 18:19:50 I'm going to do a new libtaskotron build later today and reset the history on taskotron01.qa 18:20:09 tflink: any chance of a new resultsdb release before prime time? 18:20:12 nirik: if you have the time to do the nagios stuff, I can work on the new builds and cleanup 18:20:27 threebean: yeah, was planning on a new build/release for that today as well 18:20:32 rad, rad. 18:20:36 tflink: also, there's a report email going to admin I think about success and failures for each {prod|stg|dev}... should that better go to qa-devel? or test list? 18:21:30 nirik: odd, I'm only seeing it go out to sysadmin-qa-members 18:21:39 oh, perhaps I misread. 18:21:41 * nirik looks 18:21:41 it's not supposed to go out to admin@ 18:22:10 oh, you are right. I misread. ;) 18:22:24 but still would those better go to qa-devel? or sysadmin-qa is good? 18:22:39 syadmin-qa is good for now - the information in those emails is of limited utility 18:22:50 a very limited audience, rather 18:22:56 ok. yeah, I wasn't sure if there was anything to do with them. ;) 18:23:14 in practice, anyone interested in the emails is part of sysadmin-qa 18:23:29 fair enough. 18:23:35 that may change in the future, but we'll have better methods of reporting by then, I think 18:24:10 * nirik likes tracking down automated things that send email and find out if they are needed/going the right place. ;) 18:24:18 anyhow, any other applications news? 18:24:40 oh, we flooded fedmsg earlier this week :) 18:24:59 but people cannot complain anymore that the information stored in pkgdb2 about their packages is incorrect (for most of them) 18:25:11 pingou: is that a cron job now? 18:25:19 or a fedmsg trigger? ;) 18:25:23 ooo 18:25:32 basically, we now have a cron job that on a weekly basis will take the metadata from rawhide and update the package information in pkgdb with them 18:25:38 nirik: cron :) 18:25:50 cool. worth a blog or note to devel-announce? 18:26:33 maybe a blog, seems to small to worth devel-announce (imho) 18:26:48 yeah, fair 18:27:05 #info pkgdb info on packages is now updated once a week from rawhide metadata. 18:27:23 which means it's missing some packages, those only present in the other branches 18:27:33 yeah, or epel only or whatever. 18:28:06 anything we want to try and land before freeze next week? 18:28:14 or we are in pretty good shape for apps that freeze? 18:29:18 I have some big changes coming up in pkgdb land, but that's something to coordinate with rel-eng 18:29:29 and it'll most likely wait for after the freeze 18:29:35 sounds good 18:29:41 #topic Sysadmin status / discussion 18:29:50 lets see... on sysadmin side of the world. 18:30:21 I've been seeing problems with our nightly ansible check/diff cron job not completing. :( Still investigating... it's just being really really slow when run from cron. 18:30:35 #info bastion02 reinstalled with rhel7 and ansible 18:30:54 I've reinstalled bastion02... and I would like to take a quiet time off hours to test it as vpn hub. 18:30:55 \รณ/ 18:31:08 it would be a short blip as everything reconnects (if all goes well) 18:31:38 there's more movement on the new qa boxes that were supposed to have been ordered in Q2, hopefully they'll be ordered in the next week or so 18:31:50 tflink: great. Just keep us posted. 18:31:55 will do 18:32:05 I am working on two things 18:32:12 1) getting a rack for the QA machines... 18:32:19 tflink: I was going to ask what you would think of moving all the instances off virthost-comm01.qa to 03? but I'm not sure we will have time before freeze... 18:32:29 2) starting inventoring what we have and what we will want for next fiscal year 18:32:58 * nirik nods. 18:33:15 smooge: might be good to check support status for everything too... make sure we didn't miss any renewals. 18:33:20 I learned that juno should be released next week https://wiki.openstack.org/wiki/Juno_Release_Schedule 18:33:28 nirik, will do so 18:33:37 oh BOY 18:33:37 so we may install juno for next fedora cloud 18:33:52 mirek-hm: fun. ;) what changes will we need to make for that? 18:33:57 by that time i should have that equilogic attached 18:34:05 mirek-hm: was also going to ask... yeah, about that... ;) 18:34:14 start from scratch, burn the old to the ground, add kerosene and matches 18:34:18 nirik: I do not know, hopefully nothing :) 18:34:56 ok, yeah, we should go with newer if we at all can. 18:36:00 packstack is primary developed in RH, and most development are backported immediately to RDO, so our installation should be identical or just few touches. everything else is over api, which should be same 18:36:57 ok, great. 18:37:06 thanks for working on it mirek-hm 18:37:09 thanks mirek-hm 18:37:16 smooge: on or off the matches? 18:37:19 my idea was that we can utilize this time to try upgrades. i.e. keep current installation and before we burn it down, let try to upgrade it to juno. so we will be more sure what we are doing when will be upgradin juno to k-something 18:37:58 mirek-hm: well, I'd be ok with that, but I think we should migrate manually anything important off it to the new one first 18:38:01 so icehouse to juno? 18:38:07 folsom 18:38:08 or folsom to juno? 18:38:17 * nirik doesn't think it will work at all. ;) 18:38:27 oh god.. you are a braver man than I 18:38:34 icehouse to juno, then reprovision it, and install juno from scratch 18:39:02 but it will prepare us to upgrade from juno to k-something next year 18:39:07 mirek-hm: or we can save that for after. Install juno now, get migrated, then install a icehouse and play with upgrades on a single node? 18:39:36 I really want to get off this folsom one. ;) 18:39:45 me too :) 18:40:08 when we installed folsom they said upgrades were not supported at all. 18:40:17 glad to hear that it might work now. ;) 18:40:39 anyhow, anything we can do to accelerate moving to a new one is good with me. We can then take time after to test things or whatever. 18:41:14 #info openstack juno out next week, will try and move to that for our new cloud. 18:41:20 #info need to test openstack upgrades 18:41:38 ok, one other thing I wanted to bring up: 18:42:14 currently for rhel6 hosts, we use denyhosts. It's dead upstream and has no epel7 branch (nor will it), so we have to move to something else for rhel7 hosts. 18:42:31 I tried fail2ban and could not get it working at all. It crashed my test machine too. 18:42:45 I tried pam_abl (didn't work at all) and pam_shield 18:43:10 pam_shield works, but only if you allow one or both of password auth or token auth in sshd. 18:43:29 Anyone have any better ideas in the area of blocking brute force sshd junk? 18:44:03 #info ideas wanted for rhel7 denyhost replacement 18:44:32 I guess we could just put up with the log noise, or do iptables hashlimit for now. 18:44:50 it's just anoying there's no working solution in this space. ;) 18:45:08 can't we disable password auth? 18:45:14 we have. long ago 18:45:46 the issue is external hosts, get 10,000 ssh attemps... so logs get filled with 'failed login for admin' 'failed login for root' 18:47:04 anyhow, can take that out of meeting if anyone has ideas. ;) 18:47:09 anything else sysadmin wise? 18:47:11 Is there an open ticket and is someone assigned? 18:47:31 lanica: I think tickets for this kind of thing are bad, but I could open one I suppose. 18:47:52 I don't quite follow... 18:48:08 good ticket: "do x and y", bad ticket: "figure out the best way around this problem that will take a lot of discussion and it's not clear what we should do yet" 18:48:31 IMHO tickets are poor for open discussion on something, but great when there's a known thing to do or action. 18:48:55 The list is probibly better for this... 18:49:00 Understood. But to get to specific steps someone needs to dig in and figure out what works, unless someone has hit this and has the answers. 18:49:02 Good point though. 18:49:18 I might try to work on it, so I'll talk on list if so ;) 18:49:38 I'll post to the list... there's a lot of things I have already looked at and don't work. ;) 18:49:52 #topic nagios/alerts recap 18:50:41 .tiny https://admin.fedoraproject.org/nagios/cgi-bin//summary.cgi?report=1&displaytype=3&timeperiod=last7days&smon=10&sday=1&syear=2014&shour=0&smin=0&ssec=0&emon=10&eday=9&eyear=2014&ehour=24&emin=0&esec=0&hostgroup=all&servicegroup=all&host=all&alerttypes=3&statetypes=3&hoststates=7&servicestates=120&limit=25 18:50:48 nirik: http://tinyurl.com/qevmn8u 18:51:11 so the top two there... need us to fix our monitoring. ;) 18:51:14 Rank Producer Type Host Service Total Alerts 18:51:14 #1 Service collab03 mail_queue 226 18:51:14 #2 Service lockbox01 Zombie Processes 182 18:51:33 collab03 mail_queue notices because sometimes it has more than a few emails in queue because it's sending to a large list. 18:51:50 Zombies...its not quite Halloween yet.... 18:52:07 lockbox01 zombies alerts because something like ansible runs over tons of machines and some of them show up as zombie until they are reaped. 18:52:12 (or puppet might also do it) 18:52:44 those are things I can (and will) file tickets on. ;) 18:53:14 #topic Upcoming Tasks/Items 18:53:14 https://apps.fedoraproject.org/calendar/list/infrastructure/ 18:53:25 any upcoming items anyone wants to schedule or note? 18:53:35 when is freeze again? 18:53:51 * pingou wonders if we should add it to the calendar 18:54:22 tuesday. 18:54:26 sure! we shoud 18:54:42 2014-10-14 f21 beta freeze 18:54:42 2014-10-28 f21 beta release 18:54:59 #topic Open Floor 18:55:05 anyone have any items for open floor? 18:56:12 oh, in case anyone should need me, I'll be mostly afk friday morning through sunday. 18:56:22 threebean: enjoy :) 18:56:48 threebean: cool. ;) 18:56:57 alright, thanks for coming everyone. ;) 18:57:00 * threebean waves 18:57:01 #endmeeting