20:00:03 #startmeeting Infrastructure outage retrospective (2011-03-22) 20:00:03 Meeting started Tue Mar 22 20:00:03 2011 UTC. The chair is nirik. Information about MeetBot at http://wiki.debian.org/MeetBot. 20:00:03 Useful Commands: #action #agreed #halp #info #idea #link #topic. 20:00:04 #meetingname infrastructure-retrospective 20:00:04 The meeting name has been set to 'infrastructure-retrospective' 20:00:27 Greeting everyone. We are going to be having a retrospective/lessons learned/brainstorming session here. 20:00:40 w00t! 20:00:57 There was a hardware failure friday on our signing server, and we want to figure out better ways to mitigate any risks from such. 20:01:03 who all is around? :) 20:01:09 * dgilmore is 20:01:19 Hola 20:01:31 * skvidal is here 20:01:41 Oxf13: you around? 20:01:53 * Oxf13 20:02:04 * jsmith is here 20:02:16 funny enough, I wanted to move the meeting to avoid lunch, and I haven't grabbed lunch yet :) 20:02:24 smooge: you around? 20:02:51 Oxf13: That's OK... it's 4:00pm here and I still haven't had (a proper) lunch 20:03:24 * nirik had some crackers and cheesh. Lunch of champions. ;) 20:03:31 anyhow, I guess lets get started? 20:03:42 #topic Timeline/Recap 20:03:53 here 20:03:59 hey smooge. 20:04:24 So, we have a small number of sensitive servers that don't follow our normal updates policy. 20:04:52 last friday we determined it might be a good idea to update them and reboot them into new kernels, etc. 20:05:16 yes. 20:05:17 One of those is the sign-vault01 server. We applied updates to it and rebooted. 20:05:28 it had a hardware failure and didn't come back up. 20:05:36 eventually the plan for the sign-vault is to have it off of the network entirely 20:05:50 we then took 2 approaches: 20:06:00 a) move drives to spare hardware and bring it back up. 20:06:07 b) get a new instance setup as a virtual. 20:06:17 a managed to complete first and we were back up. 20:06:32 (I would like to thank everyone who worked on getting it back up and working) 20:06:53 So, I think thats the recap/timeline... anyone have anything I missed in there? 20:07:01 * dgilmore thinks b is bad 20:07:21 it could be. ;) 20:07:33 I have something 20:07:33 #topic setup of sensitive boxes 20:07:36 well at the time there wasn't any spare hardware 20:07:37 #undo 20:07:37 Removing item from minutes: 20:07:49 Oxf13: shoot 20:07:50 We did actually get B mostly up. A virt host was created, and I got a copy of the sensitive data onto it 20:07:59 the data isn't in a production directory though. 20:07:59 I would like to thank RHIT for finding the spare hardware that they got out of somewhere for us 20:08:18 smooge: good point. RHIT found us spare HW for that. 20:08:29 Oxf13: ok. 20:08:51 they also dropped another problem and moved that hardware for us. 20:09:15 Many Kudos for their assistance. 20:09:39 anyone have anything more on background? 20:09:43 yes 20:09:52 smooge: go ahead 20:10:02 or is it a lesson learned as the data was not backed up anywhere 20:10:26 yeah... backups is on the agenda. ;) 20:10:35 it is good background data 20:10:43 at least we never got a yes/no on whether or not the backup ever completed 20:10:50 there is no backup on it 20:11:01 :( ok, good to know... 20:11:19 it was not put into backups because people were worried about mixing of sensitive data with regular backups 20:11:41 ok, shall we move on and discuss backups? 20:11:46 yah 20:11:53 #topic Backups of sensitive boxes 20:11:54 and the previous topic tooo 20:12:03 'setup' 20:12:34 20:12:47 why don't we save setup for later and hash out some of the things that might be easier/simpiler? or do we want to look at setup first to know the answer to others? 20:13:10 maybe I could talk about the original plan for these boxes? 20:13:19 ok. 20:13:20 #undo 20:13:20 Removing item from minutes: 20:13:24 I think we need to look at the original assumptions 20:13:27 #topic setup of sensitive boxes 20:13:40 which are that these box are being treated in a way which ensures no one pays attention to them 20:13:40 I'd like to note: https://fedoraproject.org/wiki/User:Mitr 20:13:51 My original plan was that bridge would be connected to the network, and allow ssh and sigul connections 20:13:52 I'd like to note that is not in an obvious place :) 20:13:53 has background info on the signing servers. 20:14:00 I wanted to limit the attack vectors though 20:14:05 originally the box was inteneded to be connected the the bridge via a crossover cable 20:14:05 skvidal: agreed. 20:14:08 so I wanted puppet off, and backups off. 20:14:12 and have no network connection at all 20:14:23 the vault, which has the sensitive data, was to be only crossover connected to bridge 20:14:30 which never happened, right? 20:14:37 any admin work was going to require a serial connection 20:14:51 "Non-network connection (USB/serial) does not provide enough infrastructure (packeting, checksumming, retransmissions, debugging tools) that I'd rather not reimplement now; this can be replaced later if necessary." 20:14:56 and maybe even a on-site call to hook up the serial as to not have it sitting there waiting for somebody to bang on the serial port 20:15:20 skvidal: that never happened. The server remained connected to the network and allowed ssh in 20:15:35 We had some stability issues that required frequent restarts of the vault and bridge processes 20:15:52 and we were not comfortable severely limiting our access to the machine. 20:16:07 * nirik nods. 20:16:14 also, these were hand installed and setup systems that got turned into production systems 20:16:17 which was my fault. 20:16:50 I believe we were under time pressure to use it for whatever Fedora release needed to be signed 20:17:07 and did not take the time to use puppet to rebuild the boxes in an automated way 20:17:12 13 beta I think? 20:17:30 As for backups, the plan was mostly hand-wavy 20:17:56 "we" felt that too many people had access to the backup storage and could grab the nss dbs and brute force them at their leisure 20:18:04 so we did not hook it into the backup system 20:18:13 who is 'we'? 20:18:15 but we also failed to create and follow an alternative backup plan 20:18:34 skvidal: Mostly me, and I believe mmcgrath and notting were in the conversation 20:18:44 our goal was to limit the number of ways the nss dbs could be accessed 20:18:55 number of ways and number of people. 20:19:25 the nss dbs are still passphrase protected (but of course could be pounded on with brute force given access to them) 20:20:11 so, at the very least we need some backup plan. 20:21:00 do we wish to persue the orig plan for serial access only, etc? or something else? 20:21:02 nirik: yeah, and with Amazon cloud, the amount of time / $$ it takes to brute force got significantly smaller IIRC 20:21:02 the backups should only need updating when we add a new key 20:21:30 dgilmore: I'd like to verify with mitr where data about what user has access to which key and the passphrases for the users are stored. 20:21:49 dgilmore: while we can re-add users and such, it'd be a hassle, so that data should be backed up on change too 20:21:50 * dgilmore would nearly be ok with someone putting an encrypted usb key in and backing up to that 20:22:01 the backup01 box is pretty limited / also a sensitive box. Perhaps we could store a backup there? 20:22:12 Oxf13: true but that doesnt change often 20:22:16 just to be clear 20:22:26 the only confidential data are the keys, correct? 20:22:43 skvidal: user passphrases should be treated as confidential too IMHO 20:23:01 each user has their own unique passphrase to access the system, along with an ssl cert 20:23:03 skvidal: there is the keys and users passwd to access the keys 20:23:09 (well the ssl cert being the FAS cert) 20:23:12 but we don't need to back up those 20:23:16 they don't matter 20:23:20 they keys for signing matter 20:23:23 skvidal: no, they can be re-created. 20:23:29 the ssl certs and passphrases can be nuked from orbit 20:23:29 right 20:23:31 so for BACKUPS 20:23:32 the keys are the critical thing we can't lose. 20:23:36 rightr 20:23:36 so 20:23:44 why don't we just back those up to some place 20:23:47 double-encrtypt 20:23:49 treble, if you'd like 20:23:53 it doesn't really matter 20:24:03 we don't need to automatically restore this info 20:24:17 Encrypt on a USB key, tape the USB key to the back of the server 20:24:19 but EVERYTHING else on the box needs to be automatically re-provisional 20:24:22 jsmith: nah 20:24:28 jsmith: requires someone on sight 20:24:31 onsite, even 20:24:37 my point is just this 20:24:46 the only piece that we need to backup is trivially small 20:24:50 and hell if we PRINT THEM OUT 20:24:53 we can get away with it 20:25:04 yes, it's a small amount of data. 20:25:05 right, it is trivially small 20:25:10 personally a gpg encrypted file of the backup files should probably be enough.. make the passphrase 64+ characters and it will still take an IPv6 full amount of computers to crack it. 20:25:15 the question is how paranoid do we want to be. 20:25:31 well let's put it this way 20:25:32 mitr may say that the encryption on the nss db is enough 20:25:35 none of the attacks we've faced 20:25:44 and that we should be "safe" having it out there 20:25:44 have been as a result of someone brute-forcing anything 20:26:02 I've got no issue with keeping these outside of the normal backup routines 20:26:07 but they need to be backed up SOMEWHERE 20:26:10 and that location needs to be: 20:26:12 raptor proof 20:26:14 documented 20:26:16 duplicated 20:26:25 right. 20:26:25 and afaict it is NONE of those right now 20:26:27 right? 20:26:34 it's not in existance right now. ;) 20:26:39 exactly 20:26:49 what about this 20:27:09 what about a cron job on the vault that will check to see if the dbs have changed, and if so, scp them off to some other host 20:27:15 no 20:27:17 that other host /is/ part of the backup process 20:27:22 the keys change 2 times a year, right? 20:27:39 skvidal: and at odd times for EPEL 20:27:41 skvidal: roughly 20:27:45 we should be able to add a 'backup the nss dbs' to 'add a new key SOP' 20:27:49 nirik: +1 20:27:51 yes 20:27:53 and moreovr 20:27:54 epel gets a new key with new rhel 20:27:55 nirik: ok, fine by me. 20:28:01 if we do not have a releng person who can do this 20:28:03 then we're SCREWED 20:28:33 so, proposal: gpg encrypt the needed files for another layer, and back them up on backup01 and/or another non phx2 host? 20:29:00 nirik: back them up to some place inside RHT if we want several more layers of obfuscation 20:29:08 i would say on a non phx2 box 20:29:15 maybe our dr box 20:29:17 I vote non phx2 20:29:42 * nirik is fine with that... passphrase in the private puppet repo? 20:29:52 I kinda like the idea of somewhere within RHT, but maybe that's too complicated or too politically sensitive. 20:29:52 nirik: no 20:30:12 dgilmore, ok 1 we currently do not have a dr box inside a place I would "trust" with them 20:30:12 I'd like to get mitrs opinion on whether or not we need a second layer of encryption or not 20:30:20 ok. but it does us no good if a raptor proof amount of people don't have it. ;) 20:30:27 nirik: sure you can 20:30:42 nirik: give the passphrase to mark cox 20:30:46 Oxf13: ok. 20:30:46 or red hat infosec 20:31:07 b/c if rh decides to screw fedora in some way the gpg keys will be the least of our problems :) 20:31:38 sure... 20:31:54 that problem goes away if we don't need to do another layer on them. 20:32:37 so: Check with mitr on what level of paranoid we need for the nssdb. Either encrypt again or not, and backup to non phx2 host. 20:32:55 anything else on backups? or is that good for those? 20:33:17 the bridge also has a nssdb, should that get the same treatment? 20:33:22 we need to also have a periodic test if we can deploy a new host via puppet and restore from backup 20:33:39 nirik: I'm not sure what all is in the bridge db, and how sensitive it is. 20:33:56 it may just have the mappings of users to rights and user passphrases. 20:34:00 * nirik gets ready to make with action items if no one objects. 20:34:01 I don't think it needs to be backed up 20:34:11 so a mitr question 20:34:19 ok 20:34:26 There is a problem with the setup assumptions. 20:34:47 #agreed backup nssdb to a non phx2 host. (may need another layer of encryption or not) 20:34:51 s/is a/are a couple/ 20:35:02 #agreed will check with mitr on sensativity and what needs to be backed up. 20:35:06 yeah, somewhat difficult to get the nss dbs off the box if it's off the network. 20:35:13 well, actually 20:35:24 it can't be totally off the net. 20:35:28 but it could not allow anything in. 20:35:30 no, still difficult, but not impossible. 20:35:50 nirik: right, it at least has to have a network connection to bridge 20:35:51 smooge: go ahead. 20:36:25 1) The sign-bridge is a virtual system. We can't do a crossover cable to it. So we need to get another hardware for it if we want that. 20:36:57 so, we could say: new setup is no incoming connections allowed to the box. Access is via serial for adding new keys or restarting things or applying updates or doing a backup of nssdb. 20:37:36 nirik: new key addition is done through the sigul client, no need to log into vault for it 20:37:37 2) New hardware is needed anyway as the box's current warranty is almost over. 20:37:56 3) The box also did not have a good warranty on it. It is next business day RMA only. 20:37:59 smooge: well, the serial/crossover is not implemented, so I don't think we need to worry about it now. 20:38:20 nirik, however the assumption has been that they could go to it anytime when it was implemented. 20:38:36 yeah, not sure if it's on the roadmap or off. 20:38:37 they can't.. without more resources. 20:39:19 nirik: id still like to do it when we are confident that the vault will just run 20:39:21 thats another mitr question I guess, but I don't think we can plan for it now without a roadmap. 20:39:39 smooge: is there replacement hardware available for the box? 20:40:06 I budgeted for one next quarter. But I didn't for second hardware :/ 20:40:30 so currently if the box has issues it can be 24-72 hours before it is fixed. 20:40:46 Since it goes off contract in May I am not sure its worth fixing it. 20:41:28 eg by the time I get it through the system... 20:41:34 yeah. 20:42:47 ok, so should we leave it as setup currently and revisit when new hardware arrives? 20:43:07 or should we move to the "no incoming connections allowed to the box. Access is via serial for adding new keys or restarting things or applying updates or doing a backup of nssdb." plan 20:43:18 well I think we don't have much choice beyond setting up some sort of gpg encrypt backup of the databases. 20:43:21 or does anyone have another plan to toss out there. ;) 20:43:24 I think I am in favor of leaving it as is 20:43:27 I'd vote for leaving it as is, except for adding the backup SOP 20:43:30 and NOT making those changes when hw arives 20:43:37 meaning - this is how far it goes 20:43:40 and no further 20:43:50 and we all just take the pills which keep us from being this paranoid 20:44:09 ok, that goes to the next topic... 20:44:16 #topic backups for sensitive boxes 20:44:16 heh 20:44:21 #undo 20:44:21 Removing item from minutes: 20:44:26 #topic updates for sensitive boxes 20:44:29 updates. ;) 20:44:41 skvidal: well, I plan on washing my hands of the whole thing in about 9 months so.... 20:44:42 I think it's a bad idea to have these out of our regular backup cycle. 20:44:56 * nirik sighs 20:44:59 Oxf13: it's like the hotel california 20:45:01 s /backup/updates/ 20:45:03 Oxf13: :) 20:46:06 so, proposal: we apply updates to sensitive boxes at the same time as others (taking into account freezes, etc). 20:46:13 nirik "have these out of our".. what is these? 20:46:21 nirik: which means we enable funcd on those boxes? 20:46:52 sign-bridge01, sign-vault01, backup1 (I think thats all... are there others that don't run func or the like) 20:46:58 ugh, paranoia setting in again. 20:47:24 backup02 sort of fits into that. 20:47:48 though func runs on it.. and puppet so never mind 20:48:12 I don't think we should do func unless we also are doing puppet, etc... which I don't think we want. 20:48:22 don't we? 20:48:28 * dgilmore thinks we should only ever install security updates 20:48:57 well, I guess the idea is that these are 'off the grid' so compromise in puppet/fas wouldn't also get them. 20:49:20 right, that was the paranoia 20:49:42 dgilmore: we could remove a ton of packages on them too I think... they could be much more minimal. (something to do with new one) 20:49:46 no attack vectors from automation such as puppet or func (or backups) 20:50:00 nirik: right 20:50:10 we should only have installed what we need 20:50:46 so, I'd say no func, no fasClient (local users?), no puppet. However, we should apply updates as part of our other updates flow... not put it off. 20:51:09 nirik: backup01 has local users 20:51:30 I think we should have sign* do that too... it's using fas currently I am pretty sure. 20:51:42 nirik: right 20:51:52 what is sign* ? 20:51:52 does anyone object too stongly to that level of paranoia? 20:52:02 well if we are going that route.. we might as well make this a home brewed ARM PCI card. I am not sure where the last stop of the reasonable train is. 20:52:03 i agree with no fas, no func, no puppet 20:52:06 shorthand for sign-bridge01 and sign-vault01 20:52:08 nirik: I think it's a waste of time 20:52:17 oh ok. 20:52:23 nirik: I think we're chasing down paths that will ultimately get us right back to where we were on friday 20:53:08 counterproposal? :) func and puppet and fas like any other machines? 20:53:10 I do. I think that at a certain point we might as well go back to hand signing it from a cdrom in Gafton's cubicle because we are now assuming a lot of technical details that can be completely circumvented by physical. 20:54:41 whatever we do, it has to be better than having the keys just sitting in somebody's homedir on say releng2.... 20:54:48 i think we need to make sure we have the backups so should anything happen, we can rebuild 20:55:00 "The Incident" is what started some of this paranoia. 20:55:08 Oxf13: right now the epel key is sitting on releng01 20:55:13 relepel01 20:55:21 the one for el4 and 5 20:55:22 1) I think the current level with a set of backup hardware and a set of backups that are encrypted is what we need to deal with. 20:55:29 i really should import that into sigul 20:56:05 after that we are assuming a lot of physical controls that don't exist. 20:56:09 the incident has nothing to do wit hthis 20:56:20 skvidal: I beg to differ. 20:56:25 well, curently we are not doing func or puppet on them, and fasClient only when someone fixes it to run. ;) 20:56:40 * nirik sees we are getting near an hour now. 20:56:45 here is my input. 20:57:18 If we're going to turn on puppet/func/fasclient, then I really don't see the point in being extra paranoid about our backups, and we just turn on the backup stuff too and have those dbs sit with the rest of the backup data. 20:57:29 (provided they aren't unlocked on the filesystem while the daemon is running) 20:58:25 so, perhaps input from mitr would be helpfull for us to decide A or B? (since they are kinda wildly different sides. ;) 20:58:54 yes. 20:59:16 ok, lets gather that and revisit in the regular infra/rel-eng channels for further decision? 20:59:34 data points for right now 20:59:49 sign-vault02 exists and has a backup of the vault data as of Friday 21:00:07 that should remain in place until we have an agreed upon backup solution going 21:00:13 * nirik nods. 21:00:28 is sign-vault02 fs encrypted? 21:00:47 I don't believe so 21:00:53 also, another datapoint: replacement hardware is available and sign-vault01 drives are going to be moved back to it later today. 21:00:56 abadger1999: nope. 21:01:00 I discussed fs encryption with mitr, and he had the opinion that it was a waste 21:01:01 k 21:01:07 given that the dbs were already encrypted 21:01:21 smooge: ping 21:01:25 A shutdown fs encrypted host would make a decently secure warm backup. 21:01:28 hi seanjon 21:01:30 abadger1999: +1 21:01:34 smooge: lets push the hardware swap 5pm 21:01:41 5pm your time? 21:01:43 yea 21:01:47 abadger1999: it would mean getting through the fs encryption and then through the nss db encryption 21:01:49 abadger1999: yeah, could be. 21:01:54 I think skvidal suggested something similar in email. 21:01:57 yep. 21:02:13 seanjon, I think that will be 00:00 UTC 21:02:25 abadger1999: it'd make the backup process a bit more complicated, and require yet another passphrase to be shared or stored somewhere. 21:02:31 you are UTC-7 I believe 21:02:41 ok, I had 2 more small items I wanted to note, shall we do them real quick, then call it a meeting and try and ponder longer term decisions? 21:02:48 Oxf13: backup process MORE complicated? 21:02:49 seanjon, ok will move it to 00:00 UTC 21:02:51 you just snapshot the lvm 21:03:19 s/lvm/lv/ 21:03:38 #topic Misc items 21:04:04 I'd like to suggest we not do updates on friday's... especially on hardware that only has next business day response. ;) 21:04:42 skvidal: er, you'd have to boot the spare system, unlock it, then copy over the changed files, and shut the backup system down again 21:04:50 Oxf13: no 21:04:56 Also, on this outage we didn't have any outage announcement, but should we have? it didn't cause any disruption to end users or developers... just need to consider when we notify about outages and how. (revisit that process) 21:04:58 yes. sorry I wasn't watching the date when I ok'd doing updates 21:04:59 Oxf13: you run the primary on an lv 21:05:04 snapshot and dd to a file 21:05:16 run the warm-backup off of the dd 21:05:37 smooge: no worries. I think it was me that was saying we should just do them then... 21:06:09 I think we both did. I have been working so many weekends I didn't think. 21:06:18 yeah, easy to do. 21:06:40 to be fair 21:06:48 this haD NOTHING to do with the updates 21:06:51 the updates were fine 21:06:55 skvidal: I guess I was confused as to how the backed up data would get onto the shut down encrypted other system. 21:06:56 the kernel was fine 21:07:15 the only issue here is that the hw pooped on itself 21:07:20 and was unrecoverable 21:07:28 let's not make this into something it is not 21:07:28 yeah 21:07:44 it has NOTHING to do with updates nor, even, with rebooting 21:07:56 if we had opted to NOT reboot these boxes 21:08:03 it may well have failed in an identical way 21:08:06 21:08:18 true. 21:08:27 or worse... failed in ways we didn't see 21:08:36 post hoc ergo propter hoc 21:08:37 * abadger1999 thinks we should also consider doing periodic reboots of the box. 21:08:43 abadger1999: +100000 21:08:47 abadger1999: rebooting ALL OF OUR BOXES 21:08:49 every 150 days 21:08:51 21:08:52 no matter way 21:08:54 just not on Friday. 21:08:56 err s/way/what/ 21:09:09 surely. entropymonkey. :) 21:09:19 drifting off topic 21:09:33 anyhow, shall we close up and take our thoughts to the list/infra&rel-eng meetings? 21:09:44 sure 21:09:50 21:10:02 thanks for all the info and brainstorming everyone! 21:10:10 #endmeeting