16:00:21 <lenkaseg> #startmeeting Infrastructure (2021-11-25)... (full message at https://libera.ems.host/_matrix/media/r0/download/libera.chat/f41e9cef1bbe06d4d48d7abf25e8bf514f0fb6b5) 16:00:21 <zodbot> Meeting started Thu Nov 25 16:00:21 2021 UTC. 16:00:21 <zodbot> This meeting is logged and archived in a public location. 16:00:21 <zodbot> The chair is lenkaseg. Information about MeetBot at https://fedoraproject.org/wiki/Zodbot#Meeting_Functions. 16:00:21 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic. 16:00:21 <zodbot> The meeting name has been set to 'infrastructure_(2021-11-25)..._(full_message_at_https://libera.ems.host/_matrix/media/r0/download/libera.chat/f41e9cef1bbe06d4d48d7abf25e8bf514f0fb6b5)' 16:00:36 <mobrien> .hi 16:00:37 <zodbot> mobrien: mobrien 'Mark O'Brien' <markobri@redhat.com> 16:01:04 <lenkaseg> Hello everybody, today unexpectedly with me as a chair! 16:01:18 <mobrien> lenkaseg: you will need to paste line by line on matrix 16:01:22 <austinpowered_> .hi 16:01:23 <zodbot> austinpowered_: Sorry, but user 'austinpowered_' does not exist 16:01:29 <lenkaseg> ah, true 16:01:36 <lenkaseg> #meetingname infrastructure 16:01:36 <zodbot> The meeting name has been set to 'infrastructure' 16:01:43 <lenkaseg> #chair nirik siddharthvipul mobrien zlopez pingou bodanel dtometzki jnsamyak computerkid 16:01:43 <zodbot> Current chairs: bodanel computerkid dtometzki jnsamyak lenkaseg mobrien nirik pingou siddharthvipul zlopez 16:01:50 <lenkaseg> #info Agenda is at: https://board.net/p/fedora-infra 16:02:02 <lenkaseg> #info About our team: https://docs.fedoraproject.org/en-US/cpe/ 16:02:08 <lenkaseg> #topic greetings! 16:03:03 <nils> \o 16:03:08 <nils> .hello nphilipp 16:03:09 <zodbot> nils: nphilipp 'Nils Philippsen' <nphilipp@redhat.com> 16:03:10 <austinpowered> .hi 16:03:12 <zodbot> austinpowered: austinpowered 'T.C. Williams' <fedoraproject@wootenwilliams.com> 16:04:34 <lenkaseg> #topic New folks introductions 16:04:43 <lenkaseg> #info This is a place where people who are interested in Fedora Infrastructure can introduce themselves 16:04:52 <lenkaseg> #info Getting Started Guide: https://fedoraproject.org/wiki/Infrastructure/GettingStarted 16:05:39 <lenkaseg> Anybody new here, who would like to introduce themselves? 16:07:30 <lenkaseg> Ok then, let's continue 16:07:37 <lenkaseg> #topic Next chair 16:07:42 <lenkaseg> #info magic eight ball says: 16:07:50 <lenkaseg> #info chair 2021-11-25 - lenkaseg 16:08:13 <lenkaseg> Anybody interested in chairing next week? 16:08:28 <mobrien> I will do it 16:08:35 <lenkaseg> mobrien: sold! 16:08:48 <lenkaseg> #info char 2021-12-02 - mobrien 16:09:35 <lenkaseg> I guess we can determine the next next chair next time 16:09:45 <lenkaseg> That is a lot of next in one sentence 16:09:57 <lenkaseg> Let's move on 16:09:59 <mobrien> We are probably down volunteers due to thanksgiving 16:10:28 <lenkaseg> #topic announcements and information 16:10:33 <lenkaseg> #info CPE Infra&Releng EU-hours team has a Monday through Thursday 30 minute meeting going through tickets at 1030 Europe/paris in #centos-meeting 16:10:40 <lenkaseg> #info CPE Infra&Releng NA-hours team has a Monday through Thursday 30 minute meeting going through tickets at 1800 UTC in #fedora-meeting-3 16:10:48 <lenkaseg> #info If your team wants support from the Fedora Program Management Team, file an isssue: https://pagure.io/fedora-pgm/pgm_team/issues?template=support_request 16:10:54 <lenkaseg> #info Happy Thanksgiving to those in US! 16:11:21 <lenkaseg> Is there some more info worth sharing? 16:13:01 <lenkaseg> Let's move on 16:13:04 <lenkaseg> #topic Oncall 16:13:11 <lenkaseg> #info https://fedoraproject.org/wiki/Infrastructure/Oncall 16:13:17 <lenkaseg> #info https://docs.fedoraproject.org/en-US/cpe/day_to_day_fedora/ 16:13:24 <lenkaseg> ## .oncalltakeeu .oncalltakeus 16:13:35 <lenkaseg> #info mobrien on call from 2021-11-18 to 2021-11-25 16:13:43 <lenkaseg> #info lenkaseg on call from 2021-11-25 to 2021-12-02 16:14:12 <lenkaseg> I guess I can take the oncall right now... 16:14:29 <lenkaseg> .oncalltakeeu 16:14:29 <zodbot> lenkaseg: Kneel before zod! 16:14:35 <mobrien> There was no oncall ping for the last week 16:14:49 <lenkaseg> Thanks for the update mobrien 16:16:03 <lenkaseg> Any volunteers for the week after December 2nd? 16:16:33 <lenkaseg> If not we can leave it for the next week 16:17:04 <mobrien> I think leave it until more people are here 16:17:34 <lenkaseg> ok, let's leave it for next time 16:17:48 <lenkaseg> ummm, now the topic... 16:18:15 <lenkaseg> I guess the Monitoring discussion with nirik was last week 16:18:29 <mobrien> I can do the monitoring topic this week 16:18:54 <lenkaseg> #topic Monitoring discussion 16:19:02 <lenkaseg> mobrien: ok, all yours! 16:19:09 <mobrien> So not a huge amount to report 16:19:26 <mobrien> copr are having some issues with their dev instances in AWS 16:19:38 <mobrien> They will work on it over the next week but no rush 16:20:07 <mobrien> proxy09 is flapping a bit which I should really look into 16:20:44 <mobrien> otherwise a few long running issues that we should clean up sometime but not urgent 16:21:31 <mobrien> That's about it so we can go to the learning topic for abompard now I would say 16:21:51 <lenkaseg> #topic Learning topic 16:22:01 <lenkaseg> #info 2021-11-25 - Migration of databases in Fedora Infra [abompard] 16:22:19 <lenkaseg> The floor is yours abompard! 16:22:24 <abompard> Thanks! 16:22:30 <abompard> How long do I have? 16:22:58 <abompard> OK if anybody has questions, feel free to ask them when they pop in your head, I'll try to answer right away 16:23:24 <abompard> First, let's talk about libraries, because it's the simplest 16:23:31 <lenkaseg> 15-20 mins, but really up to you 16:23:55 <abompard> There are currently two database libraries that we use in infra: sqlalchemy (most of the time) and Django's ORM (when using Django obviously) 16:24:13 <abompard> sqlalchemy has the "alembic" library dedicated to handling DB migrations 16:24:19 <abompard> so that's what we use most of the time 16:24:51 <abompard> Things to look out for: 16:25:36 <abompard> Make sure you stop all connections to the DB 16:25:48 <abompard> Set a downtime in Nagios before you migrate 16:26:01 <abompard> Only then can you upgrade your DB 16:26:13 <abompard> and then, obviously, restart the services and reactivate nagios 16:26:35 <abompard> Note: for some reason I cannot cut-n-paste into this window, which is a bit annoying 16:27:03 <abompard> There are mainly two situations one can be in in our infra : 16:27:13 <abompard> - the app is running in a VM 16:27:20 <abompard> - or the app is running in Openshift 16:27:41 <nils> How about dry-running migrations on a copy of the DB? 16:28:03 <nils> Is this anything we routinely do? 16:28:09 <abompard> Ah, good question nils, I don't think we do that currently 16:28:36 <abompard> but we could standardize on creating two databases for each app, one main and one for the migration tests 16:28:51 <abompard> I think it would be interesting 16:29:06 <abompard> although of course migrations should have been tested in staging 16:29:08 <copperi[m]> sure 16:29:16 <abompard> but staging is not always a perfect image of prod 16:29:16 <nils> It could quickly become infeasible, thinking of e.g. the datanommer DB 16:29:35 <abompard> yeah datanommer is a special case, I don't think any DB comes close 16:29:44 <abompard> maybe Koji's DB is big too 16:29:47 <nils> Yeah, if size allows it testing the migration on a recent copy would give me the most confidence 16:29:47 <abompard> but not as big 16:30:01 <mobrien> In the case of a live migration such as the datanommer one going on now, how does a migration guarantee the integrity of the database and ensure that what is copied over is up to date? 16:30:02 <abompard> That's a good idea, we should investigate that 16:30:17 <abompard> I'll go to the datanommer case next 16:30:49 <abompard> Let me get back, if your app is running in a VM, then you write a manual playbook 16:30:57 <abompard> it'll be in playbooks/manual/upgrade 16:31:02 <abompard> example : https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/manual/upgrade/pagure.yml 16:31:08 <abompard> (I can cut-n-paste, yeah!) 16:31:22 <abompard> If your app is running in Openshift, then you have two options 16:31:44 <abompard> You can run your upgrade during the deployment phase, using the "recreate" strategy 16:32:01 <abompard> openshift will take down a pod, run a script, and startup a pod with the new code 16:32:18 <abompard> The doc is here: https://docs.openshift.com/container-platform/4.9/applications/deployments/deployment-strategies.html#deployments-recreate-strategy_deployment-strategies 16:32:29 <abompard> it's good, but you have downtime while the DB is being migrated 16:33:06 <abompard> but at least you know that none of your code will be running while the DB is migrated, and that your older code will never connect to a DB that's been migrated 16:33:19 <abompard> That's the strategy that Bodhi uses : https://pagure.io/fedora-infra/ansible/blob/main/f/roles/openshift-apps/bodhi/templates/deploymentconfig.yml#_21 16:33:51 <abompard> There is another option: use a manual playbook there too 16:34:02 <abompard> That's the option that we settled for in Datanommer 16:34:11 <abompard> because multiple apps access the Datanommer DB directly 16:34:21 <abompard> so we need to shut them all down before upgrading 16:34:40 <abompard> For that we use a manual playbook : https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/manual/upgrade/datanommer.yml 16:34:56 <abompard> which calls an OpenShift "Job" : https://pagure.io/fedora-infra/ansible/blob/main/f/roles/openshift-apps/datanommer/templates/job.yml 16:35:10 <abompard> a Job is a pod that openshift starts with a custom command 16:35:23 <abompard> here it's the DB upgrade command obviously 16:35:50 <abompard> The trick with OS jobs is that you have to monitor them for completion and delete the job manually 16:36:02 <abompard> OS does not do that for you (or we didn't find the option) 16:36:18 <abompard> OK, that's what we currently do in infra 16:36:39 <abompard> If there's no questions on it, let's talk about long-running migrations 16:37:01 <abompard> there's one at the moment, the Datanommer migration. It's a huge DB 16:37:18 <abompard> the migration is expected to last about 60 days, and there's about 40 left 16:37:40 <abompard> when you have to undertake such a migration, I strongly recommend designing a process/script that you can restart 16:37:55 <abompard> because servers need to reboot sometimes, connections get lost, etc 16:38:18 <abompard> If you use SQLAlchemy, also make sure you use windowed queries: https://github.com/sqlalchemy/sqlalchemy/wiki/RangeQuery-and-WindowedRangeQuery 16:39:20 <abompard> The advantage of being able to recover on failure is also that you can handle the data that has arrived in between 16:39:28 <abompard> that should anser mobrien's question 16:39:33 <mobrien> So are windowed queries taking smaller chunks of a table and migrating them 1 at a time? 16:39:42 <abompard> yes 16:40:04 <abompard> and when the migration is done, we'll restart it, and it'll migrate the 60 days worth of data that has accumulated 16:40:22 <abompard> it'll take maybe an hour, then we restart it again, etc 16:40:33 <mobrien> May be an edge case but what happens if a part of that is deleted after its window has been migrated? 16:40:36 <abompard> when there's almost no data left, then we have the downtime and we migrate over 16:40:48 <abompard> ah, with datanommer we don't delete data 16:40:55 <abompard> so that's not a problem we have 16:41:07 <abompard> but if you have deletion then you need a smarter sync 16:41:17 <mobrien> ah ok, that's the part I was unsure of 16:41:28 <abompard> Something that would compare the unique ids on each side and decide what to remove 16:41:38 <abompard> (not the primary keys, those would change) 16:41:52 <abompard> (but for example the msg-ids in DN's case) 16:42:17 <mobrien> makes sense 16:42:26 <abompard> Alright, that's all I had, any other questions? 16:43:06 * lenkaseg still reading 16:43:38 <abompard> nils could you open a ticket in our tracker to investigate how we could have testing databases for apps? I think it's a very interesting idea. 16:44:28 <nils> abompard, do you mean pagure.io/fedora-infrastructure? 16:44:45 <abompard> yeah, I think it could be akin to a mini-initiative 16:45:27 <mobrien> postgres natively supports hot standbys which would be great but resource consuming 16:45:56 <abompard> Huh that's cool, never tried it before 16:46:27 <abompard> Oh, also I don't think we have the habit of writing reverse migrations in Alembic 16:46:30 <mobrien> Azure and AWS leverage it for their HA db solutions 16:46:36 <abompard> but it would be cool to do it 16:46:48 <mobrien> https://wiki.postgresql.org/wiki/Hot_Standby 16:47:04 <lenkaseg> `when you have to undertake such a migration, I strongly recommend designing a process/script that you can restart` 16:47:04 <lenkaseg> Is the script somewhere available? 16:47:16 <mobrien> I think it takes incremental snapshots and has a wal with a short time to make up the difference 16:47:59 <abompard> lenkaseg: well that depends on your migration, what I mean is that you should design your script to withstand disconnections 16:48:15 <abompard> the Datanommer migration script is available of course, but it only works for datanommer 16:48:51 <abompard> if you don't need to handle deletions, I'd say look at your last inserted unique ID, find it in the source DB and start there 16:49:17 <lenkaseg> Aurélien B: of course, i meant the datanommer one 16:49:28 <nils> I'm not sure how windowd migration meshes with alembic, though. At least the default mode is to work, convert, etc. in one live database. 16:49:36 <nils> (Sorry, was out for a moment) 16:49:58 <abompard> lenkaseg: it's there: https://github.com/fedora-infra/datanommer/blob/develop/tools/timescaledb/migrate-to-timescaledb.py 16:50:21 <nils> So I assume this is not about using Alembic, but hand-sewn migration scripts? 16:50:36 <abompard> nils: if you need to do a long running operation, I don't think you should use alembic 16:50:41 <abompard> yeah 16:50:44 <lenkaseg> Aurélien B: Thanks! I could not find it :) 16:50:47 <nils> ah gotcha 16:50:53 <abompard> it should be very rare 16:51:02 <abompard> Datanommer is one case, the mailman migration was another 16:51:42 <nils> one recipe to allow delete-with-sync is to have an additional "uuid" in the tables which are assigned randomly on creation but would stay stable between different "incarnations" of a database 16:52:15 <nils> which is probably not very efficient, but what is with huge DBs? 16:52:24 <abompard> Yes, in general I would avise against using the primary key only for unicity 16:53:25 <nils> you probably won't need it day to day though 16:53:45 <abompard> then it depends on your data 16:53:54 <nils> I guess you could tack it onto a database schema using an alembic migration before you use the real DB-to-another-DB migration script 16:54:22 <abompard> for anything that needs to leave the DB, such as messages or async tasks, a unique id is needed 16:54:37 <abompard> nils: yes indeed 16:54:40 <nils> yeah 16:55:02 <mobrien> some data may have it anyway right? like unique message ids? 16:55:11 <nils> I was thinking single system there, i.e. a user talks to one service and the ids only make sense in the context of short-term communication between both 16:55:21 <nils> mobrien, yeah 16:55:23 <abompard> mobrien: yes, most messaging systems have those anyway 16:56:38 <abompard> Any other question? 16:57:32 <mobrien> I have no other questions but I enjoyed the talk. Thanks Aurélien B 16:57:32 <abompard> Well if you do feel free to ping me in #apps:fedoraproject.org 16:57:43 <abompard> Sure, I hope it was useful 16:57:43 <mobrien> abompard++ 16:57:43 <zodbot> mobrien: Karma for abompard changed to 4 (for the current release cycle): https://badges.fedoraproject.org/tags/cookie/any 16:57:51 <pmoura> abompard++ 16:57:52 <lenkaseg> abompard++ 16:57:52 <zodbot> lenkaseg: Karma for abompard changed to 5 (for the current release cycle): https://badges.fedoraproject.org/tags/cookie/any 16:58:08 <abompard> yum ! 16:58:14 <nils> abompard++ 16:58:36 <lenkaseg> Aurélien B: Thanks a lot, I'm saving this discussion. 16:58:36 <lenkaseg> Where the logs are available by the way? 16:58:41 * nils eyes zodbot 16:58:52 <lenkaseg> 2 mins left, so left move to open floor 16:58:55 <lenkaseg> #topic Open Floor 16:59:01 <abompard> A link to the logs should be displayed at the end of the meeting 16:59:20 <abompard> I should add that it was lenkase 16:59:41 <abompard> I should add that it was lenkaseg who figured out the openshift job playbook thingie 16:59:47 <abompard> for datanommer 16:59:49 <abompard> so, well done 17:00:13 <mobrien> Good work lenkaseg 17:00:22 <mobrien> lenkaseg++ 17:00:22 <zodbot> mobrien: Karma for lenkaseg changed to 5 (for the current release cycle): https://badges.fedoraproject.org/tags/cookie/any 17:00:24 <nils> 👍 17:00:24 <lenkaseg> no no. the job was done by Aurélien B , I only added that thingy for checking if it's done I think 17:00:46 <pmoura> lenkaseg++ 17:00:53 <mobrien> lenkaseg: the logs are here if you ever want to look back on historical topics https://meetbot.fedoraproject.org/sresults/?group_id=infrastructure&type=team 17:00:55 <lenkaseg> Thanks all for attending the meeting! 17:00:57 <nils> lenkaseg++ 17:01:00 <lenkaseg> #endmeeting