how i learned to change production without breaking it
the simple ticket
"remove the aggregators filter and add those values to the methods filter instead."
that was the ticket. seemed straightforward. the ui had two separate filters for integration methods - one for "Method" and one for "Aggregators." we needed to merge them into one.
i opened the codebase. found the column: legacy_categories. found the enum: CategoryType.
my first instinct was to write a migration script that would:
- move the data from the old column to the new one
- drop the old column
- update the frontend
one pull request. one deployment. done by friday.
then i showed the plan to the senior engineer on my team.
the question that changed everything
"what happens if we need to rollback?"
i had not thought about that. rollback meant going back to the old code. the old code that expected the legacy column to exist. the column i was about to delete.
if we deployed and something broke, rolling back would just break it differently.
he showed me how they do database changes at scale. not in one step. in four separate, independent deployments.
i thought he was overengineering. he was teaching me how to deploy safely.
phase 1: add without removing
the first pr was boring. almost embarrassingly simple.
public enum CategoryType {
STANDARD,
PREMIUM,
// new values - but not used anywhere yet
TYPE_A,
TYPE_B,
TYPE_C,
TYPE_D,
TYPE_E,
TYPE_F,
TYPE_G,
TYPE_H
}
that was it. just adding enum values. no data migration. no code changes. no frontend updates.
i asked "why deploy this separately? nothing uses these values yet."
he said "exactly. which means nothing can break."
this deployment was purely additive. the old code kept working. the new enum values existed but were not referenced anywhere. if we needed to rollback for any reason, we would just roll back to code that ignored these enum values.
zero risk.
phase 2: migrate the data
the second pr was the actual data migration. move values from legacy_categories to primary_categories.
UPDATE user_profile
SET primary_categories = array_cat(
primary_categories,
legacy_categories
)
WHERE legacy_categories && ARRAY[
'TypeA', 'TypeB', 'TypeC',
'TypeD', 'TypeE', 'TypeF',
'TypeG', 'TypeH'
];
but here is the key: we did not delete anything. both columns still existed. the data was now in both places.
if something went wrong with the migration, we could rollback to phase 1. the old column was still there. the old code would still work.
the migration was tested in staging. then in production. no code changes, just data movement.
we waited 48 hours before moving to phase 3.
phase 3: stop writing to the old column
the third pr changed the application code. stop using legacy_categories. start using primary_categories.
// before
List<String> categories = userProfile.getLegacyCategories();
// after
List<CategoryType> categories = userProfile.getPrimaryCategories();
but we did not drop the database column. it was still there. unused, but there.
why? because if we needed to rollback this deployment, we would go back to code that reads from legacy_categories. and that column still had all the data.
this was the riskiest deployment. new code, new logic, new filters on the frontend. we deployed it on tuesday morning. not friday afternoon. we monitored for three days. checked error rates. watched for support tickets. verified the filters worked correctly.
everything was stable.
phase 4: delete the old column
the fourth pr was simple again.
ALTER TABLE user_profile
DROP COLUMN legacy_categories;
just drop the column. that is it.
by this point, we had been running on the new code for three days. we knew it worked. we knew the migration was complete. we knew rollback was no longer necessary.
this was the only truly irreversible step. but it was safe because we had already validated everything else.
what i got wrong about risk
when i first saw this approach, i thought it was too cautious. four prs for one feature? that is not agile. that is not moving fast.
but then i added up the actual timeline:
- phase 1: deployed monday, took 1 hour to prepare
- phase 2: deployed wednesday, took 2 hours to write and test
- phase 3: deployed the following tuesday, took 3 hours to update all the code
- phase 4: deployed friday, took 30 minutes
total time: one week, about 7 hours of work.
my original plan would have taken maybe 4 hours to write. but the deployment would have been high-risk. we would have had to:
- deploy during low-traffic hours
- have multiple engineers on standby
- probably do it on friday and watch it all weekend
- cross our fingers that nothing broke
and if something had broken? we would have had no safe way to rollback. we would have been debugging a production incident instead of sleeping.
the four-phase approach was not slower. it was actually faster because we never had to stop and fix an incident.
what this taught me about production
production is not like your laptop. you cannot just restart the server if something breaks. there are real users, real data, real consequences.
when you change production, multiple things happen at once:
- some servers restart before others
- some database queries are already in flight
- some cached data still references the old schema
- some background jobs are still running old code
you cannot think about the code as before and after. you have to think about the transition period. the messy in-between state where both versions exist simultaneously.
that is what the four phases give you: a controlled transition where each step is independently safe.
the pattern i use now
every time i have to change a database schema or an api contract, i ask myself:
"can i split this into additive changes first, then removal changes later?"
almost always, the answer is yes.
the pattern looks like:
- expand: add the new thing (column, enum, field) but do not use it yet
- migrate: move data, dual-write if needed, backfill
- contract: switch code to use the new thing, stop using the old thing
- cleanup: remove the old thing from the database
each step is a separate pr. each step can be rolled back safely.
it feels slow when you are writing the prs. it feels fast when you are deploying them without stress.
the real lesson
the first version of code i write is not the version that should run in production. the first version optimizes for "does it work?"
production code needs to optimize for "can i deploy this safely?"
sometimes that means splitting one feature into four deployments. sometimes that means keeping old code around longer than feels necessary. sometimes that means making changes that feel redundant.
but the goal is not elegant code. the goal is zero-downtime deployments that can be rolled back at any step.
at my previous company, we deployed fast and broke things. at my current company, we deploy even faster but we never break anything.
the difference is not being more careful. the difference is designing deployments that cannot fail.
maybe that is what senior engineers actually do. they do not write better code. they write code that deploys better.