WEBVTT

00:00.000 --> 00:18.880
And so now for the first real talk of the day, out, which is Victor, which is not

00:18.880 --> 00:26.120
boo, with amazing, who we're talking about monoliths and very old monoliths almost as

00:26.120 --> 00:31.400
all does this death room, which I'm very curious to hear about, because I'm also dealing

00:31.400 --> 00:35.600
with very old codes around with a blast.

00:35.600 --> 00:43.160
Welcome to my talk and modularizing a 10-year monolith.

00:43.160 --> 00:44.480
My name is Victor Libaslowski.

00:44.480 --> 00:48.240
I've been in tech for over 25 years and I'm currently a principal software engineer at

00:48.240 --> 00:52.040
fleet device management at fleet, we make endpoint telemetry for corporate security

00:52.040 --> 00:55.880
teams and device management for corporate IT teams.

00:55.880 --> 00:59.440
With my current job, it took me roughly a year to figure out the code base, get familiar

00:59.440 --> 01:02.760
with a programming language, and understand our business domain.

01:02.760 --> 01:06.760
So around that time, I started to notice a consistent issue whenever working on our service

01:06.760 --> 01:07.760
package.

01:07.760 --> 01:13.200
The service package was the biggest one in our code base, and whenever I was recompiling a test

01:13.200 --> 01:15.880
in this package, it took roughly 30 seconds.

01:15.880 --> 01:20.320
So if I modified one character in a string, it took roughly 30 seconds to recompile the

01:20.320 --> 01:23.080
package and its unit tests.

01:23.080 --> 01:27.320
So I did a little digging, the gold programming language is known for fast compile times.

01:27.320 --> 01:31.840
However, the linking of all the packages was dominating the compile.

01:31.840 --> 01:33.640
From this analysis, they usually was obvious.

01:33.640 --> 01:38.200
The service package was just too big and had too many dependencies.

01:38.200 --> 01:41.880
I asked myself a simple question, what am I actually trying to do here?

01:41.880 --> 01:47.320
I'm not trying to fight the build system or wrestle with a giant package.

01:47.320 --> 01:50.160
I'm just trying to write good software.

01:50.160 --> 01:55.440
And it shouldn't take 30 seconds of self therapy every time I change a string literal.

01:55.440 --> 01:58.680
So if I just want to write good software, what needs to change?

01:58.680 --> 02:02.200
I want a code base that feels light, understandable, and fast.

02:02.200 --> 02:05.080
One that helps me think instead of slowing me down.

02:05.080 --> 02:09.680
And I realized the only way to get there was to make the code base itself smaller.

02:09.680 --> 02:12.560
So we tried modularizing.

02:12.560 --> 02:17.000
Next, let's discuss our first attempt of modularization, whereas you might guess things

02:17.040 --> 02:20.480
didn't quite go as smooth as we hoped.

02:20.480 --> 02:22.280
Let's talk about our code base.

02:22.280 --> 02:26.360
Fleet device management is primarily MIT license open source.

02:26.360 --> 02:30.080
We also maintain a small set of advanced features in the clearly separated directory

02:30.080 --> 02:31.720
under a different license.

02:31.720 --> 02:38.600
Our code base is roughly 10 years old with over 500,000 lines of go back and code.

02:38.600 --> 02:42.480
This is a much simplified picture of the core of our server code.

02:42.480 --> 02:45.920
We have a single huge go package for the API layer.

02:45.920 --> 02:47.120
It's our service package.

02:47.120 --> 02:50.560
It includes controllers and the service containing our business logic.

02:50.560 --> 02:52.800
The API layer calls the persistence layer.

02:52.800 --> 02:55.200
The persistence layer is another huge go package.

02:55.200 --> 02:59.800
It constructs the SQL queries and executes them against our my SQL database.

02:59.800 --> 03:04.480
We also have an in memory cache, a React front end, a CLI, and a bunch of other components

03:04.480 --> 03:06.520
which are not pictured here.

03:06.520 --> 03:09.880
So at one of our engineering old hands meetings, I gave a presentation.

03:09.880 --> 03:15.720
I described how we could scale our code base to be a more modular architecture.

03:15.720 --> 03:19.720
The core idea of my presentation was that when working on a new major feature, we could

03:19.720 --> 03:22.720
create a new module of vertical slice.

03:22.720 --> 03:26.600
This new vertical slice would mirror the structure of the legacy code.

03:26.600 --> 03:30.480
We would actually organize it exactly like the legacy code that way engineers would still

03:30.480 --> 03:33.640
roughly know where things are.

03:33.640 --> 03:40.200
This actually looks exactly like a microservice but still within a single compiled binary.

03:40.200 --> 03:45.680
My presentation seemed to land and I felt confident enough to try it on a real feature.

03:45.680 --> 03:49.120
So shortly afterwards, we started working on a new feature and this new feature seemed

03:49.120 --> 03:52.640
like an ideal candidate to try the new modularization approach.

03:52.640 --> 03:56.200
The feature was Android MDM, mobile device management.

03:56.200 --> 03:58.240
I created the new structure for the Android support.

03:58.240 --> 04:01.120
I created new packages and a dedicated directory.

04:01.120 --> 04:04.920
Most of the work was actually on tangling several common parts so they could be reused

04:04.920 --> 04:08.680
between the legacy code and the new Android feature.

04:08.680 --> 04:13.200
I put up the PR, the pull request, and it just sat there.

04:13.200 --> 04:17.720
Our guidance is to review and merge PRs within 24 hours but no one wanted to touch this

04:17.720 --> 04:18.720
one.

04:18.720 --> 04:22.600
In retrospect, I should have done a better job communicating to the team that I was actually

04:22.600 --> 04:26.160
going to do what I proposed in my earlier presentation.

04:26.160 --> 04:28.600
Some engineers felt surprised by the changes.

04:28.600 --> 04:32.480
They didn't feel like they had a voice in their approach, they didn't have an opportunity

04:32.480 --> 04:33.800
to speak up.

04:33.800 --> 04:36.800
So facing this resistance, I'd backtracked.

04:36.800 --> 04:41.960
I decided to take a step back and unsplit the my SQL layer so all requests would still

04:41.960 --> 04:44.920
go to the same my SQL schema.

04:44.920 --> 04:51.920
I also wrote two ADRs, architectural decision records, want to split the API layer and want

04:51.920 --> 04:56.680
to split the persistence layer for this Android feature.

04:56.680 --> 05:00.760
The ADR to split the API layer was approved but the ADR to split the persistence layer

05:00.760 --> 05:03.520
was rejected.

05:03.520 --> 05:06.600
Engineers voiced their concerns, their feedback was consistent.

05:06.600 --> 05:11.760
This felt risky, confusing, and too big to do all at once.

05:11.760 --> 05:16.920
So I reverted the persistence layer changes, then you API layer now talk to the same

05:16.920 --> 05:21.160
large-share persistent package as the rest of the code base.

05:21.160 --> 05:25.760
And at that point, I moved off the feature and delivery pressure to cover, without sustained

05:25.760 --> 05:31.480
architectural ownership, the Android code slowly seat back into the legacy system.

05:31.480 --> 05:33.040
So what do we have now?

05:33.040 --> 05:38.880
The picture roughly shows what we have, certainly at a high level, the code base doesn't

05:38.880 --> 05:45.880
appear very modular.

05:45.880 --> 05:50.760
Arguably, it's even more complex and harder to maintain than it was before.

05:50.760 --> 05:56.560
In practice, there's no clear separation between the legacy code and the Android feature.

05:56.560 --> 06:00.560
So next, let's discuss what we learned from our first attempt.

06:00.560 --> 06:05.760
I keep rediscovering that talking to people takes time and there's no shortcut.

06:05.760 --> 06:10.760
Real-time, meetings, follow-ups, side conversations, and the bigger the organization,

06:10.760 --> 06:12.960
the slower that loop gets.

06:12.960 --> 06:18.720
We also can't rely only on who speaks up first, early voices are often smart and well-intentioned,

06:18.720 --> 06:20.880
but they don't represent everyone.

06:20.880 --> 06:25.480
To build real consensus, we have to actively solicit feedback, especially in one of the

06:25.480 --> 06:28.000
ones and smaller forums.

06:28.000 --> 06:33.160
To succeed, we need strong commitment, both from engineers and managers, to the architecture

06:33.160 --> 06:36.240
and to the specific boundaries we're proposing.

06:36.240 --> 06:42.120
Without that commitment, engineers naturally revert to old patterns and architectural compromises

06:42.120 --> 06:47.480
creep back in, and without management support, short-term delivery pressure wins over a long-term

06:47.480 --> 06:48.480
structure.

06:48.480 --> 06:50.560
Here's a small example.

06:50.560 --> 06:56.480
When I said modules, I meant architectural modules, cohesive code with clear boundaries,

06:56.480 --> 07:00.160
but some people thought I meant go modules.

07:00.160 --> 07:05.160
Nobody was wrong, we just weren't using the same language and it took a face-to-face conversation

07:05.160 --> 07:07.280
to realize that.

07:07.280 --> 07:12.320
We saw the hard part was the architecture, it turned out the hard part were the conversations.

07:12.320 --> 07:14.600
Let's move on to the second lesson.

07:14.600 --> 07:16.800
What is an architectural change anyway?

07:16.800 --> 07:20.400
That sounds like a straightforward question, but in practice, it turns out to be one of

07:20.400 --> 07:23.800
the most political questions in engineering.

07:23.800 --> 07:26.000
Here's what we saw in reality.

07:26.000 --> 07:30.080
Social changes in our code base rarely showed up as architecture work.

07:30.080 --> 07:32.680
Instead, they went in as part of feature work.

07:32.680 --> 07:36.960
If you framed your changes part of feature work and included it in a PR with other feature

07:36.960 --> 07:42.000
changes, it moved, even if it broke existing architectural patterns.

07:42.000 --> 07:45.880
After all, we needed to get all the feature changes in quickly, features of what make money

07:45.880 --> 07:47.320
for the company.

07:47.320 --> 07:52.080
But if you framed your changes as a standalone architectural improvement, it's stalled.

07:52.080 --> 07:55.080
It's stalled because it was prioritized behind feature work.

07:55.080 --> 07:58.640
It's stalled because engineers wanted to have bigger discussions around it, and it's

07:58.640 --> 08:03.320
stalled because it wasn't providing immediate value to the company.

08:03.320 --> 08:06.040
So what's the main lesson here about architectural changes?

08:06.040 --> 08:11.040
If you don't define how architecture is allowed to change, it changes anyway and often it

08:11.040 --> 08:12.800
changes behind your back.

08:12.800 --> 08:15.520
Let's move on to the third lesson.

08:15.520 --> 08:18.280
One thing we're missing was clear ownership.

08:18.280 --> 08:23.880
In a modular system, not every API endpoint belongs cleanly to a single service.

08:23.880 --> 08:28.240
Some endpoints mostly orchestrate work across multiple services, and this was true for

08:28.240 --> 08:32.640
us because the new Android feature still needed to talk to legacy code.

08:32.640 --> 08:36.680
The mistake was leaving that orchestration on owned.

08:36.680 --> 08:41.320
Every API point still needs a clear home, a specific module and a specific team that's

08:41.320 --> 08:43.080
responsible for it.

08:43.080 --> 08:48.720
One of the endpoints is just glue, someone has to own that glue.

08:48.720 --> 08:50.200
So where are we at this point?

08:50.200 --> 08:52.240
The first attempt didn't quite succeed.

08:52.240 --> 08:56.760
We still have the pain points of a large, tightly-coupled code base, but hopefully we learned

08:56.760 --> 08:57.760
a few things.

08:57.760 --> 08:59.400
Now we have experience.

08:59.400 --> 09:01.680
So where to go from here?

09:01.680 --> 09:05.520
At this point, I realized the importance of the people part in architecture.

09:05.520 --> 09:09.200
And I felt like I couldn't convince the company on my own.

09:09.200 --> 09:11.960
I still didn't feel like an architecture expert.

09:11.960 --> 09:13.920
The first attempt was kind of a mess.

09:13.920 --> 09:18.960
So why would other engineers take my next proposal seriously and not see it as just another crazy

09:18.960 --> 09:19.960
idea?

09:19.960 --> 09:23.960
I needed someone else's help, I needed something like an appeal to authority, I needed

09:23.960 --> 09:27.680
to find someone else that had already done this transition.

09:27.680 --> 09:30.520
And I needed to analyze the way they approached it.

09:30.520 --> 09:35.760
Unfortunately, I did not find any large open-source go-projects making a transition to a

09:35.760 --> 09:41.400
module on the left, so as far as I know, where the first large open-source go-project

09:41.400 --> 09:42.400
doing this.

09:42.400 --> 09:44.160
Yay, us.

09:44.160 --> 09:48.080
The next best reference was GitLab.

09:48.080 --> 09:53.120
GitLab runs a Ruby on Rails model list, which so they decided to break up into a modular

09:53.120 --> 09:54.120
model list.

09:54.120 --> 09:59.320
They documented the decision publicly, and it turned my proposal from a crazy idea into

09:59.320 --> 10:00.880
a proven pattern.

10:00.880 --> 10:02.720
We weren't trying to invent something new.

10:02.720 --> 10:06.880
We were trying to follow a trail that was already there.

10:06.880 --> 10:08.600
Next, how to start?

10:08.600 --> 10:13.680
This time there was no obvious new feature to anchor the modularization work.

10:13.680 --> 10:17.000
So we had several internal discussions with the engineering team.

10:17.000 --> 10:20.720
We aligned on the main-driven design and bounded context.

10:20.720 --> 10:25.160
But the hard question was where to draw that first boundary.

10:25.160 --> 10:31.240
I looked at our code organization, had a few conversations with my AI friend, and proposed

10:31.240 --> 10:33.960
several ways to slice the model list.

10:33.960 --> 10:37.280
As you can imagine, this process is a lot to get your head around, trying to tease

10:37.280 --> 10:40.800
a part of a 10-year-old system into self-contained modules.

10:40.800 --> 10:43.160
I felt like I was barely scratching the surface.

10:43.160 --> 10:47.360
I felt like I was truly missing a lot of the details and edge cases.

10:47.360 --> 10:51.560
There was a lot of unknowns starting a modularization effort with anything beyond a bare

10:51.560 --> 10:53.360
minimum seemed risky.

10:53.360 --> 10:58.480
Although all the engineers generally agreed that we wanted to do an incremental approach,

10:58.480 --> 11:01.960
we ultimately agreed on activity audit.

11:01.960 --> 11:05.760
This context is about recording things that had already happened in our system.

11:05.760 --> 11:09.480
For example, recording when a new user was created, when a new host enrolled, when

11:09.480 --> 11:13.440
a key configuration was changed, and a ton of other activities.

11:13.440 --> 11:18.680
Next, I wrote up a detailed architectural decision record and put it up for review by the

11:18.680 --> 11:19.680
team.

11:19.680 --> 11:23.160
Here's the link in case someone can't wait to dive into the details.

11:23.160 --> 11:25.880
Again, this ADR was framed as a pilot.

11:25.880 --> 11:31.160
This is the first bounded context we were creating before a bigger rollout.

11:31.160 --> 11:33.800
Several times, I wonder where this whole thing was going to go anywhere.

11:33.800 --> 11:38.200
Sometimes I felt like everything was on board, other times I felt like I kept running

11:38.200 --> 11:40.040
into a brick wall.

11:40.040 --> 11:44.520
I also felt I was deep in the weeds and details of this re-architecture.

11:44.520 --> 11:48.680
I didn't quite see the full picture myself, and I had a suspicion that other engineers

11:48.680 --> 11:49.680
didn't either.

11:49.680 --> 11:54.320
I needed to convince myself and others why we were doing this.

11:54.320 --> 12:00.280
I needed an argument, and I needed to use it not just once, but every chance I got.

12:00.280 --> 12:04.080
So when I talk to the team, I roughly framed it like this.

12:04.080 --> 12:07.160
Every successful engineering org eventually hits the same wall.

12:07.160 --> 12:11.720
The monolith grows beyond what humans can safely understand.

12:11.720 --> 12:16.280
At that point, teams start colliding to each other's changes, ownership gets fuzzy, and

12:16.280 --> 12:18.680
small changes become risky.

12:18.680 --> 12:22.280
We're already seeing that, and that's not the failure of the people, it's a failure of

12:22.280 --> 12:24.200
the structure.

12:24.200 --> 12:28.520
Modularization isn't about process or slowing anyone down, it's about creating clear

12:28.520 --> 12:33.600
boundaries and ownership, so teams can move fast with confidence as the system continues

12:33.600 --> 12:34.600
to grow.

12:34.600 --> 12:38.600
All right, now let's talk about the actual architecture.

12:38.600 --> 12:42.600
The high level idea is almost the same as where we started with the first attempt, then

12:42.600 --> 12:48.080
you feature or new bonded context will be in its own module with its own controller, service,

12:48.080 --> 12:49.600
and persistent layers.

12:49.600 --> 12:54.280
The controllers and service will be able to call methods on other services in other bounded

12:54.280 --> 12:55.280
contexts.

12:55.280 --> 13:01.040
However, they will not be able to access the persistence layer of another bounded context.

13:01.040 --> 13:05.720
Here's the directory structure, directly out of the bounded context, the service, and

13:05.720 --> 13:12.000
my SQL directories match the API layer and the persistence layer from the previous diagram.

13:12.000 --> 13:17.800
The integration test down here should be able to fully test this bounded context without

13:17.800 --> 13:21.200
requiring parts from the rest of the application.

13:21.200 --> 13:25.760
Using this bounded context with other bounded contexts will be done at a higher level integration

13:25.760 --> 13:26.760
test.

13:26.760 --> 13:31.040
I can talk about the decisions that went into this for a while, so catch me after the session

13:31.040 --> 13:33.160
if you'd like to dive deeper.

13:33.160 --> 13:37.640
Now let's talk about the decision not to split the database schema.

13:37.640 --> 13:42.720
When people here module a monolith, they often assume that means multiple databases.

13:42.720 --> 13:46.400
We deliberately did not do that, we kept the single schema.

13:46.400 --> 13:50.520
Our problems were database scale or load, they were unclear ownership, broad service

13:50.520 --> 13:53.800
layers, and code that was hard to reason about.

13:53.800 --> 13:57.680
Splitting the database wouldn't have fixed that, it just would have added operation

13:57.680 --> 14:01.080
and complexity and new failure modes.

14:01.080 --> 14:05.320
Splitting a database is one of the most expensive architectural decisions you can make, and

14:05.320 --> 14:06.920
it freezes boundaries.

14:06.920 --> 14:10.080
We were still learning where those boundaries should be.

14:10.080 --> 14:14.840
So modular, so we modularized the code first, preserved optionality, and decided with

14:14.840 --> 14:20.680
only split the database later if scaling actually required it.

14:20.680 --> 14:24.640
Now everywhere we turn, we heard recommendations to use ports and adapters.

14:24.640 --> 14:30.080
However, a lot of our business logic lives in SQL, so pretending the database is just

14:30.080 --> 14:32.880
an adapter doesn't make sense for us.

14:32.880 --> 14:37.120
Ports and adapters doesn't work for our system as a whole, but one idea for mid-fits

14:37.120 --> 14:39.080
extremely well.

14:39.080 --> 14:43.640
That idea bounded context do not share the main models directly.

14:43.640 --> 14:49.320
When you cross-context communication must go through an explicit own contract, in practice

14:49.320 --> 14:51.720
that means explicit module boundaries.

14:51.720 --> 14:55.960
Now, earlier I talked about how we ended up with large packages.

14:55.960 --> 15:00.840
When a shared common package is used for cross-context communication, it creates a specific

15:00.840 --> 15:01.840
problem.

15:01.840 --> 15:06.640
When two bounded contexts share the same common types, they are no longer independent,

15:06.640 --> 15:11.240
a change in one context silently changes the behavior of the other, at that point the

15:11.240 --> 15:14.880
boundary only exists in our heads, not in the code.

15:14.880 --> 15:19.600
So for cross-context communication, shared packages aren't reuse instead they're hidden

15:19.600 --> 15:21.480
coupling.

15:21.480 --> 15:26.480
That's why we require explicit own contracts at context boundaries.

15:26.480 --> 15:31.200
When one bounded context needs something from another, it doesn't reach into its internals,

15:31.200 --> 15:36.680
instead the providing context exposes a small explicit interface along with the types

15:36.680 --> 15:39.360
that belong to that interface.

15:39.360 --> 15:45.040
Those interfaces and types are owned by the provider, they live with the bounded context

15:45.040 --> 15:47.520
and they evolve on its terms.

15:47.520 --> 15:51.440
Other contexts can depend on the contracts but not on the implementation.

15:51.440 --> 15:55.840
This is what makes the boundary real in code not just in our heads.

15:55.840 --> 16:00.960
Now sometimes a new bounded context still needs data or behavior that lives in legacy code.

16:00.960 --> 16:05.680
In those cases, we don't let the new code talk to legacy directly, instead we introduce

16:05.680 --> 16:09.080
an anti-corruption layer or ACL.

16:09.080 --> 16:12.400
The ACL is the only place that understands both worlds.

16:12.400 --> 16:17.320
It translates legacy concepts, types and quirks into something that new contexts can work

16:17.320 --> 16:18.800
with safely.

16:18.800 --> 16:23.280
Notice that this ACL package is outside the new bounded context.

16:23.280 --> 16:27.240
It keeps legacy semantics from leaking into the new code and then gives us a clear

16:27.240 --> 16:31.280
scene we can replace or delete later.

16:31.280 --> 16:34.920
The goal of the ACL isn't elegance, it's containment.

16:34.920 --> 16:37.160
Now let's continue with the story.

16:37.160 --> 16:46.200
Before we change the new code, we had one big issue.

16:46.200 --> 16:52.680
By and if this was going to work, we didn't want this to feel like something bigger

16:52.680 --> 16:57.160
implemented and we just had to live it.

16:57.160 --> 17:00.720
Our software engineers needed to feel like the end result was theirs.

17:00.720 --> 17:04.600
Not something imposed, not something slipped in quietly.

17:04.600 --> 17:07.720
Something they actually owned.

17:07.720 --> 17:11.320
We left the ADR open for discussion for about four weeks.

17:11.320 --> 17:14.120
This was longer than any ADR we'd done before.

17:14.120 --> 17:18.640
We wanted time for people to read it, question it, disagree with it and sit with it.

17:18.640 --> 17:23.480
The goal wasn't past the approval, the goal was shared understanding.

17:23.480 --> 17:26.840
Next we made a deliberate decision about reviews.

17:26.840 --> 17:30.600
Every pull request related to this ADR would be reviewed by the four tech leads from

17:30.600 --> 17:31.600
each product group.

17:31.600 --> 17:32.880
They did two things.

17:32.880 --> 17:37.280
Of course, if make sure they were a genuinely unbored, second it meant their teams were

17:37.280 --> 17:40.120
represented, not surprised later.

17:40.120 --> 17:44.280
Surprise is a great for birthdays, not for architecture.

17:44.280 --> 17:49.040
Architecture stopped being something decided by one person and started being something carried

17:49.040 --> 17:51.080
by the org.

17:51.080 --> 17:53.720
We also didn't want the implementation to be rushed.

17:53.720 --> 17:58.640
The work followed our normal design process and we made progress visible to everyone.

17:58.680 --> 18:04.920
Visibility turned a risky architectural shift into a series of small understandable steps.

18:04.920 --> 18:08.720
Now at this point, it may seem like we were putting a lot of process in place.

18:08.720 --> 18:12.360
It probably sounds like we were saying, we need to have everything figured out before we

18:12.360 --> 18:13.360
start.

18:13.360 --> 18:16.040
And to be honest, that felt a little backwards.

18:16.040 --> 18:21.120
This first bounded context was supposed to be a pilot, not a well-engineered masterpiece.

18:21.120 --> 18:25.800
And a pilot is supposed to discover things, which means sometimes taking a step backward,

18:25.800 --> 18:28.560
only to backtrack when things aren't working.

18:28.560 --> 18:33.320
If we already had all the answers, if we were able to lay out all the steps from this

18:33.320 --> 18:38.160
for this refactoring top-to-bottom, then we wouldn't need a pilot, right?

18:38.160 --> 18:42.760
This means we needed to do a POC, a proof of concept, to discover things that we didn't

18:42.760 --> 18:43.760
know about.

18:43.760 --> 18:48.160
The goal of the POC was to expose hidden coupling and understand which boundaries were

18:48.160 --> 18:50.200
actually possible.

18:50.200 --> 18:55.000
So I started with a proof of concept, not a full implementation, just a scaffold, the idea

18:55.000 --> 18:56.000
was simple.

18:56.000 --> 18:58.600
I made the basic shape of a bounded context.

18:58.600 --> 19:03.840
That means using some shared utilities for HTTP, some middleware, basic my SQL access,

19:03.840 --> 19:08.480
nothing fancy, just enough structure for a dummy, hello world, and point.

19:08.480 --> 19:12.720
Then I ran a dependency analysis and that's where things got interesting.

19:12.720 --> 19:19.200
That simple scaffold was pulling in half our monolith, not because the domain needed it,

19:19.200 --> 19:23.000
this hello world domain of the scaffold didn't mean anything yet, but because all of

19:23.000 --> 19:27.640
these utility packages did, the glue code was the monolith.

19:27.640 --> 19:32.840
The things we thought were harmless helpers were actually acting like dependency magnets.

19:32.840 --> 19:36.720
And there was another issue hiding in plain sight, a God package.

19:36.720 --> 19:43.160
Anytime someone needed a new type, I did nowhere, belonged, it went into this fleet package

19:43.160 --> 19:45.000
at the bottom.

19:45.000 --> 19:49.160
Over time that large fleet package picked up its own dependencies, which were not shown

19:49.160 --> 19:54.040
on the previous slide, which meant everything that imported it, picked those up too,

19:54.040 --> 19:58.520
and that's made circular dependencies incredibly easy to create and incredibly hard to

19:58.520 --> 19:59.920
reason about.

19:59.920 --> 20:04.960
That fleet package existed to make things easier, but it ended up doing the opposite.

20:04.960 --> 20:09.360
At that point it became clear we couldn't just extract a bounded context right away,

20:09.360 --> 20:14.880
we were missing a layer, before we could pull anything out, we needed a platform layer,

20:14.880 --> 20:20.320
code that handled infrastructure concerns, HTTP helpers, database wiring, middleware,

20:20.320 --> 20:23.640
without dragging the main assumptions along with it.

20:23.640 --> 20:27.600
We needed to differentiate between two types of code.

20:27.600 --> 20:32.400
The main code should express business rules and platform code should make those rules possible

20:32.400 --> 20:33.800
to run.

20:33.800 --> 20:39.280
Before we could modularize the domain, we had to modularize the foundation.

20:39.280 --> 20:43.280
Another issue we had to solve from our first attempt was that even the good documentation,

20:43.280 --> 20:49.360
even the ADRs, even the buy-in, nothing stops coupling from sneaking back in, especially

20:49.360 --> 20:54.000
in a large code base, and especially nowadays when changes aren't only written by humans,

20:54.000 --> 21:00.400
but also by AI coding agents that optimize for make it work and not preserve architecture.

21:00.400 --> 21:05.840
Architecture that lives in people's heads doesn't scale, and it doesn't survive time.

21:05.840 --> 21:11.280
That's why we realized we needed architectural checks.

21:11.280 --> 21:16.880
Sometimes known as fitness functions.

21:16.880 --> 21:22.960
Automated checks that run continuously check that don't care who wrote the code, human or AI.

21:22.960 --> 21:28.080
Specifically, we needed checks to enforce boundaries, which packages can depend on which,

21:28.080 --> 21:33.120
what the platform layers around allow to import, what a bounded context is not allow to

21:33.120 --> 21:37.600
import, because once those rules are explicit, they stop being simply opinions.

21:37.600 --> 21:39.800
They become executable.

21:39.800 --> 21:46.600
Now all of that tooling matters, but let's circle back to the core problem we're trying to solve.

21:46.600 --> 21:52.040
Looking back, the real constraint wasn't go, it wasn't my sequel, and it wasn't even the monolith.

21:52.040 --> 21:54.200
The real constraint was human attention.

21:54.200 --> 21:59.560
How much of a system, one person can hold in their head and how confidently they can make a change.

21:59.560 --> 22:06.600
Once a system grows beyond that limit, progress slows, no matter how good the engineers are.

22:06.600 --> 22:10.680
We often talk about architecture as it's as if it's for computers.

22:10.680 --> 22:14.920
But computers don't care how big your code base is, humans do.

22:14.920 --> 22:17.560
Architecture is really an interface for people.

22:17.560 --> 22:23.400
It defines what you need to understand, what you can safely ignore, and what you're allowed to change.

22:23.400 --> 22:27.080
When that interface is unclear, every change feels risky.

22:27.080 --> 22:30.680
When it's clear, people move faster, even in a large system.

22:31.640 --> 22:36.920
For a long time, I thought the hard part was architecture or code or tooling.

22:36.920 --> 22:42.440
It was the hardest part was alignment, shared language, shared understanding, and shared ownership.

22:43.640 --> 22:48.760
Modularization isn't really about modules, it's about respecting the limits of human attention,

22:48.760 --> 22:52.840
and designing systems people can understand, trust, and change without fear.

22:53.560 --> 22:59.000
Tools matter, patterns matter, but architecture lasts only when it's owned by more than one person.

23:01.000 --> 23:03.800
So before I wrap up, I want to pause in these.

23:03.800 --> 23:06.440
These are some of the teams using fleet introduction today.

23:06.440 --> 23:09.160
They're very different organizations with very different constraints.

23:09.720 --> 23:13.720
We learned a lot from working with customers like these, and many of the lessons I talk about

23:13.720 --> 23:17.640
come from operating in this kind of scale under real conditions.

23:19.080 --> 23:20.200
That's it for this talk.

23:20.200 --> 23:23.320
If this story resonated with you, I'd love to hear about your own.

23:24.040 --> 23:26.520
Problems are rarely neat, even when the code bases are.

23:27.400 --> 23:31.960
Here's a few links about fleet. We are hiring go developers right now,

23:33.320 --> 23:35.960
and you can find some links about me.

23:37.000 --> 23:38.040
And thanks for listening.

23:39.000 --> 23:39.800
Thank you very much.

