Repo as data structure | ticketyboo.dev

Summary

Who it's for Technical leads and senior engineers managing codebases where AI coding assistants are part of the daily workflow. Anyone who's watched a monorepo sprawl into something hard to index and contextualise.

Key observations

Every company that chose monorepo at extreme scale had to abandon or fundamentally modify Git. Google, Meta, and Microsoft all independently built virtual filesystems.
AI coding assistants favour monorepos. Cross-service context in one tree means better suggestions. But AI agents are more sensitive to noise than humans.
The repo is a data structure with invariants. Binary artifacts, deep nesting, and ephemeral content violate those invariants and degrade both human and AI effectiveness.
The fix isn't architecture, it's hygiene. History cleanup, binary boundaries, shallow structure, and AI-aware .gitignore patterns.

The monorepo vs polyrepo debate has been running for twenty years. Google published their seminal paper in 2016. Meta forked Mercurial. Microsoft built a virtual filesystem. Amazon went the other way entirely. The arguments are well-rehearsed.

What's changed is the audience. When an AI coding assistant indexes your repository, it's consuming the tree as a data structure. Every file, every directory, every binary blob competes for context window space. The repo's shape isn't just an organisational choice anymore. It's an input to the quality of AI-generated code.

This article looks at how the five largest technology companies manage their source code, what pattern they converged on, and what that means for the rest of us building with AI assistants in the loop.

How the big five do it

The approaches are surprisingly divergent in implementation but convergent in the problems they solve.

How the big five do it

The approaches are surprisingly divergent in implementation but convergent in the problems they solve.

Five companies, four different version control systems, one shared insight: at scale, you can't clone everything.

Google's Piper holds over two billion lines of code in a single repository. It's not Git. It's a custom system distributed across ten data centres. Developers don't clone it. They use CitC (Clients in the Cloud), which gives each engineer a cloud-based workspace that only materialises the files they touch. The build system, Bazel (open-sourced from Google's internal Blaze), understands the full dependency graph and only rebuilds what changed. Twenty-five thousand engineers commit to a single trunk. No long-lived branches.

Meta took a different path to the same destination. They started on Git, hit scaling walls, asked the Git maintainers for help, were told to use multiple repos, refused, and migrated to Mercurial. Then they rewrote Mercurial's internals. The result evolved into Sapling, open-sourced in 2022. Like Google, the key innovation is a virtual filesystem, EdenFS, that makes all files appear present but only fetches them on demand. The server side, Mononoke, is written in Rust for performance.

Microsoft's Windows codebase is roughly 3.5 million files and 300GB as a Git repo. When they migrated to Git around 2017, clone took twelve hours, checkout took two to three hours, and git status took ten minutes. Their solution was GVFS (Git Virtual File System), which virtualises the working directory. This evolved into Scalar, a lighter set of Git extensions (sparse checkout, filesystem monitor, commit-graph, multi-pack-index) now partially upstreamed into Git itself.

Amazon is the outlier. Their two-pizza team philosophy maps directly to polyrepo: each service team owns its own repository, its own deployment pipeline, its own operational responsibility. The trade-off is coordination cost. Cross-service changes require synchronising across repos. Amazon built internal tooling (Brazil build system, Pipelines) to manage that cost. The architecture enforces service boundaries by making them repository boundaries.

Apple publishes almost nothing about their internal source control. What's known from job postings and former employee accounts suggests Perforce for large platform projects and Git for newer, smaller ones. No public tooling contributions in this space.

The virtual filesystem pattern

The convergence is striking. Google, Meta, and Microsoft, three companies that compete on almost everything, all independently arrived at the same architectural solution: a virtual filesystem layer between the repository and the developer's working directory.

The principle is identical in all three: the repository is too large to clone. So don't clone it. Present a view that looks like a full checkout but only fetches file contents when they're actually opened. Everything else is a placeholder, a stub that knows the file's metadata but not its content.

The universal scaling solution: don't clone everything. Present stubs, materialise on demand.

The implication for smaller teams is not that you need a virtual filesystem. It's that the problem these systems solve (too much stuff in the tree) is the same problem at every scale. Google solves it with CitC. You solve it with .gitignore and discipline. The principle is identical: only materialise what matters.

AI changes the calculus

The traditional monorepo vs polyrepo trade-off was about humans. Monorepos give you atomic changes across services but create build complexity. Polyrepos give you team autonomy but create coordination overhead. The right choice depended on your team size, deployment model, and tolerance for tooling investment.

AI coding assistants shift this. When an AI coding tool indexes a repository, having the API contract, the Terraform module, and the frontend consumer in the same tree means the agent can see how changes propagate. In a polyrepo, the agent sees one service at a time. It can't reason about the contract between services because the other side of the contract is in a different repository.

This tips the scales toward monorepos for AI-assisted development. But there's a catch. AI agents are more sensitive to noise than humans. A developer can glance at a directory listing and ignore node_modules/, .hypothesis/, and a folder full of CV documents. An AI context gatherer will index them, spend tokens on them, and potentially incorporate irrelevant patterns into its suggestions.

The signal-to-noise ratio of your repository directly affects the quality of AI-generated code. This isn't theoretical. It's measurable. A repo with 4,000 files where 1,000 are binary artifacts, cache directories, and personal documents means the AI is working with 25% noise. That's context window space that could have been used for actual code understanding.

The repo as data structure

Here's the reframe. Your repository isn't just a place to store code. It's a data structure that both humans and AI agents consume. Like any data structure, it has invariants, properties that should always hold true. When those invariants are violated, performance degrades.

The invariants for an AI-friendly repository:

Nothing that doesn't diff meaningfully. Binary files, compiled artifacts, and generated outputs don't produce useful diffs. They consume storage and context without contributing to understanding.
Shallow directory structure. Every level of nesting increases the number of tree objects in Git and makes path-based reasoning harder for both humans and AI. Eight levels deep is a code smell.
Clear boundary between permanent and ephemeral. Source code is permanent. Build artifacts, scratch files, and personal documents are ephemeral. They should never share the same commit history.
Aggressive exclusion of non-diffable content. The .gitignore isn't just about keeping the repo clean. It's about keeping the AI's context window clean.

Same codebase, same functionality. The difference is what the AI agent sees when it indexes the tree.

Practical hygiene

You don't need Piper or EdenFS. You need five rules applied consistently.

1. The binary boundary rule

Nothing that doesn't produce a meaningful diff belongs in Git history. PDFs, Word documents, zip archives, compiled binaries, media files. These are opaque blobs to Git. They inflate the object store, slow clones, and waste AI context. The .gitignore should catch them before they're ever committed. If they've already been committed, git-filter-repo can purge them from history.

# .gitignore: binary boundary
*.zip
*.docx
*.pdf
*.mp4
*.png
!site/favicon.png    # explicit exceptions for tracked assets
!site/og-card.png

For media that genuinely needs to be version-controlled (demo videos, design assets), Git LFS tracks them as pointers in the repo with the actual content stored externally. The repo stays small. The media is still accessible.

2. History hygiene

A 183MB .git directory for a repo with 101 commits is a red flag. The usual cause: large files committed early, then gitignored later. The files are gone from the working tree but live forever in the object store. Every clone downloads them. Every AI tool that inspects history encounters them.

# Find the largest objects in your repo
git rev-list --objects --all \
  | git cat-file --batch-check='%(objecttype) %(objectsize) %(rest)' \
  | sort -rnk2 \
  | head -20

# Purge specific paths from history (destructive, requires force push)
pip install git-filter-repo
git filter-repo --path path/to/large/directory/ --invert-paths

3. Shallow directory structure

Deep nesting creates more tree objects in Git, makes path-based reasoning harder, and reduces the effectiveness of AI context gathering. A path like demo/pitch/terraform/.terraform/providers/registry.terraform.io/hashicorp/aws/5.100.0/darwin_amd64/ is eight levels deep. That's a data structure problem, not an organisation problem.

The rule of thumb: if a human can't type the path from memory, it's too deep. Three to four levels is the sweet spot for most projects. Beyond that, consider whether the nesting reflects real boundaries or just mirrors an external tool's directory convention.

4. Ephemeral content conventions

Every repo accumulates scratch files. Temporary scripts, experiment outputs, personal documents, draft plans. The question is whether they belong in the committed tree. Usually they don't.

A simple convention: tmp/ is gitignored entirely. Anything in it is local-only. If a scratch file graduates to something permanent, it moves to the appropriate directory and gets committed. Personal content (CVs, job hunt materials, outreach documents) lives outside the repo or in a separate, private repository.

5. AI-aware .gitignore

The traditional .gitignore is about keeping the working tree clean for humans. An AI-aware .gitignore adds a second concern: keeping the context window clean for agents. This means excluding things that humans might tolerate but AI agents shouldn't waste tokens on.

# AI-aware additions to .gitignore

# Test caches (useful locally, noise for AI)
.hypothesis/
.pytest_cache/
.mypy_cache/

# Tool output (regenerable, not source)
tools/probe/node_modules/
tools/probe/probe-output/

# Ephemeral workspace
tmp/

The monorepo that works

A well-structured monorepo for AI-assisted development looks like this: flat top-level directories with clear boundaries, no binary artifacts in history, aggressive exclusion of non-source content, and shallow nesting within each module.

project/
├── site/           # static frontend
├── demos/          # self-contained demo projects
│   ├── scanner/    #   each with own tests + docs
│   └── grants/     #   independent but co-located
├── terraform/      # shared infrastructure
├── tools/          # development utilities
├── docs/           # articles, plans, governance
├── .agent/         # AI steering + specs
├── .github/        # CI/CD workflows
├── .gitignore      # aggressive, AI-aware
└── .clinerules     # AI agent context

The structure communicates intent. A new developer, or a new AI agent, can look at the top level and understand what the project contains, where to find things, and what the boundaries are. No archaeology required.

The GitHub Well-Architected guidance calls this "modular organisation within the monorepo." Google's Piper paper calls it "directory-based ownership." The principle is the same: the directory tree is the API of your repository. Design it like one.

What the guidance misses

The GitHub Well-Architected article on repository architecture strategy is thorough on the human workflow side: branching strategies, CI/CD pipelines, access control, documentation. It's a solid checklist for teams making the monorepo vs polyrepo decision.

What it doesn't cover, and what matters increasingly, is the AI dimension. How does your repository structure affect the quality of AI-generated code? How does noise in the tree degrade context gathering? How should .gitignore patterns change when AI agents are part of the development workflow?

These aren't edge cases anymore. They're the primary interface through which a growing number of developers interact with their codebase. The repo's shape is an input to AI quality, and that makes it a first-class design concern.

Sources and further reading

Potvin & Levenberg, "Why Google Stores Billions of Lines of Code in a Single Repository", Communications of the ACM, 2016
Meta Engineering, "Meta developer tools: Working at scale", 2023
Meta Engineering, "Branching in a Sapling Monorepo", 2025
Microsoft DevBlogs, "The largest Git repo on the planet"
GitHub Well-Architected, "Repository Architecture Strategy"
Augment Code, "Monorepo vs Polyrepo: AI's New Rules for Repo Architecture"

Content was rephrased for compliance with licensing restrictions. All sources linked inline.

Scan any public GitHub repo for dependency risk, secrets, and code quality issues — free, no account needed.

Scan a repo free See governance agents →