The problem is always the same. An enterprise has been running Microsoft 365 for years. SharePoint sites have proliferated — some owned, some orphaned, some shared with external partners for deals that closed three years ago. No one knows exactly what data is where, who can see it, or what classification it carries.
The compliance team asks: how many documents contain PII? The IT team asks: who are the 1,429 guest users and what can they access? The legal team asks: are any of our M&A documents shared with former counterparties? The answers are sitting in the Microsoft Graph API. They're just not assembled into anything useful.
The SH∀DW platform — SharePoint and OneDrive data governance — is the assembled answer.
The visibility gap
Most enterprise data governance starts from the wrong end. Tools are deployed to classify documents, alert on policy violations, or generate compliance reports — before anyone has a clear inventory of what they're governing.
The first thing SH∀DW does is generate that inventory. A crawl worker iterates every SharePoint site collection and every OneDrive in the tenant via the Microsoft Graph API. For each site, it records: the site ID, the display name, the item count, the total size, the crawl status. The result is a complete, queryable registry of everything that exists — including sites that have never been looked at.
This sounds basic. In practice, it's transformative. When the crawl finishes and the inventory table appears in Athena, teams see their data estate clearly for the first time. Sites created for projects that ended years ago. Drives belonging to employees who left. Guest users who still have access. The visibility itself produces action.
PII classification before it reaches the lake
The crawl produces an inventory — but some of that inventory is sensitive. A SharePoint site containing HR documents or student records shouldn't be treated the same as a site containing marketing collateral.
SH∀DW classifies inventory entries at crawl time, using field name pattern matching
against a classification rules engine. An inventory entry containing fields matching
PII patterns (names, emails, national IDs, student numbers) gets classified as
restricted or highly_restricted.
The critical design decision: classification happens before the data reaches any
analytics layer. The S3 export writes two tables — a redacted version where
pii_fields and schema_snapshot are stripped for restricted entries,
and a full version protected by AWS Lake Formation policies. Standard QuickSight
users see the redacted table. Only operators with elevated permissions can query
the full PII inventory.
The sharing exposure problem
Knowing what exists is half the problem. Knowing who can see it is the other half. In a multi-company M365 tenant — where several subsidiary companies share a single Microsoft tenant — sharing attribution is genuinely difficult.
The SH∀DW sharing exposure engine runs as a 4-phase state machine:
| Phase | What it does | Output |
|---|---|---|
| 1 — Tenant settings | Reads the SharePoint tenant's external sharing configuration | Is external sharing enabled? What policy? |
| 2 — Activity reports | Downloads SP/OD activity CSVs via Graph API | Which files have been accessed by which users? |
| 3 — Guest enumeration | Lists all Entra guest users, extracts email domain | 1,429 guests attributed to external companies |
| 4 — Site membership | Cross-references guests with site member records | Which sites does each guest company have access to? |
The attribution problem is subtle. In M365, guest users are tenant-scoped — a guest invited to any site in the parent tenant appears in the parent's Entra directory. But subsidiary companies run their own SharePoint site collections within the same tenant. An employee from one subsidiary appearing as a guest on another subsidiary's site is expected — but a former M&A counterparty appearing as a guest six years after a deal closed is not.
The UPN parsing handles the B2B guest format:
user_domain.com#EXT#@tenant.onmicrosoft.com — extracting the real
external domain even from the obfuscated Entra representation. This is the
mechanism that makes guest attribution accurate across 1,429 external users
spread across dozens of external organisations.
The analytics layer
Once the crawl results are in S3, a Lambda export job writes 11 partitioned datasets to S3 in NDJSON format, registers Glue partitions, and makes the data queryable via Athena. QuickSight connects to Athena and surfaces 5 dashboard sheets:
| Sheet | Audience | Key questions answered |
|---|---|---|
| Executive summary | CISO, CDO | Total sites, total risk score, classification breakdown |
| Exposure detail | Security team | Risk matrix by site × classification, SP vs OD exposure bar |
| External access intelligence | IT + Legal | Guest domains, activity by external party, tenant sharing capability |
| Company site profile | IT per subsidiary | Sites per company, classification donut, source breakdown |
| Site inventory | IT operations | Full searchable flat inventory with crawl status filter |
The risk score formula — guest_user_count + sp_exposure + od_exposure + (pii_count × 10)
— is deliberately simple. The 10× multiplier on PII exposure reflects that the
consequence of a PII breach is categorically different from a document policy
violation. Simple formulas are more defensible to auditors than complex ML models.
What this enables that wasn't possible before
Before SH∀DW, answering "which external party has access to which documents" required a multi-week manual exercise by an IT team. After SH∀DW, it's a QuickSight filter. The same query that would have taken weeks can be answered in under 30 seconds — and it can be answered continuously, not just at audit time.
The architecture generalises beyond M365. The same crawl-classify-expose pattern applies to any unstructured data estate: Salesforce document libraries, Google Drive, Box, legacy file shares. The pipeline topology is the same. The classification rules and the API adapters are the only things that change.
For any organisation that has grown through acquisition — and most large organisations have — this kind of visibility layer is not a nice-to-have. It's the prerequisite for everything else: GDPR compliance, M&A due diligence, zero-trust network policy, data residency enforcement. You cannot govern what you cannot see.
Scan any public GitHub repo for dependency risk, secrets, and code quality issues — free, no account needed.
Scan a repo free See governance agents →