5 Proven Ways to Master Data Classification as Code

If your services handle PII, PHI, or cardholder data, you’ve probably felt this pain:

  • Auditors asking where “all PII lives” while engineers grep for email and ssn.
  • PCI DSS and SOC 2 controls talking about “cardholder data environments” and “logical access,” but your CI/CD only talks about builds and tests.
  • Risk assessments and remediation plans living in spreadsheets, while your team ships from Git.

Data Classification as Code is how you close that gap.

Instead of treating data classification as a one-off Excel artifact, you describe sensitive data, flows, and required controls in YAML/JSON, keep it in Git, and wire it directly into CI/CD. That way:

  • Risky changes to sensitive paths can trigger extra checks and approvals.
  • External scanners and pentests focus on what actually matters.
  • Compliance teams (and partners like Pentest Testing Corp) can turn CI output into audit-ready evidence for SOC 2, PCI DSS, ISO 27001, GDPR, and HIPAA.
5 Proven Ways to Master Data Classification as Code

Below is a practical, code-heavy guide aimed at developers, DevOps/SRE, and engineering leaders building or running services that touch PII, PHI, or card data—especially if you’re already using Cyber Rely and Pentest Testing Corp for security services and assessments.


What is Data Classification as Code?

Data Classification as Code means:

  • Your data classes (public, internal, confidential, restricted) and tags (PII, PHI, cardholder, secrets, logs) are machine-readable (YAML/JSON).
  • Each service owns a small config file that lists data assets, endpoints, queues, and storage—plus the controls they must have (TLS, logging, tests, approvals).
  • CI/CD jobs read those files and fail builds when sensitive pieces don’t meet the rules.

Instead of “security saying no” after the fact, classification becomes:

  • Version-controlled (PRs, code review, history).
  • Testable (CI jobs).
  • Mappable to controls across PCI DSS, SOC 2, ISO 27001, GDPR, HIPAA. (pentesttesting.com)

In other words: classification stops being a slide deck and becomes an input to your pipelines.

The rest of this article walks through five proven ways to implement Data Classification as Code inside Cyber Rely–style SDLCs, with real snippets you can adapt.


1) Define a data-classification.yml for Every Service

Start with one small, per-service YAML file in your repo—treat it like docker-compose.yml or openapi.yml:

Example: data-classification.yml (billing service)

service: billing-api
owners:
  team: payments
  slack: "#team-payments"
  jira_project: BILLING

data_levels:
  - name: public
    description: Public, marketing-safe information
  - name: internal
    description: Internal non-customer data
  - name: confidential
    description: Customer data with moderate impact
  - name: restricted
    description: Highly sensitive data (PII, PHI, card data, secrets)

assets:
  - id: customer_db
    type: postgres.table
    location: "rds://billing-prod/customers"
    classification:
      level: restricted
      tags: [pii, cardholder]
      regulations: [pci_dss, gdpr]
    retention: "7y"
    encryption:
      at_rest: true
      in_transit: true

  - id: invoice_pdf_s3
    type: s3.bucket
    location: "s3://billing-prod-invoices"
    classification:
      level: confidential
      tags: [pii]
      regulations: [gdpr]

routes:
  - path: /api/v1/cards
    method: POST
    description: Tokenize and store card details
    asset: customer_db
    classification:
      level: restricted
      tags: [cardholder, pii]
    requires:
      tls_min: "TLS1.2"
      mfa_for_admins: true
      logging_enabled: true
      no_pii_in_logs: true
      tests:
        - "tests/api/test_cards.py::test_pan_never_logged"

  - path: /api/v1/invoices/{id}
    method: GET
    description: Download invoice PDF
    asset: invoice_pdf_s3
    classification:
      level: confidential
      tags: [pii]
    requires:
      tls_min: "TLS1.2"
      logging_enabled: true

internet_exposure:
  public_domain: "billing.example.com"
  contains_pii: true
  contains_cardholder: true
  scan_profile: "external-high"

A few tips:

  • Keep it small: assets + routes + exposure is enough to start.
  • Make classification.level and classification.tags match how your compliance team talks about card data (PCI DSS), personal data (GDPR), and PHI (HIPAA).
  • Store this file at the root (data-classification.yml) or in /compliance/.

Once this exists for each service, you have a developer-owned map of where sensitive data lives and how it should be protected.


2) Wire Data Classification as Code into CI/CD Gates

Next, let CI enforce some simple but powerful rules:

  • If classification.level is restricted or confidential, then:
    • TLS must be enabled.
    • Logging rules must be defined (logging_enabled, no_pii_in_logs).
    • At least one test is listed in requires.tests.

Python script: ci/check_classification.py

#!/usr/bin/env python3
import sys
import pathlib

import yaml  # pip install pyyaml

CONFIG_PATH = pathlib.Path("data-classification.yml")


def main() -> int:
    if not CONFIG_PATH.exists():
        print("No data-classification.yml found, skipping check.")
        return 0

    data = yaml.safe_load(CONFIG_PATH.read_text())
    errors: list[str] = []

    for route in data.get("routes", []):
        level = (route.get("classification") or {}).get("level", "internal")
        path = route.get("path")
        method = route.get("method", "GET")

        # Only enforce strict rules on confidential/restricted
        if level not in {"confidential", "restricted"}:
            continue

        requires = route.get("requires") or {}

        if not requires.get("tls_min"):
            errors.append(f"{method} {path}: missing tls_min for {level} data")

        if not requires.get("logging_enabled"):
            errors.append(f"{method} {path}: logging_enabled=false for {level} data")

        if not requires.get("no_pii_in_logs"):
            errors.append(f"{method} {path}: no_pii_in_logs not enforced")

        tests = requires.get("tests") or []
        if not tests:
            errors.append(f"{method} {path}: no tests listed for high-sensitivity route")

    if errors:
        print("Data Classification as Code check failed:")
        for err in errors:
            print(" -", err)
        return 1

    print("Data Classification as Code check passed.")
    return 0


if __name__ == "__main__":
    raise SystemExit(main())

GitHub Actions: block merges when rules aren’t met

# .github/workflows/data-classification.yml
name: data-classification-gates

on:
  pull_request:
    paths:
      - "data-classification.yml"
      - "src/**"
      - "tests/**"

jobs:
  classification:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install dependencies
        run: pip install pyyaml

      - name: Run Data Classification as Code checks
        run: python ci/check_classification.py

This does a few important things for PCI DSS, SOC 2, ISO 27001, and GDPR/HIPAA audits:

  • Shows that changes to sensitive paths are gated by automated checks.
  • Produces pipeline logs and artifacts that map cleanly to secure SDLC and logging requirements (PCI DSS 6/10, SOC 2 CC6/CC7, ISO 27001 A.8/A.12).

If you already use policy-as-code / CI gates (for example, from Cyber Rely’s guides on CI/CD compliance and API security), this becomes one more input into those gates rather than a separate process.


3) Scan Tagged Internet-Facing Assets and Push Findings with Labels

Once the YAML tells you which domains and APIs are internet-facing and hold PII/PHI/card data, you can:

  1. Focus external scans on those assets.
  2. Tag every finding with the same classification labels used in your YAML.
  3. Send that into GitHub/Jira with the right labels so triage is fast and audit-ready.

Script: export high-risk URLs to scan

#!/usr/bin/env python3
import pathlib
import yaml

data = yaml.safe_load(pathlib.Path("data-classification.yml").read_text())

urls_to_scan: list[str] = []

internet = data.get("internet_exposure") or {}
if internet.get("contains_pii") or internet.get("contains_cardholder"):
    domain = internet.get("public_domain")
    if domain:
        urls_to_scan.append(f"https://{domain}")

for url in urls_to_scan:
    print(url)

You can wire this into your CI to produce a list of targets for your external scanner (including Cyber Rely’s preferred free Website Vulnerability Scanner at free.pentesttesting.com). (DEV Community)

Using the free Website Vulnerability Scanner in your workflow

At Cyber Rely and Pentest Testing Corp, we often recommend teams:

  1. Visit https://free.pentesttesting.com/.
  2. Paste the URLs exported from data-classification.yml.
  3. Run quick scans before major releases.
  4. Attach before/after reports to Jira/GitHub tickets as evidence artifacts.

In your blog or internal docs, you can embed two screenshots around this step:

Tool homepage screenshot

Screenshot of the free tools webpage where you can access security assessment tools for different vulnerability detection.
Screenshot of the free tools webpage where you can access security assessment tools for different vulnerability detection.

Sample report screenshot to check Website Vulnerability

An example of a vulnerability assessment report generated using our free tool provides valuable insights into potential vulnerabilities.
An example of a vulnerability assessment report generated using our free tool provides valuable insights into potential vulnerabilities.

Script: enrich scanner results with classification labels

Assume your scanner exports a CSV like:

target,severity,id,title
https://billing.example.com,High,XSS-001,Reflected XSS on /pay
https://billing.example.com,Medium,HEADERS-002,Missing HSTS

You can merge this with data-classification.yml and produce JSON ready for GitHub/Jira:

#!/usr/bin/env python3
import csv
import json
import pathlib

import yaml

root = pathlib.Path(".")
classification = yaml.safe_load((root / "data-classification.yml").read_text())
internet = classification.get("internet_exposure") or {}
labels = []

if internet.get("contains_cardholder"):
    labels.append("data:cardholder")
if internet.get("contains_pii"):
    labels.append("data:pii")

out = []

with (root / "scan-results.csv").open() as f:
    reader = csv.DictReader(f)
    for row in reader:
        out.append(
            {
                "summary": row["title"],
                "target": row["target"],
                "severity": row["severity"],
                "scanner_id": row["id"],
                "labels": labels + [
                    f"service:{classification.get('service','unknown')}",
                    "source:external-scan",
                ],
            }
        )

print(json.dumps(out, indent=2))

You can then post this JSON to your issue tracker API so that every vulnerability on a high-sensitivity service arrives pre-labeled with:

  • The service name.
  • Data classification tags.
  • Whether it came from an external scan.

That makes it trivial for risk/compliance teams (or Pentest Testing’s Risk Assessment Services) to filter and prioritize.


4) Drive Extra Approvals, Logging, and TLS from Classification

Now that your YAML can be trusted, you can use it to drive review rules and runtime behavior.

Generate CODEOWNERS entries from classification

If a route is restricted, you might want security + data owners automatically pulled in as reviewers.

#!/usr/bin/env python3
import pathlib
import yaml

data = yaml.safe_load(pathlib.Path("data-classification.yml").read_text())

owners = data.get("owners") or {}
team_slug = owners.get("team", "team-security")

restricted_paths: list[str] = []

for route in data.get("routes", []):
    level = (route.get("classification") or {}).get("level")
    if level == "restricted":
        path = route.get("path")
        # Map API path → folder pattern (adapt to your routing)
        restricted_paths.append(f"src{path.replace('/api', '')}/*")

restricted_paths = sorted(set(restricted_paths))

with open("CODEOWNERS.generated", "w") as f:
    for pattern in restricted_paths:
        f.write(f"{pattern} @{team_slug}-owners\n")

Then, in CI:

  • Generate CODEOWNERS.generated.
  • Validate that CODEOWNERS includes those lines (or import them automatically).
  • Combine with GitHub/GitLab’s branch protection rules to require approvals from @team-payments-owners whenever restricted code changes.

Enforce logging and TLS at runtime

You can also have your services load classification at startup and assert that runtime config matches expectations:

Example (Node.js / Express middleware):

// config/dataClassification.js
const fs = require("fs");
const yaml = require("js-yaml");

function loadClassification() {
  const raw = fs.readFileSync("data-classification.yml", "utf8");
  return yaml.load(raw);
}

const classification = loadClassification();

function requireSecureLogging(req, res, next) {
  const path = req.route?.path;
  const method = req.method;
  const sensitiveRoute = classification.routes?.find(
    (r) => r.path === path && r.method === method
  );

  if (
    sensitiveRoute &&
    ["confidential", "restricted"].includes(
      sensitiveRoute.classification?.level
    )
  ) {
    // Example: attach a flag to logs so your logger avoids PII fields
    req.context = req.context || {};
    req.context.noPiiLogging = true;
  }

  return next();
}

module.exports = { requireSecureLogging };

Wire this middleware early in your pipeline, then update your logger to drop PII fields whenever req.context.noPiiLogging is set.

This ties directly into:

  • PCI DSS logging requirements (avoiding PAN in logs).
  • GDPR / HIPAA obligations to minimize unnecessary exposure of personal data.

Because the rule is derived from data-classification.yml, you now have a line from classification → code → runtime that auditors can follow.


5) Close the Loop with Risk Assessments, Remediation & Evidence

At some point, a compliance team—or a partner like Pentest Testing Corp—will need to:

Data Classification as Code makes that much easier because you can export a risk-ready view straight from your YAML.

Script: export risk register rows

#!/usr/bin/env python3
import csv
import pathlib
import yaml

data = yaml.safe_load(pathlib.Path("data-classification.yml").read_text())

rows = []

for asset in data.get("assets", []):
    cls = asset.get("classification") or {}
    for reg in cls.get("regulations", []):
        rows.append(
            {
                "service": data.get("service"),
                "asset_id": asset["id"],
                "asset_type": asset["type"],
                "classification_level": cls.get("level"),
                "regulation": reg,
                "owner_team": data.get("owners", {}).get("team"),
            }
        )

with open("risk-register-export.csv", "w", newline="") as f:
    writer = csv.DictWriter(
        f,
        fieldnames=[
            "service",
            "asset_id",
            "asset_type",
            "classification_level",
            "regulation",
            "owner_team",
        ],
    )
    writer.writeheader()
    writer.writerows(rows)

This CSV can go straight into:

  • Your internal GRC / risk register.
  • Pentest Testing Corp’s Risk Assessment Services intake form to accelerate scoping and gap analysis.
  • Pentest Testing Corp’s Remediation Services can use your CI logs, scanner outputs, and YAML files as evidence when preparing you for certification or external audits.

On the Cyber Rely side, this approach complements existing guides on:

Data Classification as Code simply extends that pattern down to the data layer.


How This Helps with SOC 2, PCI DSS, GDPR, HIPAA & ISO 27001

Here’s how your Data Classification as Code setup maps to key frameworks:

  • PCI DSS
    • classification.tags: [cardholder] and classification.level: restricted show where the CDE is.
    • CI gates + TLS/logging checks provide evidence for secure SDLC and logging requirements.
  • GDPR
    • tags: [pii] and regulations: [gdpr] support records of processing, data minimization, and DPIA inputs.
    • External scans focused on PII endpoints help prove ongoing risk management.
  • HIPAA
    • tags: [phi] combined with strong logging/TLS rules support administrative and technical safeguards around PHI.
  • SOC 2 / ISO 27001
    • Version-controlled YAML + CI logs = evidence of logical access, change management, monitoring, and risk treatment controls in operation. (pentesttesting.com)

The key point: engineers never need to memorize every clause. They just maintain:

  • A small YAML file per service.
  • A handful of CI gates.
  • A simple export for risk/compliance partners.

Cyber Rely and Pentest Testing Corp can then take those artifacts and package them into assessor-friendly evidence for your SOC 2, PCI DSS, ISO 27001, GDPR, and HIPAA journeys.


90-Day Rollout Plan for Data Classification as Code

If you want a concrete path, here’s a pragmatic 90-day approach:

Days 1–30 – Pilot on one critical service

  • Pick a service handling cardholder data, PHI, or high-volume PII.
  • Add data-classification.yml with assets, routes, and internet exposure.
  • Add the classification CI job and fix initial failures.
  • Run the free Website Vulnerability Scanner on its public URLs and attach sample reports to tickets.

Days 31–60 – Roll out to Tier-1 services

  • Copy the pattern to remaining Tier-1 services (payments, accounts, auth, key data hubs).
  • Generate CODEOWNERS entries from classification to tighten approvals.
  • Start exporting risk-register-export.csv and feed it into your risk/compliance workflow or Pentest Testing’s Risk Assessment Services.

Days 61–90 – Integrate with formal compliance

  • Map classification tags to specific risks and controls (with your compliance team or Pentest Testing’s Remediation Services).
  • Align your CI evidence with Cyber Rely’s broader CI/CD compliance patterns (API gates, feature-flag evidence, PQC gates).
  • Lock the process into your SDLC policy and make “update data-classification.yml” a standard acceptance criterion for new features.

By the end of 90 days, you should be able to answer an auditor’s “Where is cardholder data and how is it protected?” with Git-backed configs and CI logs, not guesswork.


If you’d like to extend this pattern, your next steps could be:


Free Consultation

If you have any questions or need expert assistance, feel free to schedule a Free consultation with one of our security engineers>>

🔐 Frequently Asked Questions (FAQs)

Find answers to commonly asked questions about Data Classification as Code.

Get a Quote

Leave a Comment

Your email address will not be published. Required fields are marked *

Cyber Rely Logo cyber security
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.