Labeling Method

Following is a description of the method we developed and used to label data for the purposes of training and testing the machine-learning model that serves as the engine for Privacy Screen.

Context

Inspired by highlighted portion of CA Civ Code § 1798.105(b):

Notes, caveats, nuances, etc.

Collecting policy URLs

To gather a set of URLs for unique privacy policies:

  1. Ensure your browser has malware protection -- website certificates expire, and sites are sometimes compromised by other means.
  2. Randomly sample a batch of URLs from the MAPS dataset.
  3. Remove any duplicate URLs.

Collecting policies

For each URL:

  1. Assign the URL its own unique identification number.
  2. If the URL doesn't work at all (e.g., 404 error), try finding the Privacy Policy from the sub-domain's home/index page or via Google.
  3. Verify the URL is for a privacy policy, which is written in English. Sometimes privacy policies are linked from within other documents like Terms & Conditions.
  4. See if there's an updated policy elsewhere on the website. Updated policies are more likely to allow requests for deletion (improving balance).
  5. See if there's a more complete policy elsewhere on the website. Updated policies are more likely to allow requests for deletion (improving balance).
  6. Record the final URL.
  7. Ensure that any "See More" fields collapsed by default are expanded.
  8. Use browser's "inspector" tool to isolate the lowest-level "div" (or "body" if no div) element in HTML that encapsulates all relevant policy text (and hopefully nothing else).
    1. Record its CSS selector (i.e., path/location in hierarchy).
    2. Copy-paste the corresponding policy text into a TXT file dedicated to that URL.

Classifying policies

For each privacy policy:

  1. Search sequentially for the following partial synonyms to generate a set of good hits, and then retain the best one:.
    1. "delet" (as in delete or deletion)
    2. "remov" (as in remove or removal)
    3. "eras" (as in erase or erasure)
    4. "forg" (as in forget or forgotten).
  2. Note that other synonyms (e.g., "destr" for destroy or destruction) are more commonly associated with other things like malicious deletion of data.
  3. Note that "change" and its synonyms (e.g., amend, modify, alter) are NOT considered sufficient.
  4. Verify that deletion isn't limited to:
    1. "child" / "kid" / "minors" (also look for under, years, age, etc.)
    2. "cooke".
  5. Verify that the user isn't required to delete their own info. For example, user may have limited access -- NOT sufficient if they can only delete contact info. Look for relevant keywords: request, send, ask, inquiry, query, have, obtain, contact, email, write, reach.
  6. If all of the above are satisfied.
    1. Mark URL as 1 (otherwise 0)
    2. Copy-paste only the text relevant to classification into a TXT file dedicated to that URL, capturing all such text between periods (may include section heading, bullets, semicolons, etc.).
Go back