Personal Identity disambiguation and persistence

NIST’s Informative Reference Catalog, NIST’s Open Security Controls Assessment Language (OSCAL), TagVault.org’s Software Identification Tags (SWID Tags), the Unified Compliance Framework, and SIGLEX (a Special Interest Group on the Lexicon of the Association for Computational Linguistics) all identify various forms of contributors to their content. These contributors are listed as anything from primary authors, through editors, through “commentators” of contributed content to their various repositories of information. We will this collective group contributors.

Contributors to these systems must be tracked; for accountability of content, for accolades as contributors, for any number or reasons. Unlike tracking a single system’s users wherein each user can be assigned a permanent and unique ID that remains constant as factoids about the user change (their name, their email, their address), shared systems cannot reasonably assign a unique user ID between them as that would create the potential for myriad international privacy regulations.

The problem, therefore, is how to

A. disambiguate a user so that we know Joe Smith is that Joe Smith and not some other Joe Smith, and

B. allow those disambiguating characteristics to be persisted between systems so that as factoids change (name, address, email, etc.) the disambiguated Joe Smith remains that Joe Smith instead of some other Joe Smith.

Persons versus agents

As Dan Brickley and Libbey Miller, co-authors of the FOAF (Friend of a Friend) Project, people as well as bots can have e-mail addresses, chat addresses, etc. Distinguishing Joe Smith from a Joe Smith bot can’t be done by just checking an email address. Therefore, personal identity disambiguation must begin with disambiguating people from things. Until one passes the Turing test, a computer, which manages bots and various software agents, will remain an antonym of person. Therefore, throughout the rest of this discussion, we will use the following terms and definitions (the term is a linked item to the ComplianceDictionary.com’s entry for the term):

Term Definition
agent A program acting on behalf of a person or organization.
bot A robot; A piece of software designed to complete a minor but repetitive task automatically and on command.
natural person A living human being. Legal systems can attach rights and duties to natural persons without their express consent.

Because a bot is a type of agent, we will refer to all computerized processes that communicate agents.

Disambiguating a person from an agent

The world is inundated with fake accounts on social media platforms run by bots of various types and for various reasons. Merely having an e-mail address or social media address isn’t enough to disambiguate a person from an agent. A shibboleth of sorts is necessary to distinguish the two.

In 2015 DARPA initiated a Twitter Bot challenge to determine person or agent. At that time, they came up with five criteria (user profile, syntax, semantics, behavior, network features) to disambiguate between person and agent. Since that time, Google’s Duplex and IBM’s Watson and Project Debator have made huge leaps in their capabilities to use syntax, semantics, and even linguistic behaviors. This leaves two characteristics still in play as a potential shibboleth:

  1. User Profile – links to other accounts, biography, etc.

  2. Network features – how distinct the person’s “network” is, i.e., where they work and others they connect to.

We will save the discussion of the various objects that should be tracked in a person’s profile for later, for now, we’ll just call this collection of objects a person’s information.

Since the DARPA challenge, a great deal of research and practical application has been put into place in order to disambiguate natural persons from agents as well as personal name disambiguation (once you know Joe Smith isn’t an agent, knowing Joe Smith is that Joe Smith versus another one). So much so, that there are several methodologies for extracting information about both people and organizations from given information such as URLs and email addresses. We will cover each below as the information that can be collected from each source is different.

Extracting personal information from web content

Many websites today use tags in the format of microdata, RDFa, and JSON-LD to add structured content for searchability, advertising, etc. Because of this, several APIs exist to extract structured (and to some extent unstructured) text from HTML documentation.

Alchemy API

Given a URL of an HTML document containing organizational or personal information, Alchemy’s GetNamedEntities APIs will return information about both persons and organizations. The information returned is as follows:

Object Definition
Name The name as it was found in either the HTML documentation or the structured data. This could be in the form of a full name, or last name followed by first name.
Contact Info If the contributor added contact information such as an email address, a twitter account, or a LinkedIn account, that information will be presented here.
Citation If there is a DOI, ISBN, Zotero, or other Citation reference embedded in the structured data, that information will be presented.
Citation Text This will be the content in the article attributed to the contributor.

As you can see, unless information has been embedded into an HTML document through structured language with a well-formatted vocabulary, disambiguation on a mass scale is still relatively complex and as our research has noted, is at best 60% accurate (as of this writing).

Extracting personal information from organizational email addresses

Beyond scraping websites for personal information, various organizations have developed full-blown APIs for providing in-depth information about people, usually for marketing purposes, using their email addresses and domain names.

First and foremost, personal email addresses, such as those from Hotmail, Apple, Google, Yahoo, etc. cannot be used for extracting personal information. They just can’t.

Any personal information extracted must be extracted from organizational email because the applications, such as ClearBit, FullContact, BigPicture, Powrbot, Crunchbase, etc. all focus on business-to-business marketing, and as such, organizational e-mails. We will call these disambiguating APIs.

Suggested methodology

It is our postulation that there is enough information found within amalgamating information from a couple of the disambiguating APIs to put together a personal profile that is both disambiguating and persistent in nature without the need for a personal UUID for coherency.

A natural person can be disambiguated with persisting information when described in JSON-LD format using an amalgam of information found using disambiguating APIs. The beginning schema for describing a disambiguated person will be presented in JSON-LD format herein, and will be maintained online at http://grcschema.org/Person.

Why JSON-LD?

JSON-LD (JavaScript Object Notation for Linked Data) is a method of encoding Linked Data using JSON. JSON-LD is designed around the concept of a context to provide additional mappings from JSON to an RDF model. The context links object properties in a JSON document to concepts in an ontology. In order to map the JSON-LD syntax to RDF, JSON-LD allows values to be coerced to a specified Type.

By combining various Types of information about a person (name, postal address, various e-mail addresses, social website personal URLS, etc.) each person’s record can be disambiguated from another, and even if individual factoids change, there will be enough unchanging information to persis the record.

Audit information

In order to determine the persistence of a person’s information, an audit methodology must first be decided upon. An AuditData JSON Type can be established for this purpose. Below is the initial proposed AuditData Type.

AuditData

Metadata about a JSON Thing’s audit information.

Property Expected Type Description
id Integer A unique and persistent identifier for the audit record.
date_modified Datetime The date the record was created.
modified_by Integer The ID of the person or agent that last modified the record.
live_status Boolean A Boolean field of “live” (1) or “deprecated” (0).
modified_property String The JSON property that has been modified.
previous_value String The value of the modified_property prior to modification. For a new record, this value will be “null”.
current_value String The value of the JSON property after modification. For a record that was deleted, this value will be “null.”
audited_record_id Integer The record ID of the record that created the audit trail.

By tracking individual changes, we’ll be able to see when Joe Smith becomes Dr. Joe Smith, or when Joe changes his current work e-mail address.

AuditData will be tracked live with all revisions at http://grcschema.org/AuditData.

Core personal information

Person

Core personal information is broken down into several Things, listed below.

Property Expected Type Description
PostalAddress Thing The address to which physical mail and packages are delivered.
InternetAddresses Thing The various Internet locations that help disambiguate a person, such as their FaceBook, LinkedIn, and Twitter Address.
PersonsName Thing All of the name properties associated with a real person’s name.
EmailAddresses Thing All of the various email addresses that could be associated with a person.
PastEmailAddresses Thing Previous email addresses.
CoreMetaData Thing Metadata documenting the ID and core information about a JSON Thing.
HierarchicalMetaData Thing MetaData about a JSON Type’s hierarchical information. If the record is in a non-hierarchical array this will be nulled.

PostalAddress

The address to which physical mail and packages are delivered.

Property Expected Type Description
address1 String The first part of a postal address such as the building number and street name.
address2 String The second part of a postal address, usually denoting a suite.
city String The city the address is located in.
state String The state/province for the address.
postal_code String The postal/zip code for the address.
country String The country the address is located in.

InternetAddresses

The various Internet locations that help disambiguate a person, such as their FaceBook, LinkedIn, and Twitter Address.

Property Expected Type Description
facebook URL The Facebook URL for the person.
linkedin URL The LinkedIn URL for the person.

PersonsName

All the name properties associated with a real person’s name.

Property Expected Type Description
first_name Text The person’s first name.
last_name Text The person’s last name.
middle_name Text A person’s middle name.
name_prefix String The prefix before a person’s name, such as Dr., Mr., Mrs.
name_suffix String The suffix after a person’s name, such as Jr., III, etc.

EmailAddresses

All the various email addresses that could be associated with a person.

Property Expected Type Description
work_email Email An email address belonging to a person associated with their work account.
personal_email Email An email address belonging to a person not associated with their work account.

PastEmailAddresses

An array of previous email addresses.

Property Expected Type Description
past_work_email_addresses Array A collection of work email addresses formerly used by a person but no longer active. This is often used in personal disambiguation.
past_personal_email_addresses Array A collection of personal email addresses formerly used by a person but no longer active. This is often used in personal disambiguation.

CoreMetaData

Metadata documenting the ID and core information about a JSON Thing.

Property Expected Type Description
id Integer A unique and persistent identifier for the record.
date_created Datetime The date the record was created.
date_modified Datetime The date the record was created.
created_by Integer The ID of the person or agent that created the record.
live_status Boolean A Boolean field of “live” (1) or “deprecated” (0).

HierarchicalMetaData

MetaData about a JSON Type’s hierarchical information. If the record is in a non-hierarchical array this will be nulled.

Property Expected Type Description
parent_id Integer ID of the associated parent for this record.
sort_value Integer An integer given to a record relative to its siblings used to sort at each sibling level.
ChildIDs Thing A collection of children identifiers.

CRUD methodology

There is a simple proposed methodology to the API structure for getting a person’s data. The flowchart is in code2flow format below. Clicking the image will open the flow in Code2Flow.

Disambiguated User persistence

Given the JSON Object set for Person, there is enough information for each person to ensure that if any individual or set of personal factoids changes (name, postal address, email address, etc.), non-changing information, as well as a full audit trail of changes, will be sufficient to sus-tain record persistence.

For instance, within the Unified Compliance team, there was a contributor we will call Jane Doe. Jane began her contributions while at Snortblat.com, and was initially registered as:

Name Email Organization LinkedInProfile
Jane Doe jdoe@snortblat.com Snortblat https://www.linkedin.com/in/jdoe23

Then, Jane joined the UCF team and continued her contributions. Her profile changed accordingly. However, her LinkedInProfile id didn’t change. And from that information persistence for Jane’s record was maintained.

Name Email Organization LinkedInProfile
Jane Doe jdoe@theucf.info Unified Compliance https://www.linkedin.com/in/jdoe23

While this is a very simplistic instance, there are other instances that can be extreme. One in-stance of persistence had to rely on a semantic comparison of LinkedIn’s work history. Why? The person got married and changed their name, which then changed the email address as well as their LinkedIn profile address. Because of this amount of changes, the company processing the user’s persona had created a hash based on the user’s work history (which remained the same) to track them.

Within our scope, because contributors are the ones who are creating and editing their records, extreme persistence routines aren’t necessary. Each developer should employ a mechanism to allow contributors to edit their records and track those edits within their system. And when a user has changed their record, the developer should trigger a mechanism to write that change back to the API so that the contributor is updated.

Endnotes

All Citations from Authoritative Works can be found within the Personal Name Disambiguation subdirectory within the GRCschema.org’s research tab.