Personal Identity disambiguation and persistence
NIST’s Informative Reference Catalog, NIST’s Open Security Controls Assessment Language (OSCAL), TagVault.org’s Software Identification Tags (SWID Tags), the Unified Compliance Framework, and SIGLEX (a Special Interest Group on the Lexicon of the Association for Computational Linguistics) all identify various forms of contributors to their content. These contributors are listed as anything from primary authors, through editors, through “commentators” of contributed content to their various repositories of information. We will this collective group contributors.
Contributors to these systems must be tracked; for accountability of content, for accolades as contributors, for any number or reasons. Unlike tracking a single system’s users wherein each user can be assigned a permanent and unique ID that remains constant as factoids about the user change (their name, their email, their address), shared systems cannot reasonably assign a unique user ID between them as that would create the potential for myriad international privacy regulations.
The problem, therefore, is how to
A. disambiguate a user so that we know Joe Smith is that Joe Smith and not some other Joe Smith, and
B. allow those disambiguating characteristics to be persisted between systems so that as factoids change (name, address, email, etc.) the disambiguated Joe Smith remains that Joe Smith instead of some other Joe Smith.
Persons versus agents
As Dan Brickley and Libbey Miller, co-authors of the FOAF (Friend of a Friend) Project, people as well as bots can have e-mail addresses, chat addresses, etc. Distinguishing Joe Smith from a Joe Smith bot can’t be done by just checking an email address. Therefore, personal identity disambiguation must begin with disambiguating people from things. Until one passes the Turing test, a computer, which manages bots and various software agents, will remain an antonym of person. Therefore, throughout the rest of this discussion, we will use the following terms and definitions (the term is a linked item to the ComplianceDictionary.com’s entry for the term):
Term | Definition |
---|---|
agent | A program acting on behalf of a person or organization. |
bot | A robot; A piece of software designed to complete a minor but repetitive task automatically and on command. |
natural person | A living human being. Legal systems can attach rights and duties to natural persons without their express consent. |
Because a bot is a type of agent, we will refer to all computerized processes that communicate agents.
Disambiguating a person from an agent
The world is inundated with fake accounts on social media platforms run by bots of various types and for various reasons. Merely having an e-mail address or social media address isn’t enough to disambiguate a person from an agent. A shibboleth of sorts is necessary to distinguish the two.
In 2015 DARPA initiated a Twitter Bot challenge to determine person or agent. At that time, they came up with five criteria (user profile, syntax, semantics, behavior, network features) to disambiguate between person and agent. Since that time, Google’s Duplex and IBM’s Watson and Project Debator have made huge leaps in their capabilities to use syntax, semantics, and even linguistic behaviors. This leaves two characteristics still in play as a potential shibboleth:
-
User Profile – links to other accounts, biography, etc.
-
Network features – how distinct the person’s “network” is, i.e., where they work and others they connect to.
We will save the discussion of the various objects that should be tracked in a person’s profile for later, for now, we’ll just call this collection of objects a person’s information.
Since the DARPA challenge, a great deal of research and practical application has been put into place in order to disambiguate natural persons from agents as well as personal name disambiguation (once you know Joe Smith isn’t an agent, knowing Joe Smith is that Joe Smith versus another one). So much so, that there are several methodologies for extracting information about both people and organizations from given information such as URLs and email addresses. We will cover each below as the information that can be collected from each source is different.
Extracting personal information from web content
Many websites today use tags in the format of microdata, RDFa, and JSON-LD to add structured content for searchability, advertising, etc. Because of this, several APIs exist to extract structured (and to some extent unstructured) text from HTML documentation.
Alchemy API
Given a URL of an HTML document containing organizational or personal information, Alchemy’s GetNamedEntities APIs will return information about both persons and organizations. The information returned is as follows:
Object | Definition |
---|---|
Name | The name as it was found in either the HTML documentation or the structured data. This could be in the form of a full name, or last name followed by first name. |
Contact Info | If the contributor added contact information such as an email address, a twitter account, or a LinkedIn account, that information will be presented here. |
Citation | If there is a DOI, ISBN, Zotero, or other Citation reference embedded in the structured data, that information will be presented. |
Citation Text | This will be the content in the article attributed to the contributor. |
As you can see, unless information has been embedded into an HTML document through structured language with a well-formatted vocabulary, disambiguation on a mass scale is still relatively complex and as our research has noted, is at best 60% accurate (as of this writing).
Extracting personal information from organizational email addresses
Beyond scraping websites for personal information, various organizations have developed full-blown APIs for providing in-depth information about people, usually for marketing purposes, using their email addresses and domain names.
First and foremost, personal email addresses, such as those from Hotmail, Apple, Google, Yahoo, etc. cannot be used for extracting personal information. They just can’t.
Any personal information extracted must be extracted from organizational email because the applications, such as ClearBit, FullContact, BigPicture, Powrbot, Crunchbase, etc. all focus on business-to-business marketing, and as such, organizational e-mails. We will call these disambiguating APIs.
Suggested methodology
It is our postulation that there is enough information found within amalgamating information from a couple of the disambiguating APIs to put together a personal profile that is both disambiguating and persistent in nature without the need for a personal UUID for coherency.
A natural person can be disambiguated with persisting information when described in JSON-LD format using an amalgam of information found using disambiguating APIs. The beginning schema for describing a disambiguated person will be presented in JSON-LD format herein, and will be maintained online at http://grcschema.org/Person.
Why JSON-LD?
JSON-LD (JavaScript Object Notation for Linked Data) is a method of encoding Linked Data using JSON. JSON-LD is designed around the concept of a context to provide additional mappings from JSON to an RDF model. The context links object properties in a JSON document to concepts in an ontology. In order to map the JSON-LD syntax to RDF, JSON-LD allows values to be coerced to a specified Type.
By combining various Types of information about a person (name, postal address, various e-mail addresses, social website personal URLS, etc.) each person’s record can be disambiguated from another, and even if individual factoids change, there will be enough unchanging information to persis the record.
Audit information
In order to determine the persistence of a person’s information, an audit methodology must first be decided upon. An AuditData JSON Type can be established for this purpose. Below is the initial proposed AuditData Type.
AuditData
Metadata about a JSON Thing’s audit information.
Property | Expected Type | Description |
---|---|---|
id | Integer | A unique and persistent identifier for the audit record. |
date_modified | Datetime | The date the record was created. |
modified_by | Integer | The ID of the person or agent that last modified the record. |
live_status | Boolean | A Boolean field of “live” (1) or “deprecated” (0). |
modified_property | String | The JSON property that has been modified. |
previous_value | String | The value of the modified_property prior to modification. For a new record, this value will be “null”. |
current_value | String | The value of the JSON property after modification. For a record that was deleted, this value will be “null.” |
audited_record_id | Integer | The record ID of the record that created the audit trail. |
By tracking individual changes, we’ll be able to see when Joe Smith becomes Dr. Joe Smith, or when Joe changes his current work e-mail address.
AuditData will be tracked live with all revisions at http://grcschema.org/AuditData.
Core personal information
Person
Core personal information is broken down into several Things, listed below.
Property | Expected Type | Description |
---|---|---|
PostalAddress | Thing | The address to which physical mail and packages are delivered. |
InternetAddresses | Thing | The various Internet locations that help disambiguate a person, such as their FaceBook, LinkedIn, and Twitter Address. |
PersonsName | Thing | All of the name properties associated with a real person’s name. |
EmailAddresses | Thing | All of the various email addresses that could be associated with a person. |
PastEmailAddresses | Thing | Previous email addresses. |
CoreMetaData | Thing | Metadata documenting the ID and core information about a JSON Thing. |
HierarchicalMetaData | Thing | MetaData about a JSON Type’s hierarchical information. If the record is in a non-hierarchical array this will be nulled. |
PostalAddress
The address to which physical mail and packages are delivered.
Property | Expected Type | Description |
---|---|---|
address1 | String | The first part of a postal address such as the building number and street name. |
address2 | String | The second part of a postal address, usually denoting a suite. |
city | String | The city the address is located in. |
state | String | The state/province for the address. |
postal_code | String | The postal/zip code for the address. |
country | String | The country the address is located in. |
InternetAddresses
The various Internet locations that help disambiguate a person, such as their FaceBook, LinkedIn, and Twitter Address.
Property | Expected Type | Description |
---|---|---|
URL | The Facebook URL for the person. | |
URL | The LinkedIn URL for the person. |
PersonsName
All the name properties associated with a real person’s name.
Property | Expected Type | Description |
---|---|---|
first_name | Text | The person’s first name. |
last_name | Text | The person’s last name. |
middle_name | Text | A person’s middle name. |
name_prefix | String | The prefix before a person’s name, such as Dr., Mr., Mrs. |
name_suffix | String | The suffix after a person’s name, such as Jr., III, etc. |
EmailAddresses
All the various email addresses that could be associated with a person.
Property | Expected Type | Description |
---|---|---|
work_email | An email address belonging to a person associated with their work account. | |
personal_email | An email address belonging to a person not associated with their work account. |
PastEmailAddresses
An array of previous email addresses.
Property | Expected Type | Description |
---|---|---|
past_work_email_addresses | Array | A collection of work email addresses formerly used by a person but no longer active. This is often used in personal disambiguation. |
past_personal_email_addresses | Array | A collection of personal email addresses formerly used by a person but no longer active. This is often used in personal disambiguation. |
CoreMetaData
Metadata documenting the ID and core information about a JSON Thing.
Property | Expected Type | Description |
---|---|---|
id | Integer | A unique and persistent identifier for the record. |
date_created | Datetime | The date the record was created. |
date_modified | Datetime | The date the record was created. |
created_by | Integer | The ID of the person or agent that created the record. |
live_status | Boolean | A Boolean field of “live” (1) or “deprecated” (0). |
HierarchicalMetaData
MetaData about a JSON Type’s hierarchical information. If the record is in a non-hierarchical array this will be nulled.
Property | Expected Type | Description |
---|---|---|
parent_id | Integer | ID of the associated parent for this record. |
sort_value | Integer | An integer given to a record relative to its siblings used to sort at each sibling level. |
ChildIDs | Thing | A collection of children identifiers. |
CRUD methodology
There is a simple proposed methodology to the API structure for getting a person’s data. The flowchart is in code2flow format below. Clicking the image will open the flow in Code2Flow.
Disambiguated User persistence
Given the JSON Object set for Person, there is enough information for each person to ensure that if any individual or set of personal factoids changes (name, postal address, email address, etc.), non-changing information, as well as a full audit trail of changes, will be sufficient to sus-tain record persistence.
For instance, within the Unified Compliance team, there was a contributor we will call Jane Doe. Jane began her contributions while at Snortblat.com, and was initially registered as:
Name | Organization | LinkedInProfile | |
---|---|---|---|
Jane Doe | jdoe@snortblat.com | Snortblat | https://www.linkedin.com/in/jdoe23 |
Then, Jane joined the UCF team and continued her contributions. Her profile changed accordingly. However, her LinkedInProfile id didn’t change. And from that information persistence for Jane’s record was maintained.
Name | Organization | LinkedInProfile | |
---|---|---|---|
Jane Doe | jdoe@theucf.info | Unified Compliance | https://www.linkedin.com/in/jdoe23 |
While this is a very simplistic instance, there are other instances that can be extreme. One in-stance of persistence had to rely on a semantic comparison of LinkedIn’s work history. Why? The person got married and changed their name, which then changed the email address as well as their LinkedIn profile address. Because of this amount of changes, the company processing the user’s persona had created a hash based on the user’s work history (which remained the same) to track them.
Within our scope, because contributors are the ones who are creating and editing their records, extreme persistence routines aren’t necessary. Each developer should employ a mechanism to allow contributors to edit their records and track those edits within their system. And when a user has changed their record, the developer should trigger a mechanism to write that change back to the API so that the contributor is updated.
Endnotes
All Citations from Authoritative Works can be found within the Personal Name Disambiguation subdirectory within the GRCschema.org’s research tab.