Structured data classification

Data classification is the process of classifying data as a whole (e.g. database schema) or its parts (e.g. column name, column values) into categories. It can also be evaluated for its identifiability, sensitivity and/or confidentiality. In this work, our focus lies in and around structured (and semi structured) data.

Our goal is to identify, classify and understand the data residing in structured repositories such relational databases (tables) Object storage (sets of related semi structured files) and single semi structured files (e.g. patient release form in xml).

Many use cases exists where data classification is a necessary first necessary step such as: (1) identifying where confidential data is stored across the whole organization to protect it (2) identifying which tables contain personal data to address regulation such as the right to be forgotten (GDPR), and (3) providing semantics for the data residing in columns to allow for automation of AI algorithms.

In recent years, work related to classification of structured data (e.g. relational DBs or json/xml files) has been done by both the industry and academia. The work performed in the industry was aimed at addressing security issues such as data leakage prevention or privacy needs arising from regulation (e.g. GDPR). Much of the classification work is based on applying regular expressions, dictionaries, specific validation code (e.g. luhn checksum) on the content (the data itself) as well as on the context (e.g. column name, file extention). Additional approaches include ML algorithms trained to detect specific content types. These methods support classification of selected predefined date types such as national identifiers, credit cards, phones, emails etc. [1] [2] [3] [4] [5].

In contrast, the work performed by academia has been focused on tasks known as “Table Annotations”, “Table Interpretation” or “Semantic Labeling” in the area of semantic tables. The goal in this work was to interpret/link tables to a reference KB (e.g. Wikidata or DBpedia). This goal has several subtasks [6]:

  • Column-Type Annotation – linking columns (considering both data and metadata) to ontology concepts/types.
  • Cell-Entity Annotation – linking column cells (data) to ontology concept/types.
  • Columns-Property Annotation – linking columns to concept/types through ontology properties.

Much of the work in this area has progressed through a challenge [7] and benchmark (SemTab) [8] aimed at providing common grounds for evaluation of the different tasks of different systems. Applied technologies include preprocessing to lookup data in wikidata, lexical matching, fuzzy search as well as supervised learning [9] [10] [11] [12].

In our work, we have progressed in several areas. First, we have developed an efficient and fast fuzzy matching algorithm able to match metadata such as column names to wikidata concepts. The result enables us to match column names, which are frequently connected, shortened words such as “empId” to a set a predefined terms such as “employee identifier”. These are then mapped to a set of relevant wikidata [13] concepts. This whole process can be performed in milliseconds.

Second, we have built on the work we have performed for Masking, and in particular the library (“format”) able to classify data values and added the following layers:

  1. Enable faster classification of data values by identifying early on which data classes are possible candidates and ignoring all others. This is done by analyzing each data class regex state machine and building a classification tree to quickly identify which data classes are relevant for each data value.
  2. Make use of multiple values (values in the same column) to distinguish between similar looking regular expressions. For example, US SSN, Patient Id and Israel Id are all 9 digits long. However, these can be distinguished using statistical analysis in case these data classes have different limitations (e.g. luhn check sum) [14].

Lastly, we have constructed a framework able to take as input a DB, analyze the schema and values, and provide each classification method (e.g., fuzzy matching metadata, data values classification) different views of the DB. Example views include the schema view, where schema metadata (e.g. Foreign keys) and schema name can be accessed, or a column view, where the column name and or column values can be consumed.

Using this framework, we plan to develop new classification methods (e.g. for different views) and construct algorithms able to account for multiple results from different classification methods and to select the best result.

Micha Moffie, IBM.

References

[1] Microsoft, "Microsoft information protection solutions," [Online]. Available: https://www.microsoft.com/en-us/security/business/information-protection#office-SecondaryMessaging-fw5353f. [Accessed 2020].

[2] Microsoft, "Tutorial: Automatically apply Azure Information Protection classification labels," [Online]. Available: https://docs.microsoft.com/en-us/cloud-app-security/use-case-information-protection. [Accessed 2021].

[3] Amazon, "Amazon Macie," [Online]. Available: https://docs.aws.amazon.com/macie/index.html. [Accessed 2020].

[4] Google, "Cloud Data Loss Prevention," [Online]. Available: https://cloud.google.com/dlp/. [Accessed 2021].

[5] 1touch.io, "1touch.io data aware security," [Online]. Available: https://1touch.io/. [Accessed 2021].

[6] V. Cutrona, F. Bianchi, E. Jimenez-Ruiz and M. Palmonari, "Tough Tables: Carefully Evaluating Entity Linking for Tabular Data," in The 19th International Semantic Web Conference, athens, 2020.

[7] E. Jimenez-Ruiz, O. Hassanzadeh, V. Efthymiou, J. Chen, K. Srinivas and V. Cutrona, "Results of SemTab 2020," [Online]. Available: CEUR-WS.org/Vol-2775/paper0.pdf. [Accessed 2021].

[8] E. Jimenez-Ruiz, O. Hassanzadeh, V. Efthymiou, J. Chen and K. Srinivas, "SemTab 2019: Resources to Benchmark Tabular Data to Knowledge Graph Matching Systems," The Semantic Web. ESWC 2020. Lecture Notes in Computer Science, pp. 514-530, 2020.

[9] R. Azzi and G. Diallo, "AMALGAM: making tabular dataset explicit with knowledge graph," in Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab), 2020.

[10] S. Tyagi and E. Jimenez-Ruiz, "LexMa: Tabular Data to Knowledge Graph Matching using Lexical Techniques," in Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab), 2020.

[11] P. Nguyen, I. Yamada, N. Kertkeidkachorn, R. Ichise and H. Takeda, "MTab4Wikidata at SemTab 2020: Tabular Data Annotation with Wikidata," in Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab), 2020.

[12] C. S. Bhagavatula, T. Noraset and D. Downey, "TabEL: Entity Linking in Web Tables," in The Semantic Web - ISWC 2015, 2015.

[13] WikiData," [Online]. Available: https://www.wikidata.org. [Accessed 2022].

[14] S. Assaf, A. Farkash and M. Moffie, "Multi-value Classification of Ambiguous Personal Data," 2019.