Unstructured data — the data that doesn’t have any predefined data models — can be difficult for an enterprise to locate and digest. Emails, text files, photos, videos, call transcripts, and business chat apps all congregate a ton of data, which ends up just floating around in the metaphorical ether. But what’s floating around might also cause IT security and business continuity nightmares for businesses who don’t rein it all in.
Unstructured data currently makes up more than 80% of enterprise data and is growing at a rate of 55-65% per year, according to Apoorv Agarwal, co-founder and CEO at Text IQ, an artificial intelligence platform.
These are the most common IT and business security threats hidden in unstructured data that Agarwal says enterprises are often unaware of until it’s too late:
Personally Identifiable Information & Personal Health Information
Failing to redact all the PII and PHI in files might leave an enterprise in grave danger. Often, PII isn’t as obvious as a name or address.
Special category information like political alignment, religious belief, or sexual orientation might need to be redacted as well.
According to Agarwal, part of the problem is there’s just too much unstructured data — the volume is extremely high, especially with COVID.
Now that people are working remotely, people are communicating more over email and text and other forms of communication that are now being recorded.
“But the other challenge is to be able to find this in an automated manner. There are probably not enough humans on Earth to actually go through all this data and find this PII and Ph.”
When committing insider trading or other fraudulent activity, a person is likely to disguise their activity behind code words. Hopefully this doesn’t happen often or at all — but it begs the question if you know what your colleagues really mean when they say “Rueben sandwich?”
Data loss prevention and compliance tools, such as the ones used by financial institutions, use keywords and regular expressions to flag certain kinds of communications, especially those between people who are making an investment and people who are doing the research.
The problem is, the number of false positives are too high, Agarwal says.
“These systems don’t catch all the things they need to be catching. So it’s both like a false positive problem, but they’re also missing things that that need to be caught.”
The Same Person Appearing Under Different Names
In files, people can be referred to by their first name, by their initials, by a misspelled version of their name, or even a different name entirely.
“In general, communication has become extremely informal, as opposed to 100 years ago, when employees would be very formal in their writing,” Agarwal says.
“Seeing communication become shorter, more frequent, and more informal. We can encourage employees not to misspell, but since people are doing this on the back end, what becomes problematic is identifying certain people or identifying certain things. Machines need to go in and do that normalization, and this is one of the problems we’ve been able to solve using machine learning.”
Inappropriate dialogue or NSFW photos or videos could be lurking in a company’s Slack channel. A lot of this falls on company policy and employee education.
But from an IT perspective, it goes back to data loss prevention tech that has certain keywords and regular expressions to flag communications that may have instances of sexual harassment.
“I think every manager will tell you that there are too many false positives, too many things to look at, and the system is probably not catching all the things that need to be caught. So it’s a pretty hard problem to automatically identify and to be accurate.
“It’s not just about the kind of language people are using; it’s about who’s communicating with who and what their roles are.”
An enterprise might not be aware of a department’s potentially discriminatory hiring or performance review practices until there’s a lawsuit.
By definition, this bias is unconscious. Humans have their own biases, so it’s very hard for us to find unconscious bias, it requires going through a lot of data to start noticing that a manager uses personality traits when they’re reviewing a female, tor work product traits when they’re reviewing a male.
“This is a task which machine learning has to be brought in — and not supervised machine learning,” Agarwal says.
“Supervised machine learning requires humans to come in and label data and thereby the human bias gets injected into the machine. So the right solution here is using unsupervised machine learning methods to to find instances of unconscious bias.”
What to do about unstructured data
So what can IT departments do right now to prevent these problems from plaguing their organizations, and what specifically needs to be done with unstructured data?
Agarwal says IT needs to bring more machine learning and AI to bear that understands context. The solution needs to understand a few different kinds of contexts, not just the linguistic context, but also the social context.