As companies store more and more consumer data, they’re increasingly at risk of embarrassing — and even deeply damaging — breaches by hackers. But what if they could glean useful insights without hoarding every scrap of data? Edge computing, in which data is processed locally on hardware instead of on the cloud, can help them do just that by implementing three critical design choices. The design choices begin with how to think about data collection and extend to the actual data processing. They are: 1) sufficiency, or a focus on only must-have data; 2) aggregation, or lumping data together to produce group insights; and 3) alteration, or making minor changes to the data to hide an individual’s identity while minimally impacting the accuracy of insights.
When collecting consumer data, there is almost always a risk to consumer privacy. Sensitive information could be leaked unintentionally or breached by bad actors. For example, the Equifax data breach of 2017 compromised the personal information of 143 million U.S. consumers. Smaller breaches, which you may or may not hear about, happen all the time. As companies collect more data — and rely more heavily on its insights — the potential for data to be compromised will likely only grow.
With the appropriate data architecture and processes, however, these risks can be substantially mitigated by ensuring that private data is touched at as few points as possible. Specifically, companies should consider the potential of what is known as edge computing. Under this paradigm, computations are performed not in the cloud, but on devices that are on the edge of the network, close to where the data are generated. For example, the computations that make Apple’s Face ID work happen right on your iPhone. As researchers who study privacy in the context of business, computer science, and statistics, we think this approach is sensible — and should be used more — because edge computing minimizes the transmission and retention of sensitive information to the cloud, lowering the risk that it could land in the wrong hands.
But how does this tech actually work, and how can companies who don’t have Apple-sized resources deploy it?
Consider a hypothetical wine store that wants to capture the faces of consumers sampling a new wine to measure how they like it. The store’s owners are picking between two competing video technologies: The first system captures hours of video, sends the data to third-party servers, saves the content to a database, processes the footage using facial analysis algorithms, and reports the insight that 80% of consumers looked happy upon tasting the new wine. The second system runs facial analysis algorithms on the camera itself, does not store or transmit any video footage, and reports the same 80% aggregated insight to the wine retailer.
The second system uses edge computing to restrict the number of points at which private data are touched by humans, servers, databases, or interfaces. Therefore, it reduces the chances of a data breach or future unauthorized use. It only gathers sufficient data to make a business decision: Should the wine retailer invest in advertising the new wine?
As companies work to protect their customers’ privacy, they will face similar situations as the one above. And in many cases, there will be an edge computing solution. Here’s what they need to know.
Privacy by Design
In 1980, the Organization for Economic Cooperation and Development, an international forum of 38 countries, established guidelines for the protection of privacy and trans-border flows of personal data for its member countries with the goal of harmonizing national privacy legislation. These guidelines, which were based on principles such as purpose limitation and data minimization, evolved into recent data-privacy legislation such as the General Data Protection Regulation (GDPR) in Europe and the California Consumer Privacy Act (CCPA), both introduced in 2018.
The rise of edge computing helps organizations meet the privacy guidelines above by implementing three critical design choices. The design choices begin with how to think about data collection and extend to the actual data processing. They are:
Sufficiency
A mindful data architecture should collect and retain only the must-have information. Data-collection approaches should be designed and implemented around the desired insights (in other words, its purpose should be limited), thus reducing the number of variables and people tracked, meaning the minimum amount of data is collected.
In some ways, this is an old idea: In 1922, the groundbreaking British statistician R.A. Fisher developed the statistical theory of a “sufficient statistic,” which provides all the information required on the desired insight. (E.g., 80% of consumers looked happy upon tasting the new wine.) Minimal sufficiency goes a step further by most efficiently capturing the sufficient information required for an insight. Translated loosely, the wine retailer may use an edge device to perform facial analysis on fewer consumers — a smaller sample — to reach the same 80% insight.
Aggregation
For many business decisions we don’t need insights on the individual level. Summarizing the information at a group level retains most of the necessary insights while minimizing the risk of compromising private data. Such non-personal data is often not subject to data protection legislation, such as the GDPR or the CCPA.
Alteration
When it is critical to obtain insights at a personal level, the data may be altered to hide the individual’s identity while minimally impacting the accuracy of insights. For instance, Apple uses a technique called local differential privacy to add statistical noise to any information that is shared by a user’s device, so Apple cannot reproduce the true data. In some situations, alteration of individual data is legally mandated, such as in clinical studies. Techniques may include pseudo-anonymization and go as far as generating synthetic data.
Knowing when to apply data-processing tools is as critical as using the right tools. Applying sufficiency, aggregation, and alteration during data collection maximizes protection while retaining the most useful information. This approach can also reduce costs for cyber insurance, compliance with data-protection regulations, and more scalable infrastructure.
The Information-Privacy Tradeoff
Restricting private data collection and processing to the edge is not without its downsides. Companies will not have all their consumer data available to go back and re-run new types of analyses when business objectives change. However, this is the exact situation we advocate against to protect consumer privacy.
Information and privacy operate in a tradeoff — that is, a unit increase in privacy requires some loss of information. By prioritizing data utility with purposeful insights, edge computing reduces the quantity of information from a “data lake” to the sufficient data necessary to make the same business decision. This emphasis on finding the most useful data over keeping heaps of raw information increases consumer privacy.
The design choices that support this approach — sufficiency, aggregation, and alteration — apply to structured data, such as names, emails or number of units sold, and unstructured data, such as images, videos, audio, and text. To illustrate, let us assume the retailer in our wine-tasting example receives consumer input via video, audio, and text.
Video
If the goal of the wine retailer is to understand consumer reactions broken down by demographic groups, there is no need to identify individual consumers via facial recognition or to maintain a biometric database. One might wonder — aren’t the pictures that contain people’s faces private data? Indeed, they are. And this is where edge computing allows the video feed to be analyzed locally (namely, on the camera) without ever being stored permanently or transmitted anywhere. AI models are trained to extract in real time the required information, such as positive sentiment and demographics, and discard everything else. That is an example of sufficiency and aggregation employed during data collection.
Audio
In our wine-tasting setting, an audio analysis may distinguish between when speech occurs versus silence or background music. It may also reveal the age of the person speaking, their emotions, and energy levels. Are people more excited after tasting the new wine? AI models can understand the overall energy of the speaker without knowing what was said. They analyze inflections and intonations in the voice to reveal an individual’s state of mind. Sufficiency is built into the classifications (i.e., the output) of the AI technology by default. Running these models on the edge and summarizing results by demographic group also achieves data aggregation.
Text
Our wine retailer can use consumer textual feedback about the new wine not only to understand whether consumers are satisfied but, equally importantly, learn the words consumers use to describe the taste and feel of the new wine. This information is invaluable input into the development of advertising. In this analysis, the data do not need to be tied to specific consumers. Instead, textual comments are aggregated across consumers, and the relative frequencies of taste and feeling keywords for each wine type are sent to the wine retailer. Alternatively, if insights are desired on the personal level, textual feedback can be altered synthetically using Natural Language Generation (NLG) models.
In the examples above, the Sufficiency-Aggregation-Alteration design choices enhance privacy. These ideas are also relevant to applications and data types as far ranging as unlocking your phone, evaluating your health with smart devices, and creating better experiences. Paradoxically, the mindful use of edge computing and AI, which often scares people, is critical for maximizing privacy protection. Privacy advocates also promote the idea of consumers owning and controlling their personal data via a Customer Data Platform (CDP). A data architecture that links the CDP to an edge device (think of voice-activated home assistants) can further increase consumer trust by providing consumers complete control and transparency over their data.
This framework is only a partial solution to concerns about privacy, however, to be deployed alongside other beneficial practices such as data encryption, minimizing access privileges, and data retention. Encryption is employed when data are stored permanently and in transit. That is an essential first step to minimize unauthorized access because it converts the dataset into a black box. Without a key, the black box has no value. Likewise, limiting data access to a need-to-know basis, having clear policies for data retention, and providing opt-out mechanisms, reduces the risk of data leaks. Even though the above steps are standard practice, not everyone employs them, creating many more touchpoints where private data breaches can occur. Be a good manager and check with your IT team and third-party vendors.
***
Privacy is a social choice, and leadership teams should prioritize data utility. Many companies have been collecting as much data as possible and deciding later what is useful versus not. They are implicitly trading off all consumer privacy with the most information. We advocate a more disciplined approach wherein the uses of the data are specified upfront to guide both the collection and retention of data. Furthermore, technology has offered us all the tools we need to safeguard privacy without impacting business intelligence. By leveraging edge computing and AI technologies, companies may apply the design choices of sufficiency, aggregation, and alteration at the data collection stage. With a carefully designed architecture, we may obtain the desired insights and secure the privacy of consumers’ data at the same time. Contrary to conventional wisdom, we can have our (privacy) cake and eat it too.