WPES’18- Proceedings of the 2018 Workshop on Privacy in the Electronic Society
SESSION: Session1: Web Privacy
Tracking and Tricking a Profiler: Automated Measuring and Influencing of Bluekai’s Interest Profiling
Online advertising services infer interest profiles based on users browsing behavior, but little is known about the extent of these profiles and how they can be influenced. In this paper we describe and evaluate a system to analyze online profiling as a black box by simulating web browsing sessions based on links posted to Reddit. The study utilizes Oracle’s Bluekai Registry to gain insights into profiles created through online tracking and evaluates how they can be obfuscated. We report on the extent of Bluekai’s tracking network and taxonomy, analyze how profiles are shown to users, and observe how they develop for sessions of up to 3,000 website visits. Our results show that only a fraction of websites influence the interests assigned to a session’s profile, that the profiles themselves are very noisy, and that identical browsing behavior results in different profiles. We evaluate two simple obfuscation schemes that effectively alter interest profiles by selectively adding 5% targeted obfuscation traffic.
Recent works showed that websites can detect browser extensions that users install and websites they are logged into. This poses significant privacy risks, since extensions and Web logins that reflect user’s behavior, can be used to uniquely identify users on the Web. This paper reports on the first large-scale behavioral uniqueness study based on 16,393 users who visited our website. We test and detect the presence of 16,743 Chrome extensions, covering 28% of all free Chrome extensions. We also detect whether the user is connected to 60 different websites. We analyze how unique users are based on their behavior, and find out that 54.86% of users that have installed at least one detectable extension are unique; 19.53% of users are unique among those who have logged into one or more detectable websites; and 89.23% are unique among users with at least one extension and one login. We use an advanced fingerprinting algorithm and show that it is possible to identify a user in less than 625 milliseconds by selecting the most unique combinations of extensions. Because privacy extensions contribute to the uniqueness of users, we study the trade-off between the amount of trackers blocked by such extensions and how unique the users of these extensions are. We have found that privacy extensions should be considered more useful than harmful. The paper concludes with possible countermeasures.
Privacy has deteriorated in the world wide web ever since the 1990s. The tracking of browsing habits by different third-parties has been at the center of this deterioration. Web cookies and so-called web beacons have been the classical ways to implement third-party tracking. Due to the introduction of more sophisticated technical tracking solutions and other fundamental transformations, the use of classical image-based web beacons might be expected to have lost their appeal. According to a sample of over thirty thousand images collected from popular websites, this paper shows that such an assumption is a fallacy: classical 1 x 1 images are still commonly used for third-party tracking in the contemporary world wide web. While it seems that ad-blockers are unable to fully block these classical image-based tracking beacons, the paper further demonstrates that even limited information can be used to accurately classify the third-party 1 x 1 images from other images. An average classification accuracy of 0.956 is reached in the empirical experiment. With these results the paper contributes to the ongoing attempts to better understand the lack of privacy in the world wide web, and the means by which the situation might be eventually improved.
Google’s Ad Settings shows the gender and age that Google has inferred about a web user. We compare the inferred values to the self-reported values of 501 survey participants. We find that Google often does not show an inference, but when it does, it is typically correct. We explore which usage characteristics, such as using privacy enhancing technologies, are associated with Google’s accuracy, but found no significant results.
SESSION: Session2: Secure Computation
The significant advancements in the field of homomorphic encryption have led to a grown interest in securely outsourcing data and computation for privacy critical applications. In this paper, we focus on the problem of performing secure predictive analysis, such as principal component analysis (PCA) and linear regression, through exact arithmetic over encrypted data. We improve the plaintext structure of Lu et al.’s protocols (from NDSS 2017), by switching over from linear array arrangement to a two-dimensional hypercube. This enables us to utilize the SIMD (Single Instruction Multiple Data) operations to a larger extent, which results in improving the space and time complexity by a factor of matrix dimension. We implement both Lu et al.’s method and ours for PCA and linear regression over HElib, a software library that implements the Brakerski-Gentry-Vaikuntanathan (BGV) homomorphic encryption scheme. In particular, we show how to choose optimal parameters of the BGV scheme for both methods. For example, our experiments show that our method takes 45 seconds to train a linear regression model over a dataset with 32k records and 6 numerical attributes, while Lu et al.’s method takes 206 seconds.
Private set-intersection (PSI) allows a client to only learn the intersection between his/her set C and the set S of another party, while this latter party learns nothing. We aim to enhance PSI in different dimensions, motivated by the use cases of increasingly popular online matchmaking — Meeting “the one” who possesses all desired qualities and free from any undesirable attributes may be a bit idealistic. In this paper, we realize over- (resp. below-) threshold PSI, such that the client learns the intersection (or other auxiliary private data) only when $|C \cap S| > t$ (resp. $łeq t$). The threshold corresponds to tunable criteria for (mis)matching, without marking all possible attributes as desired or not. In other words, the matching criteria are in a succinct form and the matching computation does not exhaust the whole universe of attributes. To the best of our knowledge, our constructions are the very first solution for these two open problems posed by Bradley etal. (SCN ’16) and Zhao and Chow (PoPETS ’17), without resorting to the asymptotically less efficient generic approach from garbled circuits. Moreover, we consider an “outsourced” setting with a service provider coordinating the PSI execution, instead of having two strangers to be online simultaneously for running a highly-interactive PSI directly with each other. Outsourcing our protocols are arguably optimal — the two users perform O(|C|) and O(1) decryptions, for unlocking the private set C and the outcome of matching.
Oblivious RAM is a cryptographic primitive that embodies one of the cornerstones of privacy-preserving technologies for database protection. While any Oblivious RAM (ORAM) construction offers access pattern hiding, there does not seem to be a construction that is safe against the potential leakage due to knowledge about the number of accesses performed by a client. Such leakage constitutes a privacy violation, as client data may be stored in a domain specific fashion. In this work, we examine this leakage by considering an adversary that can probe the server that stores an ORAM database, and who takes regular snapshots of it. We show that even against such a weak adversary, no major ORAM architecture is resilient, except for the trivial case, where the client scans the whole database in order to access a single element. In fact, we argue that constructing a non-trivial ORAM that is formally resilient seems impossible. Moreover, we quantify the leakage of different constructions to show which architecture offers the best privacy in practice.
Nowadays, genomic sequencing has become affordable for many people. Since more people let analyze their genome, more genome data gets collected. The good side of this is that analyses on this data become possible. However, this raises privacy concerns because the genomic data uniquely identify their owner, contain sensitive information about his/her risk for getting diseases, and even sensitive information about his/her family members. In this paper, we introduce a highly efficient privacy-preserving protocol for Similar Sequence Queries (SSQs), which can be used for finding genetically similar individuals in an outsourced genomic database aggregated from data of multiple institutions. Our SSQ protocol is based on the edit distance approximation by Asharov et al. (PETS’18), which we extend to the outsourcing scenario. We also improve their protocol by using more efficient building blocks and achieve a 5-6× run-time improvement compared to their work in the two-party scenario. Recently, Cheng et al. (ASIACCS’18) introduced protocols for outsourced SSQs that rely on homomorphic encryption. Our approach outperforms theirs by more than factor 20000× in terms of run-time in the outsourcing scenario.
SESSION: Session 3: Secure Cmmunication
We study the problem of load-balancing in path selection in anonymous networks such as Tor. We first find that the current Tor path selection strategy can create significant imbalances. We then develop a (locally) optimal algorithm for selecting paths and show, using flow-level simulation, that it results in much better balancing of load across the network. Our initial algorithm uses the complete state of the network, which is impractical in a distributed setting and can compromise users’ privacy. We therefore develop a revised algorithm that relies on a periodic, differentially private summary of the network state to approximate the optimal assignment. Our simulations show that the revised algorithm significantly outpe forms the current strategy while maintaining provable privacy guarantees.
The social demand for email end-to-end encryption is barely supported by mainstream service providers. Autocrypt is a new community-driven open specification for e-mail encryption that attempts to respond to this demand. In Autocrypt the encryption keys are attached directly to messages, and thus the encryption can be implemented by email clients without any collaboration of the providers. The decentralized nature of this in-band key distribution, however, makes it prone to man-in-the-middle attacks and can leak the social graph of users. To address this problem we introduce ClaimChain, a cryptographic construction for privacy-preserving authentication of public keys. Users store claims about their identities and keys, as well as their beliefs about others, in ClaimChains. These chains form authenticated decentralized repositories that enable users to prove the authenticity of both their keys and the keys of their contacts. ClaimChains are encrypted, and therefore protect the stored information, such as keys and contact identities, from prying eyes. At the same time, ClaimChain implements mechanisms to provide strong non-equivocation properties, discouraging malicious actors from distributing conflicting or inauthentic claims. We implemented ClaimChain and we show that it offers reasonable performance, low overhead, and authenticity guarantees.
This paper introduces a new attack on recent messaging systems that protect communication metadata. The main observation is that if an adversary manages to compromise a user’s friend, it can use this compromised friend to learn information about the user’s other ongoing conversations. Specifically, the adversary learns whether a user is sending other messages or not, which opens the door to existing intersection and disclosure attacks. To formalize this compromised friend attack, we present an abstract scenario called the exclusive call center problem that captures the attack’s root cause, and demonstrates that it is independent of the particular design or implementation of existing metadata-private messaging systems. We then introduce a new primitive called a private answering machine that can prevent the attack. Unfortunately, building a secure and efficient instance of this primitive under only computational hardness assumptions does not appear possible. Instead, we give a construction under the assumption that users can place a bound on their maximum number of friends and are okay leaking this information.
Website fingerprinting attacks enable a local adversary to determine which website a Tor user visits. In recent years, several researchers have proposed defenses to counter these attacks. However, these defenses have shortcomings: many do not provide formal guarantees of security, incur high latency and bandwidth overheads, and require a frequently-updated database of website traffic patterns. In this work, we introduce a new countermeasure, DynaFlow, based on dynamically-adjusting flows to protect against website fingerprinting. DynaFlow provides a similar level of security as current state-of-the-art while being over 40% more efficient. At the same time, DynaFlow does not require a pre-established database and extends protection to dynamically-generated websites.
SESSION: Session 4: Data and Identity
Enhancing and Evaluating Identity Privacy and Authentication Strength by Utilizing the Identity Ecosystem
This paper presents a novel research model of identity and the use of this model to answer some interesting research questions. Information travels in the cyber world, not only bringing us convenience and prosperity but also jeopardy. Protecting this information has been a commonly discussed issue in recent years. One type of this information is Personally Identifiable Information (PII), often used to perform personal authentication. People often give PIIs to organizations, e.g., when applying for a new job or filling out a new application on a website. While the use of such PII might be necessary for authentication, giving PII increases the risk of its exposure to criminals. We introduce two innovative approaches based on our model of identity to help evaluate and find an optimal set of PIIs that satisfy authentication purposes but minimize risk of exposure. Our model paves the way for more informed selection of PIIs by organizations that collect them as well as by users who offer PIIs to these organizations.
The promise of big data relies on the release and aggregation of data sets. When these data sets contain sensitive information about individuals, it has been scalable and convenient to protect the privacy of these individuals by de-identification. However, studies show that the combination of de-identified data sets with other data sets risks re-identification of some records. Some studies have shown how to measure this risk in specific contexts where certain types of public data sets (such as voter roles) are assumed to be available to attackers. To the extent that it can be accomplished, such analyses enable the threat of compromises to be balanced against the benefits of sharing data. For example, a study that might save lives by enabling medical research may be enabled in light of a sufficiently low probability of compromise from sharing de-identified data. In this paper, we introduce a general probabilistic re-identification framework that can be instantiated in specific contexts to estimate the probability of compromises based on explicit assumptions. We further propose a baseline of such assumptions that enable a first-cut estimate of risk for practical case studies. We refer to the framework with these assumptions as the Naive Re-identification Framework (NRF). As a case study, we show how we can apply NRF to analyze and quantify the risk of re-identification arising from releasing de-identified medical data in the context of publicly-available social media data. The results of this case study show that NRF can be used to obtain meaningful quantification of the re-identification risk, compare the risk of different social media, and assess risks of combinations of various demographic attributes and medical conditions that individuals may voluntarily disclose on social media.
When differential privacy was created more than a decade ago, the motivating example was statistics published by an official statistics agency. In attempting to transition differential privacy from the academy to practice, the U.S. Census Bureau has encountered many challenges unanticipated by differential privacy’s creators. These challenges include obtaining qualified personnel and a suitable computing environment, the difficulty accounting for all uses of the confidential data, the lack of release mechanisms that align with the needs of data users, the expectation on the part of data users that they will have access to micro-data, and the difficulty in setting the value of the privacy-loss parameter, ? (epsilon), and the lack of tools and trained individuals to verify the correctness of differential privacy implementations.
The results of recent experiments have suggested that code stylometry can successfully identify the author of short programs from among hundreds of candidates with up to 98% precision. This potential ability to discern the programmer of a code sample from a large group of possible authors could have concerning consequences for the open-source community at large, particularly those contributors that may wish to remain anonymous. Recent international events have suggested the developers of certain anti-censorship and anti-surveillance tools are being targeted by their governments and forced to delete their repositories or face prosecution. In light of this threat to the freedom and privacy of individual programmers around the world, we devised a tool, Style Counsel, to aid programmers in obfuscating their inherent style and imitating another, overt, author’s style in order to protect their anonymity from this forensic technique. Our system utilizes the implicit rules encoded in the decision points of a random forest ensemble in order to derive a set of recommendations to present to the user detailing how to achieve this obfuscation and mimicry attack.
SESSION: Session 5: Privacy Goals and Stategies
A wide array of Privacy-Enhancing Technologies (PETs) have been proposed as technical measures to provide various levels of privacy protection. Each technical measure is a building block that addresses specific privacy issues and is applicable to specific contexts. Existing approaches, however, do not provide step-by-step guidance to illustrate how these PETs can be appropriately adopted in a contextual and structured manner. From an engineering perspective, it is important to illustrate precisely how to design and implement privacy requirements and incorporate them into software architectures, as well as to choose between alternative PETs. We present an engineering approach to Privacy by Design (PbD) that uses the concept of architectural strategies to support the adoption of PETs in the early stages of the design process to achieve various levels of privacy protection. These strategies are collections of architectural tactics, which are described through design patterns and realised by PETs. We illustrate the approach’s use in the context of eToll pricing systems and argue that this contribution lays the foundation for developing appropriate privacy engineering methodologies.
Use-based privacy restricts how information may be used, making it well-suited for data collection and data analysis applications in networked information systems. This work investigates the feasibility of enforcing use-based privacy in distributed systems with adversarial service providers. Three architectures that use Intel-SGX are explored: source-based monitoring, delegated monitoring, and inline monitoring. Trade-offs are explored between deployability, performance, and privacy. Source-based monitoring imposes no burden on application developers and supports legacy applications, but 35-62% latency overhead was observed for simple applications. Delegated monitoring offers the best performance against malicious adversaries, whereas inline monitoring provides performance improvements (0-14% latency overhead compared to a baseline application) in an attenuated threat model. These results provide evidence that use-based privacy might be feasible in distributed systems with active adversaries, but the appropriate architecture will depend on the type of application.
To protect users’ privacy, it is important to understand how they value personal information. Prior work identified how framing effects alter users’ valuations and highlighted the difficulty in eliciting real valuations through user studies under hypothetical circumstances. However, our understanding of users’ valuations remains limited to specific entities, information types, and levels of realism. We examined the effects of realism and purpose of use on users’ valuations of their personal information. Specifically, we conducted an online study in which participants (N=434) were asked to assign monetary value to their personal information in the context of an information marketplace involving different receiving parties, while we experimentally manipulated the level of realism of the scenario and the timing of eliciting valuations. Among our findings is a nuanced understanding of valuation biases, including when they may not apply. For example, we find that, contrary to common belief, participants’ valuations are not generally higher in hypothetical scenarios compared to realistic ones. Importantly, we find that while absolute valuations vary greatly between participants, the order in which users prioritize information types (i.e., users’ relative valuations of different attributes) remains stable across the levels of realism we study. We discuss how our findings inform system design and future studies.