Swedes Online: You Are More Tracked Than You Think

joelpurra.com

joelpurra.com/r/masters

Originally presented by Joel Purra as his master's thesis defense, 2015-02-19 at Linköping University in Sweden. A video recording is also available.

Change slide by swiping or using the ← → direction buttons on your keyboard. Press s to toogle showing short slide notes in a separate tab.

What is tracking?

When browsing online, information is recorded by the servers you communicate directly with. When visiting a website, resources from other services might be requested as well -- with or without being visible.

Assumption: all third-party resources have server logs and/or analytics software to record your online habits.

What does this thesis show?

By downloading websites from over 150.000 domains, it is shown how common third-party resource usage is.

Domain classes: Random, top and curated. Focus is on Sweden.

Randomly selected domains

  • 100.000 .se
  • 10.000 .dk
  • 10.000 .com
  • 10.000 .net

From Alexa's global top 1.000.000

  • 10.000 in the very top
  • 10.000 randomly selected
  • 3.400 .se
  • 2.600 .dk

Important in Sweden

  • Counties, municipalities, public authorities, and higher education
  • Financial services, government-owned corporations, media
  • Domain registrars, ISPs

Reach50's top list

  • 50 most popular domains in Sweden

What are third-party resources?

A resource belonging to the origin's primary domain is called internal. Otherwise it's an external resource.

Assumption: any external resource is a third-party resource.

Domain examples

  • example.se (primary domain)
  • www.example.se (subdomain)
  • example.org (third-party domain)
  • doubleclick.net (known tracker domain)

Resource examples

  • Branded (videos, services, images)
  • Unbranded (fonts, useful scripts, images)
  • Ads (scripts, images, flash)
  • Web beacons (hidden images, analytics scripts)

What is passive tracking?

Collecting information that is required to retrieve the resource as part of the HTTP standard, or inferrable from observing network traffic.

Anyone can listen in anywhere along the network path, unless HTTPS is used. HTTPS prevents passive tracking of the following:

Browsing without HTTPS and TOR © EFF.org (CC-BY)

What is active tracking?

A script or plugin executed in the browser to extract and collect extended information. Can already collect all of the passive properties by default, even when using HTTPS.

Why is tracking used?

Information is collected and stored to gain knowledge about the visitors a website has. The purpose differs depending on the perspective.

On to results

Domains per organization

Distribution of domains per organization in the Disconnect.me blocking list.

521 out of 980 organization have 1 domain, 331 have 2 domain.

Google has 271, Yahoo 71, AOL 40, Microsoft 32.

Why is tracking bad?

What can tracking lead to?

Thank you!

Thesis supervision: Niklas Carlsson, Associate Professor, IDA. Patrik Wallström, Project Manager within R&D, .SE. Staffan Hagnell, Head of New Businesses, .SE. Anton Nilsson, opponent. Thank you!

Domains, data and software: .SE (Richard Isberg, Tobbe Carlsson, Anne-Marie Eklund-Löwinder, Erika Lund), DK Hostmaster A/S (Steen Vincentz Jensen, Lise Fuhr), Reach50/Webmie (Mika Wenell, Jyry Suvilehto), Alexa, Verisign. Disconnect.me, Mozilla. PhantomJS, jq, GNU Parallel, LyX. Thank you!

Tips, feedback, inspiration and help: Dwight Hunter, Peter Forsman, Linus Nordberg, Pamela Davidsson, Lennart Bonnevier, Isabelle Edlund, Amar Andersson, Per-Ola Mjömark, Elisabeth Nilsson, Mats Dufberg, Ana Rodriguez Garcia, Stanley Greenstein, Markus Bylund. Thank you!

And of course everyone I forgot to mention – sorry and thank you!

More!

Open source

Open datasets

Other findings

There are at least as many external resources, meaning as much tracking, on secure as insecure top domains (Figure 4.3(a))

Swedish top/curated domain findings

Swedish media seems very social, with the highest Twitter and Facebook coverage (C.11.4)

Other findings

50% of top sites always redirect to the www subdomain, 13% always redirect to their primary domain (C.8)

Random .se domain findings

58% use content from known trackers (C.11.3)

Swedish top/curated domain findings

Over 90% of most categories' domains rely on external resources – external resources are considered trackers (C.4)

Random .se domain findings

Uses more external resources than .dk, but less than .com and .net (C.4)

Other findings

A few global top domains load more than 75 known trackers on their front page alone (C.11.1)

Random .se domain findings

Disconnect only detects 3% of external primary domains as trackers (4.3.4)

Other findings

Disconnect's blocking list only detects 10% of external primary domains as trackers for top website datasets (4.3.4)

Other findings

Twitter has about half the coverage of Facebook (C.11.4)

Swedish top/curated domain findings

A single visit to each media sites would leak information to at least 57 organizations (C.11.1)

Other findings

78% of 123,000 HTTP-www variation domains call external domains (C.5)

Random .se domain findings

Many random domains use only external resources due to being parked 4.2 or redirecting away from the origin domain (C.8)

Swedish top/curated domain findings

25% of Swedish municipalities responding to secure requests load 90% of their resources securely – it's close, but still considered insecure (Figure 4.3(b))

Swedish top/curated domain findings

Financial instititions redirect from secure to insecure sites for 20% of responding domains (C.8)

Other findings

94% of 5,959 HTTPS-www variation domains call external domains (C.5)

Swedish top/curated domain findings

70% use content from known trackers (C.11.3)

Random .se domain findings

39% use only external resources (Figure 4.2(a))

Random .se domain findings

Over 40% use Google Analytics or Google API (C.11.2)

Swedish top/curated domain findings

Only 13 of 290 municipalities have fully secure websites; no Swedish media sites are completely secure (C.7)

Random .se domain findings

Only 0.3% respond to secure requests, in line with .dk and .net, while .com has 0.5-0.6% response rate (C.2)