What Really Caused Data Breaches in 2020?
Looking for the latest version?
Click the button!
First, a little background
Verizon’s 2021 Data Breach Investigations Report (DBIR) [1], an industry publication that analyzes cybersecurity incident and breach data from around the world, found that over 99% of all incident and breach events fall into one of only eight major categories. According to DBIR, social engineering and basic web application attacks account for over 50% of all incidents of breaches. Figure 5 from the report (shown below) shows the relative proportion of causes that fall within each classification pattern.
DBIR incident and breach classification patterns are based on clustering of data as opposed to how industry tends to group them when we talk about cybersecurity. Here are some examples of the types of incidents and breaches that fall within each group:
Source: DBIR [1]
- Social engineering: phishing emails 
- Basic web application attacks: SQL Injection 
- System intrusion: ransomware, malware, stolen credentials, hackers 
- Miscellaneous errors: misconfigurations 
- Privilege misuse: disgruntled employee data leak 
- Lost and stolen assets: stolen laptop or phone 
- Denial of service: DDoS attacks 
- Everything else: ATM card skimmers 
DBIR results were surprising to some of us. When we thought about 2020, it felt like hackers and ransomware should have been at the top (these fall under system intrusion). But that isn’t what Verizon’s data showed. What influences our perception of cybersecurity breach causes? Do those perceptions reflect reality? If the news media and the internet are leading us astray, maybe we can pinpoint what those influences are and systematically reduce our bias on those topics.
One way to address the question of what we perceive as cybersecurity incident and breach causes was inspired by Hannah Ritchie and Max Roser’s 2018 article [2] comparing causes of death data to what the New York Times, The Guardian, and Google Trends covered as causes of death. We adapted their approach to investigate the perceptions of cybersecurity incident and breach causes. We looked at five sources of data -- New York Times, The Guardian, Google Trends, Google Search, and scite.ai -- to compare how frequently their cybersecurity incident and breach coverage from 2020 fell into each one of DBIR’s classification patterns.
Our Analysis
Perceptions of Data Breach Causes
We were interested in comparing what DBIR, Google, news outlets, and academia reported as the causes of data breaches in 2020. To that end, we compiled keywords and search terms related to cybersecurity incidents and breaches for each of the eight DBIR categories. We searched for those terms across Google Trends, Google Search, New York Times, The Guardian, and Scite.ai using each outlet’s Application Programming Interface (API), inspected the search results for semantic context (e.g. phish email, not Phish the band), and tallied the number of hits in each category by outlet. Since we were interested in relative shares, we normalized the remaining counts by dividing them by the total for the year. Here is what we found:
People and news media create and engage with cybersecurity content for many reasons. Since our approach here was exploratory, we make the assumption that people tend to turn their attention and subsequently dedicate their resources to cybersecurity content that they feel will solve their most pressing cybersecurity problems. We also assume here that DBIR captures the “true” causes of breaches, as opposed to what people feel -- and would Google -- as the cause of breaches.
Here are some insights we can draw from our analysis:
What caused data breaches in 2020?
DBIR listed social engineering as the top cause of breaches in 2020, followed by basic web application attacks and system intrusion (that includes hackers and malware including ransomware). Let's use that as our baseline.
What did “the internet” think was causing breaches in 2020?
Googlers seemed largely preoccupied with system intrusion events due to malware, hackers, and the like. Journalists seem to agree and published quite a number of articles following that trend.
What did the news report as having caused breaches in 2020?
New York Times and The Guardian coverage mostly agreed with what people were searching for. A curious discrepancy is that The Guardian covered relatively little (1%) on the topic of denial of service attacks compared to the New York Times (12%).
What did academic research say about data breaches in 2020?
Academic publications aligned most closely with DBIR in terms of coverage proportion across topics. Miscellaneous errors like user error apparently aren’t the hottest topic among researchers. Instead, researchers spent most of their share on system intrusion events followed by denial of service events.
What content was available to us in 2020 if we did a Google search?
Nearly half of the Googleable internet’s share of these topics was content mentioning lost and stolen assets, particularly smartphones. System intrusion was the next most Googleable topic.
Did popular Google searches align with DBIR breach causes or did they miss the mark?
What we googled was system intrusion, social engineering, and denial of service information. What we got was a lot of “How to locate your stolen iPhone.”
Our Methodology
To perform the searches, we compiled a list of keywords and search terms (like “phishing” or “social engineering”) for each of the eight DBIR cybersecurity incident and breach cause groups. We collected the initial keywords and terms from the 2021 DBIR report, the National Institute of Standards and Technology (NIST) Glossary of Key Information Security Terms [3], and from cybersecurity professionals at Hive Systems. We entered the lists into Google Trends and added the resulting related queries suggested by Google to the respective lists. We repeated the process until the Google Trends related query results were no longer related to the respective DBIR cause groups. This was particularly helpful because it ensured we included terms like “wannacry” in our searches for “ransomware,” which are the terms people actually used in their search queries.
Counting Google Trends Results
We first entered the keywords and terms into Google Trends, but ended up getting more relevant results by using Google Trends’ Topics feature. We went with Topics because the feature includes synonyms as well as multiple languages. We also tested using all combinations of Google Trends Topics and Google Categories in order to find which combination of Google Trends settings yielded the most accurate results and the largest number of results. Trends Topics are useful because they include concepts based on your keywords e.g. if you use the topic “London” Google silently includes results for searches of "Capital of the UK" and the words for London in other languages. Trends Categories are also useful because they narrow down concepts to your subject of interest e.g. searching for the term “jaguar” yields results for the animal and car manufacturer of the same name unless you specify one or the other category. What yielded the most results and the fewest errors was using the category "Computers & Electronics" and the trends topics "Web application security,” “phishing,” “security hacker,” “insider threat,” and “denial-of-service attack."
Curious to see it in action? Try it out!
Counting Google Search Results
We sampled Google search results for relevance by reviewing the last pages in each result set -- pages that Google finds least relevant. We counted results on pages with results that were related to the topic being queried. We based relevance on the Google preview of the content.
Counting New York Times and The Guardian Results
We used the New York Times and The Guardian API to see how many 2020 articles contained each keyword and search term. Starting with the search terms that yielded the highest number of results, we manually reviewed headlines for their relevance to each topic. If the article was not on topic, it was not counted. We reviewed each individual result for relevance with the exception of results for system intrusion due to volume. Instead, we randomly sampled system intrusion results until at least 10% of the total were reviewed. This resulted in a random sample and manual review of 214 New York Times articles of the total 2136 results, and a random sample of 465 The Guardian articles of the total 4648 results.
Normalizing the Counts
Since we were interested in relative shares of cybersecurity incident and breach causes by cause group, we normalized each data source's values by dividing the number of reports in each category by the sum of all reports for the year 2020.
Caveats and Limitations
- Our results are based on a convenience sample and we cannot claim that our findings are representative or generalizable. 
- We did our best to put together an extensive list of incident and breach cause synonyms, including colloquialisms, but don’t claim to have used an exhaustive list. 
- None of the data sources we used provide raw output (even with their APIs), so the algorithms and biases of the sources will have influenced these findings. 
- Google Trends allowed us to pool results from multiple languages, but other sources presumably did not. Google Trends results may be skewed toward results in languages that happen to talk more about one of the topics. 
- We evaluated Google Search results using the Google summary blurb, but it is possible the blurbs were misleading and that reviewing the whole article is necessary. We were unable to do so due to volume, but could use sampling in the future. 
- Cybersecurity professionals and people under cyber attack may be more discrete in their searches and opt for search engines like DuckDuckGo. 
- The search engines and news sites we chose are obviously not the only options. We chose them for their convenience (how many search engines have something like Google Trends?) and because of their prominence and coverage. 
- If you find a good informative result or article for your search, do you keep Googling it? Any topics people found quickly, they presumably searched for less. This means that good writers are throwing off our stats! 
What do you think? Tell us!
Join the discussion below and keep us honest.
If there’s enough interest, we may add more stacks to the graphic like Wikipedia, Reddit, LinkedIn, Google patent search, security conferences, WayBack Machine, other search engines etc.
References
[1] Widup, Suzanne & Pinto, Alex & Hylender, David & Bassett, Gabriel & Langlois, Philippe. (2021). 2021 Verizon Data Breach Investigations Report. Retrieved from: ‘https://verizon.com/dbir’.
[2] Hannah Ritchie and Max Roser. (2018) "Causes of Death". Published online at OurWorldInData.org. Retrieved from: 'https://ourworldindata.org/causes-of-death'.
[3] National Institute of Standards and Technology (2019) Glossary of Key Information Security Terms. Retrieved from: ‘https://csrc.nist.gov/glossary’.
 
                        ![Source: DBIR [1]](https://images.squarespace-cdn.com/content/v1/5ffe234606e5ec7bfc57a7a3/1631583013118-RALE4BIWCR08KZHTXAK4/DBIRfigure5.png) 
             
             
  
  
    
    
     
  
  
    
    
     
  
  
    
    
    