Which countries are mentioned the most on Hacker News?

As I was walking around Osaka with Sacha, we started talking about Hacker News and how there seem to be a lot of popular stories about Japan.

Is the popularity of Japan just our confirmation bias, or would it really top the lists of most popular countries on HN? Since the data is public, I decided to find out.

Most mentioned countries

Here are the top countries which had at least 100 stories mentioning them. This only counts news stories posted, not comments.

Rank Country Story mentions
# 1 United States 18164
# 2 China 8610
# 3 India 7671
# 4 United Kingdom 6774
# 5 Japan 3029
# 6 Russia 2722
# 7 Canada 2248
# 8 Germany 2248
# 9 Australia 2188
# 10 France 2153
# 11 Israel 1096
# 12 Spain 891
# 13 Brazil 889
# 14 Jordan 878
# 15 Pakistan 850
# 16 Netherlands 812
# 17 Sweden 808
# 18 Greece 767
# 19 Italy 744
# 20 Ireland 688
# 21 Mexico 682
# 22 Switzerland 639
# 23 Singapore 608
# 24 Turkey 604
# 25 Ukraine 587
# 26 Egypt 536
# 27 Malaysia 443
# 28 Norway 434
# 29 Indonesia 428
# 30 Vietnam 386
# 31 Philippines 356
# 32 Chile 342
# 33 Thailand 337
# 34 Finland 334
# 35 Argentina 328
# 36 Afghanistan 327
# 37 Nigeria 326
# 38 Iraq 307
# 39 Saudi Arabia 293
# 40 Georgia 278
# 41 Poland 273
# 42 Iceland 257
# 43 Denmark 233
# 44 Kenya 219
# 45 Estonia 214
# 46 Nepal 209
# 47 Taiwan 195
# 48 Portugal 193
# 49 Haiti 183
# 50 Libya 165
# 51 Belgium 156
# 52 Romania 152
# 53 Venezuela 152
# 54 Ecuador 142
# 55 Antarctica 126
# 56 Bangladesh 123
# 57 Cyprus 120
# 58 Hungary 117
# 59 Austria 111

Average score of post vs. country

Now we know which countries are mentioned the most, but how about upvotes? Which countries have the highest average number of upvotes per post? For this I only include the countries mentioned in the previous list, since including rarely mentioned countries would have added noise.

Rank Country Story mentions Average score
# 1 Ecuador 142 22.94
# 2 Norway 434 15.72
# 3 Germany 2248 14.86
# 4 Austria 111 14.32
# 5 Sweden 808 14.11
# 6 Iceland 257 14.04
# 7 Venezuela 152 13.33
# 8 United States 18164 13.01
# 9 Denmark 233 12.84
# 10 Finland 334 12.73
# 11 Netherlands 812 12.58
# 12 Switzerland 639 12.51
# 13 Japan 3029 11.91
# 14 Chile 342 11.42
# 15 Russia 2722 11.11
# 16 Libya 165 10.95
# 17 Poland 273 10.58
# 18 France 2153 10.57
# 19 Kenya 219 10.25
# 20 Afghanistan 327 10.14
# 21 Romania 152 10.02
# 22 Georgia 278 9.88
# 23 Haiti 183 9.73
# 24 Iraq 307 9.61
# 25 Estonia 214 9.57
# 26 Mexico 682 9.42
# 27 Argentina 328 9.39
# 28 Greece 767 9.33
# 29 Hungary 117 9.21
# 30 Turkey 604 9.14
# 31 Brazil 889 9.08
# 32 Egypt 536 8.96
# 33 Saudi Arabia 293 8.73
# 34 United Kingdom 6774 8.72
# 35 Malaysia 443 8.7
# 36 Antarctica 126 8.56
# 37 China 8610 8.55
# 38 Cyprus 120 8.38
# 39 Canada 2248 8.26
# 40 Nigeria 326 8.06
# 41 Nepal 209 7.72
# 42 Singapore 608 7.6
# 43 Italy 744 7.45
# 44 Belgium 156 7.37
# 45 Australia 2188 7.07
# 46 Portugal 193 7.07
# 47 Israel 1096 6.99
# 48 Ireland 688 6.79
# 49 Thailand 337 6.67
# 50 Ukraine 587 6.48
# 51 Bangladesh 123 6.48
# 52 Spain 891 6.46
# 53 India 7671 6.1
# 54 Taiwan 195 5.24
# 55 Vietnam 386 4.92
# 56 Pakistan 850 4.65
# 57 Philippines 356 4.38
# 58 Indonesia 428 2.67
# 59 Jordan 878 2.3

Conclusions

What is going on with Ecuador? Snowden. Here are the stories mentioning it.

Japan really does have both very high popularity in submissions and also fairly good success with those posts as well. I did notice India and China appearing often in stories, but didn't expect them to beat Japan.

Notes on method

I thought this blog post would be a 2-hour project. The data is readily available, you can either download all HN stories in JSON format (1.1GB uncompressed) or use Google BigQuery, which has the table available already. I went with the latter route.

At first I thought I would just match stories against country strings like "Germany" and "Japan", but then I realized I should probably include "German" and "Japanese" in there as well. For multi-word country names I also wanted to count abbreviations ("US", "U.S.", "USA", "U.S.A."). The query already removes dots, so it was enough to have "US", "USA" in the list. To prevent "US" from also matching "us", I decided to stick with case sensitivity.

In the end to determine whether a story might be talking about a certain country, I made a list of strings which map to country codes. In the list I have both country names ("Japan") and demonyms ("Japanese"). Here is my whole mapping.

I decided not to include "English" as a word for UK, because it more commonly refers to the language. Since it was ambiguous what "Korea" appearing along would refer to, I didn't count that as referring to anything.

In the end I spent two evenings creating the country synonym list, reading up on how JOINs and subselects work on BigQuery (finally ending up with this query) and then composing the final post along with maps and formatting. I already had the country flags, as I was using them for the geoip part of Candy Japan.

Hope you liked it!


I also made one for Reddit.