Monday, April 29, 2013

How big does your city have to be in order to make Sentiment Analysis worthwhile

I wrote earlier about a solution I helped develop which allows city leaders monitor the sentiment being expressed online about their city. As we present this solution to the leaders of various cities, one of the questions that is always asked is whether their city is well known enough to generate enough mentions so that the sentiment charts will be statistically significant.

The general rule of thumb we have been using is that a city must have a population of at least .25 million in order to make the tool feasible. The thing that matters is the number of mentions of the city online (we would hope for at least 5k per week) and many times (but not always) the population of a city can be a rough guide to how many mentions that are likely to be made. Therefore I decided to run a quick test to see how many mentions I would find for a pseudo-random selection of cities with both large and small populations (some of the smaller places were not technically cities) in the first week of April this year.

This table summarises the results:

CityPopulation Mentions/week 
Loughrea5,057102
Birr5,818291
Nazareth14,1233,873
Clemmons18,627224
Bethlehem25,2663,661
Navan28,158775
Dundalk31,149898
Lorient58,1351,971
Galway75,5294,758
Cergy-Pontoise   183,430235
Bordeaux235,8919,555
Montpellier255,0806,539
Toulouse449,32811,933
Dublin527,61229,859
Boston625,087111,035
Jerusalem801,00020,767
Paris2,234,105178,406
Sydney4,627,34552,791
London8,173,194257,094
Bangalore8,474,97016,455
Moscow11,503,50149,534
Tokyo13,185,502216,606
New York19,570,261440,535
Beijing20,693,000274,062

This can be visualised by the following chart:


I think you can see that there is a correlation between city size and the number of mentions (correlation coefficient = 0.83). You can also see that Galway is getting roughly enough mentions to make sentiment analysis useful despite only having a population of 75k, while Cergy-Panoise has more than double the population but is not getting enough internet mentions to make sentiment monitoring useful.

A few examples of where the city gets a number of mentions very different from what would be predicted for their population:
  • Both Bethlehem and Nazareth get many more mentions than would be predicted by their population (e.g. they are both mentions significantly more than Navan and Dundalk which have larger populations). This is probably due to the biblical significance of the towns - in fact this is why I chose them for inclusion in the test and I don't know the names of any other towns in the middle-east with such small population.
  • Where the name of the city in the local language was different from the name in English I searched on both versions of the name. In general the local language version received more hits (e.g. there were 6.5 times as many mentions of 北京 as there were for Beijing. However, for the Israeli cities it was the other way around. For example the word "Jerusalem" got 20,711 mentions while the Arabic and Hebrew translations of the city name only had 17 and 39 mentions respectively. Perhaps this is an indication that people in other parts of the world are talking about the city much more than the locals.
  • Cergy-Ponoise only gets 235 mentions,while Lorient gets 1,971 mentions despite having a smaller population. I am not sure why this should be the case, but perhaps it is due to the fact that Cergy-Ponoise is so close to Paris that local residents consider themselves to be Parisians. Lorient has no similar large city nearby to overshadow it.
  • The statistics will vary over time.For example,if I has run my test for the 3rd week in April rather than the first, the number of mentions for Boston would have been 1,893,159 rather than 111,035 - probably due to coverage of the marathon bombing.
Notes:
  • In the case of some of the cities I chose, there are multiple cities with same name - for example, the wikipedia disambiguation page for Boston lists several cities with this name, but I only counted the population of the capital of Massachusetts (the population of the other cities would probably not be very large). 
  • Wikipedia sometimes has several different estimates of population because of ambiguity of how large an area to include. I only considered the first number listed(which is typically the smallest). In some cases the difference might be minor, but in others it would be very significant. For example the estimates of the population of Boston vary by an order of magnitude from 625 thousand to 7.6 million.

No comments:

Post a Comment