Monday, April 29, 2013

How big does your city have to be in order to make Sentiment Analysis worthwhile

I wrote earlier about a solution I helped develop which allows city leaders monitor the sentiment being expressed online about their city. As we present this solution to the leaders of various cities, one of the questions that is always asked is whether their city is well known enough to generate enough mentions so that the sentiment charts will be statistically significant.

The general rule of thumb we have been using is that a city must have a population of at least .25 million in order to make the tool feasible. The thing that matters is the number of mentions of the city online (we would hope for at least 5k per week) and many times (but not always) the population of a city can be a rough guide to how many mentions that are likely to be made. Therefore I decided to run a quick test to see how many mentions I would find for a pseudo-random selection of cities with both large and small populations (some of the smaller places were not technically cities) in the first week of April this year.

This table summarises the results:

CityPopulation Mentions/week 
Loughrea5,057102
Birr5,818291
Nazareth14,1233,873
Clemmons18,627224
Bethlehem25,2663,661
Navan28,158775
Dundalk31,149898
Lorient58,1351,971
Galway75,5294,758
Cergy-Pontoise   183,430235
Bordeaux235,8919,555
Montpellier255,0806,539
Toulouse449,32811,933
Dublin527,61229,859
Boston625,087111,035
Jerusalem801,00020,767
Paris2,234,105178,406
Sydney4,627,34552,791
London8,173,194257,094
Bangalore8,474,97016,455
Moscow11,503,50149,534
Tokyo13,185,502216,606
New York19,570,261440,535
Beijing20,693,000274,062

This can be visualised by the following chart:


I think you can see that there is a correlation between city size and the number of mentions (correlation coefficient = 0.83). You can also see that Galway is getting roughly enough mentions to make sentiment analysis useful despite only having a population of 75k, while Cergy-Panoise has more than double the population but is not getting enough internet mentions to make sentiment monitoring useful.

A few examples of where the city gets a number of mentions very different from what would be predicted for their population:
  • Both Bethlehem and Nazareth get many more mentions than would be predicted by their population (e.g. they are both mentions significantly more than Navan and Dundalk which have larger populations). This is probably due to the biblical significance of the towns - in fact this is why I chose them for inclusion in the test and I don't know the names of any other towns in the middle-east with such small population.
  • Where the name of the city in the local language was different from the name in English I searched on both versions of the name. In general the local language version received more hits (e.g. there were 6.5 times as many mentions of 北京 as there were for Beijing. However, for the Israeli cities it was the other way around. For example the word "Jerusalem" got 20,711 mentions while the Arabic and Hebrew translations of the city name only had 17 and 39 mentions respectively. Perhaps this is an indication that people in other parts of the world are talking about the city much more than the locals.
  • Cergy-Ponoise only gets 235 mentions,while Lorient gets 1,971 mentions despite having a smaller population. I am not sure why this should be the case, but perhaps it is due to the fact that Cergy-Ponoise is so close to Paris that local residents consider themselves to be Parisians. Lorient has no similar large city nearby to overshadow it.
  • The statistics will vary over time.For example,if I has run my test for the 3rd week in April rather than the first, the number of mentions for Boston would have been 1,893,159 rather than 111,035 - probably due to coverage of the marathon bombing.
Notes:
  • In the case of some of the cities I chose, there are multiple cities with same name - for example, the wikipedia disambiguation page for Boston lists several cities with this name, but I only counted the population of the capital of Massachusetts (the population of the other cities would probably not be very large). 
  • Wikipedia sometimes has several different estimates of population because of ambiguity of how large an area to include. I only considered the first number listed(which is typically the smallest). In some cases the difference might be minor, but in others it would be very significant. For example the estimates of the population of Boston vary by an order of magnitude from 625 thousand to 7.6 million.

Sunday, April 28, 2013

Cycling in the Wicklow Mountains

Earlier this year I was persuaded to sign up for the Wicklow 100/200 cycle event. This event takes place in June and offers a choice of two routes, one 100km long and another 200km long. A 200km cycle would be challenging enough, but this route has the added challenge of passing over several steep climbs.Luckily you don't need commit to either distance when you enter and you are allowed change your mind at any stage until you come to the fork in the road where the two roads diverge.

I don't have much experience of cycling in the mountains so I am unsure how I would get on.Yesterday I rode over the Sally Gap for the first time. I found that I was not a strong climber and was constantly being dropped from the group as we went uphill. Luckily I had no problem catching up again when we came to a flat section, but it is looking very much like I will be opting for the 100km route in June. I will also need several training cycles in the meantime to ensure I complete it in a decent time.


View My Firsttime Cycling Over The Sally Gap in a larger map

Thursday, April 25, 2013

[xpost] Smasher now works with juniper firewalls

As many people know, I was the original developer of the smasher Sametime plugin for automatic BSO authentication. However, I have not been actively maintaining it in the last few years. The last update I did was in 2011 when I partially fixed a problem which stopped SUT and smasher working together. Whenever people ask for new features or bug fixes, I typically point them at the location of the source code and then politely suggest that if they really want their issue solved they should fix it themselves.
Recently the Böblingen lab announced that they were planning to replace all of their CISCO BSO devices with juniper ones. This caused a flurry of emails from German employees since neither smasher of any of the alternative tools work with the Juniper firewalls. I was not in a position to help because I don't have access to any of the new firewalls to test, Luckily Thomas Immel was kind enough to help out and he developed a new version 1.3.5 which apparently works with the new firewalls.
The new version of smasher is available from the same update site URL as before http://dubgsa.ibm.com/~bodonova/public/smasher/latest/ - I didn't get a chance to do any testing with this new version (I no longer use smasher myself), so just in case it causes problems for anyone the old version is still available at http://dubgsa.ibm.com/~bodonova/public/smasher/smasher-1.3.4/
I hope you enjoy (and send any praise or complaints to Thomas rather than me).

Sunday, April 21, 2013

To tri-bar or not to tri-bar? - that is the question

When I bought my racing bike through the bike to work scheme, I had 50 euro left over. The bike shop offered to give me a voucher for the unused money, but I was keen to spend it on some accessory. I asked the shop what I could get for 50 euro and I finally decided on getting tri-bars.

Tri-bars are extensions to the handlebars on a bike which allows the cyclist to take on a more aerodynamic position. They are called tri-bars because they are normally only used by participants in either a triathlon or an individual time trial.

The advantages of the tri-bar are:

  • The position of the cyclist is more aerodynamic so it is possible to cycle faster and expend less effort.
  • While using the tri-bars the cyclist will normally rest their elbows on soft pads which eliminates all strain on your arms or back.
  • You look really cool when using your tri-bars (this was probably the main motivation for me to purchase the tri-bars),
However, the tri-bars also have some dis-advantages:
  • When your hands are on the tri-bars they are quite some distance from the brakes, so sudden braking is not possible. Hence they cannot be used in traffic or when cycling in a group.
  • You have minimal steering control while using the tri-bars so they can only be used on straight road. In fact it is not even feasible to swerve to avoid pot holes while using the bars so they can't be used on poor road surfaces.
  • While it is more efficient to cycle with tri-bars, it takes some practice to get used to the different cycling position.
  • The tri-bars use up some space on the handlebars which reduces the space for attaching other accessories. 
When I initially started cycling on my new bike, I found that I hardly ever used the tri-bars and so I decided to remove them. However, when I started training for a triathlon last year I re-attached them and decided to make a concerted effort to learn how to use them. I still find that I don't use the bars very often, but I think that it is still worth having them because they don't get in the way very much when not being used.