The market is better calibrated than we thought, but not perfect. In our previous calibration post, each question counted once. In the chart below, each forecast counts once, which is the usual method.
The main message is that the overall market is much better calibrated than raw estimates (safe mode). Comparing the top and bottom rows, the market is strongly driven by the super-forecasters (here the Top 70 by Brier score, but it doesn’t change much if we use market score or the Top 10).
On Scaled questions, both the raw estimates and the market as a whole over-estimate. The supers, however, do exceptionally well. (The noted anomaly in the Scaled markets is a single forecaster who submitted a forecast at the right extreme and then quickly corrected it. However, forecasters in Safe Mode were much more likely to assert extreme values, as shown by the small error bars in the middle plot.)
The market shows perhaps a small favorite-longshot bias for multiple-choice questions and a strong longshot bias for binary questions. The dip around forecasts of 0.75 on binary questions that ruins the supers’ beautiful S-shape is caused by a high number of forecasts in the vicinity of 0.75 on two questions: “Will Google announce development of a smartwatch at or before the Google I/O 2014 Conference?” and “Will Google acquire Twitch by the end of September 2014?” We suspect that the better calibration on multiple-choice questions has to do with counting forecasts on all options when only one is directly provided by the user; the others are normalized to accommodate the one.
Users who have LIKED this post: