A common problem faced by data analysts is to decide upon the required sample size.
This can be estimated by using various calculations and distribution functions. However, even if a statistician is readily available, the answer is not necessarily straightforward, and is certainly not easy to understand.
An appropriate sample size depends on various study design parameters: minimum expected difference; measurement variability; desired statistical power; significance criteria, etc.
I do not wish to over-complicate matters and try to explain in detail what the above terminology means. Just Google search some of these words to dive deeper and understand more.
In your school mathematics lessons you may have learned that sample size matters. This is correct! But what was not essentially taught at school, because it may have overloaded the lessons (and even some brains!), is that a larger sample size does not necessarily mean a higher confidence level (= level of certainty).
Hence, we are often asked by readers:
How many data sets are actually required to achieve a good level of accuracy?
In other words, how many matches do I need to analyse for accurate predictions?
(1) HDAFU Tables
The HDAFU tables capture five seasons’ data.
The smallest league analysis we offer for sale, the Swiss Super League, includes all 882 matches played in the last five seasons. Mathematically speaking, analysing this data set will produce estimations and predictions within a confidence interval of 3.3%.
The largest leagues, English League One and The Championship, both contain 2,760 matches. Statistical analysis of these data will produce calculations below a standard error rate of 1%, and a confidence interval of 1.8%, which is more than acceptable.
Any longer period, for example, looking at more than five seasons’ data, is unnecessary in our opinion, at least for HDA simulations and strategies.
Piling on more data sets is not likely to make any significant improvement to the standard error or the confidence interval.
Five seasons’ data is absolutely enough; even three seasons do a pretty accurate job for many leagues, and our value calculators are even based on the last 25 matches only (= not even 2 seasons).
(2) Value Bet Calculator for League Games with Head-to-Head (H2H) History
Once odds calculation is understood using five seasons of data then opportunities to simplify emerge.
In order to predict distributions with enough accuracy for a particular subset of matches it is not always necessary to carry out a whole set of exhaustive calculations. The True Odds calculator & Value Bet detector for League Games with H2H history is just such an example.
Using the last 25 matches plus H2H’s for league games is a kind of “back-of-the-envelope calculation”, or a snapshot of comparable data. It does not address matches which are not so easy to calculate (e.g. cup games, international club fixtures, matches without H2H history, matches between teams without a history of playing in the same league, etc.).
These other game constellations may require not only different sample sizes and data sets, but also different formulas and calculations.
Read more on odds calculation and strategy development: 1×2 Betting System: Analysis of HDA Data and Strategy Development – LAY THE DRAW
Can you please elaborate on percentage price difference between market odds and calculated odds?
You may find some answers here:
What is Value? What is Value Betting?
The Science of Prediction
I’d love to dig deeper into the formula you used. Could you point me in the right direction? Is there anything I can specifically Google? Or any additional articles that you have that elaborate more on this process?
I don’t know if this question has already been asked or addressed and if it has I apologize. My question is why is it the last 25 games and at least 6 years of head to head matches? Is this simply a percentage you picked? Or is there a mathematical formal behind it? Thanks and Cheers!
You are talking about the Value Calculator… The 25 games are important, and the 6 H2H’s are a correction factor …. If you use the VC at the beginning of a season, then this means that data from the last season plus a few records of the previous season are included in the calculations. The further the season develops the more the previous season becomes obsolete.
Yes, there is a lot of mathematical research behind it.
Hi Soccerwidow,
I understand that a large sample size is necessary to reduce error and improve the confidence level. However, considering that players and circumstances change frequently from season to season, would historical data be a good measure for future probabilities?
Thank you!
There is no other option than to use the little historical data available for calculating future probabilities. Just keep in mind that football clubs are professional entities, and even if players change the management will certainly employ suitable replacements.
Have you tried to incorporate a exponential scaling function over-proportionally weighting receng matches. With going further back in time matches have a smaller weight in the overall calculation. Works pretty fine. And you can implement a very large dataset.
Regards,
Dennis
Hi Dennis, thanks for your input 🙂
You are on the right track… results become more accurate if giving matches further back in time smaller weight.
Hi,
This sounds very interesting. It reminds me of technical analysis in trading there is an EMA, which means Exponential Moving Average – more weight added on to recent price (in case you didn’t already know). Any tips on how to implement such a thing?
Thanks.
The EMA is a moving average that places a greater weight and significance on the most recent data points. This technical indicator is used to produce buy and sell signals based on crossovers and divergences from the historical average. You can use it for trading, for instance, 5-minutes, 15-minutes, 30-minutes, and 1-hour.
However, it has nothing to do with Value Betting what Soccerwidow is all about and I’m not planning to extend the content into trading. Sorry, no time.
Hi, The True Odds & Value Detector spreadsheet uses the 25 past games for analysis but this seems a small number, for say, Correct Score analysis. For this would it not be better to use a larger set of data, say 100 perhaps?
Generally speaking you are right, if betting decisions for correct scores would be only based on scores from the last 25 games only. However, the value bet detector also calculates the goal expectation (please scroll down in the value bet detector spreadsheet – lines 88 to 98 in the ‘ValueCalc’ tab) based on home goals scored/conceded as well as away team goals & H2H’s. These numbers need also to be taken into consideration.