Calibration and evaluation are two important steps prior to the application of a crop simulation model. The objective of this paper was to review common statistical methods that are being used for crop model calibration and evaluation. A group of deviation statistics were reviewed, including root mean squired error (RMSE), normalize-RMSE (nRMSE), mean absolute error (MAE), mean error (E), paired-t, index of agreement (d), modified index of agreement (d1), revised index of agreement (d1′), modeling efficiency (EF) and revised modeling efficiency (EF1). A case study of the statistical evaluation was conducted for the DSSAT Cropping System Model (CSM) using 10 experimental datasets for maize, peanut, soybean, wheat and potato from Brazil, China, Ghana, and the USA. The results indicated that R2 was not a good statistic for model evaluation because it is insensitive to regression coefficients (α and β) of the linear model y=α+βx+ε. However, linear regression can be used for model evaluation (test H0: α=0, β=1) if auto-correlation, normality and heteroskedasticaity of the error term (ε) are tested or the proper data transfers are made. The results also illustrated that statistical evaluation of total dataset across treatments might be insufficient. Hence the evaluation of each treatment is necessary to make the right conclusion, especially when evaluating soil water content under different planting date treatments and soil mineral N under different N treatments. Co-variability analysis among dimensionless statistics (d, d1, d1′, EF and EF1) recommended that d and EF are inflated by the sum of squares-based deviations, i.e., the larger deviations contribute more weight on the statistic than the smaller deviation due to the squared term. However, EF had a larger range and a clear physical meaning at EF=0, making it superior to d. Values of d=0.75 were obtained from regression with all positive values of EF (EF⩾0), indicating that values of d⩾0.75 and EF⩾0 should be the minimum values for plant growth evaluation. Values of d⩾0.60 and EF⩾−1.0 should be the minimum values for soil outputs evaluation combined with t-test due to the fact that the soil parameters in the DSSAT SOIL module are difficult to calibrate compared with plant growth parameters because of no sufficient observed soil dataset. Due to the statistical nature, no single statistic is more robust over others but some statistics are highly correlated. Therefore, several statistics may be used from each of the following correlated groups (RMSE, MAE), (E, t-test), (d, d1, d1′) and (EF, EF1) in one assessment of model evaluation so that a representative statistical conclusion can be obtained with respect to model performance.