¤ OpenAccess: Gold
This work has “Gold” OA status. This means it is published in an Open Access journal that is indexed by the DOAJ.
Predictive ability of machine learning methods for massive crop yield prediction
Alberto Gonzalez-Sanchez,Juan Frausto-Solís,Waldo Ojeda-Bustamante
An important issue for agricultural planning purposes is the accurate yield estimation for the numerous crops involved in the planning. Machine learning (ML) is an essential approach for achieving practical and effective solutions for this problem. Many comparisons of ML methods for yield prediction have been made, seeking for the most accurate technique. Generally, the number of evaluated crops and techniques is too low and does not provide enough information for agricultural planning purposes. This paper compares the predictive accuracy of ML and linear regression techniques for crop yield prediction in ten crop datasets. Multiple linear regression, M5-Prime regression trees, perceptron multilayer neural networks, support vector regression and k-nearest neighbor methods were ranked. Four accuracy metrics were used to validate the models: the root mean square error (RMS), root relative square error (RRSE), normalized mean absolute error (MAE), and correlation factor (R). Real data of an irrigation zone of Mexico were used for building the models. Models were tested with samples of two consecutive years. The results show that M5-Prime and k-nearest neighbor techniques obtain the lowest average RMSE errors (5.14 and 4.91), the lowest RRSE errors (79.46% and 79.78%), the lowest average MAE errors (18.12% and 19.42%), and the highest average correlation factors (0.41 and 0.42). Since M5-Prime achieves the largest number of crop yield models with the lowest errors, it is a very suitable tool for massive crop yield prediction in agricultural planning.
Powered by Citationsy*
- Referenced Papers
- Papers that cite this paper
- Related Papers
¤ Open Access
Cited 11 times
Feature Selection for Wheat Yield Prediction
Carrying out effective and sustainable agriculture has become an important issue in recent years. Agricultural production has to keep up with an everincreasing population by taking advantage of a field’s heterogeneity. Nowadays, modern technology such as the global positioning system (GPS) and a multitude of developed sensors enable farmers to better measure their fields’ heterogeneities. For this small-scale, precise treatment the term precision agriculture has been coined. However, the large amounts of data that are (literally) harvested during the growing season have to be analysed. In particular, the farmer is interested in knowing whether a newly developed heterogeneity sensor is potentially advantageous or not. Since the sensor data are readily available, this issue should be seen from an artificial intelligence perspective. There it can be treated as a feature selection problem. The additional task of yield prediction can be treated as a multi-dimensional regression problem. This article aims to present an approach towards solving these two practically important problems using artificial intelligence and data mining ideas and methodologies.
¤ Open Access
Cited 8,034 times
A tutorial on support vector regression
In this tutorial we give an overview of the basic ideas underlying Support Vector (SV) machines for function estimation. Furthermore, we include a summary of currently used algorithms for training SV machines, covering both the quadratic (or convex) programming part and advanced methods for dealing with large datasets. Finally, we mention some modifications and extensions that have been applied to the standard SV algorithm, and discuss the aspect of regularization from a SV perspective.
Cited 17 times
Using spatial information systems to improve water management in Mexico
In the last two decades, Mexican irrigated agriculture has faced large changes as a result of recurrent droughts and the transfer of irrigation management from the federal government to Water User Associations (WUA). The associations face a great challenge in the efficient operation of water distribution networks and use of water. Under water scarcity conditions, new irrigation management strategies must be implemented to estimate irrigation requirements at field scale and integrate them into several operating levels of the irrigation network. A computer tool, Spriter, was developed and transferred to several WUA of Mexican irrigation districts. The system allows the dynamic generation of digital map reports, as mapping of physical features of fields is linked to the main database. Software applications are presented to illustrate the advantages of using spatial information technologies to improve water management in large irrigation districts.
Cited 175 times
AFRCWHEAT2: A model of the growth and development of wheat incorporating responses to water and nitrogen
AFRCWHEAT2, developed from the earlier model ARCWHEATI, is a computer simulation model of the growth and development of wheat for conditions where nitrogen and water may be in sub-optimal supply. The model is described and tested for its ability to simulate crop growth in four field experiments with different sites, growing seasons, sowing dates, amount and timing of nitrogen applications and irrigation. The overall root mean square error for simulated grain yield as a percentage of observed was 13 per cent. A relative error [(observed-simulated)lobserved] of less than 20 per cent occurred in more than 50 per cent of comparisons between observed and simulated values. Errors in simulation of green area index (GAI) and dry matter were small, except for one case. The model simulated adequately crop evapo-transpiration and the ratio between dry matter and amount of water transpired (the transpiration coefficient) but it consistently overestimated the rate of decline in shoot nitrogen concentration. Dry matter accumulation was modelled as being more sensitive to soil nitrogen than to soil water level, in agreement with observations. Generally, model crops continued to absorb nitrogen for a longer time than did observed crops. A future use of AFRCWHEAT2 will be to examine its ability to simulate crops grown under variable conditions of elevated CO 2 and temperature.
¤ Open Access
Cited 78 times
Consistency of cross validation for comparing regression procedures
Theoretical developments on cross validation (CV) have mainly focused on selecting one among a list of finite-dimensional models (e.g., subset or order selection in linear regression) or selecting a smoothing parameter (e.g., bandwidth for kernel smoothing). However, little is known about consistency of cross validation when applied to compare between parametric and nonparametric methods or within nonparametric methods. We show that under some conditions, with an appropriate choice of data splitting ratio, cross validation is consistent in the sense of selecting the better procedure with probability approaching 1. Our results reveal interesting behavior of cross validation. When comparing two models (procedures) converging at the same nonparametric rate, in contrast to the parametric case, it turns out that the proportion of data used for evaluation in CV does not need to be dominating in size. Furthermore, it can even be of a smaller order than the proportion for estimation while not affecting the consistency property.
Cited 28 times
Factors Underlying Yield Variability in Two California Rice Fields
Modern technologies associated with precision agriculture provide the opportunity to more precisely measure yield variability and the ecological processes underlying this variability. Effective analysis of data from these measurements requires statistical methods different from those traditionally employed on data from controlled agronomic experiments. Our objective was to develop and test multivariate statistical methods appropriate for use in analyzing precision agriculture data. We analyzed a data set taken from two commercial California rice fields and consisting of yield spatial trends together with soil core data from a grid of sample points. We used cluster analysis to discern spatiotemporal patterns in grain yield. We applied a Monte Carlo randomization process to the generation of clusters to analyze cluster stability. We then used classification and regression trees (CART) to determine the factors underlying cluster distribution. The clustering procedure successfully identified stable, physically meaningful clusters with recognizable spatial and temporal structure. Thus, the randomization procedure may present an attractive alternative to fuzzy clustering. The CART analysis identified some but not all of the factors underlying the cluster patterns. The number of available data values may have been too small to take advantage of the CART partitioning capabilities.
Cited 42 times
Artificial Neural Network Model as a Data Analysis Tool in Precision Farming
Spatial variation in landscape and soil properties combined with temporal variations in weather can result in yield patterns that change annually within a field. The complexity of interactions between a number of yield-limiting factors makes it difficult to accurately attribute yield losses to conditions that occur within a field. In this research, a back-propagation neural network (BPNN) model was developed to predict the spatial distribution of soybean yields and to understand the causes of yield variability. First, we developed a BPNN model by relating soybean yield to topography, soil, weather, and site factors and evaluated model predictions for the same field for independent years. We also explored the potential use of BPNN for predicting yields in independent fields. Finally, we evaluated the ability of the BPNN to attribute yield losses due to soybean cyst nematodes (SCN), soil pH, and weeds. A total of 14 input datasets with combinations of four controlling factors (topographic, soil fertility, weather, and site) were used. For each objective, data from fields in Iowa were used for training the BPNN, while a portion of the data was withheld to verify the accuracy of yield predictions. All BPNN models had fully connected feed-forward architecture with a back-propagation weight adjustment algorithm. When tested for a particular field, the BPNN captured the major patterns of yield variability in independent years; the root mean square error of prediction (RMSEP) was 14.2% of actual yield. When the BPNN was trained with inputs from five fields, the RMSEP at test sites was 11.2% of actual yield. When the BPNN was used to attribute yield losses to soil pH, SCN, and weed populations, standard errors were 92, 262, and 171 kg ha-1, respectively. The technique showed that the BPNN could predict spatial yield variability with an RMSEP of about 14%.
¤ Open Access
Cited 65 times
An overview of regression techniques for knowledge discovery
Predicting or learning numeric features is called regression in the statistical literature, and it is the subject of research in both machine learning and statistics. This paper reviews the important techniques and algorithms for regression developed by both communities. Regression is important for many applications, since lots of real life problems can be modeled as regression problems. The review includes Locally Weighted Regression (LWR), rule-based regression, Projection Pursuit Regression (PPR), instance-based regression, Multivariate Adaptive Regression Splines (MARS) and recursive partitioning regression methods that induce regression trees (CART, RETIS and M5).
Cited 198 times
Modeling Soybean Growth for Crop Management
ABSTRACT Asoybean (Glycine max (L.) Merr.) crop growth simulation model (SOYGRO) was developed to aid farm managers in making irrigation and pest management decisions. Non-linear first order differential equations describe dry matter rates of change, accumulation and depletion of protein pools, and changes in shell and seed numbers. Two data sets from defoliation and irrigation experiments were used for calibration and validation of the model. The model responds well to drought and defoliation stresses for two test cases. Sensitivity analyses of SOYGRO revealed that simulated yield was most sensitive to changes in gross photosynthesis and growth respiration. The sensitivity of simulated yield to changes in model parameters was increased by the occurrence of either water or defoliation stress.
Cited 295 times
Sirius: a mechanistic model of wheat response to environmental variation
Sirius is a wheat simulation model that calculates biomass production from intercepted photosynthetically active radiation (PAR) and grain growth from simple partitioning rules. Leaf area index (LAI) is developed from a thermal time sub-model. Phenological development is calculated from the mainstem leaf appearance rate and final leaf number, with the latter determined by responses to daylength and vernalisation. Effects of water and N deficits are calculated through their influences on LAI development and radiation-use efficiency. This paper describes the model and its validation using data from independent and near independent experiments at Lincoln, New Zealand, and Rothamsted, UK. Despite there being no calculation of tiller dynamics or grain number, the model accurately simulated the behaviour of crops exposed to a wide range of conditions. We conclude that the accurate prediction of phenological development and LAI is much more important for grain yield prediction than are the components of yield. Although grain population is not a necessary step in yield calculation in Sirius, the model proved useful in investigating the effects of stress in setting grain number. The analysis showed that the influence of stress on partitioning of biomass to the ear during pre-anthesis ear growth was much more important in determining grain number than was the effect on biomass accumulation during the same phase.
Cited 46 times
Site-specific early season potato yield forecast by neural network in Eastern Canada
¤ Open Access
Cited 2,525 times
Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author)
There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.
¤ Open Access
Cited 114 times
Applying machine learning to agricultural data
Abstract Many techniques have been developed for learning rules and relationships automatically from diverse data sets, to simplify the often tedious and error-prone process of acquiring knowledge from empirical data. While these techniques are plausible, theoretically well-founded, and perform well on more or less artificial test data sets, they depend on their ability to make sense of real-world data. This paper describes a project that is applying a range of machine learning strategies to problems in agriculture and horticulture. We briefly survey some of the techniques emerging from machine learning research, describe a software workbench for experimenting with a variety of techniques on real-world data sets, and describe a case study of dairy herd management in which culling rules were inferred from a medium-sized database of herd information.
¤ Open Access
Cited 616 times
STICS: a generic model for the simulation of crops and their water and nitrogen balances. I. Theory and parameterization applied to wheat and corn
STICS (Simulateur mulTJdiscplinaire pour les Cultures Standard) is a crop model constructed as a simula- tion tool capable of working under agricultural conditions. Outputs comprise the production (amount and quality) and the environment. Inputs take into account the climate, the soi1 and the cropping system. STICS is presented as a model exhibiting the following qualities: robustness, an easy access to inputs and an uncomplicated f~~ture evolution thanks to a modular (easy adaptation to various types of plant) nature and generic. However, STICS is not an entirely new model since most parts use classic formalisms or stem from existing models. The main simulated processes are the growth, the development of the crop and the water and nitrogenous balance of the soil-crop system. The seven modules of STICS - development, shoot growth, yield components, root growth, water balance, thermal environment and nitrogen balance - are presented in tum with a discussion about the theoretical choices in comparison to other models. These choices should render the model capable of exhibiting the announced qualities in classic environmental contexts. However, because some processes (e.g. ammoniac volatilization, clrought resistance, etc.) are not taken into account, the use of STICS is presently limited to several cropping systems. (O InraIElsevier, Paris.) crop modelling / wheat / corn / water balance / nitrogen balance
Cited 16 times
A note on the computer simulation of crop growth in agricultural land evaluation
¤ Open Access
Cited 705 times
Improvements to the SMO algorithm for SVM regression
This paper points out an important source of inefficiency in Smola and Scholkopf's (1998) sequential minimal optimization (SMO) algorithm for support vector machine regression that is caused by the use of a single threshold value. Using clues from the Karush-Kuhn-Tucker conditions for the dual problem, two threshold parameters are employed to derive modifications of SMO for regression. These modified algorithms perform significantly faster than the original SMO on the datasets tried.
Cited 242 times
A comparison of the models AFRCWHEAT2, CERES-Wheat, Sirius, SUCROS2 and SWHEAT with measurements from wheat grown under drought
The predictions of five simulation models were compared with data from a winter sown wheat experiment performed in a mobile automatic rainshelter at Lincoln, New Zealand in 1991/1992, where observed grain yields ranged from 3.6 to 9.9 t ha−1. Four of the five models predicted the yield of the fully irrigated treatment to within 10%, and SWHEAT underestimated by more than 20%. The same four models also predicted the grain yield response to varying water supply with reasonable accuracy, but SWHEAT again underestimated the yield reduction with increasing drought. However, the performance of all the models in predicting both the time course and final amount of aboveground biomass, of leaf area index (LAI) and evapotranspiration, varied substantially. These variations were associated with their diverging assumptions about the effects of root distribution and soil dryness on the ability of the crops to extract water, the value of the ratio of water supply to water demand at which stress begins to reduce leaf area development, and photosynthetic, or light-use efficiency (LUE). All the models predicted, to varying degrees, that reductions in photosynthetic efficiency or LUE was an important contributor to reductions in the rate of biomass accumulation. In contrast, analysis of the experimental data indicated that this factor was a minor contributor to the reduction, and variation in light interception, associated with changes in LAI, was the major cause.
¤ Open Access
Cited 35 times
Data Mining of Agricultural Yield Data: A Comparison of Regression Models
Nowadays, precision agriculture refers to the application of state-of-the-art GPS technology in connection with small-scale, sensor-based treatment of the crop. This introduces large amounts of data which are collected and stored for later usage. Making appropriate use of these data often leads to considerable gains in efficiency and therefore economic advantages. However, the amount of data poses a data mining problem – which should be solved using data mining techniques. One of the tasks that remains to be solved is yield prediction based on available data. From a data mining perspective, this can be formulated and treated as a multi-dimensional regression task. This paper deals with appropriate regression techniques and evaluates four different techniques on selected agriculture data. A recommendation for a certain technique is provided.KeywordsPrecision AgricultureData MiningRegressionModeling
¤ Open Access
Cited 551 times
All of Statistics
Cited 433 times
Modelling Potential Crop Growth Processes
“Predictive ability of machine learning methods for massive crop yield prediction” is a paper by Alberto Gonzalez-Sanchez Juan Frausto-Solís Waldo Ojeda-Bustamante published in the journal Spanish Journal of Agricultural Research in 2014. It was published by Instituto Nacional de Investigacion y Tecnologia Agraria y Alimentaria. It has an Open Access status of “gold”. You can read and download a PDF Full Text of this paper here.