Statistical Models vs Machine Learning

# Statistical Models vs Machine Learning
## A Reflection with Two Case Studies
### Liming Wang
### October 13, 2017

---

# About me

- Currently an assistant professor of Urban Studies and Planning at Portland State University
- Did dissertation research on urban simulation models (with Paul Waddell)
- Worked as a developer for UrbanSim for a number of years
- Research interests on land use - transportation interaction (LUTI) models
- Never had formal computer science training/background

---

# Statistical Modeling vs Machine Learning

Two cultures of developing models (Breiman, 2001):

**Statistical models** ("the data modeling"): Assuming a data generation model and use data and hypothesis testing framework to recover parameters of the data generation process;
]

**Machine learning** ("algorithmic modeling"): With no assumption of data generation process, use computer algorithms for pattern recognition and data-driven predictions-making

]

---

---

# Challenges to Statistical Models

Or the case for machine learning:

- Assumption/theory of the data generation process may be wrong
- Competing data generation models may give different pictures of the relation between the predictors and response variable;
- Changing landscape of data availability
   - Curse of dimensionality
   - Easy to detect significant correlations with large sample size
   - Increasingly models involving data of the population instead of a sample
   - Missing data issue

---

# Two Case Studies

- Imputation of missing data in travel surveys
- Models travel outcomes

---

# Case I: Imputation of Missing Data

Annual Vehile Miles Travelled information in the 2001 National Household Travel Survey (NHTS)

Only 12% (17037 out of 139382) observations are complete.

---

# Multiple Imputation by Chained Equations

<img src="resources/mice1.png" width="1424" style="display: block; margin: auto;" />
Source: van Buuren, Stef and Karin Groothuis-Oudshoorn, 2011. mice: Multivariate Imputation by Chained Equations in R, Journal of Statistical Software, Vol 45 (3).

---

# Imputation Results (1)

**Validation**: randomly set 10% of values to missing, impute them and compare with actual values

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Variable </th>
   <th style="text-align:right;"> Normalized RMSE </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> ANNMILES (Self reported annual VMT) </td>
   <td style="text-align:right;"> 31.8240 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ANNUALZD (VMT annualized from two Odmeter readings) </td>
   <td style="text-align:right;"> 22.2640 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> HHFAMINC (Family income) </td>
   <td style="text-align:right;"> 0.0475 </td>
  </tr>
</tbody>
</table>

---

**Imputation Results (2)**: Comparing linear regression results (y=ANNUALZD) without and with multiple imputation

---

# Case II: Travel Behavior Modeling

`$$\text{VMT}_h \leftarrow (\text{SES}_h, \text{regional characteristics}, \text{built environment})$$`
Data Sources:

- 2009 NHTS for household's SES, travel outcome (VMT);
- EPA's Smart Location Database (for blockgroup level 5D built environment measures);
- Highway Performance Measure System for regionwide roadway information;
- National Transit Database for regionwide transit supply.

150,000 households with more than 180 independent variables (before considering non-linear transformation or interaction between variables)

---

# VMT models

- Statistical Models
   - linear regression
   - non-linear regression (transformed dependent variable)
   - tobit model
   - zero-inflated negative binomial model
- Machine learning algorithms
   - Random Forest
   - Gradient Tree Boosting
   - Deep nureal network
   
---

# Cross Validation Results

- Dependent variable is household VMT on the day of survey
- Data are randomly partitioned into 5 parts for a 5-fold cross-validation

---

# Conclusion and Discussion

**Conclusions**:
- Some tasks, such as multivariate data imputation, are hard or impossible to do with statistical models but possible with machine learning, 
- Growing modeling complexity adds challenges to statistical models, machine learning has an advantage in complex models
- If you're developing models for prediction, there are few reasons not to look into machine learning algorithms

**Challenges**

- Combining machine learning skills with the domain knowledge of planning; 
- Train planning students with machine learning skills
- Computation intensity & access to computer resources

---

# Acknowledgements

- Oregon Department of Transportation (SPR 788) 
- National Institute for Transportation and Communities (NITC-881)
- Portland Institute for Computational Science and its resources acquired using NSF Grant #DMS 1624776 and ARO Grant #W911NF-16-1-0307

---

background-image: url("resources/Olson2018.png")
background-size: 80%
class: center, bottom

Benchmarking Machine Learning Algorithms

Source: Randal S. Olson and William La Cava et al., 2018.