Beer-in-Hand Data Science

Amanda Dobbyn

First things first

Who am I?

UChicago '15 uchi_ultimate ^ Angela not yet pictured

Now at Earlybird Software

First things first

Where's the code?

https://github.com/aedobbyn/beer-data-science

Markdown writeup in /compile
Shiny app in /clusterfun
Step-by-step scripts for getting and munging data in /run_it
This presentation in /present

$writeup$

How did this come about?

Typical Friday afternoon office conversation
- How do you architect the ideal beer flavor profile visualizer?
  - In particular, how do you represent “hoppy, for a Kölsch”?

Does the style Kölsch even describe a well-defined beer sub-group?
- Would it be reliably distinguishable from non-Kölschs in a blind taste test?

How did this come about?

Beer == water + malted barley + hops + yeast +

sometimes other stuff like fruit

We categorize it into different styles based on
- Type and ratio of ingredients
- How the beer is made (e.g., how long and at what temperature it's brewed)
- Regional and historical differences

How did this come about?

BUT

How well do styles actually demarcate the beer landscape?
- Is there actually more inter-style variation than intra-style variation?
- Is there a more empiricially clean way to categorize beers into super-groups?

In other words, we're asking: are beer styles just a social construct?

The Order of Things, theoretically

Implications

If styles do demarcate the beer landscape well, we should expect to see distinct clusters dominated mostly by beers classified into a single style
It should also be easy to predict style from the other variables

This all assumes the variables we have (more on those in a sec) can account for most of the variance between styles.

The stakes could not be higher.

Step 1: GET Beer

The age-old dilemma

From where? BreweryDB, an online database of beers with a public API.

get_beers

Step 1: GET one Beer

Once we've got a key, we can use a beer's unique BreweryDB ID and our key to get data on that beer.

http://api.brewerydb.com/v2/beer/<BEER_ID_HERE>/?key=/<YOUR_KEY_HERE>

In the browser, that looks like:

Step 1: GET one Beer

base_url <- "http://api.brewerydb.com/v2"
key_preface <- "/?key="

get_beer <- function(id) {
  fromJSON(paste0(base_url, "/beer/", id, "/", key_preface, key))
}

We can make a little function from this that'll take a single ID and use fromJSON to convert from, well, JSON into a nested list starting with our 200 success message.

get_beer("GZQpRX")

$message
[1] "READ ONLY MODE: Request Successful"

$data
$data$id
[1] "GZQpRX"

$data$name
[1] "Alpha King"

$data$nameDisplay
[1] "Alpha King"

$data$description
[1] "A bold yet balanced American Pale Ale with slight caramel sweetness and aggressive citrus hoppiness. This is our flagship beer."

$data$abv
[1] "6.7"

$data$ibu
[1] "68"

$data$glasswareId
[1] 5

$data$srmId
[1] 21

$data$availableId
[1] 1

$data$styleId
[1] 25

$data$isOrganic
[1] "N"

$data$labels
$data$labels$icon
[1] "https://s3.amazonaws.com/brewerydbapi/beer/GZQpRX/upload_t1NkG7-icon.png"

$data$labels$medium
[1] "https://s3.amazonaws.com/brewerydbapi/beer/GZQpRX/upload_t1NkG7-medium.png"

$data$labels$large
[1] "https://s3.amazonaws.com/brewerydbapi/beer/GZQpRX/upload_t1NkG7-large.png"


$data$status
[1] "verified"

$data$statusDisplay
[1] "Verified"

$data$createDate
[1] "2012-01-03 02:42:40"

$data$updateDate
[1] "2017-01-10 21:43:48"

$data$glass
$data$glass$id
[1] 5

$data$glass$name
[1] "Pint"

$data$glass$createDate
[1] "2012-01-03 02:41:33"


$data$srm
$data$srm$id
[1] 21

$data$srm$name
[1] "21"

$data$srm$hex
[1] "952D00"


$data$available
$data$available$id
[1] 1

$data$available$name
[1] "Year Round"

$data$available$description
[1] "Available year round as a staple beer."


$data$style
$data$style$id
[1] 25

$data$style$categoryId
[1] 3

$data$style$category
$data$style$category$id
[1] 3

$data$style$category$name
[1] "North American Origin Ales"

$data$style$category$createDate
[1] "2012-03-21 20:06:45"


$data$style$name
[1] "American-Style Pale Ale"

$data$style$shortName
[1] "American Pale"

$data$style$description
[1] "American pale ales range from deep golden to copper in color. The style is characterized by fruity, floral and citrus-like American-variety hop character producing medium to medium-high hop bitterness, flavor, and aroma. Note that the \"traditional\" style of this beer has its origins with certain floral, fruity, citrus-like, piney, resinous, or sulfur-like American hop varietals. One or more of these hop characters is the perceived end, but the perceived hop characters may be a result of the skillful use of hops of other national origins. American pale ales have medium body and low to medium maltiness. Low caramel character is allowable. Fruity-ester flavor and aroma should be moderate to strong. Diacetyl should be absent or present at very low levels. Chill haze is allowable at cold temperatures."

$data$style$ibuMin
[1] "30"

$data$style$ibuMax
[1] "42"

$data$style$abvMin
[1] "4.5"

$data$style$abvMax
[1] "5.6"

$data$style$srmMin
[1] "6"

$data$style$srmMax
[1] "14"

$data$style$ogMin
[1] "1.044"

$data$style$fgMin
[1] "1.008"

$data$style$fgMax
[1] "1.014"

$data$style$createDate
[1] "2012-03-21 20:06:45"

$data$style$updateDate
[1] "2015-04-07 15:25:18"



$status
[1] "success"

Quick funciton factory

Using purrr::walk() and assign() we can create functions to GET any beer, brewery, category, etc. if we know its ID.

endpoints <- c("beer", "brewery", "category", "event", "feature", "glass", "guild", "hop", "ingredient", "location", "socialsite", "style", "menu")

# Base function
get_ <- function(id, ep) {
  jsonlite::fromJSON(paste0(base_url, "/", ep, "/", id, "/", key_preface, key))
}

# Create new get_<ep> functions
endpoints %>% walk(~ assign(x = paste0("get_", .x),
                             value = partial(get_, ep = .x),
                             envir = .GlobalEnv))

Now we have the functions get_beer(), get_brewery(), get_category(), etc. in our global environment.

Testing testing

get_hop("3")

$message
[1] "READ ONLY MODE: Request Successful"

$data
$data$id
[1] 3

$data$name
[1] "Ahtanum"

$data$description
[1] "An open-pollinated aroma variety developed in Washington, Ahtanum is used for its distinctive, somewhat Cascade-like aroma and for moderate bittering."

$data$countryOfOrigin
[1] "US"

$data$alphaAcidMin
[1] 5.7

$data$betaAcidMin
[1] 5

$data$betaAcidMax
[1] 6.5

$data$humuleneMin
[1] 16

$data$humuleneMax
[1] 20

$data$caryophylleneMin
[1] 9

$data$caryophylleneMax
[1] 12

$data$cohumuloneMin
[1] 30

$data$cohumuloneMax
[1] 35

$data$myrceneMin
[1] 50

$data$myrceneMax
[1] 55

$data$farneseneMax
[1] 1

$data$category
[1] "hop"

$data$categoryDisplay
[1] "Hops"

$data$createDate
[1] "2013-06-24 16:07:26"

$data$updateDate
[1] "2013-06-24 16:10:37"

$data$country
$data$country$isoCode
[1] "US"

$data$country$name
[1] "UNITED STATES"

$data$country$displayName
[1] "United States"

$data$country$isoThree
[1] "USA"

$data$country$numberCode
[1] 840

$data$country$createDate
[1] "2012-01-03 02:41:33"



$status
[1] "success"

Digging In

What are the parts of the data that we want and where do they live?

get_beer("GZQpRX")

$message
[1] "READ ONLY MODE: Request Successful"

$data
$data$id
[1] "GZQpRX"

$data$name
[1] "Alpha King"

$data$nameDisplay
[1] "Alpha King"

$data$description
[1] "A bold yet balanced American Pale Ale with slight caramel sweetness and aggressive citrus hoppiness. This is our flagship beer."

$data$abv
[1] "6.7"

$data$ibu
[1] "68"

$data$glasswareId
[1] 5

$data$srmId
[1] 21

$data$availableId
[1] 1

$data$styleId
[1] 25

$data$isOrganic
[1] "N"

$data$labels
$data$labels$icon
[1] "https://s3.amazonaws.com/brewerydbapi/beer/GZQpRX/upload_t1NkG7-icon.png"

$data$labels$medium
[1] "https://s3.amazonaws.com/brewerydbapi/beer/GZQpRX/upload_t1NkG7-medium.png"

$data$labels$large
[1] "https://s3.amazonaws.com/brewerydbapi/beer/GZQpRX/upload_t1NkG7-large.png"


$data$status
[1] "verified"

$data$statusDisplay
[1] "Verified"

$data$createDate
[1] "2012-01-03 02:42:40"

$data$updateDate
[1] "2017-01-10 21:43:48"

$data$glass
$data$glass$id
[1] 5

$data$glass$name
[1] "Pint"

$data$glass$createDate
[1] "2012-01-03 02:41:33"


$data$srm
$data$srm$id
[1] 21

$data$srm$name
[1] "21"

$data$srm$hex
[1] "952D00"


$data$available
$data$available$id
[1] 1

$data$available$name
[1] "Year Round"

$data$available$description
[1] "Available year round as a staple beer."


$data$style
$data$style$id
[1] 25

$data$style$categoryId
[1] 3

$data$style$category
$data$style$category$id
[1] 3

$data$style$category$name
[1] "North American Origin Ales"

$data$style$category$createDate
[1] "2012-03-21 20:06:45"


$data$style$name
[1] "American-Style Pale Ale"

$data$style$shortName
[1] "American Pale"

$data$style$description
[1] "American pale ales range from deep golden to copper in color. The style is characterized by fruity, floral and citrus-like American-variety hop character producing medium to medium-high hop bitterness, flavor, and aroma. Note that the \"traditional\" style of this beer has its origins with certain floral, fruity, citrus-like, piney, resinous, or sulfur-like American hop varietals. One or more of these hop characters is the perceived end, but the perceived hop characters may be a result of the skillful use of hops of other national origins. American pale ales have medium body and low to medium maltiness. Low caramel character is allowable. Fruity-ester flavor and aroma should be moderate to strong. Diacetyl should be absent or present at very low levels. Chill haze is allowable at cold temperatures."

$data$style$ibuMin
[1] "30"

$data$style$ibuMax
[1] "42"

$data$style$abvMin
[1] "4.5"

$data$style$abvMax
[1] "5.6"

$data$style$srmMin
[1] "6"

$data$style$srmMax
[1] "14"

$data$style$ogMin
[1] "1.044"

$data$style$fgMin
[1] "1.008"

$data$style$fgMax
[1] "1.014"

$data$style$createDate
[1] "2012-03-21 20:06:45"

$data$style$updateDate
[1] "2015-04-07 15:25:18"



$status
[1] "success"

We can grab just the fourth element of the data part of the response, the beer's description in a couple ways:

get_beer("GZQpRX")$data$description

[1] "A bold yet balanced American Pale Ale with slight caramel sweetness and aggressive citrus hoppiness. This is our flagship beer."

Or in other words:

get_beer("GZQpRX")[["data"]][4][[1]]

[1] "A bold yet balanced American Pale Ale with slight caramel sweetness and aggressive citrus hoppiness. This is our flagship beer."

Digging In

Other things are more deeply nested.

get_beer("GZQpRX")$data$style

$id
[1] 25

$categoryId
[1] 3

$category
$category$id
[1] 3

$category$name
[1] "North American Origin Ales"

$category$createDate
[1] "2012-03-21 20:06:45"


$name
[1] "American-Style Pale Ale"

$shortName
[1] "American Pale"

$description
[1] "American pale ales range from deep golden to copper in color. The style is characterized by fruity, floral and citrus-like American-variety hop character producing medium to medium-high hop bitterness, flavor, and aroma. Note that the \"traditional\" style of this beer has its origins with certain floral, fruity, citrus-like, piney, resinous, or sulfur-like American hop varietals. One or more of these hop characters is the perceived end, but the perceived hop characters may be a result of the skillful use of hops of other national origins. American pale ales have medium body and low to medium maltiness. Low caramel character is allowable. Fruity-ester flavor and aroma should be moderate to strong. Diacetyl should be absent or present at very low levels. Chill haze is allowable at cold temperatures."

$ibuMin
[1] "30"

$ibuMax
[1] "42"

$abvMin
[1] "4.5"

$abvMax
[1] "5.6"

$srmMin
[1] "6"

$srmMax
[1] "14"

$ogMin
[1] "1.044"

$fgMin
[1] "1.008"

$fgMax
[1] "1.014"

$createDate
[1] "2012-03-21 20:06:45"

$updateDate
[1] "2015-04-07 15:25:18"

In these cases, we really only care about the name portion.

get_beer("GZQpRX")$data$style$name

[1] "American-Style Pale Ale"

Unnesting

So, we'll unravel the data part of the response, grab whatever we want there, and glue it together into a dataframe.

If the particular list item we're unnesting has a name portion (like $style$name), great, we'll grab that for the column. Otherwise, we'll take whatever's first.

unnest_it <- function(lst) {
  unnested <- lst
  for(col in seq_along(lst[["data"]])) {
    if(! is.null(ncol(lst[["data"]][[col]]))) {
      if(! is.null(lst[["data"]][[col]][["name"]])) {
        unnested[["data"]][[col]] <- lst[["data"]][[col]][["name"]]
      } else {
        unnested[["data"]][[col]] <- lst[["data"]][[col]][[1]]
      }
    }
  }
  return(unnested)
}

This approach is a bit slow because it uses a a for loop 😔. Optimizing it is on the docket.

Grabbing all Beers

Instead of /beer/<BEER_ID> we're now asking for /beers

Paginating the Request

We find out how many pages there are total and then send requests and unnest each until we hit number_of_pages.

paginated_request <- function(ep, addition, trace_progress = TRUE) {    
  full_request <- NULL
  first_page <- fromJSON(paste0(base_url, "/", ep, "/", key_preface, key
                                , "&p=1"))
  number_of_pages <- ifelse(!(is.null(first_page$numberOfPages)), 
                            first_page$numberOfPages, 1)      

    for (page in 1:number_of_pages) {                               
    this_request <- fromJSON(paste0(base_url, "/", ep, "/", key_preface, key
                                    , "&p=", page, addition),
                             flatten = TRUE) 
    this_req_unnested <- unnest_it(this_request)    #  <- request unnested here

    if(trace_progress == TRUE) {message(paste0("Page ", this_req_unnested$currentPage))} # if TRUE, print the page we're on

    full_request <- bind_rows(full_request, this_req_unnested[["data"]])
  }
  return(full_request)
}

Quick note on Ingredients

A few apporaches I used:

Concatenated into a single string during the unnesting process
- hop_name and malt_name using this function 👉
Split out into one hop per column and one malt per column
- hops_name_1, hops_name_2, etc.
Sparse dataframe with each type of hop (like Cascade, Citra, etc.) as its own column
- Value is either 1 or 0

(Another would have been a nested list-col)

unnest_ingredients <- function(df) {
  df$hops_name <- NA
  df$malt_name <- NA

  for (row in 1:nrow(df)) {
    if (!is.null(df[["ingredients.hops"]][[row]][["name"]]) |
        !is.null(df[["ingredients.malt"]][[row]][["name"]])) {

      df[["hops_name"]][[row]] <- paste(df[["ingredients.hops"]][[row]][["name"]], collapse = ", ")

      df[["malt_name"]][[row]] <- paste(df[["ingredients.malt"]][[row]][["name"]], collapse = ", ")
    }
  }
  return(df)
}

What have we got?

beer_necessities <- paginated_request(ep = "beers", addition = "&withIngredients=Y")

id	name	style	style_collapsed	glass	abv	ibu	srm	hops_name	malt_name
X4KcGF	(512) TWO	Imperial or Double India Pale Ale	Double India Pale Ale	Pint	9.0	99	9	Columbus, Glacier, Horizon, Nugget, Simcoe	Caramel/Crystal Malt, Two-Row Pale Malt - Organic, Wheat Malt
USaRyl	(512) Whiskey Barrel Double Pecan Porter	Wood- and Barrel-Aged Strong Beer	Barrel-Aged	Pint	9.5	30	NA	Glacier	Black Malt, Caramel/Crystal Malt, Chocolate Malt, Two-Row Pale Malt - Organic
bXwskR	(512) White IPA	American-Style India Pale Ale	India Pale Ale	Pint	5.3	55	4	NA	NA
XnPVIo	(512) Wild Bear	Specialty Beer	Specialty Beer	Tulip	8.5	9	NA	NA	NA
QLp4mV	(512) Wit	Belgian-Style White (or Wit) / Belgian-Style Wheat	Wheat	Pint	5.1	10	5	Golding (American)	Oats - Malted, Two-Row Pale Malt - Organic, Wheat Malt - White
tWuIyV	(714): Blond Ale	Golden or Blonde Ale	Blonde	NA	4.8	NA	NA	NA	NA

What have we got?

63,495 distinct beers
Info about the beer:
- ABV: alcohol by volume
- IBU: International Biterness Units (really)
- SRM: a measure of color
- Ingredients
  - Hops, malts

Where to put it?

I threw it into MySQL

This allows us to
- Easily update the data if anything changes
- Others easy access to the data if they want to build an app using it

Step 1 Complete ✔️

A look at our outcome variable

We have 170 styles of beer.

levels(beer_necessities$style)

  [1] "Adambier"                                                  
  [2] "Aged Beer (Ale or Lager)"                                  
  [3] "American-Style Amber (Low Calorie) Lager"                  
  [4] "American-Style Amber Lager"                                
  [5] "American-Style Amber/Red Ale"                              
  [6] "American-Style Barley Wine Ale"                            
  [7] "American-Style Black Ale"                                  
  [8] "American-Style Brown Ale"                                  
  [9] "American-Style Cream Ale or Lager"                         
 [10] "American-Style Dark Lager"                                 
 [11] "American-Style Ice Lager"                                  
 [12] "American-Style Imperial Porter"                            
 [13] "American-Style Imperial Stout"                             
 [14] "American-Style India Pale Ale"                             
 [15] "American-Style Lager"                                      
 [16] "American-Style Light (Low Calorie) Lager"                  
 [17] "American-Style Low-Carbohydrate Light Lager"               
 [18] "American-Style Malt Liquor"                                
 [19] "American-Style Märzen / Oktoberfest"                       
 [20] "American-Style Pale Ale"                                   
 [21] "American-Style Pilsener"                                   
 [22] "American-Style Premium Lager"                              
 [23] "American-Style Sour Ale"                                   
 [24] "American-Style Stout"                                      
 [25] "American-Style Strong Pale Ale"                            
 [26] "American-Style Wheat Wine Ale"                             
 [27] "Apple Wine"                                                
 [28] "Australasian, Latin American or Tropical-Style Light Lager"
 [29] "Australian-Style Pale Ale"                                 
 [30] "Baltic-Style Porter"                                       
 [31] "Bamberg-Style Bock Rauchbier"                              
 [32] "Bamberg-Style Helles Rauchbier"                            
 [33] "Bamberg-Style Märzen Rauchbier"                            
 [34] "Bamberg-Style Weiss (Smoke) Rauchbier (Dunkel or Helles)"  
 [35] "Belgian-Style Blonde Ale"                                  
 [36] "Belgian-Style Dark Strong Ale"                             
 [37] "Belgian-Style Dubbel"                                      
 [38] "Belgian-Style Flanders Oud Bruin or Oud Red Ales"          
 [39] "Belgian-style Fruit Beer"                                  
 [40] "Belgian-Style Fruit Lambic"                                
 [41] "Belgian-Style Gueuze Lambic"                               
 [42] "Belgian-Style Lambic"                                      
 [43] "Belgian-Style Pale Ale"                                    
 [44] "Belgian-Style Pale Strong Ale"                             
 [45] "Belgian-Style Quadrupel"                                   
 [46] "Belgian-Style Table Beer"                                  
 [47] "Belgian-Style Tripel"                                      
 [48] "Belgian-Style White (or Wit) / Belgian-Style Wheat"        
 [49] "Berliner-Style Weisse (Wheat)"                             
 [50] "Bohemian-Style Pilsener"                                   
 [51] "Braggot"                                                   
 [52] "Brett Beer"                                                
 [53] "British-Style Barley Wine Ale"                             
 [54] "British-Style Imperial Stout"                              
 [55] "Brown Porter"                                              
 [56] "California Common Beer"                                    
 [57] "Chili Pepper Beer"                                         
 [58] "Chocolate / Cocoa-Flavored Beer"                           
 [59] "Classic English-Style Pale Ale"                            
 [60] "Classic Irish-Style Dry Stout"                             
 [61] "Coffee-Flavored Beer"                                      
 [62] "Common Cider"                                              
 [63] "Common Perry"                                              
 [64] "Contemporary Gose"                                         
 [65] "Cyser (Apple Melomel)"                                     
 [66] "Dark American Wheat Ale or Lager with Yeast"               
 [67] "Dark American Wheat Ale or Lager without Yeast"            
 [68] "Dark American-Belgo-Style Ale"                             
 [69] "Dortmunder / European-Style Export"                        
 [70] "Double Red Ale"                                            
 [71] "Dry Lager"                                                 
 [72] "Dry Mead"                                                  
 [73] "Dutch-Style Kuit, Kuyt or Koyt"                            
 [74] "Energy Enhanced Malt Beverage"                             
 [75] "English Cider"                                             
 [76] "English-Style Brown Ale"                                   
 [77] "English-Style Dark Mild Ale"                               
 [78] "English-Style India Pale Ale"                              
 [79] "English-Style Pale Mild Ale"                               
 [80] "English-Style Summer Ale"                                  
 [81] "European-Style Dark / Münchner Dunkel"                     
 [82] "Experimental Beer (Lager or Ale)"                          
 [83] "Extra Special Bitter"                                      
 [84] "Field Beer"                                                
 [85] "Flavored Malt Beverage"                                    
 [86] "Foreign (Export)-Style Stout"                              
 [87] "French & Belgian-Style Saison"                             
 [88] "French Cider"                                              
 [89] "French-Style Bière de Garde"                               
 [90] "Fresh \"Wet\" Hop Ale"                                     
 [91] "Fruit Beer"                                                
 [92] "Fruit Cider"                                               
 [93] "Fruit Wheat Ale or Lager with or without Yeast"            
 [94] "German-Style Altbier"                                      
 [95] "German-Style Doppelbock"                                   
 [96] "German-Style Eisbock"                                      
 [97] "German-Style Heller Bock/Maibock"                          
 [98] "German-Style Kölsch / Köln-Style Kölsch"                   
 [99] "German-Style Leichtbier"                                   
[100] "German-Style Leichtes Weizen / Weissbier"                  
[101] "German-Style Märzen"                                       
[102] "German-Style Oktoberfest / Wiesen (Meadow)"                
[103] "German-Style Pilsener"                                     
[104] "German-Style Rye Ale (Roggenbier) with or without Yeast"   
[105] "German-Style Schwarzbier"                                  
[106] "Ginjo Beer or Sake-Yeast Beer"                             
[107] "Gluten-Free Beer"                                          
[108] "Golden or Blonde Ale"                                      
[109] "Grodziskie"                                                
[110] "Herb and Spice Beer"                                       
[111] "Historical Beer"                                           
[112] "Imperial or Double India Pale Ale"                         
[113] "Imperial Red Ale"                                          
[114] "Indigenous Beer (Lager or Ale)"                            
[115] "International-Style Pale Ale"                              
[116] "International-Style Pilsener"                              
[117] "Irish-Style Red Ale"                                       
[118] "Kellerbier (Cellar beer) or Zwickelbier - Ale"             
[119] "Kellerbier (Cellar beer) or Zwickelbier - Lager"           
[120] "Leipzig-Style Gose"                                        
[121] "Light American Wheat Ale or Lager with Yeast"              
[122] "Light American Wheat Ale or Lager without Yeast"           
[123] "Metheglin"                                                 
[124] "Mixed Culture Brett Beer"                                  
[125] "Münchner (Munich)-Style Helles"                            
[126] "New England Cider"                                         
[127] "Non-Alcoholic (Beer) Malt Beverages"                       
[128] "Oatmeal Stout"                                             
[129] "Old Ale"                                                   
[130] "Open Category Mead"                                        
[131] "Ordinary Bitter"                                           
[132] "Other Belgian-Style Ales"                                  
[133] "Other Fruit Melomel"                                       
[134] "Other Specialty Cider or Perry"                            
[135] "Other Strong Ale or Lager"                                 
[136] "Pale American-Belgo-Style Ale"                             
[137] "Pumpkin Beer"                                              
[138] "Pyment (Grape Melomel)"                                    
[139] "Robust Porter"                                             
[140] "Rye Ale or Lager with or without Yeast"                    
[141] "Scotch Ale"                                                
[142] "Scottish-Style Export Ale"                                 
[143] "Scottish-Style Heavy Ale"                                  
[144] "Scottish-Style Light Ale"                                  
[145] "Semi-Sweet Mead"                                           
[146] "Session Beer"                                              
[147] "Session India Pale Ale"                                    
[148] "Smoke Beer (Lager or Ale)"                                 
[149] "Smoke Porter"                                              
[150] "South German-Style Bernsteinfarbenes Weizen / Weissbier"   
[151] "South German-Style Dunkel Weizen / Dunkel Weissbier"       
[152] "South German-Style Hefeweizen / Hefeweissbier"             
[153] "South German-Style Kristall Weizen / Kristall Weissbier"   
[154] "South German-Style Weizenbock / Weissbock"                 
[155] "Special Bitter or Best Bitter"                             
[156] "Specialty Beer"                                            
[157] "Specialty Honey Lager or Ale"                              
[158] "Specialty Stouts"                                          
[159] "Strong Ale"                                                
[160] "Sweet Mead"                                                
[161] "Sweet or Cream Stout"                                      
[162] "Traditional German-Style Bock"                             
[163] "Traditional Perry"                                         
[164] "Vienna-Style Lager"                                        
[165] "Wild Beer"                                                 
[166] "Wood- and Barrel-Aged Beer"                                
[167] "Wood- and Barrel-Aged Dark Beer"                           
[168] "Wood- and Barrel-Aged Pale to Amber Beer"                  
[169] "Wood- and Barrel-Aged Sour Beer"                           
[170] "Wood- and Barrel-Aged Strong Beer"

What's the best way to condense these?

I want to e.g., lump American-Style India Pale Ale in with English-Style India Pale Ale.

Step 2: Breathe sigh of relief, Collapse

keywords <- c("Pale Ale", "India Pale Ale", "Double India Pale Ale", "Lager", "India Pale Lager", "Hefeweizen", "Barrel-Aged","Wheat", "Pilsner", "Pilsener", "Amber", "Golden", "Blonde", "Brown", "Black", "Stout", "Imperial Stout", "Fruit", "Porter", "Red", "Sour", "Kölsch", "Tripel", "Bitter", "Saison", "Strong Ale", "Barley Wine", "Dubbel")

collapse_styles <- function(df, trace_progress = TRUE) {

  df[["style_collapsed"]] <- vector(length = nrow(df))

  for (beer in 1:nrow(df)) {
    if (grepl(paste(keywords, collapse="|"), df$style[beer])) {    
      for (keyword in keywords) {         
        if(grepl(keyword, df$style[beer]) == TRUE) {
          df$style_collapsed[beer] <- keyword    
        }                         
      } 
    } else {
      df$style_collapsed[beer] <- as.character(df$style[beer])       
    }
    if(trace_progress == TRUE) {message(paste0("Collapsing this ", df$style[beer], " to: ", df$style_collapsed[beer]))}
  }
  return(df)
}

Collapsing in Action

Setting trace_progress = TRUE:

get_beers

keywords are ordered from most general to most specific
- If a beer's name matches multiple keywords, its style_collapsed is the last of those that appear in keywords

American-Style Pale Ale ➡ Pale Ale

American-Style India Pale Ale ➡ India Pale Ale

Popular Styles

Let's see where most of the data is concentrated.

Popular defined as
- Styles with above the mean number of beers per style (z-score > 0)
  - Of course, this isn't a measure of popular consumption; just a reflection of the number of different beers we get from BreweryDB that are classified into that style

And then get a sense of where those styles fall in relation to one another
- Style “centers” = mean ABV, IBU, and SRM of each style

Popular Styles

Collapsed Style	Mean ABV	Mean IBU	Mean SRM	Numer of Beers
India Pale Ale	6.58	66.04	9.99	6524
Pale Ale	5.70	40.87	8.89	4280
Stout	7.99	43.90	36.30	4238
Wheat	5.16	17.47	5.86	3349
Double India Pale Ale	8.93	93.48	11.01	2525
Red	5.74	33.81	16.18	2521
Lager	5.45	30.64	8.46	2230
Saison	6.40	27.25	7.05	2167
Blonde	5.60	22.39	5.62	2044
Porter	6.18	33.25	32.20	1973
Brown	6.16	32.22	23.59	1462
Pilsener	5.23	33.51	4.41	1268
Specialty Beer	6.45	33.78	15.52	1044
Bitter	5.32	38.28	12.46	939
Fruit Beer	5.20	19.24	8.67	905
Herb and Spice Beer	6.62	27.77	18.17	872
Sour	6.22	18.89	10.04	797
Strong Ale	8.83	36.74	22.55	767
Tripel	9.03	32.52	7.68	734
Black	6.96	65.51	31.08	622
Barley Wine	10.78	74.05	19.56	605
Kölsch	4.98	23.37	4.37	593
Barrel-Aged	9.00	39.16	18.13	540
Other Belgian-Style Ales	7.52	37.56	17.55	506
Pumpkin Beer	6.71	23.48	17.92	458
Dubbel	7.51	25.05	22.94	399
Scotch Ale	7.62	26.37	24.22	393
German-Style Doppelbock	8.05	28.89	25.70	376
Fruit Cider	6.21	25.60	12.00	370
German-Style Märzen	5.75	25.64	14.32	370

Style Centers

If beer styles are really self-contained, beer in each style should cluster tightly around these points.

plot of chunk unnamed-chunk-16

To the main question

Do styles truly define distinct pockets of beer?

If they do, we could expect styles to align with clusters generated using an unsupervised learning algorithm.

k-means
- Splits observations into k clusters (we choose k)
- Try to minimize the sum of squares between each datapoint and its assigned cluster center

Note that we'll have to throw away rows that contain NA values in any of the variables we're using.

What variables should we include?

What's in a Predictor?

A heuristic I used:

Outputs

Only measured after a beer been brewed
- ABV, IBU, SRM

Style-Defined

Dependent entirely on style
- Serving glass

Inputs

Only directly controlled by a brewer before a beer is brewed
- Hops, malts

👍 predictor

👎 predictor

🤷🏼‍♀️ predictor because chicken and egg problem
Do brewers assign style first and then choose which ingredients to add, or vice versa? 🐣

Clustering: Let's have a function

do_cluster <- function (df, vars, to_cluster_on, n_centers = 5) {
  df_for_clustering <- df %>% select(!!vars) %>% na.omit()

  # Scale the ones to be scaled and append _scaled to their names
  df_vars_scale <- df_for_clustering %>% select(!!to_cluster_on) %>%
    scale() %>% as_tibble()
  names(df_vars_scale) <- names(df_vars_scale) %>% stringr::str_c("_scaled")

  # Do the clustering on the scaled data
  clusters_out <- kmeans(x = df_vars_scale, centers = n_centers, trace = FALSE)

  # Combine cluster assignment, scaled data, and unscaled rest of data
  clustered_df <- bind_cols(
    cluster_assignment = factor(clusters_out$cluster),   # Cluster assignment
    df_vars_scale,
    df_for_clustering
  )

  return(clustered_df)
}

We don't need to specify an outcome variable because this is unsupervised.

Clustering: Run It

We'll cluster the beers on the scaled versions of ABV, IBU, and SRM

to_include <- c("id", "name", "style", "style_collapsed", "abv", "ibu", "srm")

to_cluster_on <- c("abv", "ibu", "srm")

and stitch together the cluster assignments, scaled columns, and the original data.

clustered_beer <- do_cluster(beer_dat, to_include, to_cluster_on)

Clustering: Output

Cluster Assignment	ABV Scaled	IBU Scaled	SRM Scaled	ID	Name	Style	Style Collapsed	ABV	IBU	SRM
4	0.2428039	-0.6803269	2.1840186	tmEthz	“Admiral” Stache	Baltic-Style Porter	Porter	7.0	23.0	37
4	-0.5689624	0.3608647	2.4657681	b7SfHG	“Ah Me Joy” Porter	Robust Porter	Porter	5.4	51.0	40
1	-0.3660208	0.3608647	-0.5395600	PBEXhV	“Bison Eye Rye” Pale Ale \| 2 of 4 Part Pale Ale Series	American-Style Pale Ale	Pale Ale	5.8	51.0	8
1	-0.5689624	0.4724210	-0.2578105	AXmvOd	“Dust Up” Cloudy Pale Ale \| 1 of 4 Part Pale Ale Series	American-Style Pale Ale	Pale Ale	5.4	54.0	11
3	-0.4674916	-0.4869627	-0.8213095	Hr5A0t	“God Country” Kolsch	German-Style Kölsch / Köln-Style Kölsch	Kölsch	5.6	28.2	5
3	-0.8226393	-0.7918831	-0.8213095	mrVjY4	“Jemez Field Notes” Golden Lager	Golden or Blonde Ale	Blonde	4.9	20.0	5
3	-0.7211685	-1.1265519	-0.9152260	xFM8w5	#10 Hefewiezen	South German-Style Hefeweizen / Hefeweissbier	Wheat	5.1	11.0	4
3	-0.7211685	-0.7918831	-0.4456435	hB0QeO	#9	American-Style Pale Ale	Pale Ale	5.1	20.0	9
3	-0.8733747	-0.5315852	-1.0091425	m8f62Y	#KoLSCH	German-Style Kölsch / Köln-Style Kölsch	Kölsch	4.8	27.0	3
3	-0.6196977	-0.8662540	-0.8213095	35lHUq	'Inappropriate' Cream Ale	American-Style Cream Ale or Lager	Lager	5.3	18.0	5
1	0.2428039	-0.4200290	-0.6334765	qbRV90	'tis the Saison	French & Belgian-Style Saison	Saison	7.0	30.0	7
3	-0.7719039	-0.7918831	-0.4456435	qhaIVA	(306) URBAN WHEAT BEER	Belgian-Style White (or Wit) / Belgian-Style Wheat	Wheat	5.0	20.0	9
4	0.5472162	-0.4200290	0.6813546	VwR7Xg	(512) Bruin (A.K.A. Brown Bear)	American-Style Brown Ale	Brown	7.6	30.0	21
1	0.4964808	-0.2341019	-0.5395600	oJFZwK	(512) FOUR	Strong Ale	Strong Ale	7.5	35.0	8
1	0.2428039	0.8814606	-0.5395600	ezGh5N	(512) IPA	American-Style India Pale Ale	India Pale Ale	7.0	65.0	8
1	0.7501578	-0.7175123	-0.5395600	s8rdpK	(512) ONE	Belgian-Style Pale Strong Ale	Strong Ale	8.0	22.0	8
3	-0.2645500	-0.4200290	-0.6334765	2fXsvw	(512) Pale	American-Style Pale Ale	Pale Ale	6.0	30.0	7
4	0.4964808	-0.6059561	1.3387701	9O3QPg	(512) SIX	Belgian-Style Dubbel	Dubbel	7.5	25.0	28
1	1.5111886	-0.7175123	-0.3517270	A78JSF	(512) THREE	Belgian-Style Tripel	Tripel	9.5	22.0	10
5	1.5111886	-0.7175123	2.4657681	WKSYBT	(512) THREE (Cabernet Barrel Aged)	Belgian-Style Tripel	Tripel	9.5	22.0	40
2	1.2575117	2.1457647	-0.4456435	X4KcGF	(512) TWO	Imperial or Double India Pale Ale	Double India Pale Ale	9.0	99.0	9
1	-0.6196977	0.5096064	-0.9152260	bXwskR	(512) White IPA	American-Style India Pale Ale	India Pale Ale	5.3	55.0	4
3	-0.7211685	-1.1637373	-0.8213095	QLp4mV	(512) Wit	Belgian-Style White (or Wit) / Belgian-Style Wheat	Wheat	5.1	10.0	5
3	-1.0763162	-0.8290686	-0.9152260	thTbY7	(904) Weissguy	South German-Style Hefeweizen / Hefeweissbier	Wheat	4.4	19.0	4
3	-0.6704331	-0.2341019	-0.5395600	EPnv3B	(916)	American-Style Pale Ale	Pale Ale	5.2	35.0	8
3	-0.1123438	-0.6431415	-0.0699775	QT9hB8	+1 Pumpkin	Pumpkin Beer	Pumpkin Beer	6.3	24.0	13
2	0.6486870	1.4392418	-0.8213095	btwcy1	077XX India Pale Ale	Imperial or Double India Pale Ale	Double India Pale Ale	7.8	80.0	5
1	1.2067763	0.1377522	-0.8213095	FWiYZi	08.08.08 Vertical Epic Ale	Belgian-Style Pale Ale	Pale Ale	8.9	45.0	5
1	-0.5182270	0.5467918	-0.4456435	M6vu9P	10 Blocks South	American-Style Pale Ale	Pale Ale	5.5	56.0	9
4	0.2428039	-0.4200290	1.9022691	y6LrZ3	10 Ton	Oatmeal Stout	Stout	7.0	30.0	34

Clustering: Plot

Color vs. bitterness with style centers overlaid:

plot of chunk cluster_srm_ibu

Here I've trimmed outliers using this function with a cutoff = 2.5.

trim_outliers <- function(df, cutoff = 1.96, exclude = NULL, keep_scaled = TRUE){

  to_scale <- names(df)[!names(df) %in% exclude]

  df_scaled <- df %>%
    select(!!to_scale) %>%
    transmute_if(
      is.numeric, scale
    )
  names(df_scaled) <- names(df_scaled) %>% stringr::str_c("_scaled")

  df_out <- bind_cols(df_scaled, df)

  df_out_trimmed <- df_out %>%
    filter_at(
      .vars = vars(contains("_scaled")),
      .vars_predicate = all_vars(. < abs(cutoff))
    )

  if(keep_scaled == FALSE) {
    df_out_trimmed <- df_out_trimmed %>% select(!!names(df))
  }

  return(df_out_trimmed)
}

(You can grab it from my dobtools package on GitHub if you like.)

Narrowing In

If we focus in on 5 distinct styles and cluster them into 5 clusters, will each style be siphoned off into its own cluster?

	1	2	3	4	5
Blonde	131	17	4	1	7
India Pale Ale	43	1	468	11	54
Stout	6	10	3	180	5
Tripel	1	57	1	3	3
Wheat	289	9	6	5	12

plot of chunk cluster_certain_styles

Clusterfun with Shiny: UI

Snapshot of what we're including on the frontend:

sidebarLayout(
    sidebarPanel(
      h4("Control Panel"),

      checkboxInput("show_all", "Show all styles", TRUE),      
      checkboxInput("show_centers", "Show style centers", FALSE),
      numericInput("num_clusters", "Number of Clusters:", starting_n_clusters),

      checkboxGroupInput("cluster_on", "Choose variables to cluster on: ",
                         c("ABV (alcohol)" = "abv", 
                           "IBU (bitterness)" = "ibu", 
                           "SRM (color)" ="srm", 
                           "Total number of hops" = "total_hops", 
                           "Total number of malts" = "total_malt"),
                         selected = c("abv", "ibu", "srm")),

      checkboxGroupInput("response_vars", "Choose response variable(s): ",
                         c("Collapsed style" = "style_collapsed",
                           "Specific style" = "style"
                         ),
                         selected = c("style_collapsed")),

      conditionalPanel(
        condition = "input.show_all == false",
        selectInput("style_collapsed", "Collapsed Style:",
                    style_names)
        )
    ),

Clusterfun with Shiny: Server

And how the backend is serving up that data, conditional on what the user asks it to display.

  output$cluster_plot <- renderPlot({

    # If our checkbox is checked saying we do want style centers, show them. Else, don't.
    if (input$show_all == FALSE & input$show_centers == TRUE) {

      this_style_center <- reactive({style_centers %>% filter(style_collapsed == input$style_collapsed)})

      ggplot() +
        geom_point(data = this_style_data(),
                   aes(x = abv, y = ibu, colour = cluster_assignment), alpha = 0.5) +
        geom_point(data = this_style_center(),
                   aes(mean_abv, mean_ibu), colour = "black") +
        geom_text_repel(data = this_style_center(),
                        aes(mean_abv, mean_ibu, label = input$style_collapsed),
                        box.padding = unit(0.45, "lines"),
                        family = "Calibri") +
        ggtitle("k-Means Clustered Beer") +
        labs(x = "ABV", y = "IBU") +
        labs(colour = "Cluster Assignment") +
        theme_minimal() +
        theme(legend.position="none")
    } else if  # ....... etc., etc.

Clusterfun with Shiny

https://amandadobbyn.shinyapps.io/clusterfun/

And now for something completely different

…a quick dive into hops

Hops

No, not those hops!

fresh_hops

These hops ☝️

Hops \häps\, n: 1. It's what makes beer bitter and flavorful.

One question: do more kinds of hops generally make a beer more bitter?

(This is different than the total quantity of hops used during brewing.)

Hops Munge

Let's munge a bit. We'll need to split out ingredients from one column into many.

split_ingredients <- function(df, ingredients_to_split) {
  ncol_df <- ncol(df)

  for (ingredient in ingredients_to_split) {
    ingredient_split <- str_split(df[[ingredient]], ", ")    
    num_new_cols <- max(lengths(ingredient_split))    

    for (num in 1:num_new_cols) {
      this_col <- ncol_df + 1         
      df[, this_col] <- NA
      names(df)[this_col] <- paste0(ingredient, "_", num)
      ncol_df <- ncol(df)             

      for (row in seq_along(ingredient_split)) {          
        if (!is.null(ingredient_split[[row]][num])) {        
          df[row, this_col] <- ingredient_split[[row]][num]
        }
      }
      df[[names(df)[this_col]]] <- factor(df[[names(df)[this_col]]])
    }
    ncol_df <- ncol(df)
  }
  return(df)
}

beer_necessities <- split_ingredients(beer_necessities, 
                              ingredients_to_split = c("hops", "malt"))

Hops Munge

# Gather up all the hops columns into one called `hop_name`
beer_necessities_hops_gathered <- beer_necessities %>%
  gather(
    hop_key, hop_name, hops_name_1:hops_name_13
  ) %>% as_tibble()

# Filter to just those beers that have at least one hop
beer_necessities_w_hops <- beer_necessities_hops_gathered %>% 
  filter(!is.na(hop_name)) %>% 
  filter(!hop_name == "")

beer_necessities_w_hops$hop_name <- factor(beer_necessities_w_hops$hop_name)

# For all hops, find the number of beers they're in as well as those beers' mean IBU and ABV
hops_beer_stats <- beer_necessities_w_hops %>% 
  ungroup() %>% 
  group_by(hop_name) %>% 
  summarise(
    mean_ibu = mean(ibu, na.rm = TRUE) %>% round(digits=2), 
    mean_abv = mean(abv, na.rm = TRUE) %>% round(digits=2),
    n = n()
  ) %>% 
  arrange(desc(n))

# Pare to hops that are used in at least 50 beers
pop_hops_beer_stats <- hops_beer_stats[hops_beer_stats$n > 50, ] 

pop_hops_display <- pop_hops_beer_stats %>% 
    rename(
    `Hop` = hop_name,
    `Mean IBU` = mean_ibu,
    `Mean ABV` = mean_abv,
    `N Beers` = n
  )

What's the hop landscape look like?

Hops contained in >50 beers:

Hop	Mean IBU	Mean ABV	N Beers
Cascade	51.92	6.51	445
Centennial	63.97	7.08	243
Chinook	60.87	7.04	194
Simcoe	64.07	6.88	191
Columbus	63.74	6.95	183
Amarillo	61.36	6.96	163
Citra	59.60	6.73	157
Willamette	39.61	7.01	133
Nugget	52.24	6.38	114
Magnum	48.72	6.93	109
East Kent Golding	38.52	6.35	89

Hop	Mean IBU	Mean ABV	N Beers
Perle (American)	32.04	6.25	88
Hallertauer (American)	23.92	5.66	83
Mosaic	56.82	6.98	71
Northern Brewer (American)	39.48	6.47	71
Mount Hood	37.84	6.55	68
Warrior	59.13	6.98	62
Saaz (American)	30.70	6.25	60
Fuggles	40.76	6.77	59
Tettnanger (American)	30.28	6.02	59

Hop Landscape

Per hop, number beers containing that hop and their average ABV and IBU.

Note that there is actually, irl, a strain of hops called Fuggles.

Q: Do more kinds of hops generally make a beer more bitter?

plot of chunk unnamed-chunk-23

How do hops affect bitterness?

plot of chunk unnamed-chunk-24

Is the relationship significant?

hops_ibu_lm <- lm(ibu ~ total_hops, 
  data = beer_dat %>% filter(total_hops > 0)) %>% 
  broom::tidy() %>% dobtools::style_lm()

    Variable Estimate Std Error P Value
1 Total Hops    8.635     0.488       0

We can expect an increase in around 9 IBU for every 1 extra hop.

Okay back on track!

onward

Prediction

The other side of the coin: supervised learning classification problem
- Random forest
- Neural network

We'll go through the neural net.

A low prediction error rate should indicate that styles are well-defined, at least by the variables that we have.

Prediction: Neural Net

Package: nnet
Outcome variable: style or style_collapsed

What we'll do

Feed it a dataframe, an outcome variable, and a set of predictor variables
It will train it 80%, test on 20%
- From this, we can get a measure of accuracy

Neural Net: the Function

run_neural_net <- function(df, multinom = TRUE, outcome, predictor_vars,
                           maxit = 500, size = 5, trace = FALSE, seed = NULL) {
  out <- list(outcome = outcome)

  # Create a new column outcome; it's style_collapsed if you set outcome to style_collapsed, and style otherwise
  if (outcome == "style_collapsed") {
    df[["outcome"]] <- df[["style_collapsed"]]
  } else {
    df[["outcome"]] <- df[["style"]]
  }

  cols_to_keep <- c("outcome", predictor_vars)

  df <- df %>%
    select_(.dots = cols_to_keep) %>%
    mutate(row = 1:nrow(df)) %>% 
    droplevels()

  # Select 80% of the data for training
  df_train <- sample_n(df, nrow(df)*(0.8)) %>% droplevels()

  # The rest is for testing
  df_test <- df %>%
    filter(! (row %in% df_train$row)) %>%
    select(-row)

  df_train <- df_train %>%
    select(-row)

  if(multinom==TRUE) {
    nn <- multinom(outcome ~ .,
                   data = df_train, maxit=maxit, trace=trace)

    # Which variables are the most important in the neural net?
    most_important_vars <- varImp(nn)

  } else if (multinom==FALSE) {
    nn <- nnet(outcome ~ ., size = size,
                   data = df_train, maxit=maxit, trace=trace)
    most_important_vars <- NULL

  }

  # How accurate is the model? Compare predictions to outcomes from test data
  nn_preds <- predict(nn, type="class", newdata = df_test) %>% factor()
  nn_accuracy <- postResample(df_test$outcome, nn_preds)

  out <- list(out, nn = nn, 
              most_important_vars = most_important_vars,
              df_test = df_test,
              nn_preds = nn_preds,
              nn_accuracy = nn_accuracy)

  return(out)
}

Neural Net: Run It

Our predictors will include the total number of hops and malts in the beer, plus our usual ABV, IBU, SRM.

p_vars <- c("total_hops", "total_malt", "abv", "ibu", "srm")

We'll use data filtered to just the most popular styles and run it.

nn_collapsed_out <- run_neural_net(df = beer_dat %>% drop_na(!!p_vars), outcome = "style_collapsed", predictor_vars = p_vars, size = 10, trace = TRUE, multinom = FALSE, seed = 9)

# weights:  379
initial  value 12671.710386 
iter  10 value 8852.454547
iter  20 value 8330.444290
iter  30 value 8022.367811
iter  40 value 7690.050305
iter  50 value 7426.537616
iter  60 value 7136.939031
iter  70 value 6941.070550
iter  80 value 6650.527339
iter  90 value 6484.024279
iter 100 value 6251.726635
iter 110 value 6151.316864
iter 120 value 6054.067797
iter 130 value 5928.756321
iter 140 value 5866.960200
iter 150 value 5830.132567
iter 160 value 5805.548164
iter 170 value 5792.047662
iter 180 value 5786.967344
iter 190 value 5783.937058
iter 200 value 5780.752235
iter 210 value 5771.709115
iter 220 value 5757.418148
iter 230 value 5745.412760
iter 240 value 5737.414828
iter 250 value 5733.175411
iter 260 value 5729.750039
iter 270 value 5727.505435
iter 280 value 5724.679021
iter 290 value 5722.687159
iter 300 value 5721.413677
iter 310 value 5720.333103
iter 320 value 5719.452310
iter 330 value 5718.813302
iter 340 value 5718.244388
iter 350 value 5717.786992
iter 360 value 5717.346364
iter 370 value 5716.925022
iter 380 value 5716.466009
iter 390 value 5716.072555
iter 400 value 5715.745053
iter 410 value 5715.259823
iter 420 value 5714.886930
iter 430 value 5714.408302
iter 440 value 5713.300458
iter 450 value 5712.727905
iter 460 value 5712.489987
final  value 5712.489501 
converged

Neural Net: Visual

We can check out the structure of the neural net using the NeuralNetTools package.

We can see we've got one hidden layer and two bias nodes.

par(mar=c(4.1, 0.1, 4.1, 7.1))
NeuralNetTools::plotnet(nn_collapsed_out$nn, line_stag = 0.01, max_sp = TRUE)

plot of chunk unnamed-chunk-28

Neural Net: Evaluate

How'd we do?

nn_collapsed_out$nn_accuracy[1]

 Accuracy 
0.5057803

50.6% isn't terrible given we've got 30 collapsed styles; chance would be 3.3%.

Which variables are most important?

caret::varImp will give us an importance measure using combinations of the absolute values of the weights

Variable	Importance Percent
total_hops	58.8%
total_malt	19.1%
abv	17.7%
ibu	2.5%
srm	1.8%

Neural Net: Glass

What happens if we add in glass, a style-dependent attribute, as a predictor?

p_vars_add_glass <- c("total_hops", "total_malt", "abv", "ibu", "srm", "glass")

nn_collapsed_out_add_glass <- run_neural_net(df = beer_dat %>% drop_na(!!p_vars_add_glass), outcome = "style_collapsed", predictor_vars = p_vars_add_glass, trace = FALSE, multinom = FALSE)

 Accuracy 
0.5436409

That's 3.79 percentage points more accurate than the model without glass included.

Here summing up the contributions from all glasses.

Variable	Importance Percent
glass	46.3%
total_hops	15.6%
total_malt	12.0%
abv	9.1%
ibu	8.6%
srm	8.5%

So what's the answer?

Are beer styles a useful construct to use as a proxy for natural clusters in beer?

I'd give it a fuzzy yes.

Fuzzy because:

We couldn't do better than ~50% accuracy
We had a lot of rows with missing predictors

Unknowns:

Was our style collapsing scheme successful in putting beers in the “right” buckets?
Would taste-related information have accounted for much of the remaining variance we couldn't explain?

So what's the answer?

plot of chunk unnamed-chunk-33

Plotting the landscape directly on our two main dimensions,

Syles are very overlapping, especially among lower-alcohol and -bitterness styles
But, no clear super-groups that are independent of style

Future Directions

In no particular order, some thoughts I've had plus suggestions from others:

More Shiny app features
- Beer searchability, tooltips over each point on hover
Join on other data
- e.g., Untappd or something scraped from the interwebs
Beer consumption: how is this trending over time, for each style?
- What drives the trend? Supply or demand?
  - Do brewers brew more sours causing people buy more of them or do people start liking sours and cause brewers to brew more?
Hierarchical clustering to make a more scientific beer family tree
Some funky algorithm to generate new beer names

Cheers, all!

R version 3.3.3 (2017-03-06)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.6

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2         NeuralNetTools_1.5.0 emo_0.0.0.9000      
 [4] caret_6.0-76         lattice_0.20-35      nnet_7.3-12         
 [7] feather_0.3.1        forcats_0.2.0        dplyr_0.7.4.9000    
[10] purrr_0.2.4          readr_1.1.1          tidyr_0.7.2         
[13] tibble_1.3.4         tidyverse_1.1.1      dobtools_0.1.0      
[16] ggrepel_0.6.5        ggplot2_2.2.1        jsonlite_1.5        
[19] broom_0.4.2          knitr_1.17          

loaded via a namespace (and not attached):
 [1] nlme_3.1-131        pbkrtest_0.4-7      lubridate_1.6.0    
 [4] devtools_1.13.3     RColorBrewer_1.1-2  httr_1.3.1         
 [7] tools_3.3.3         backports_1.1.0     R6_2.2.2           
[10] rpart_4.1-11        Hmisc_4.0-3         lazyeval_0.2.0     
[13] mgcv_1.8-17         colorspace_1.3-2    withr_2.0.0        
[16] tidyselect_0.2.2    gridExtra_2.2.1     mnormt_1.5-5       
[19] curl_2.8.1          git2r_0.18.0        rvest_0.3.2        
[22] quantreg_5.29       htmlTable_1.9       SparseM_1.74       
[25] xml2_1.1.1          labeling_0.3        scales_0.5.0       
[28] checkmate_1.8.3     psych_1.7.5         stringr_1.2.0      
[31] digest_0.6.12       foreign_0.8-69      minqa_1.2.4        
[34] base64enc_0.1-3     pkgconfig_2.0.1     htmltools_0.3.6    
[37] lme4_1.1-13         highr_0.6           htmlwidgets_0.9    
[40] rlang_0.1.2.9000    readxl_1.0.0        shiny_1.0.5.9000   
[43] bindr_0.1           acepack_1.4.1       ModelMetrics_1.1.0 
[46] car_2.1-5           magrittr_1.5        Formula_1.2-2      
[49] Matrix_1.2-8        Rcpp_0.12.13        munsell_0.4.3      
[52] stringi_1.1.5       MASS_7.3-47         plyr_1.8.4         
[55] grid_3.3.3          parallel_3.3.3      crayon_1.3.4       
[58] miniUI_0.1.1        haven_1.1.0         splines_3.3.3      
[61] hms_0.3             ranger_0.8.0        reshape2_1.4.2     
[64] codetools_0.2-15    stats4_3.3.3        glue_1.1.1         
[67] evaluate_0.10.1     latticeExtra_0.6-28 data.table_1.10.4  
[70] modelr_0.1.1        nloptr_1.0.4        httpuv_1.3.5.9000  
[73] foreach_1.4.3       MatrixModels_0.4-1  cellranger_1.1.0   
[76] gtable_0.2.0        assertthat_0.2.0    mime_0.5           
[79] xtable_1.8-2        e1071_1.6-8         class_7.3-14       
[82] survival_2.41-3     iterators_1.0.8     memoise_1.1.0      
[85] cluster_2.0.5