Beer-in-Hand Data Science

Amanda Dobbyn

First things first

Who am I?

UChicago '15 uchi_ultimate ^ Angela not yet pictured

First things first


Where's the code?

https://github.com/aedobbyn/beer-data-science

  • Markdown writeup in /compile
  • Shiny app in /clusterfun
  • Step-by-step scripts for getting and munging data in /run_it
  • This presentation in /present


writeup

How did this come about?

  • Typical Friday afternoon office conversation
    • How do you architect the ideal beer flavor profile visualizer?
      • In particular, how do you represent “hoppy, for a Kölsch”?


  • Does the style Kölsch even describe a well-defined beer sub-group?
    • Would it be reliably distinguishable from non-Kölschs in a blind taste test?

How did this come about?

Beer == water + malted barley + hops + yeast +

sometimes other stuff like fruit


  • We categorize it into different styles based on
    • Type and ratio of ingredients
    • How the beer is made (e.g., how long and at what temperature it's brewed)
    • Regional and historical differences

How did this come about?

BUT

  • How well do styles actually demarcate the beer landscape?
    • Is there actually more inter-style variation than intra-style variation?
    • Is there a more empiricially clean way to categorize beers into super-groups?

In other words, we're asking: are beer styles just a social construct?

The Order of Things, theoretically

order_of_things

Implications

  • If styles do demarcate the beer landscape well, we should expect to see distinct clusters dominated mostly by beers classified into a single style

  • It should also be easy to predict style from the other variables

This all assumes the variables we have (more on those in a sec) can account for most of the variance between styles.

The stakes could not be higher.

Step 1: GET Beer

The age-old dilemma

From where? BreweryDB, an online database of beers with a public API.

get_beers

get_beers

Step 1: GET one Beer

Once we've got a key, we can use a beer's unique BreweryDB ID and our key to get data on that beer.

http://api.brewerydb.com/v2/beer/<BEER_ID_HERE>/?key=/<YOUR_KEY_HERE>

In the browser, that looks like:

got_a_beer

Step 1: GET one Beer

base_url <- "http://api.brewerydb.com/v2"
key_preface <- "/?key="

get_beer <- function(id) {
  fromJSON(paste0(base_url, "/beer/", id, "/", key_preface, key))
}

We can make a little function from this that'll take a single ID and use fromJSON to convert from, well, JSON into a nested list starting with our 200 success message.

get_beer("GZQpRX")
$message
[1] "READ ONLY MODE: Request Successful"

$data
$data$id
[1] "GZQpRX"

$data$name
[1] "Alpha King"

$data$nameDisplay
[1] "Alpha King"

$data$description
[1] "A bold yet balanced American Pale Ale with slight caramel sweetness and aggressive citrus hoppiness. This is our flagship beer."

$data$abv
[1] "6.7"

$data$ibu
[1] "68"

$data$glasswareId
[1] 5

$data$srmId
[1] 21

$data$availableId
[1] 1

$data$styleId
[1] 25

$data$isOrganic
[1] "N"

$data$labels
$data$labels$icon
[1] "https://s3.amazonaws.com/brewerydbapi/beer/GZQpRX/upload_t1NkG7-icon.png"

$data$labels$medium
[1] "https://s3.amazonaws.com/brewerydbapi/beer/GZQpRX/upload_t1NkG7-medium.png"

$data$labels$large
[1] "https://s3.amazonaws.com/brewerydbapi/beer/GZQpRX/upload_t1NkG7-large.png"


$data$status
[1] "verified"

$data$statusDisplay
[1] "Verified"

$data$createDate
[1] "2012-01-03 02:42:40"

$data$updateDate
[1] "2017-01-10 21:43:48"

$data$glass
$data$glass$id
[1] 5

$data$glass$name
[1] "Pint"

$data$glass$createDate
[1] "2012-01-03 02:41:33"


$data$srm
$data$srm$id
[1] 21

$data$srm$name
[1] "21"

$data$srm$hex
[1] "952D00"


$data$available
$data$available$id
[1] 1

$data$available$name
[1] "Year Round"

$data$available$description
[1] "Available year round as a staple beer."


$data$style
$data$style$id
[1] 25

$data$style$categoryId
[1] 3

$data$style$category
$data$style$category$id
[1] 3

$data$style$category$name
[1] "North American Origin Ales"

$data$style$category$createDate
[1] "2012-03-21 20:06:45"


$data$style$name
[1] "American-Style Pale Ale"

$data$style$shortName
[1] "American Pale"

$data$style$description
[1] "American pale ales range from deep golden to copper in color. The style is characterized by fruity, floral and citrus-like American-variety hop character producing medium to medium-high hop bitterness, flavor, and aroma. Note that the \"traditional\" style of this beer has its origins with certain floral, fruity, citrus-like, piney, resinous, or sulfur-like American hop varietals. One or more of these hop characters is the perceived end, but the perceived hop characters may be a result of the skillful use of hops of other national origins. American pale ales have medium body and low to medium maltiness. Low caramel character is allowable. Fruity-ester flavor and aroma should be moderate to strong. Diacetyl should be absent or present at very low levels. Chill haze is allowable at cold temperatures."

$data$style$ibuMin
[1] "30"

$data$style$ibuMax
[1] "42"

$data$style$abvMin
[1] "4.5"

$data$style$abvMax
[1] "5.6"

$data$style$srmMin
[1] "6"

$data$style$srmMax
[1] "14"

$data$style$ogMin
[1] "1.044"

$data$style$fgMin
[1] "1.008"

$data$style$fgMax
[1] "1.014"

$data$style$createDate
[1] "2012-03-21 20:06:45"

$data$style$updateDate
[1] "2015-04-07 15:25:18"



$status
[1] "success"

Quick funciton factory

Using purrr::walk() and assign() we can create functions to GET any beer, brewery, category, etc. if we know its ID.

endpoints <- c("beer", "brewery", "category", "event", "feature", "glass", "guild", "hop", "ingredient", "location", "socialsite", "style", "menu")

# Base function
get_ <- function(id, ep) {
  jsonlite::fromJSON(paste0(base_url, "/", ep, "/", id, "/", key_preface, key))
}

# Create new get_<ep> functions
endpoints %>% walk(~ assign(x = paste0("get_", .x),
                             value = partial(get_, ep = .x),
                             envir = .GlobalEnv))

Now we have the functions get_beer(), get_brewery(), get_category(), etc. in our global environment.

Testing testing

get_hop("3")
$message
[1] "READ ONLY MODE: Request Successful"

$data
$data$id
[1] 3

$data$name
[1] "Ahtanum"

$data$description
[1] "An open-pollinated aroma variety developed in Washington, Ahtanum is used for its distinctive, somewhat Cascade-like aroma and for moderate bittering."

$data$countryOfOrigin
[1] "US"

$data$alphaAcidMin
[1] 5.7

$data$betaAcidMin
[1] 5

$data$betaAcidMax
[1] 6.5

$data$humuleneMin
[1] 16

$data$humuleneMax
[1] 20

$data$caryophylleneMin
[1] 9

$data$caryophylleneMax
[1] 12

$data$cohumuloneMin
[1] 30

$data$cohumuloneMax
[1] 35

$data$myrceneMin
[1] 50

$data$myrceneMax
[1] 55

$data$farneseneMax
[1] 1

$data$category
[1] "hop"

$data$categoryDisplay
[1] "Hops"

$data$createDate
[1] "2013-06-24 16:07:26"

$data$updateDate
[1] "2013-06-24 16:10:37"

$data$country
$data$country$isoCode
[1] "US"

$data$country$name
[1] "UNITED STATES"

$data$country$displayName
[1] "United States"

$data$country$isoThree
[1] "USA"

$data$country$numberCode
[1] 840

$data$country$createDate
[1] "2012-01-03 02:41:33"



$status
[1] "success"

Digging In

What are the parts of the data that we want and where do they live?

get_beer("GZQpRX")
$message
[1] "READ ONLY MODE: Request Successful"

$data
$data$id
[1] "GZQpRX"

$data$name
[1] "Alpha King"

$data$nameDisplay
[1] "Alpha King"

$data$description
[1] "A bold yet balanced American Pale Ale with slight caramel sweetness and aggressive citrus hoppiness. This is our flagship beer."

$data$abv
[1] "6.7"

$data$ibu
[1] "68"

$data$glasswareId
[1] 5

$data$srmId
[1] 21

$data$availableId
[1] 1

$data$styleId
[1] 25

$data$isOrganic
[1] "N"

$data$labels
$data$labels$icon
[1] "https://s3.amazonaws.com/brewerydbapi/beer/GZQpRX/upload_t1NkG7-icon.png"

$data$labels$medium
[1] "https://s3.amazonaws.com/brewerydbapi/beer/GZQpRX/upload_t1NkG7-medium.png"

$data$labels$large
[1] "https://s3.amazonaws.com/brewerydbapi/beer/GZQpRX/upload_t1NkG7-large.png"


$data$status
[1] "verified"

$data$statusDisplay
[1] "Verified"

$data$createDate
[1] "2012-01-03 02:42:40"

$data$updateDate
[1] "2017-01-10 21:43:48"

$data$glass
$data$glass$id
[1] 5

$data$glass$name
[1] "Pint"

$data$glass$createDate
[1] "2012-01-03 02:41:33"


$data$srm
$data$srm$id
[1] 21

$data$srm$name
[1] "21"

$data$srm$hex
[1] "952D00"


$data$available
$data$available$id
[1] 1

$data$available$name
[1] "Year Round"

$data$available$description
[1] "Available year round as a staple beer."


$data$style
$data$style$id
[1] 25

$data$style$categoryId
[1] 3

$data$style$category
$data$style$category$id
[1] 3

$data$style$category$name
[1] "North American Origin Ales"

$data$style$category$createDate
[1] "2012-03-21 20:06:45"


$data$style$name
[1] "American-Style Pale Ale"

$data$style$shortName
[1] "American Pale"

$data$style$description
[1] "American pale ales range from deep golden to copper in color. The style is characterized by fruity, floral and citrus-like American-variety hop character producing medium to medium-high hop bitterness, flavor, and aroma. Note that the \"traditional\" style of this beer has its origins with certain floral, fruity, citrus-like, piney, resinous, or sulfur-like American hop varietals. One or more of these hop characters is the perceived end, but the perceived hop characters may be a result of the skillful use of hops of other national origins. American pale ales have medium body and low to medium maltiness. Low caramel character is allowable. Fruity-ester flavor and aroma should be moderate to strong. Diacetyl should be absent or present at very low levels. Chill haze is allowable at cold temperatures."

$data$style$ibuMin
[1] "30"

$data$style$ibuMax
[1] "42"

$data$style$abvMin
[1] "4.5"

$data$style$abvMax
[1] "5.6"

$data$style$srmMin
[1] "6"

$data$style$srmMax
[1] "14"

$data$style$ogMin
[1] "1.044"

$data$style$fgMin
[1] "1.008"

$data$style$fgMax
[1] "1.014"

$data$style$createDate
[1] "2012-03-21 20:06:45"

$data$style$updateDate
[1] "2015-04-07 15:25:18"



$status
[1] "success"

We can grab just the fourth element of the data part of the response, the beer's description in a couple ways:

get_beer("GZQpRX")$data$description
[1] "A bold yet balanced American Pale Ale with slight caramel sweetness and aggressive citrus hoppiness. This is our flagship beer."

Or in other words:

get_beer("GZQpRX")[["data"]][4][[1]]
[1] "A bold yet balanced American Pale Ale with slight caramel sweetness and aggressive citrus hoppiness. This is our flagship beer."

Digging In

Other things are more deeply nested.

get_beer("GZQpRX")$data$style
$id
[1] 25

$categoryId
[1] 3

$category
$category$id
[1] 3

$category$name
[1] "North American Origin Ales"

$category$createDate
[1] "2012-03-21 20:06:45"


$name
[1] "American-Style Pale Ale"

$shortName
[1] "American Pale"

$description
[1] "American pale ales range from deep golden to copper in color. The style is characterized by fruity, floral and citrus-like American-variety hop character producing medium to medium-high hop bitterness, flavor, and aroma. Note that the \"traditional\" style of this beer has its origins with certain floral, fruity, citrus-like, piney, resinous, or sulfur-like American hop varietals. One or more of these hop characters is the perceived end, but the perceived hop characters may be a result of the skillful use of hops of other national origins. American pale ales have medium body and low to medium maltiness. Low caramel character is allowable. Fruity-ester flavor and aroma should be moderate to strong. Diacetyl should be absent or present at very low levels. Chill haze is allowable at cold temperatures."

$ibuMin
[1] "30"

$ibuMax
[1] "42"

$abvMin
[1] "4.5"

$abvMax
[1] "5.6"

$srmMin
[1] "6"

$srmMax
[1] "14"

$ogMin
[1] "1.044"

$fgMin
[1] "1.008"

$fgMax
[1] "1.014"

$createDate
[1] "2012-03-21 20:06:45"

$updateDate
[1] "2015-04-07 15:25:18"

In these cases, we really only care about the name portion.

get_beer("GZQpRX")$data$style$name
[1] "American-Style Pale Ale"

Unnesting

So, we'll unravel the data part of the response, grab whatever we want there, and glue it together into a dataframe.

If the particular list item we're unnesting has a name portion (like $style$name), great, we'll grab that for the column. Otherwise, we'll take whatever's first.

unnest_it <- function(lst) {
  unnested <- lst
  for(col in seq_along(lst[["data"]])) {
    if(! is.null(ncol(lst[["data"]][[col]]))) {
      if(! is.null(lst[["data"]][[col]][["name"]])) {
        unnested[["data"]][[col]] <- lst[["data"]][[col]][["name"]]
      } else {
        unnested[["data"]][[col]] <- lst[["data"]][[col]][[1]]
      }
    }
  }
  return(unnested)
}

This approach is a bit slow because it uses a a for loop 😔. Optimizing it is on the docket.

Grabbing all Beers

Instead of /beer/<BEER_ID> we're now asking for /beers

grab_all_beers

Paginating the Request

We find out how many pages there are total and then send requests and unnest each until we hit number_of_pages.

paginated_request <- function(ep, addition, trace_progress = TRUE) {    
  full_request <- NULL
  first_page <- fromJSON(paste0(base_url, "/", ep, "/", key_preface, key
                                , "&p=1"))
  number_of_pages <- ifelse(!(is.null(first_page$numberOfPages)), 
                            first_page$numberOfPages, 1)      

    for (page in 1:number_of_pages) {                               
    this_request <- fromJSON(paste0(base_url, "/", ep, "/", key_preface, key
                                    , "&p=", page, addition),
                             flatten = TRUE) 
    this_req_unnested <- unnest_it(this_request)    #  <- request unnested here

    if(trace_progress == TRUE) {message(paste0("Page ", this_req_unnested$currentPage))} # if TRUE, print the page we're on

    full_request <- bind_rows(full_request, this_req_unnested[["data"]])
  }
  return(full_request)
} 

Quick note on Ingredients

A few apporaches I used:

  • Concatenated into a single string during the unnesting process
    • hop_name and malt_name using this function 👉
  • Split out into one hop per column and one malt per column
    • hops_name_1, hops_name_2, etc.
  • Sparse dataframe with each type of hop (like Cascade, Citra, etc.) as its own column
    • Value is either 1 or 0

(Another would have been a nested list-col)


unnest_ingredients <- function(df) {
  df$hops_name <- NA
  df$malt_name <- NA

  for (row in 1:nrow(df)) {
    if (!is.null(df[["ingredients.hops"]][[row]][["name"]]) |
        !is.null(df[["ingredients.malt"]][[row]][["name"]])) {

      df[["hops_name"]][[row]] <- paste(df[["ingredients.hops"]][[row]][["name"]], collapse = ", ")

      df[["malt_name"]][[row]] <- paste(df[["ingredients.malt"]][[row]][["name"]], collapse = ", ")
    }
  }
  return(df)
}

What have we got?

beer_necessities <- paginated_request(ep = "beers", addition = "&withIngredients=Y")
id name style style_collapsed glass abv ibu srm hops_name malt_name
X4KcGF (512) TWO Imperial or Double India Pale Ale Double India Pale Ale Pint 9.0 99 9 Columbus, Glacier, Horizon, Nugget, Simcoe Caramel/Crystal Malt, Two-Row Pale Malt - Organic, Wheat Malt
USaRyl (512) Whiskey Barrel Double Pecan Porter Wood- and Barrel-Aged Strong Beer Barrel-Aged Pint 9.5 30 NA Glacier Black Malt, Caramel/Crystal Malt, Chocolate Malt, Two-Row Pale Malt - Organic
bXwskR (512) White IPA American-Style India Pale Ale India Pale Ale Pint 5.3 55 4 NA NA
XnPVIo (512) Wild Bear Specialty Beer Specialty Beer Tulip 8.5 9 NA NA NA
QLp4mV (512) Wit Belgian-Style White (or Wit) / Belgian-Style Wheat Wheat Pint 5.1 10 5 Golding (American) Oats - Malted, Two-Row Pale Malt - Organic, Wheat Malt - White
tWuIyV (714): Blond Ale Golden or Blonde Ale Blonde NA 4.8 NA NA NA NA

What have we got?


  • 63,495 distinct beers
  • Info about the beer:
    • ABV: alcohol by volume
    • IBU: International Biterness Units (really)
    • SRM: a measure of color
    • Ingredients
      • Hops, malts

plot of chunk unnamed-chunk-13

Where to put it?


local_db


I threw it into MySQL

  • This allows us to
    • Easily update the data if anything changes
    • Others easy access to the data if they want to build an app using it


Step 1 Complete ✔️

A look at our outcome variable

We have 170 styles of beer.

levels(beer_necessities$style)
  [1] "Adambier"                                                  
  [2] "Aged Beer (Ale or Lager)"                                  
  [3] "American-Style Amber (Low Calorie) Lager"                  
  [4] "American-Style Amber Lager"                                
  [5] "American-Style Amber/Red Ale"                              
  [6] "American-Style Barley Wine Ale"                            
  [7] "American-Style Black Ale"                                  
  [8] "American-Style Brown Ale"                                  
  [9] "American-Style Cream Ale or Lager"                         
 [10] "American-Style Dark Lager"                                 
 [11] "American-Style Ice Lager"                                  
 [12] "American-Style Imperial Porter"                            
 [13] "American-Style Imperial Stout"                             
 [14] "American-Style India Pale Ale"                             
 [15] "American-Style Lager"                                      
 [16] "American-Style Light (Low Calorie) Lager"                  
 [17] "American-Style Low-Carbohydrate Light Lager"               
 [18] "American-Style Malt Liquor"                                
 [19] "American-Style Märzen / Oktoberfest"                       
 [20] "American-Style Pale Ale"                                   
 [21] "American-Style Pilsener"                                   
 [22] "American-Style Premium Lager"                              
 [23] "American-Style Sour Ale"                                   
 [24] "American-Style Stout"                                      
 [25] "American-Style Strong Pale Ale"                            
 [26] "American-Style Wheat Wine Ale"                             
 [27] "Apple Wine"                                                
 [28] "Australasian, Latin American or Tropical-Style Light Lager"
 [29] "Australian-Style Pale Ale"                                 
 [30] "Baltic-Style Porter"                                       
 [31] "Bamberg-Style Bock Rauchbier"                              
 [32] "Bamberg-Style Helles Rauchbier"                            
 [33] "Bamberg-Style Märzen Rauchbier"                            
 [34] "Bamberg-Style Weiss (Smoke) Rauchbier (Dunkel or Helles)"  
 [35] "Belgian-Style Blonde Ale"                                  
 [36] "Belgian-Style Dark Strong Ale"                             
 [37] "Belgian-Style Dubbel"                                      
 [38] "Belgian-Style Flanders Oud Bruin or Oud Red Ales"          
 [39] "Belgian-style Fruit Beer"                                  
 [40] "Belgian-Style Fruit Lambic"                                
 [41] "Belgian-Style Gueuze Lambic"                               
 [42] "Belgian-Style Lambic"                                      
 [43] "Belgian-Style Pale Ale"                                    
 [44] "Belgian-Style Pale Strong Ale"                             
 [45] "Belgian-Style Quadrupel"                                   
 [46] "Belgian-Style Table Beer"                                  
 [47] "Belgian-Style Tripel"                                      
 [48] "Belgian-Style White (or Wit) / Belgian-Style Wheat"        
 [49] "Berliner-Style Weisse (Wheat)"                             
 [50] "Bohemian-Style Pilsener"                                   
 [51] "Braggot"                                                   
 [52] "Brett Beer"                                                
 [53] "British-Style Barley Wine Ale"                             
 [54] "British-Style Imperial Stout"                              
 [55] "Brown Porter"                                              
 [56] "California Common Beer"                                    
 [57] "Chili Pepper Beer"                                         
 [58] "Chocolate / Cocoa-Flavored Beer"                           
 [59] "Classic English-Style Pale Ale"                            
 [60] "Classic Irish-Style Dry Stout"                             
 [61] "Coffee-Flavored Beer"                                      
 [62] "Common Cider"                                              
 [63] "Common Perry"                                              
 [64] "Contemporary Gose"                                         
 [65] "Cyser (Apple Melomel)"                                     
 [66] "Dark American Wheat Ale or Lager with Yeast"               
 [67] "Dark American Wheat Ale or Lager without Yeast"            
 [68] "Dark American-Belgo-Style Ale"                             
 [69] "Dortmunder / European-Style Export"                        
 [70] "Double Red Ale"                                            
 [71] "Dry Lager"                                                 
 [72] "Dry Mead"                                                  
 [73] "Dutch-Style Kuit, Kuyt or Koyt"                            
 [74] "Energy Enhanced Malt Beverage"                             
 [75] "English Cider"                                             
 [76] "English-Style Brown Ale"                                   
 [77] "English-Style Dark Mild Ale"                               
 [78] "English-Style India Pale Ale"                              
 [79] "English-Style Pale Mild Ale"                               
 [80] "English-Style Summer Ale"                                  
 [81] "European-Style Dark / Münchner Dunkel"                     
 [82] "Experimental Beer (Lager or Ale)"                          
 [83] "Extra Special Bitter"                                      
 [84] "Field Beer"                                                
 [85] "Flavored Malt Beverage"                                    
 [86] "Foreign (Export)-Style Stout"                              
 [87] "French & Belgian-Style Saison"                             
 [88] "French Cider"                                              
 [89] "French-Style Bière de Garde"                               
 [90] "Fresh \"Wet\" Hop Ale"                                     
 [91] "Fruit Beer"                                                
 [92] "Fruit Cider"                                               
 [93] "Fruit Wheat Ale or Lager with or without Yeast"            
 [94] "German-Style Altbier"                                      
 [95] "German-Style Doppelbock"                                   
 [96] "German-Style Eisbock"                                      
 [97] "German-Style Heller Bock/Maibock"                          
 [98] "German-Style Kölsch / Köln-Style Kölsch"                   
 [99] "German-Style Leichtbier"                                   
[100] "German-Style Leichtes Weizen / Weissbier"                  
[101] "German-Style Märzen"                                       
[102] "German-Style Oktoberfest / Wiesen (Meadow)"                
[103] "German-Style Pilsener"                                     
[104] "German-Style Rye Ale (Roggenbier) with or without Yeast"   
[105] "German-Style Schwarzbier"                                  
[106] "Ginjo Beer or Sake-Yeast Beer"                             
[107] "Gluten-Free Beer"                                          
[108] "Golden or Blonde Ale"                                      
[109] "Grodziskie"                                                
[110] "Herb and Spice Beer"                                       
[111] "Historical Beer"                                           
[112] "Imperial or Double India Pale Ale"                         
[113] "Imperial Red Ale"                                          
[114] "Indigenous Beer (Lager or Ale)"                            
[115] "International-Style Pale Ale"                              
[116] "International-Style Pilsener"                              
[117] "Irish-Style Red Ale"                                       
[118] "Kellerbier (Cellar beer) or Zwickelbier - Ale"             
[119] "Kellerbier (Cellar beer) or Zwickelbier - Lager"           
[120] "Leipzig-Style Gose"                                        
[121] "Light American Wheat Ale or Lager with Yeast"              
[122] "Light American Wheat Ale or Lager without Yeast"           
[123] "Metheglin"                                                 
[124] "Mixed Culture Brett Beer"                                  
[125] "Münchner (Munich)-Style Helles"                            
[126] "New England Cider"                                         
[127] "Non-Alcoholic (Beer) Malt Beverages"                       
[128] "Oatmeal Stout"                                             
[129] "Old Ale"                                                   
[130] "Open Category Mead"                                        
[131] "Ordinary Bitter"                                           
[132] "Other Belgian-Style Ales"                                  
[133] "Other Fruit Melomel"                                       
[134] "Other Specialty Cider or Perry"                            
[135] "Other Strong Ale or Lager"                                 
[136] "Pale American-Belgo-Style Ale"                             
[137] "Pumpkin Beer"                                              
[138] "Pyment (Grape Melomel)"                                    
[139] "Robust Porter"                                             
[140] "Rye Ale or Lager with or without Yeast"                    
[141] "Scotch Ale"                                                
[142] "Scottish-Style Export Ale"                                 
[143] "Scottish-Style Heavy Ale"                                  
[144] "Scottish-Style Light Ale"                                  
[145] "Semi-Sweet Mead"                                           
[146] "Session Beer"                                              
[147] "Session India Pale Ale"                                    
[148] "Smoke Beer (Lager or Ale)"                                 
[149] "Smoke Porter"                                              
[150] "South German-Style Bernsteinfarbenes Weizen / Weissbier"   
[151] "South German-Style Dunkel Weizen / Dunkel Weissbier"       
[152] "South German-Style Hefeweizen / Hefeweissbier"             
[153] "South German-Style Kristall Weizen / Kristall Weissbier"   
[154] "South German-Style Weizenbock / Weissbock"                 
[155] "Special Bitter or Best Bitter"                             
[156] "Specialty Beer"                                            
[157] "Specialty Honey Lager or Ale"                              
[158] "Specialty Stouts"                                          
[159] "Strong Ale"                                                
[160] "Sweet Mead"                                                
[161] "Sweet or Cream Stout"                                      
[162] "Traditional German-Style Bock"                             
[163] "Traditional Perry"                                         
[164] "Vienna-Style Lager"                                        
[165] "Wild Beer"                                                 
[166] "Wood- and Barrel-Aged Beer"                                
[167] "Wood- and Barrel-Aged Dark Beer"                           
[168] "Wood- and Barrel-Aged Pale to Amber Beer"                  
[169] "Wood- and Barrel-Aged Sour Beer"                           
[170] "Wood- and Barrel-Aged Strong Beer"                         


What's the best way to condense these?

I want to e.g., lump American-Style India Pale Ale in with English-Style India Pale Ale.

Step 2: Breathe sigh of relief, Collapse

keywords <- c("Pale Ale", "India Pale Ale", "Double India Pale Ale", "Lager", "India Pale Lager", "Hefeweizen", "Barrel-Aged","Wheat", "Pilsner", "Pilsener", "Amber", "Golden", "Blonde", "Brown", "Black", "Stout", "Imperial Stout", "Fruit", "Porter", "Red", "Sour", "Kölsch", "Tripel", "Bitter", "Saison", "Strong Ale", "Barley Wine", "Dubbel")
collapse_styles <- function(df, trace_progress = TRUE) {

  df[["style_collapsed"]] <- vector(length = nrow(df))

  for (beer in 1:nrow(df)) {
    if (grepl(paste(keywords, collapse="|"), df$style[beer])) {    
      for (keyword in keywords) {         
        if(grepl(keyword, df$style[beer]) == TRUE) {
          df$style_collapsed[beer] <- keyword    
        }                         
      } 
    } else {
      df$style_collapsed[beer] <- as.character(df$style[beer])       
    }
    if(trace_progress == TRUE) {message(paste0("Collapsing this ", df$style[beer], " to: ", df$style_collapsed[beer]))}
  }
  return(df)
}

Collapsing in Action

Setting trace_progress = TRUE:

get_beers

  • keywords are ordered from most general to most specific
    • If a beer's name matches multiple keywords, its style_collapsed is the last of those that appear in keywords

American-Style Pale AlePale Ale

American-Style India Pale AleIndia Pale Ale

Popular Styles

Let's see where most of the data is concentrated.

  • Popular defined as
    • Styles with above the mean number of beers per style (z-score > 0)
      • Of course, this isn't a measure of popular consumption; just a reflection of the number of different beers we get from BreweryDB that are classified into that style


  • And then get a sense of where those styles fall in relation to one another
    • Style “centers” = mean ABV, IBU, and SRM of each style

Popular Styles

Collapsed Style Mean ABV Mean IBU Mean SRM Numer of Beers
India Pale Ale 6.58 66.04 9.99 6524
Pale Ale 5.70 40.87 8.89 4280
Stout 7.99 43.90 36.30 4238
Wheat 5.16 17.47 5.86 3349
Double India Pale Ale 8.93 93.48 11.01 2525
Red 5.74 33.81 16.18 2521
Lager 5.45 30.64 8.46 2230
Saison 6.40 27.25 7.05 2167
Blonde 5.60 22.39 5.62 2044
Porter 6.18 33.25 32.20 1973
Brown 6.16 32.22 23.59 1462
Pilsener 5.23 33.51 4.41 1268
Specialty Beer 6.45 33.78 15.52 1044
Bitter 5.32 38.28 12.46 939
Fruit Beer 5.20 19.24 8.67 905
Herb and Spice Beer 6.62 27.77 18.17 872
Sour 6.22 18.89 10.04 797
Strong Ale 8.83 36.74 22.55 767
Tripel 9.03 32.52 7.68 734
Black 6.96 65.51 31.08 622
Barley Wine 10.78 74.05 19.56 605
Kölsch 4.98 23.37 4.37 593
Barrel-Aged 9.00 39.16 18.13 540
Other Belgian-Style Ales 7.52 37.56 17.55 506
Pumpkin Beer 6.71 23.48 17.92 458
Dubbel 7.51 25.05 22.94 399
Scotch Ale 7.62 26.37 24.22 393
German-Style Doppelbock 8.05 28.89 25.70 376
Fruit Cider 6.21 25.60 12.00 370
German-Style Märzen 5.75 25.64 14.32 370

Style Centers

If beer styles are really self-contained, beer in each style should cluster tightly around these points.

plot of chunk unnamed-chunk-16

To the main question

Do styles truly define distinct pockets of beer?

If they do, we could expect styles to align with clusters generated using an unsupervised learning algorithm.

  • k-means
    • Splits observations into k clusters (we choose k)
    • Try to minimize the sum of squares between each datapoint and its assigned cluster center

Note that we'll have to throw away rows that contain NA values in any of the variables we're using.

What variables should we include?

What's in a Predictor?

A heuristic I used:

Outputs
  • Only measured after a beer been brewed
    • ABV, IBU, SRM
Style-Defined
  • Dependent entirely on style
    • Serving glass
Inputs
  • Only directly controlled by a brewer before a beer is brewed
    • Hops, malts



👍 predictor




👎 predictor


🤷🏼‍♀️ predictor because chicken and egg problem
Do brewers assign style first and then choose which ingredients to add, or vice versa? 🐣

Clustering: Let's have a function

do_cluster <- function (df, vars, to_cluster_on, n_centers = 5) {
  df_for_clustering <- df %>% select(!!vars) %>% na.omit()

  # Scale the ones to be scaled and append _scaled to their names
  df_vars_scale <- df_for_clustering %>% select(!!to_cluster_on) %>%
    scale() %>% as_tibble()
  names(df_vars_scale) <- names(df_vars_scale) %>% stringr::str_c("_scaled")

  # Do the clustering on the scaled data
  clusters_out <- kmeans(x = df_vars_scale, centers = n_centers, trace = FALSE)

  # Combine cluster assignment, scaled data, and unscaled rest of data
  clustered_df <- bind_cols(
    cluster_assignment = factor(clusters_out$cluster),   # Cluster assignment
    df_vars_scale,
    df_for_clustering
  )

  return(clustered_df)
}


We don't need to specify an outcome variable because this is unsupervised.

Clustering: Run It

We'll cluster the beers on the scaled versions of ABV, IBU, and SRM

to_include <- c("id", "name", "style", "style_collapsed", "abv", "ibu", "srm")

to_cluster_on <- c("abv", "ibu", "srm")


and stitch together the cluster assignments, scaled columns, and the original data.

clustered_beer <- do_cluster(beer_dat, to_include, to_cluster_on)

Clustering: Output

Cluster Assignment ABV Scaled IBU Scaled SRM Scaled ID Name Style Style Collapsed ABV IBU SRM
4 0.2428039 -0.6803269 2.1840186 tmEthz “Admiral” Stache Baltic-Style Porter Porter 7.0 23.0 37
4 -0.5689624 0.3608647 2.4657681 b7SfHG “Ah Me Joy” Porter Robust Porter Porter 5.4 51.0 40
1 -0.3660208 0.3608647 -0.5395600 PBEXhV “Bison Eye Rye” Pale Ale | 2 of 4 Part Pale Ale Series American-Style Pale Ale Pale Ale 5.8 51.0 8
1 -0.5689624 0.4724210 -0.2578105 AXmvOd “Dust Up” Cloudy Pale Ale | 1 of 4 Part Pale Ale Series American-Style Pale Ale Pale Ale 5.4 54.0 11
3 -0.4674916 -0.4869627 -0.8213095 Hr5A0t “God Country” Kolsch German-Style Kölsch / Köln-Style Kölsch Kölsch 5.6 28.2 5
3 -0.8226393 -0.7918831 -0.8213095 mrVjY4 “Jemez Field Notes” Golden Lager Golden or Blonde Ale Blonde 4.9 20.0 5
3 -0.7211685 -1.1265519 -0.9152260 xFM8w5 #10 Hefewiezen South German-Style Hefeweizen / Hefeweissbier Wheat 5.1 11.0 4
3 -0.7211685 -0.7918831 -0.4456435 hB0QeO #9 American-Style Pale Ale Pale Ale 5.1 20.0 9
3 -0.8733747 -0.5315852 -1.0091425 m8f62Y #KoLSCH German-Style Kölsch / Köln-Style Kölsch Kölsch 4.8 27.0 3
3 -0.6196977 -0.8662540 -0.8213095 35lHUq 'Inappropriate' Cream Ale American-Style Cream Ale or Lager Lager 5.3 18.0 5
1 0.2428039 -0.4200290 -0.6334765 qbRV90 'tis the Saison French & Belgian-Style Saison Saison 7.0 30.0 7
3 -0.7719039 -0.7918831 -0.4456435 qhaIVA (306) URBAN WHEAT BEER Belgian-Style White (or Wit) / Belgian-Style Wheat Wheat 5.0 20.0 9
4 0.5472162 -0.4200290 0.6813546 VwR7Xg (512) Bruin (A.K.A. Brown Bear) American-Style Brown Ale Brown 7.6 30.0 21
1 0.4964808 -0.2341019 -0.5395600 oJFZwK (512) FOUR Strong Ale Strong Ale 7.5 35.0 8
1 0.2428039 0.8814606 -0.5395600 ezGh5N (512) IPA American-Style India Pale Ale India Pale Ale 7.0 65.0 8
1 0.7501578 -0.7175123 -0.5395600 s8rdpK (512) ONE Belgian-Style Pale Strong Ale Strong Ale 8.0 22.0 8
3 -0.2645500 -0.4200290 -0.6334765 2fXsvw (512) Pale American-Style Pale Ale Pale Ale 6.0 30.0 7
4 0.4964808 -0.6059561 1.3387701 9O3QPg (512) SIX Belgian-Style Dubbel Dubbel 7.5 25.0 28
1 1.5111886 -0.7175123 -0.3517270 A78JSF (512) THREE Belgian-Style Tripel Tripel 9.5 22.0 10
5 1.5111886 -0.7175123 2.4657681 WKSYBT (512) THREE (Cabernet Barrel Aged) Belgian-Style Tripel Tripel 9.5 22.0 40
2 1.2575117 2.1457647 -0.4456435 X4KcGF (512) TWO Imperial or Double India Pale Ale Double India Pale Ale 9.0 99.0 9
1 -0.6196977 0.5096064 -0.9152260 bXwskR (512) White IPA American-Style India Pale Ale India Pale Ale 5.3 55.0 4
3 -0.7211685 -1.1637373 -0.8213095 QLp4mV (512) Wit Belgian-Style White (or Wit) / Belgian-Style Wheat Wheat 5.1 10.0 5
3 -1.0763162 -0.8290686 -0.9152260 thTbY7 (904) Weissguy South German-Style Hefeweizen / Hefeweissbier Wheat 4.4 19.0 4
3 -0.6704331 -0.2341019 -0.5395600 EPnv3B (916) American-Style Pale Ale Pale Ale 5.2 35.0 8
3 -0.1123438 -0.6431415 -0.0699775 QT9hB8 +1 Pumpkin Pumpkin Beer Pumpkin Beer 6.3 24.0 13
2 0.6486870 1.4392418 -0.8213095 btwcy1 077XX India Pale Ale Imperial or Double India Pale Ale Double India Pale Ale 7.8 80.0 5
1 1.2067763 0.1377522 -0.8213095 FWiYZi 08.08.08 Vertical Epic Ale Belgian-Style Pale Ale Pale Ale 8.9 45.0 5
1 -0.5182270 0.5467918 -0.4456435 M6vu9P 10 Blocks South American-Style Pale Ale Pale Ale 5.5 56.0 9
4 0.2428039 -0.4200290 1.9022691 y6LrZ3 10 Ton Oatmeal Stout Stout 7.0 30.0 34

Clustering: Plot

Color vs. bitterness with style centers overlaid:

plot of chunk cluster_srm_ibu


Here I've trimmed outliers using this function with a cutoff = 2.5.

trim_outliers <- function(df, cutoff = 1.96, exclude = NULL, keep_scaled = TRUE){

  to_scale <- names(df)[!names(df) %in% exclude]

  df_scaled <- df %>%
    select(!!to_scale) %>%
    transmute_if(
      is.numeric, scale
    )
  names(df_scaled) <- names(df_scaled) %>% stringr::str_c("_scaled")

  df_out <- bind_cols(df_scaled, df)

  df_out_trimmed <- df_out %>%
    filter_at(
      .vars = vars(contains("_scaled")),
      .vars_predicate = all_vars(. < abs(cutoff))
    )

  if(keep_scaled == FALSE) {
    df_out_trimmed <- df_out_trimmed %>% select(!!names(df))
  }

  return(df_out_trimmed)
}

(You can grab it from my dobtools package on GitHub if you like.)

Narrowing In


If we focus in on 5 distinct styles and cluster them into 5 clusters, will each style be siphoned off into its own cluster?

1 2 3 4 5
Blonde 131 17 4 1 7
India Pale Ale 43 1 468 11 54
Stout 6 10 3 180 5
Tripel 1 57 1 3 3
Wheat 289 9 6 5 12

plot of chunk cluster_certain_styles

Clusterfun with Shiny: UI

Snapshot of what we're including on the frontend:

sidebarLayout(
    sidebarPanel(
      h4("Control Panel"),

      checkboxInput("show_all", "Show all styles", TRUE),      
      checkboxInput("show_centers", "Show style centers", FALSE),
      numericInput("num_clusters", "Number of Clusters:", starting_n_clusters),

      checkboxGroupInput("cluster_on", "Choose variables to cluster on: ",
                         c("ABV (alcohol)" = "abv", 
                           "IBU (bitterness)" = "ibu", 
                           "SRM (color)" ="srm", 
                           "Total number of hops" = "total_hops", 
                           "Total number of malts" = "total_malt"),
                         selected = c("abv", "ibu", "srm")),

      checkboxGroupInput("response_vars", "Choose response variable(s): ",
                         c("Collapsed style" = "style_collapsed",
                           "Specific style" = "style"
                         ),
                         selected = c("style_collapsed")),

      conditionalPanel(
        condition = "input.show_all == false",
        selectInput("style_collapsed", "Collapsed Style:",
                    style_names)
        )
    ),

Clusterfun with Shiny: Server

And how the backend is serving up that data, conditional on what the user asks it to display.

  output$cluster_plot <- renderPlot({

    # If our checkbox is checked saying we do want style centers, show them. Else, don't.
    if (input$show_all == FALSE & input$show_centers == TRUE) {

      this_style_center <- reactive({style_centers %>% filter(style_collapsed == input$style_collapsed)})

      ggplot() +
        geom_point(data = this_style_data(),
                   aes(x = abv, y = ibu, colour = cluster_assignment), alpha = 0.5) +
        geom_point(data = this_style_center(),
                   aes(mean_abv, mean_ibu), colour = "black") +
        geom_text_repel(data = this_style_center(),
                        aes(mean_abv, mean_ibu, label = input$style_collapsed),
                        box.padding = unit(0.45, "lines"),
                        family = "Calibri") +
        ggtitle("k-Means Clustered Beer") +
        labs(x = "ABV", y = "IBU") +
        labs(colour = "Cluster Assignment") +
        theme_minimal() +
        theme(legend.position="none")
    } else if  # ....... etc., etc.

Clusterfun with Shiny

And now for something completely different

…a quick dive into hops

Hops

Hops

No, not those hops!

fresh_hops

These hops ☝️

Hops \häps\, n: 1. It's what makes beer bitter and flavorful.


One question: do more kinds of hops generally make a beer more bitter?

(This is different than the total quantity of hops used during brewing.)

Hops Munge

Let's munge a bit. We'll need to split out ingredients from one column into many.

split_ingredients <- function(df, ingredients_to_split) {
  ncol_df <- ncol(df)

  for (ingredient in ingredients_to_split) {
    ingredient_split <- str_split(df[[ingredient]], ", ")    
    num_new_cols <- max(lengths(ingredient_split))    

    for (num in 1:num_new_cols) {
      this_col <- ncol_df + 1         
      df[, this_col] <- NA
      names(df)[this_col] <- paste0(ingredient, "_", num)
      ncol_df <- ncol(df)             

      for (row in seq_along(ingredient_split)) {          
        if (!is.null(ingredient_split[[row]][num])) {        
          df[row, this_col] <- ingredient_split[[row]][num]
        }
      }
      df[[names(df)[this_col]]] <- factor(df[[names(df)[this_col]]])
    }
    ncol_df <- ncol(df)
  }
  return(df)
}
beer_necessities <- split_ingredients(beer_necessities, 
                              ingredients_to_split = c("hops", "malt"))

Hops Munge

# Gather up all the hops columns into one called `hop_name`
beer_necessities_hops_gathered <- beer_necessities %>%
  gather(
    hop_key, hop_name, hops_name_1:hops_name_13
  ) %>% as_tibble()

# Filter to just those beers that have at least one hop
beer_necessities_w_hops <- beer_necessities_hops_gathered %>% 
  filter(!is.na(hop_name)) %>% 
  filter(!hop_name == "")

beer_necessities_w_hops$hop_name <- factor(beer_necessities_w_hops$hop_name)

# For all hops, find the number of beers they're in as well as those beers' mean IBU and ABV
hops_beer_stats <- beer_necessities_w_hops %>% 
  ungroup() %>% 
  group_by(hop_name) %>% 
  summarise(
    mean_ibu = mean(ibu, na.rm = TRUE) %>% round(digits=2), 
    mean_abv = mean(abv, na.rm = TRUE) %>% round(digits=2),
    n = n()
  ) %>% 
  arrange(desc(n))

# Pare to hops that are used in at least 50 beers
pop_hops_beer_stats <- hops_beer_stats[hops_beer_stats$n > 50, ] 

pop_hops_display <- pop_hops_beer_stats %>% 
    rename(
    `Hop` = hop_name,
    `Mean IBU` = mean_ibu,
    `Mean ABV` = mean_abv,
    `N Beers` = n
  )

What's the hop landscape look like?

Hops contained in >50 beers:

Hop Mean IBU Mean ABV N Beers
Cascade 51.92 6.51 445
Centennial 63.97 7.08 243
Chinook 60.87 7.04 194
Simcoe 64.07 6.88 191
Columbus 63.74 6.95 183
Amarillo 61.36 6.96 163
Citra 59.60 6.73 157
Willamette 39.61 7.01 133
Nugget 52.24 6.38 114
Magnum 48.72 6.93 109
East Kent Golding 38.52 6.35 89


Hop Mean IBU Mean ABV N Beers
Perle (American) 32.04 6.25 88
Hallertauer (American) 23.92 5.66 83
Mosaic 56.82 6.98 71
Northern Brewer (American) 39.48 6.47 71
Mount Hood 37.84 6.55 68
Warrior 59.13 6.98 62
Saaz (American) 30.70 6.25 60
Fuggles 40.76 6.77 59
Tettnanger (American) 30.28 6.02 59

Hop Landscape


Per hop, number beers containing that hop and their average ABV and IBU.


Note that there is actually, irl, a strain of hops called Fuggles.


Q: Do more kinds of hops generally make a beer more bitter?


plot of chunk unnamed-chunk-23

How do hops affect bitterness?

plot of chunk unnamed-chunk-24

Is the relationship significant?

hops_ibu_lm <- lm(ibu ~ total_hops, 
  data = beer_dat %>% filter(total_hops > 0)) %>% 
  broom::tidy() %>% dobtools::style_lm() 
    Variable Estimate Std Error P Value
1 Total Hops    8.635     0.488       0

We can expect an increase in around 9 IBU for every 1 extra hop.

Okay back on track!

onward

Prediction

  • The other side of the coin: supervised learning classification problem
    • Random forest
    • Neural network

We'll go through the neural net.


A low prediction error rate should indicate that styles are well-defined, at least by the variables that we have.

Prediction: Neural Net

  • Package: nnet
  • Outcome variable: style or style_collapsed

What we'll do

  • Feed it a dataframe, an outcome variable, and a set of predictor variables
  • It will train it 80%, test on 20%
    • From this, we can get a measure of accuracy

Neural Net: the Function

run_neural_net <- function(df, multinom = TRUE, outcome, predictor_vars,
                           maxit = 500, size = 5, trace = FALSE, seed = NULL) {
  out <- list(outcome = outcome)

  # Create a new column outcome; it's style_collapsed if you set outcome to style_collapsed, and style otherwise
  if (outcome == "style_collapsed") {
    df[["outcome"]] <- df[["style_collapsed"]]
  } else {
    df[["outcome"]] <- df[["style"]]
  }

  cols_to_keep <- c("outcome", predictor_vars)

  df <- df %>%
    select_(.dots = cols_to_keep) %>%
    mutate(row = 1:nrow(df)) %>% 
    droplevels()

  # Select 80% of the data for training
  df_train <- sample_n(df, nrow(df)*(0.8)) %>% droplevels()

  # The rest is for testing
  df_test <- df %>%
    filter(! (row %in% df_train$row)) %>%
    select(-row)

  df_train <- df_train %>%
    select(-row)
  if(multinom==TRUE) {
    nn <- multinom(outcome ~ .,
                   data = df_train, maxit=maxit, trace=trace)

    # Which variables are the most important in the neural net?
    most_important_vars <- varImp(nn)

  } else if (multinom==FALSE) {
    nn <- nnet(outcome ~ ., size = size,
                   data = df_train, maxit=maxit, trace=trace)
    most_important_vars <- NULL

  }

  # How accurate is the model? Compare predictions to outcomes from test data
  nn_preds <- predict(nn, type="class", newdata = df_test) %>% factor()
  nn_accuracy <- postResample(df_test$outcome, nn_preds)

  out <- list(out, nn = nn, 
              most_important_vars = most_important_vars,
              df_test = df_test,
              nn_preds = nn_preds,
              nn_accuracy = nn_accuracy)

  return(out)
}

Neural Net: Run It

Our predictors will include the total number of hops and malts in the beer, plus our usual ABV, IBU, SRM.

p_vars <- c("total_hops", "total_malt", "abv", "ibu", "srm")

We'll use data filtered to just the most popular styles and run it.

nn_collapsed_out <- run_neural_net(df = beer_dat %>% drop_na(!!p_vars), outcome = "style_collapsed", predictor_vars = p_vars, size = 10, trace = TRUE, multinom = FALSE, seed = 9)
# weights:  379
initial  value 12671.710386 
iter  10 value 8852.454547
iter  20 value 8330.444290
iter  30 value 8022.367811
iter  40 value 7690.050305
iter  50 value 7426.537616
iter  60 value 7136.939031
iter  70 value 6941.070550
iter  80 value 6650.527339
iter  90 value 6484.024279
iter 100 value 6251.726635
iter 110 value 6151.316864
iter 120 value 6054.067797
iter 130 value 5928.756321
iter 140 value 5866.960200
iter 150 value 5830.132567
iter 160 value 5805.548164
iter 170 value 5792.047662
iter 180 value 5786.967344
iter 190 value 5783.937058
iter 200 value 5780.752235
iter 210 value 5771.709115
iter 220 value 5757.418148
iter 230 value 5745.412760
iter 240 value 5737.414828
iter 250 value 5733.175411
iter 260 value 5729.750039
iter 270 value 5727.505435
iter 280 value 5724.679021
iter 290 value 5722.687159
iter 300 value 5721.413677
iter 310 value 5720.333103
iter 320 value 5719.452310
iter 330 value 5718.813302
iter 340 value 5718.244388
iter 350 value 5717.786992
iter 360 value 5717.346364
iter 370 value 5716.925022
iter 380 value 5716.466009
iter 390 value 5716.072555
iter 400 value 5715.745053
iter 410 value 5715.259823
iter 420 value 5714.886930
iter 430 value 5714.408302
iter 440 value 5713.300458
iter 450 value 5712.727905
iter 460 value 5712.489987
final  value 5712.489501 
converged

Neural Net: Visual

We can check out the structure of the neural net using the NeuralNetTools package.

We can see we've got one hidden layer and two bias nodes.

par(mar=c(4.1, 0.1, 4.1, 7.1))
NeuralNetTools::plotnet(nn_collapsed_out$nn, line_stag = 0.01, max_sp = TRUE)

plot of chunk unnamed-chunk-28

Neural Net: Evaluate


How'd we do?

nn_collapsed_out$nn_accuracy[1]
 Accuracy 
0.5057803 


50.6% isn't terrible given we've got 30 collapsed styles; chance would be 3.3%.


Which variables are most important?

caret::varImp will give us an importance measure using combinations of the absolute values of the weights

Variable Importance Percent
total_hops 58.8%
total_malt 19.1%
abv 17.7%
ibu 2.5%
srm 1.8%

Neural Net: Glass

What happens if we add in glass, a style-dependent attribute, as a predictor?

p_vars_add_glass <- c("total_hops", "total_malt", "abv", "ibu", "srm", "glass")
nn_collapsed_out_add_glass <- run_neural_net(df = beer_dat %>% drop_na(!!p_vars_add_glass), outcome = "style_collapsed", predictor_vars = p_vars_add_glass, trace = FALSE, multinom = FALSE)
 Accuracy 
0.5436409 


That's 3.79 percentage points more accurate than the model without glass included.

Here summing up the contributions from all glasses.

Variable Importance Percent
glass 46.3%
total_hops 15.6%
total_malt 12.0%
abv 9.1%
ibu 8.6%
srm 8.5%

So what's the answer?

Are beer styles a useful construct to use as a proxy for natural clusters in beer?

I'd give it a fuzzy yes.

Fuzzy because:

  • We couldn't do better than ~50% accuracy
  • We had a lot of rows with missing predictors

Unknowns:

  • Was our style collapsing scheme successful in putting beers in the “right” buckets?
  • Would taste-related information have accounted for much of the remaining variance we couldn't explain?

So what's the answer?


plot of chunk unnamed-chunk-33


Plotting the landscape directly on our two main dimensions,

  • Syles are very overlapping, especially among lower-alcohol and -bitterness styles
  • But, no clear super-groups that are independent of style

Future Directions

In no particular order, some thoughts I've had plus suggestions from others:

  • More Shiny app features
    • Beer searchability, tooltips over each point on hover
  • Join on other data
    • e.g., Untappd or something scraped from the interwebs
  • Beer consumption: how is this trending over time, for each style?
    • What drives the trend? Supply or demand?
      • Do brewers brew more sours causing people buy more of them or do people start liking sours and cause brewers to brew more?
  • Hierarchical clustering to make a more scientific beer family tree
  • Some funky algorithm to generate new beer names

Cheers, all!

R version 3.3.3 (2017-03-06)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.6

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2         NeuralNetTools_1.5.0 emo_0.0.0.9000      
 [4] caret_6.0-76         lattice_0.20-35      nnet_7.3-12         
 [7] feather_0.3.1        forcats_0.2.0        dplyr_0.7.4.9000    
[10] purrr_0.2.4          readr_1.1.1          tidyr_0.7.2         
[13] tibble_1.3.4         tidyverse_1.1.1      dobtools_0.1.0      
[16] ggrepel_0.6.5        ggplot2_2.2.1        jsonlite_1.5        
[19] broom_0.4.2          knitr_1.17          

loaded via a namespace (and not attached):
 [1] nlme_3.1-131        pbkrtest_0.4-7      lubridate_1.6.0    
 [4] devtools_1.13.3     RColorBrewer_1.1-2  httr_1.3.1         
 [7] tools_3.3.3         backports_1.1.0     R6_2.2.2           
[10] rpart_4.1-11        Hmisc_4.0-3         lazyeval_0.2.0     
[13] mgcv_1.8-17         colorspace_1.3-2    withr_2.0.0        
[16] tidyselect_0.2.2    gridExtra_2.2.1     mnormt_1.5-5       
[19] curl_2.8.1          git2r_0.18.0        rvest_0.3.2        
[22] quantreg_5.29       htmlTable_1.9       SparseM_1.74       
[25] xml2_1.1.1          labeling_0.3        scales_0.5.0       
[28] checkmate_1.8.3     psych_1.7.5         stringr_1.2.0      
[31] digest_0.6.12       foreign_0.8-69      minqa_1.2.4        
[34] base64enc_0.1-3     pkgconfig_2.0.1     htmltools_0.3.6    
[37] lme4_1.1-13         highr_0.6           htmlwidgets_0.9    
[40] rlang_0.1.2.9000    readxl_1.0.0        shiny_1.0.5.9000   
[43] bindr_0.1           acepack_1.4.1       ModelMetrics_1.1.0 
[46] car_2.1-5           magrittr_1.5        Formula_1.2-2      
[49] Matrix_1.2-8        Rcpp_0.12.13        munsell_0.4.3      
[52] stringi_1.1.5       MASS_7.3-47         plyr_1.8.4         
[55] grid_3.3.3          parallel_3.3.3      crayon_1.3.4       
[58] miniUI_0.1.1        haven_1.1.0         splines_3.3.3      
[61] hms_0.3             ranger_0.8.0        reshape2_1.4.2     
[64] codetools_0.2-15    stats4_3.3.3        glue_1.1.1         
[67] evaluate_0.10.1     latticeExtra_0.6-28 data.table_1.10.4  
[70] modelr_0.1.1        nloptr_1.0.4        httpuv_1.3.5.9000  
[73] foreach_1.4.3       MatrixModels_0.4-1  cellranger_1.1.0   
[76] gtable_0.2.0        assertthat_0.2.0    mime_0.5           
[79] xtable_1.8-2        e1071_1.6-8         class_7.3-14       
[82] survival_2.41-3     iterators_1.0.8     memoise_1.1.0      
[85] cluster_2.0.5