Feature Columns

Overview

Feature columns are used to specify how Tensors received from the input function should be combined and transformed before entering the model. A feature column can be a plain mapping to some input column (e.g. column_numeric() for a column of numerical data), or a transformation of other feature columns (e.g. column_crossed() to define a new column as the cross of two other feature columns).

The following feature columns are available:

Feature Column Description
column_categorical_with_vocabulary_list() Construct a Categorical Column with In-Memory Vocabulary.
column_categorical_with_vocabulary_file() Construct a Categorical Column with a Vocabulary File.
column_categorical_with_identity() Construct a Categorical Column that Returns Identity Values.
column_categorical_with_hash_bucket() Represents Sparse Feature where IDs are set by Hashing.
column_categorical_weighted() Construct a Weighted Categorical Column.
column_indicator() Represents Multi-Hot Representation of Given Categorical Column.
column_numeric() Construct a Real-Valued Column.
column_embedding() Construct a Dense Column.
column_crossed() Construct a Crossed Column.
column_bucketized() Construct a Bucketized Column.

Some typical mappings of R data types to feature column are:

Data Type Feature Column
Numeric column_numeric()
Factor column_categorical_with_identity()
Character column_categorical_with_hash_bucket()

We’ll use the flights dataset from the nycflights13 package to explore how feature columns can be constructed. The flights dataset records airline on-time data for all flights departing NYC in 2013.

library(nycflights13)
print(flights)
> print(flights)
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin  dest air_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>   <chr>  <chr> <chr>    <dbl>
 1  2013     1     1      517            515         2      830            819        11      UA   1545  N14228    EWR   IAH      227
 2  2013     1     1      533            529         4      850            830        20      UA   1714  N24211    LGA   IAH      227
 3  2013     1     1      542            540         2      923            850        33      AA   1141  N619AA    JFK   MIA      160
# ... with 336,766 more rows, and 4 more variables: distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

For example, we can define numeric columns based on the dep_time and dep_delay variables:

cols <- feature_columns(
  column_numeric("dep_time"),
  column_numeric("dep_delay")
)

You can also define multiple feature columns at once.

cols <- feature_columns(
  column_numeric("dep_time", "dep_delay")
)

Pattern Matching

Often, you will find that you want to generate a number of feature column definitions based on some pattern existing in the names of your data set. tfestimators uses the tidyselect package to make it easy to define feature columns, similar to what you might be familiar with in the dplyr package. You can use the names = argument of feature_columns() function to define a context from which variable names will be selected.

For example, we can use the ends_with() helper to assert that all columns ending with "time" are numeric columns as follows:

library(nycflights13)

cols <- feature_columns(names = flights,
  column_numeric(ends_with("time"))
)

The names parameter can either be a character vector with the names as-is, or any named R object.

If the code you are using to compose columns is more complicated, or if you need to save references to columns for use in column embeddings you can also establish a scope for given set of column names using the with_columns() function:

cols <- with_columns(flights, {
  feature_columns(
    column_numeric(ends_with("time"))
  )
})

You can also use an alternate syntax of the form (pattern) ~ (column), which can add clarity when longer pattern rules are used, as it separates the matching rule from the column definition:

cols <- with_columns(flights, {
  feature_columns(
    ends_with("time") ~ column_numeric(),
  )
})

Available pattern matching operators include:

Operator Description
starts_with() Starts with a prefix
ends_with() Ends with a suffix
contains() Contains a literal string
matches() Matches a regular expression
one_of() Included in character vector
everything() All columns

See help("select_helpers", package = "tidyselect") for full information on the set of helpers made available by the tidyselect package.