This post was written by Willem Pienaar, Principal Engineer at Tecton and creator of Feast.

Feast is an open source feature store and a fast, convenient way to serve machine learning (ML) features for training and online inference. Feast lets you build point-in-time correct training datasets from feature data, allows you to deploy a production-grade feature serving stack to Amazon Web Services (AWS) in seconds, and simplifies tracking features models are using.

Why Feast?

Most ML teams today are well versed in shipping machine learning models into production, but deploying models into production is only a small part of the MLOps lifecycle. Most teams don’t have a declarative way to ship data into production for consumption by machine learning models. That’s where Feast helps.

  • Tracking and sharing features: Feast allows teams to define and track feature metadata (such as data sources, entities, and features) through declarative definitions that are version controlled in Git. This allows teams to maintain a versioned history of operationalized features, helping teams understand how features are performing in production, and enabling reuse and sharing of features across teams.
  • Managed serving infrastructure: Feast takes all the work out of setting up data infrastructure. Feast makes configuring your data infrastructure for serving features possible, makes populating these stores with feature values easy, and provides an SDK for reading feature values from these stores at low latency.
  • A consistent view of data: Machine learning models need to see a consistent view of features in training as they will see in production. Feast ensures this consistency through time-travel-based training dataset generation, and through a unified serving interface that helps your online models see a consistent view of features during inference and training.

Feast on AWS

With the latest release of Feast, you can take advantage of AWS storage services to run an open source feature store:

  1. Amazon Redshift and Amazon Simple Storage Service (Amazon S3) can be used as an offline store, which supports feature serving for training and batch inference of large amounts of feature data.
  2. Amazon DynamoDB, a NoSQL key-value database, can be used as an online store. Amazon DynamoDB supports feature serving at low latency for real-time prediction.

Overview of workflow described in article

Use case: Real-time credit scoring

When individuals apply for loans from banks and other credit providers, the decision to approve a loan application is made through a statistics model. Often, this model uses information about a customer to determine the likelihood that they will repay or default on a loan. This process is called credit scoring.

For this use case, we will demonstrate how a real-time credit scoring system can be built using Feast and scikit-Learn.

This real-time system is required to accept a loan request from a customer and respond within 100 ms with a decision on whether their loan has been approved or rejected.

A fully working demo repository for this use case is available on GitHub.

Data model

We have three datasets at our disposal to build this credit scoring system.

The first is a loan dataset. This dataset has features based on historic loans for current customers. Importantly, this dataset contains the target column, loan_status. This column denotes whether a customer has defaulted on a loan.

ColumnDescriptionSample
loan_idUnique id for the loan12208
dob_ssnDate of birth joined to SSN19790429_9552
zipcodeZip code of the customer30721
person_ageAge of customer24
person_incomeYearly income of the customer30000
person_home_ownershipHome ownership class for customerRENT
person_emp_lengthHow long the customer has been employed (months)2.0
loan_intentReason for taking out loanEDUCATION
loan_amntLoan amount3000
loan_int_rateLoan interest rate5.2
loan_statusStatus of loan0
event_timestampWhen the loan was issued or updated2021-07-28 17:09:19
created_timestampWhen this record was written to storage2021-07-28 17:09:19

The second dataset we will use is a zip code dataset. This dataset is used to enrich the loan dataset with supplementary features about a specific geographic location.

ColumnDescriptionSample
zipcodeZip code to which features relate94546
cityCity to which features relateCASTRO VALLEY
stateState to which features relateCA
tax_returns_filedAmount of tax returns filed in this zip code20616
populationTotal population of this zip code35351
wagesCombined yearly earnings for all individuals in this zip code987939047
event_timestampWhen the zipcode features were collected2017-01-01 12:00:00
created_timestampWhen this record was written to storage2017-01-01 12:00:00

The third and final dataset is a credit history dataset. This is a dataset that contains the credit history on a per-person basis and is updated on a frequent basis by the credit institution. Every time a credit check is done on an individual, this dataset will be updated.

ColumnDescriptionSample
dob_ssnDate of birth joined to SSN19530219_5179
credit_card_dueHow much this person owes on their credit cards0
mortgage_dueHow much this person owes on their mortgages91803
student_loan_dueHow much this person owes on their student loans0
vehicle_loan_dueHow much this person owes on their vehicle loans0
hard_pullsHow many hard credit checks this person has had1
missed_payments_2yHow many missed payments this person has had in the last 2 years1
missed_payments_1yHow many missed payments this person has had in the last 1 years0
missed_payments_6mHow many missed payments this person has had in the last 6 months0
bankruptciesHow many bankruptcies this person has had0
event_timestampWhen the credit check was executed2017-01-01 12:00:00
created_timestampWhen this record was written to storage2017-01-01 12:00:00

The preceding loan, zip code, and credit history features will be combined into a single training dataset when building a credit-scoring model. However, historic loan data is not useful for making predictions based on new customers. Therefore, we will register and serve only the zip code and credit history features with Feast, and we will assume that the incoming request contains the loan application features.

An example of the loan application payload is as follows:

loan_request = { "zipcode": [76104], "dob_ssn": [19530219_5179], "person_age": [133], "person_income": [59000], "person_home_ownership": ["RENT"], "person_emp_length": [123.0], "loan_intent": ["PERSONAL"], "loan_amnt": [35000], "loan_int_rate": [16.02],
}

Amazon S3 and Redshift as a data source and offline store

A Redshift data source allows you to fetch historical feature values from Redshift for building training datasets and materializing features into an online store.

Install Feast using pip:

pip install feast

Initialize a blank feature repository:

feast init -m credit_scoring

This command will create a feature repository for your project. Let’s edit our feature store configuration using the provided feature_store.yaml:

project: credit_scoring_aws
registry: registry.db # where we will store our feature metadata
provider: aws # the environment we are deploying to online_store: type: dynamodb # the online feature store region: us-west-2 offline_store: type: redshift # the offline feature store cluster_id: region: us-west-2 user: admin database: dev s3_staging_location: iam_role: 

A data source is defined as part of the Feast Declarative API in the feature repo directory’s Python files. Now that we’ve configured our infrastructure, let’s register the zip code and credit history features we will use during training and serving.

Create a file called features.py within the credit_scoring/ directory. Then add the following feature definition to features.py:

from datetime import timedelta
from feast import Entity, Feature, FeatureView, RedshiftSource, ValueType zipcode = Entity(name="zipcode", value_type=ValueType.INT64) zipcode_features = FeatureView( name="zipcode_features", entities=["zipcode"], ttl=timedelta(days=3650), features=[ Feature(name="city", dtype=ValueType.STRING), Feature(name="state", dtype=ValueType.STRING), Feature(name="location_type", dtype=ValueType.STRING), Feature(name="tax_returns_filed", dtype=ValueType.INT64), Feature(name="population", dtype=ValueType.INT64), Feature(name="total_wages", dtype=ValueType.INT64), ], batch_source=RedshiftSource( query="SELECT * FROM spectrum.zipcode_features", event_timestamp_column="event_timestamp", created_timestamp_column="created_timestamp", ),
) dob_ssn = Entity( name="dob_ssn", value_type=ValueType.STRING,
) credit_history_source = RedshiftSource( query="SELECT * FROM spectrum.credit_history", event_timestamp_column="event_timestamp", created_timestamp_column="created_timestamp",
) credit_history = FeatureView( name="credit_history", entities=["dob_ssn"], ttl=timedelta(days=90), features=[ Feature(name="credit_card_due", dtype=ValueType.INT64), Feature(name="mortgage_due", dtype=ValueType.INT64), Feature(name="student_loan_due", dtype=ValueType.INT64), Feature(name="vehicle_loan_due", dtype=ValueType.INT64), Feature(name="hard_pulls", dtype=ValueType.INT64), Feature(name="missed_payments_2y", dtype=ValueType.INT64), Feature(name="missed_payments_1y", dtype=ValueType.INT64), Feature(name="missed_payments_6m", dtype=ValueType.INT64), Feature(name="bankruptcies", dtype=ValueType.INT64), ], batch_source=credit_history_source,
)

Feature views allow users to register data sources in their organizations into Feast, and then use those data sources for both training and online inference. The preceding feature view definition tells Feast where to find zip code and credit history features.

Now that we have defined our first feature view, we can apply the changes to create our feature registry and configure our infrastructure:

feast apply Registered entity dob_ssn
Registered entity zipcode
Registered feature view credit_history
Registered feature view zipcode_features
Deploying infrastructure for credit_history
Deploying infrastructure for zipcode_features

The preceding apply command will:

  • Store all entity and feature view definitions in a local file called registry.db.
  • Create an empty DynamoDB table for serving zip code and credit history features.
  • Ensure that your data sources on Redshift are available.

Building a training dataset

Our loan dataset contains our target variable, so we will load that first:

loans_df = pd.read_parquet("loan_table.parquet")

But this dataset does not contain all the features we need in order to make an accurate scoring prediction. We also must join our zip code and credit history features, and we need to do so in a point-in-time correct way.

First, we create a feature store object from our feature repository:

fs = feast.FeatureStore(repo_path="credit_scoring/")

Then we identify the features we want to query from Feast:

feast_features = [ "zipcode_features:city", "zipcode_features:state", "zipcode_features:location_type", "zipcode_features:tax_returns_filed", "zipcode_features:population", "zipcode_features:total_wages", "credit_history:credit_card_due", "credit_history:mortgage_due", "credit_history:student_loan_due", "credit_history:vehicle_loan_due", "credit_history:hard_pulls", "credit_history:missed_payments_2y", "credit_history:missed_payments_1y", "credit_history:missed_payments_6m", "credit_history:bankruptcies",
]

Then we make a query from Feast to enrich our loan dataset. Feast will automatically detect the zip code and dob_ssn join columns and join the feature data in a point-in-time correct way. It does this by only joining features that were available at the time the loan was active.

training_df = fs.get_historical_features( entity_df=loans, features=feast_features
).to_df()

Once we have retrieved the complete training dataset, we can:

  • Drop timestamp columns and the loan_id column.
  • Encode categorical features.
  • Split the training dataframe into a train, validation, and test set.

Finally, we can train our classifier:

from sklearn import tree clf = tree.DecisionTreeClassifier()
clf.fit(train_X[sorted(train_X)], train_Y)

The full model training code is on GitHub.

DynamoDB as an online store

Before we can make online loan predictions with our credit scoring model, we must populate our online store with feature values. To load features into the online store, we use materialize incremental:

CURRENT_TIME=$(date -u +"%Y-%m-%dT%H:%M:%S")
feast materialize-incremental $CURRENT_TIME

This command will load features from our zip code and credit history data sources up to the $CURRENT_TIME. The materialize command can be repeatedly called as more data becomes available in order to keep the online store fresh.

Fetching a feature vector at low latency

Now we have everything we need to make a loan prediction.

# The incoming loan request is shown in the following object
loan_request = { "zipcode": [76104], "dob_ssn": ["19632106_4278"], "person_age": [133], "person_income": [59000], "person_home_ownership": ["RENT"], "person_emp_length": [123.0], "loan_intent": ["PERSONAL"], "loan_amnt": [35000], "loan_int_rate": [16.02],
} # Next we fetch our online features from DynamoDB using Feast
customer_zipcode = loan_request['zipcode'][0]
dob_ssn = loan_request["dob_ssn"][0] feature_vector = fs.get_online_features( entity_rows=[{"zipcode": zipcode, "dob_ssn": dob_ssn}], features=feast_features,
).to_dict() # Then we join the Feast features to the loan request
features = loan_request.copy()
features.update(feature_vector)
features_df = pd.DataFrame.from_dict(features) # Finally we make a prediction
prediction = clf.predict(features_df) # 1 = default, 0 = will repay

Conclusion

That’s it! We have a functional real-time credit scoring system.

Check out the feast GitHub repository for the latest features, such as on-demand transformation, Feast server deployment to AWS Lambda, as well as support for streaming sources.

The complete end-to-end real-time credit scoring system is available on GitHub. Feel free to deploy it and try it out.

If you want to participate in the Feast community, join us on Slack, or read the Feast documentation to get a better understanding of how to use Feast.

William Pienaar

Willem Pienaar

Willem is a principal engineer at Tecton where he leads open source development for Feast, the open source feature store. Willem previously started and led the data science platform team at Gojek, where his team built and operated an ML platform that supported ML systems for pricing, recommendations, forecasting, fraud detection, and matchmaking, all processing hundreds of million orders every month. His main focus areas are building operational data and ML tools and systems. In a previous life, Willem founded and sold a networking startup and was a software engineer in industrial control systems.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

Categories: Open Source