Charlene's thoughts on software, language models, and, well... mostly just those two. 😄
Introducing FoodBERT: Food Extraction with DistilBERT

Introducing FoodBERT: Food Extraction with DistilBERT

Charlene Chambliss

2020 Nov 1

I built a token classification model using DistilBERT to provide a lightweight and fast method for extracting foods and ingredients from structured and unstructured text.

To my knowledge, prior to this model, there were no open-source neural models for the extraction of foods and food ingredients from text.

From an education perspective, this codebase is also a good example of how to build a modern token classification model using the latest utilities from the HuggingFace transformers library.

The model has both research and commercial applications; for example, researchers can use it to algorithmically extract ingredients from cookbooks, then chart the popularity of different ingredients throughout history.

Those with a commercial interest can run it on news articles to understand which foods and ingredients are trending, and what kinds of new products are being invested in within the CPG space.

In terms of time spent, I worked on this on weekends for around a month / month and a half, including labeling my own data.

So end-to-end I’d say building and documenting the entire repo took about 12 days - a far cry from the ~2 months it took to do my first BERT project!

This is in part due to the wonderful progress HuggingFace has made with transformers, and in part due to the fact that stuff like this is my full-time job now. 😉

Check out this repo to use the model yourself, or to read more about the technical details.

Here’s an excerpt from the README demonstrating how the model can be used:

Load the trained model from the transformers model zoo

Loading the trained model from HuggingFace can be done in a single line:

from food_extractor.food_model import FoodModel
model = FoodModel("chambliss/distilbert-for-food-extraction")

This downloads the model from HF’s S3 bucket and means you will always be using the best-performing/most up-to-date version of the model.

You can also load a model from a local directory using the same syntax.

Extract foods from some text

The model is especially good at extracting ingredients from lists of recipe ingredients, since there are many training examples of this format:

>>> examples = """3 tablespoons (21 grams) blanched almond flour
... ¾ teaspoon pumpkin spice blend
... ⅛ teaspoon baking soda
... ⅛ teaspoon Diamond Crystal kosher salt
... 1½ tablespoons maple syrup or 1 tablespoon honey
... 1 tablespoon (15 grams) canned pumpkin puree
... 1 teaspoon avocado oil or melted coconut oil
... ⅛ teaspoon vanilla extract
... 1 large egg""".split("\n")

>>> model.extract_foods(examples[0])
[{'Product': [], 'Ingredient': [{'text': 'almond flour', 'span': [34, 46], 'conf': 0.9803279439608256}]}]

>>> model.extract_foods(examples)
[{'Product': [], 'Ingredient': [{'text': 'almond flour', 'span': [34, 46], 'conf': 0.9803279439608256}]},
{'Product': [], 'Ingredient': [{'text': 'pumpkin spice blend', 'span': [11, 30], 'conf': 0.8877270460128784}]},
{'Product': [], 'Ingredient': [{'text': 'baking soda', 'span': [11, 22], 'conf': 0.89898481965065}]},
{'Product': [{'text': 'Diamond Crystal kosher salt', 'span': [11, 38], 'conf': 0.7700592577457428}], 'Ingredient': []},
... (further results omitted for brevity)
]

It also works well on standard prose:

>>> text = """Swiss flavor company Firmenich used artificial intelligence (AI) in partnership with Microsoft to optimize flavor combinations and create a lightly grilled beef taste for plant-based meat alternatives, according to a release."""

>>> model.extract_foods(text)
[{'Product': [],
'Ingredient': [{'text': 'beef', 'span': [156, 160], 'conf': 0.9615312218666077},
{'text': 'plant', 'span': [171, 176], 'conf': 0.8789700269699097},
{'text': 'meat', 'span': [183, 187], 'conf': 0.9639666080474854}]}]

To get raw predictions, you can also use model.predict directly.

However, extract_foods has a couple of heuristics added to remove low-quality predictions, so model.predict is likely to give slightly worse performance.

That said, it is useful for examining the raw labels/probabilities/etc. from the forward pass.

# Using the same text as the previous example
>>> predictions = model.predict(text)[0]

# All data available from the example
>>> predictions.keys()
dict_keys(['tokens', 'labels', 'offsets', 'probabilities', 'avg_probability', 'lowest_probability', 'entities'])

>>> for t, p in zip(predictions['tokens'], predictions['probabilities']):
...     print(t, round(p, 3))
Swiss 0.991
flavor 0.944
company 0.998
Fi 0.952
...

# Get the token the model was least confident in predicting
>>> least_confident = predictions['probabilities'].index(predictions['lowest_probability'])
>>> predictions[0]['tokens'][least_confident]
'plant'

# Get the dict of ingredients and products
>>> predictions['entities']
{'Product': [],
 'Ingredient': [{'text': 'beef',
   'span': [156, 160],
   'conf': 0.9615312218666077},
   ...

If you found it useful, I’d love to hear from you!

<< Previous Next >>