Charlene Chambliss
2023 Aug 24
(NOTE: This is an interview I did for Aquariumâs Tidepool blog while I was working there. Aquarium has since been acquired by Notion.)
Charlene Chambliss is a senior software engineer at Aquarium Learning, where sheâs working on tooling to help ML teams improve their model performance by improving their data. In addition to being an incredible engineer with an inspiring backstory, Charlene previously worked on NLP applications at Primer.AI. In this blog post, we interview Charlene about her experiences working with older models like BERT, and the perspective this gives her on the more recent wave of generative, RLHF-based LLMs (e.g. GPT-4 and LLaMA).
Could you talk a bit about what you worked on at Primer?
When I was at Primer, the main product, Analyze, analyzed raw news articles. We took in all the latest news, from news outlets both big and small, and we were trying to cluster them into discrete, structured âevents.â
For example, âTropical Storm Hilary hits southern Californiaâ â thatâs an event. But identifying events is nontrivial, because every article will talk about this event differently, with different titles, vocabulary, focus, and so on. Also, what we humans think of as âan eventâ might span multiple days, or weeks, or months, so the event clustering problem is pretty challenging.
Much of my time at Primer was on the applied research side â training models, evaluating models, and setting them up to do model inference on the firehose of incoming documents. I worked on a bunch of different tasks: document-level classification, question-answering, relationship extraction, and named-entity recognition.
What does named entity recognition refer to?
Named-entity recognition (NER) is a classic NLP task that takes in a document and identifies the important ânounsâ in the document text â for example, people, places, organizations and miscellaneous entities (like the name of a product, or a proposed legislation bill). So the goal was to be able to provide users with a list of entities associated with each event.
Who were your users?
The end users were mostly government analysts â they were responsible for monitoring some sort of key area, like âI need to know whatâs going on in Pakistan todayâ, and could use the product as a source of âopen-source intelligenceâ (OSINT). But you might imagine other sorts of users, like finance customers who need to monitor their portfolio companies, or anything tangentially related that might affect their stock prices. Anyone who might find it helpful to have the âpulseâ on something.
What did a pre-GPT-era architecture look like?
I think the number one difference between an older model like BERT vs todayâs GPT-3.5 is that, generally speaking, the older models had to be trained to do one specific task and that task only. You could fine-tune a BERT model to do classification, or named entity recognition, or summarization, or what-have-you. But you couldnât, for example, take a classification model and then start using it to do question-answering â fine-tuned weights werenât transferable to other tasks.
If you wanted to do multiple tasks, you would have to train multiple models, each of which is its own distinct project, which will each take at least two weeks or so to be production-ready. The new models have been trained in a way that allows them to generalize to multiple different tasks, but at a slightly lower quality than if you had done a fine-tuned, dedicated model. (There are rumors that some of them ARE in fact multiple models, via the Mixture-of-Experts method, but those are still unverified!)
Effective real-world LLM applications actually are structured like this, drawing on multiple tasks. With Primer, for example, Analyze didnât just do named-entity recognition; it also did the initial clustering of the documents, then within each cluster it was doing extractive summarization, document classification by topic, quote extraction, relationship extraction between entities, among other things.
So from an architectural perspective, in the pre-GPT days, your LLM application would often be made up of a collection of different, independently trained models. If GPT-3.5 had been available back then, it might have looked more like a series of API calls to a single catch-all model, where the tasks would be differentiated by prompts.
Could you also talk a bit more about the differences in upfront cost, between BERT and todayâs off-the-shelf models?
Another big difference is that you donât need your own inference infrastructure in order to use these new LLMs. When we were working with various fine-tuned BERT models, weâd have to figure out how to host them and run inference. At the time, there were maybe a few places starting to offer cloud hosting for ML inference workloads, but nothing like the ecosystem now.
Today you have HuggingFace, or Replicate, or tons of different places that allow you to just upload a model and ping their API for inference, without having to maintain your own infra for GPU-intensive workloads. Or, in reality, most people just use OpenAI or Anthropicâs general models. As a result, todayâs startups are able to build genuinely useful products with much smaller and less specialized teams.
Youâd mentioned two weeks to spin up a model for productionâwhat was the bulk of this time spent on?
The biggest thing was data labeling. If you wanted a model to determine whether something is a financial document or not, youâd need to get a dataset where you have X documents that are finance-related and Y documents that arenât. Like the models themselves, datasets werenât reusable â youâd need an entirely new set of labeled documents for a new class, like whether a document is clickbait.
So youâd wait for a team to label some documents (enough for both training AND evaluation), your ML engineer trains a model on a small cloud VM or similar, and then theyâd pull it back down and evaluate it. In my experience, the majority of those 2 weeks was waiting for turnaround on the data. Writing a script and training and evaluating the model often took as little as an hour or so. Sometimes I would label it myself because I was impatient.
Do you think thereâs still a place today for these older models like BERT?
Absolutely. If youâre dealing with a high-volume, narrowly-defined task, then thereâs potentially big cost savings to be had by using the older and smaller transformers. Or if your task is specialized, like if whatever domain youâre working in isnât well-represented in the LLM training data.
- High volume: you have a bulk amount of documents that you just need to churn through every day.
- Narrowly-defined: all you want is named-entity recognition, or to classify the document as X/Y/Z, or to do some other single task reliably and at high accuracy.
For these kinds of tasks, at scale itâs cheaper to fine-tune and deploy an older model than it would be to try to use GPT-3.5 or LLaMA. Older models are much smaller in terms of their memory footprint, and there have also been a lot of transformer-specific optimizations made in the last few years to make inference fast. It ends up costing a fraction of what it would cost to do that same task on the same number of documents using a newer and larger model.
That said, if your application really does need to deal with unpredictable requests, users can provide their own documents and expect immediate answers, and that kind of thing, it makes sense to use a larger and more general model since you canât really encode that as a âdo-at-scaleâ task for narrow models.
Wouldnât sourcing the training data still be a bottleneck, in the case of fine-tuning?
Actually, itâs easier than ever to train the narrower models because you can now use the generative models to label data for you. If your task is straightforward enough, the outputs from GPT-3.5 or LLaMA 2 might be correct on 90% of your examples, so instead of paying people to label 100% of the data, you pay just for inference and then for people to correct that other 10% of the data.
What is your take on fine-tuning in general, as opposed to âwaiting until GPT-5â? Â Is it reasonable to expect significnant improvements in subsequent models?
I wouldnât feel comfortable saying yes or no at this point as to whether LLMs will level off in quality significantly. We havenât seen any quantum leaps since GPT-4, but that was also only just earlier this year. There could very well be another quantum leap within the next 12 months.
From the product perspective, I do err on the side of waiting and seeing, because oftentimes the quality of the generation isnât the limiting factor in terms of whether people will use your product and get value out of it. The limiting factor is usually UX. A tool thatâs easy to use and which occasionally produces bad outputs will have more users than a tool with pristine outputs but thatâs awkward to use, every day of the week.
You can always be working on better UX while waiting for LLM improvements. Just listen to what users say about using your productâs interface, or integrations. Read their comments about the generation quality and see if itâs actually a problem that can be solved by adding a simple heuristic on top, like good old string matching or regex. And donât forget about tools like Microsoftâs guidance, which can help you more easily apply well-researched prompt engineering techniques to get more reliable outputs.
UX and generation quality will both be bottlenecks at different stages of your productâs development. No one will notice generation quality if the UI is annoying to use, because theyâll churn and go use something else, but once people are using it and liking it, theyâll put up with some jank here and there if the outputs are good enough.
What are your top three favorite AI-powered interfaces right now?
If I had to pick a top three for my favorite AI-powered interfaces right now, I would say that my favorite is probably the perplexity.ai search engine. You can use it for open-ended questions, learning, debugging, and of course traditional search queries.
And then you have GitHub Copilot, of course. Itâs great for churning out boilerplate code. I think the autocomplete on Copilot is currently still the best in the code generation market. Itâs not perfect, but it often does know what youâre trying to do, so itâs great as long as youâre willing to be attentive.
The third one is Sourcegraph Cody. Copilot is great for code generation, but Cody is king when it comes to understanding your entire codebase. For example, Iâve had it tell me about entire multipart request lifecycles, like âhereâs where the request is made in the frontend, how itâs processed in the server, how itâs processed by this other service, and the artifacts get uploaded to GCS here.â Normally I would have to ask someone or just trace the code all the way through, which could take hours for something sufficiently complex.
Sometimes people dismiss AI coding tools because they donât want the tools to write code for them. So donât! They can still make you faster at other things, like writing tests and documentation. This is true in general: AI often isnât accurate enough to fully delegate tasks to, but it can still provide a lot of time savings as an assistant.
Thanks to Charlene Chambliss for this interview! You can find her on LinkedIn or check out her personal website.