Atlas is an AI-powered search engine that can find anything on Youtube.
Search “basketball” and it will take you to the exact point in the video where they talk about Lebron James and Michael Jordan, even though the word basketball is never actually said in that clip.
Search “city in california” and it will take you to the point in another video where they talk about Berkeley, a city in California. Again, note that the word California is not mentioned in that clip.
Ask Atlas “what shoes should I wear?” to find the videos that talks about fashion and the exact part where they talk about shoes. Similar for, “best jeans to get” or “what are enzymes” or “best ways to invest”.
You can even ask Atlas “how to find love” and it will give you specific advice on how to find love.
Youtube is the is the world’s largest source of information. 500 hours of video is uploaded to Youtube every minute, 6 million books worth of information is created on Youtube every year (calculations and source).
If Youtube was a library it would have about 99 million books. If websites were libraries, Youtube would be 16 times larger than Reddit, 50 times larger than Twitter and 2,000 times larger than Wikipedia.
If Youtube were a library it would be the 3rd largest library in the world with 99 Million books. Following the Library of Cognress (173M) and British Library (170M). It would have almost double the size of the next largest libraries: Shanghai Library (56M), New York Public Library (55M), and Library and Archives Canada (54M) and even the Amazon Book store (48M).(calculations and source)
All of this to say, if there’s a piece of information you’re looking for, there’s a very good chance it’s on Youtube. However, unlike Reddit, Twitter, Wikipedia, books and other text-based information sources, information on Youtube isn’t well indexed. Sure, you can do a keyword search but you can’t find the precise timestamp of the information you want. Atlas fixes this.
By making a search engine for the world’s largest hub of knowledge, we unlock access to a huge untapped source of information.
These two diagrams show the data flows to start with a video and a query and return a timestamped result. The first diagram is from Fixing YouTube Search with OpenAI's Whisper and the second diagram was created by us and gives a more detailed breakdwn of the steps involved.
Source: Fixing YouTube Search with OpenAI's Whisper
Full diagram (editable version)
At a high-level, the project works in the following way:
For a more detailed code walkthrough, see the Atlas Search notebook and Atlas Long Form Question Answering Notebook.
This tutorial was heavily inspired by the amazing Fixing Youtube Search with OpenAI’s Whisper and Making Youtube Search Better with NLP tutorials by James Briggs. Thanks James!
The list of models we used can be found on Github and the deployed version can be found on Huggingface. All of the models we used were open because, the AI revolution will be open.
tiny.en
"multi-qa-mpnet-base-dot-v1"
BART_LFQA
The part that we found most interesting is that with minimal work required, it’s able to connect “basketball” with Lebron James and Michael Jordan. Let’s take a closer look at how it does this.
The key innovation is that it uses a process called vector embeddings that turns a set of words into an array of numbers, this array of numbers can be thought of as co-ordinates for identifying a point in a virtual space.
Similar to how “Paris” can be represented as a latitude and longitude on a 2-dimensional map, then we can use this latitude and longitude to find places near Paris.
We can do the same with our word as an array of numbers, however instead of a 2-dimensional co-ordinate our words are represented as 768-dimensional co-ordinates. Since it’s not in a 2-dimension space but actually in a 768 dimension space we will have to use a special math function of either the Euclidean distance or cosine similarity.
The cool thing about this is that 2 vectors being near each other mathematically also means that they’re near semantically, they have a similar meaning. See Using Semantic Search to Find GIFs.
So without having to do any additional work, simply by taking a set of words and converting it to an array of numbers, we can use off the shelf math to compare it to other arrays of numbers and so if they’re mathematically close, they’re also close semantically close (they have the same meaning).
This is why vectors are so powerful. In fact, the unreasonable effectiveness of vectors reminds of Andrej Karpathy’s excellent post about the unreasonable effectiveness of neural networks.
Andrej’s article stuck in my mind because the name is so brilliant. The name might be an inspiration from The Unreasonable Effectiveness of Mathematics in the Natural Sciences.
The sentence embedding is one of those technologies that really impressed me by how subtly powerful it is. I feel like I still only have a surface level understanding of how it works and I plan on digging deeper. In the meantime, I would encourage others to read the paper, Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks or read one of the blog post summaries that are a bit simpler.
See: Atlas Long Form Question Answering Notebook
Once we’ve found the video segments that match our search term, the next step was to see if we can combine the individual video segments together to generate long-form answers. This uses a technique called Sequence to Sequence generation that essentially takes a sequence of words and generates another sequence of words.
The model we used for this is Bart LFQA (Long Form Question Answering) which is a model that is trained on Reddit’s r/eli5, r/askhistorian and r/askscience. Bart LFQA was based on the BART ELI5 model, I would really recommend reading the blog post (accompanying paper), as it’s one of the most interesting and well researched writings on machine learning that I have come across.
The workflow is as follows, the user enters a search term such as “what shoes should I wear” and it returns a list of matches. Then we take each of those matches and ask our generator model to combine them to generate a coherent sentence.
Running ML apps in a development environment was relatively easy. Google Colab has free GPU notebooks (update: AWS Sagemaker labs also seems to provide free GPU notebooks), many ML models are open source, and there’s a lot of free ML tutorials available online. The hard part was deploying an ML app it to a production environment so that other people can actually use it.
At first, this was a very frustrating experience because we would spend an entire day trying to use more “do it yourself” hosting options like Sagemaker only to realize that it was too complicated and switch to a different provider.
However, as the perpetual optimists we are, the benefit was that we learned about a lot of very interesting ML deployment products. Such as:
The problem is all of them were too confusing to use. Especially because it wasn’t clear if we could customize the actual models beyond just their restrictive . For example, we wanted to be able to run 2 models in one deployment and it wasn’t clear how to do that on any of those options.
Another example, there’s a way to get whisper to include the timestamp when you transcribe but you need to pass in the arguement verbose=true. The models that claimed to be “easy to use” achieved this ease by removing a lot of customizability. So It wasn’t clear that any of the 3 options we tried would allow you to add timestamps.
There was another really cool service called vast.ai which I discovered through some very obscure Twitter searches. Tip: Twitter search is vastly underrated (see Appendix: Twitter Search). It basically allows you to rent GPUs from a decentralized market.
Unfortunately, that was also very hard to use. We found a video transcription project that used Vast.ai and it was the best tutorial we could find on using Vast but even that was too complicated.
If you work at vast, we think you have a very brilliant business model with the most upside potential in the ML deployment space, but you need to make it easier for ML developers to use your platform.
Note: ML and AI are being used interchangeably in this essay
Basically, all the ML deployment option we came across were unusable, until we came across the glorius Hugging Face Inference endpoint. This provided the perfect goldilocks “just right” mix between Sagemaker’s customizable but complicated setup and the cottage industry of other providers where you couldn’t customize anything.
To make it even easier, the documentation was excellent. We were able to follow along with the tutorial Custom Inference with Hugging Face Inference Endpoints by Phil Schmid that explained everything brilliantly and we were able to deploy our model easily.
If you want to create a Custom Handler for an existing model from the community, you can use the repo_duplicator to create a repository fork, which you can then use to add your
handler.py
.
Custom Inference with Hugging Face Inference Endpoints
We were worried that we would have to deploy 3 separate endpoints to support the Whisper, BERT and LFQA model in the same instance but impressively, we initialized the other 2 models exactly the same way we did for the Whisper model and it “just worked”.
"""
See: https://www.philschmid.de/custom-inference-handler
"""
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import whisper
import torch
class EndpointHandler():
WHISPER_MODEL_NAME = "tiny.en"
SENTENCE_TRANSFORMER_MODEL_NAME = "multi-qa-mpnet-base-dot-v1"
QUESTION_ANSWER_MODEL_NAME = "vblagoje/bart_lfqa"
def __init__(self, path=""):
# load the models
device = "cuda" if torch.cuda.is_available() else "cpu"
self.whisper_model = whisper.load_model(WHISPER_MODEL_NAME).to(device)
self.sentence_transformer_model = SentenceTransformer(SENTENCE_TRANSFORMER_MODEL_NAME)
self.question_answer_tokenizer = AutoTokenizer.from_pretrained(self.QUESTION_ANSWER_MODEL_NAME)
self.question_answer_model = AutoModelForSeq2SeqLM.from_pretrained(self.QUESTION_ANSWER_MODEL_NAME).to(device)
Huge shoutout to the Huggingface team. They’ve solved a very important problem. If you are an ML deployment startup in this space, literally just copy what Huggingface is doing.
When we look at the advancements in the AI space. Model creation and implementation is very advanced however, model deployment still has a lot of room for improvement.
However, Hugging Face’s amazing service did not come cheap.
Hosting our model on Hugging Face costs $438.05 a month. We use a Small GPU 1X Nvidia Tesla T4.
The following is the invoice of what we paid in the last 2 months.
We have a lot of thoughts regarding the cost of deploying a model using Huggingface and other options. See the Appendix for more on this.
Machine Learning applications are an increasingly important part of our society. Atlas is a great way to make these tools accessible for everyone so we can become participants and not just spectators in the glorious intelligence revolution.
While writing this post and updating Atlas, we logged on to our Huggingface Deployed Endpoints tab and realized we accidentally left 4 servers running at the same time. Which cost us to accidentally waste about $100 in unused server costs.
Huggingface doesn’t support updating existing endpoints to a newer commit (updates by revision hash). So each time we update handler.py
we had to delete existing endpoints and deploy a new one. In times when we were updating the code frequently we created newer endpoitns, forgot to delete the older ones and thus we had 4 endpoints running at the same time!
We accept responsibility that this was our fault but we also think Huggingface could do a better job preventing users from unintended billing.
Hugginface probably has a long backlist of things to fix, but in my opinion, is 100% a very high priority fix because there is a direct dollar value attached to fixing this for users.
Again, we’re very wary of the incentive alignment problem in fixing this. They make more money if they don’t fix this problem so I can imagine them deprioritizing fixing this issue. But if Huggingface is taking a long-term view, people will use HuggingFace more and they will make more money if they removed the unexpected cost anxiety and gave people more control over their costs on Hugginf Face.
Hosting an ML model on Hugginface is simply too expensive so we’re actively looking for a cheaper alternative.
Note that It’s not necesseraly that paying $400 is the problem, it’s that our service is still a relatively small side project serving less than 100 requests a day. I understand the unit economics of hosting a GPU instance so we’re willing to pay even upwards of $400 if we were using compute that much. But $400 for such low usage is way too much.
If you’re at a different company or you know of a better solution please reach out.
You can message Tomiwa on twitter(@tomiwa1a) or send an email to tomiwa with atila.ca as the domain name.
Before you recommend something, please note that we have very specific criteria.
I know your incentives may cause you to not want to show us how to spend less money on your platform, but consider that it’s better than us migrating completely and you get $0. One idea we saw was hosting the Huggingface Models on Sagemaker using the Sagemaker Huggingface inference toolkit. Note that the tutorial talks about finetuning and training which we’re not interested in doing, just deploying and inference.
We have an active issue on Github where we’re discussing the different options.
If you’re at Huggingface and reading this, please show me how we can continue using your service or models in a cheaper way. I want to keep using Hugging face, you’re one of my favorite AI companies at the moment. I love your mission of democratizing machine learning and you have great people working there. However, $400 a month to host an ML model will not democratize machine learning.
Atlas is an open platform. We chose this because we believe that open platforms last longer than closed platforms and that they’re better for society.
AI and Machine learing is going to be one of the most powerful technologies for humanity. Such a powerful tool should be built in the open as a forcing function for transparency and making decisions that are most aligned with humanity.
Atlas being an open platform means that it’s both open source and open state. Open source means that the frontend and backend code is open-source and licensed as copyleft, which means that anyone is free to fork it and use it but they must also open source it.
Open state means that we will make as much of our data as possible to be open and easily accessible and easily exportable. Starting with a JSON dump of the transcribed videos.
All the ML models used to build Atlas are also open-source. The main benefit is that it reduces our API counterparty risk such as having access unexpectedly cut off or prices raised and it also allows anyone to also re-create our setup using their own inrastructure.
Twitter Search is a vastly underrated tool. It played a big role in helping me learn what we need to build Atlas. Vast.ai, Pinecone, Whisper, Banana.dev are just a few examples of tools that we discovered through casually browsing Twitter and Advanced Twitter Search.
Here’s some tips for finding information on Twitter:
We’re going to make a crypto wallet that allows you to create accounts, recover accounts, send and receive crypto and view your transactions. Very similar to the popular crypto wallet Metamask but with a few extra features.
Will Richman is the founder of GrowthGenius and Bitmaker Labs. Before that he worked in Wealth management at BNY mellon and spent 12 months backpacking through Asia.
Annie Zhang is a product manager at Facebook, working on the Facebook Watch team. Before that she was a PM at Shopify and the first employee at Brainstation. Annie graduated from Western with a Philosophy, politics and economics (PPE) degre...