Atlas is an AI-powered search engine that can find anything on Youtube.

Search “basketball” and it will take you to the exact point in the video where they talk about Lebron James and Michael Jordan, even though the word basketball is never actually said in that clip.

Search “city in california” and it will take you to the point in another video where they talk about Berkeley, a city in California. Again, note that the word California is not mentioned in that clip.

Ask Atlas “what shoes should I wear?” to find the videos that talks about fashion and the exact part where they talk about shoes. Similar for, “best jeans to get” or “what are enzymes” or “best ways to invest”.

You can even ask Atlas “how to find love” and it will give you specific advice on how to find love.

Why Youtube

If youtube was a library, it'd be the 3rd largest library in the world

Youtube is the is the world’s largest source of information. 500 hours of video is uploaded to Youtube every minute, 6 million books worth of information is created on Youtube every year (calculations and source).

If Youtube was a library it would have about 99 million books. If websites were libraries, Youtube would be 16 times larger than Reddit, 50 times larger than Twitter and 2,000 times larger than Wikipedia.

If Youtube were a library it would be the 3rd largest library in the world with 99 Million books. Following the Library of Cognress (173M) and British Library (170M). It would have almost double the size of the next largest libraries: Shanghai Library (56M), New York Public Library (55M), and Library and Archives Canada (54M) and even the Amazon Book store (48M).(calculations and source)

All of this to say, if there’s a piece of information you’re looking for, there’s a very good chance it’s on Youtube. However, unlike Reddit, Twitter, Wikipedia, books and other text-based information sources, information on Youtube isn’t well indexed. Sure, you can do a keyword search but you can’t find the precise timestamp of the information you want. Atlas fixes this.

By making a search engine for the world’s largest hub of knowledge, we unlock access to a huge untapped source of information.

How Atlas Works

These two diagrams show the data flows to start with a video and a query and return a timestamped result. The first diagram is from Fixing YouTube Search with OpenAI's Whisper and the second diagram was created by us and gives a more detailed breakdwn of the steps involved.

openai-whisper-2-1.png

Source: Fixing YouTube Search with OpenAI's Whisper

Screen Shot 2023-01-21 at 8.45.04 PM.png

Full diagram (editable version)

At a high-level, the project works in the following way:

  1. Get the transcript of a Youtube video using the URL from the youtube transcript api.
  2. If a transcript doesn’t exist, download the audio of the video as an mp3 file with Pytube and use our first ML model, OpenAI Whisper to transcribe
  3. Break up transcript into shorter segments and convert segments to a 768 vector array. Use a process known as embedding using our second ML model, UKP Labs BERT’s sentence transformer model.
  4. Save vector array in a vector database, Atlas uses Pinecone
  5. Take the search phrase, embed the search phrase into a 768 vector array
  6. Use the vector database to see which transcript segment vector is nearest to our search phrase vector (more information on what it means for a vector to be near).
  7. Combine the search results to create a long form answer using our 3rd ML model, BART LFQA.

For a more detailed code walkthrough, see the Atlas Search notebook and Atlas Long Form Question Answering Notebook.

This tutorial was heavily inspired by the amazing Fixing Youtube Search with OpenAI’s Whisper and Making Youtube Search Better with NLP tutorials by James Briggs. Thanks James!

Models Used

The list of models we used can be found on Github and the deployed version can be found on Huggingface. All of the models we used were open because, the AI revolution will be open.

  1. Transcription: Open AI Whisper tiny.en
  2. Embeddings: Sentence Transformers "multi-qa-mpnet-base-dot-v1"
  3. Long form Question Answering: BART_LFQA

How Search Works

See: Atlas Search notebook

The part that we found most interesting is that with minimal work required, it’s able to connect “basketball” with Lebron James and Michael Jordan. Let’s take a closer look at how it does this.

Vector Embeddings

Screen Shot 2023-01-22 at 12.24.52 PM.png

The key innovation is that it uses a process called vector embeddings that turns a set of words into an array of numbers, this array of numbers can be thought of as co-ordinates for identifying a point in a virtual space.

Similar to how “Paris” can be represented as a latitude and longitude on a 2-dimensional map, then we can use this latitude and longitude to find places near Paris.

Screen Shot 2023-01-22 at 12.05.03 PM.png

We can do the same with our word as an array of numbers, however instead of a 2-dimensional co-ordinate our words are represented as 768-dimensional co-ordinates. Since it’s not in a 2-dimension space but actually in a 768 dimension space we will have to use a special math function of either the Euclidean distance or cosine similarity.

The cool thing about this is that 2 vectors being near each other mathematically also means that they’re near semantically, they have a similar meaning. See Using Semantic Search to Find GIFs.

https://i.imgur.com/Ct5p3SX.png

So without having to do any additional work, simply by taking a set of words and converting it to an array of numbers, we can use off the shelf math to compare it to other arrays of numbers and so if they’re mathematically close, they’re also close semantically close (they have the same meaning).

This is why vectors are so powerful. In fact, the unreasonable effectiveness of vectors reminds of Andrej Karpathy’s excellent post about the unreasonable effectiveness of neural networks.

Andrej’s article stuck in my mind because the name is so brilliant. The name might be an inspiration from The Unreasonable Effectiveness of Mathematics in the Natural Sciences.

The sentence embedding is one of those technologies that really impressed me by how subtly powerful it is. I feel like I still only have a surface level understanding of how it works and I plan on digging deeper. In the meantime, I would encourage others to read the paper, Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks or read one of the blog post summaries that are a bit simpler.

Fb-mTl-WQAIUqlB.jpg

Source: @jamescalam

Generating Long Form Answers

See: Atlas Long Form Question Answering Notebook

Once we’ve found the video segments that match our search term, the next step was to see if we can combine the individual video segments together to generate long-form answers. This uses a technique called Sequence to Sequence generation that essentially takes a sequence of words and generates another sequence of words.

The model we used for this is Bart LFQA (Long Form Question Answering) which is a model that is trained on Reddit’s r/eli5, r/askhistorian and r/askscience. Bart LFQA was based on the BART ELI5 model, I would really recommend reading the blog post (accompanying paper), as it’s one of the most interesting and well researched writings on machine learning that I have come across.

The workflow is as follows, the user enters a search term such as “what shoes should I wear” and it returns a list of matches. Then we take each of those matches and ask our generator model to combine them to generate a coherent sentence.

Deploying the Model

Running ML apps in a development environment was relatively easy. Google Colab has free GPU notebooks (update: AWS Sagemaker labs also seems to provide free GPU notebooks), many ML models are open source, and there’s a lot of free ML tutorials available online. The hard part was deploying an ML app it to a production environment so that other people can actually use it.

At first, this was a very frustrating experience because we would spend an entire day trying to use more “do it yourself” hosting options like Sagemaker only to realize that it was too complicated and switch to a different provider.

However, as the perpetual optimists we are, the benefit was that we learned about a lot of very interesting ML deployment products. Such as:

  1. banana.dev
  2. cortex.dev
  3. Paperspace

The problem is all of them were too confusing to use. Especially because it wasn’t clear if we could customize the actual models beyond just their restrictive . For example, we wanted to be able to run 2 models in one deployment and it wasn’t clear how to do that on any of those options.

Another example, there’s a way to get whisper to include the timestamp when you transcribe but you need to pass in the arguement verbose=true. The models that claimed to be “easy to use” achieved this ease by removing a lot of customizability. So It wasn’t clear that any of the 3 options we tried would allow you to add timestamps.

There was another really cool service called vast.ai which I discovered through some very obscure Twitter searches. Tip: Twitter search is vastly underrated (see Appendix: Twitter Search). It basically allows you to rent GPUs from a decentralized market.

Unfortunately, that was also very hard to use. We found a video transcription project that used Vast.ai and it was the best tutorial we could find on using Vast but even that was too complicated.

If you work at vast, we think you have a very brilliant business model with the most upside potential in the ML deployment space, but you need to make it easier for ML developers to use your platform.

Note: ML and AI are being used interchangeably in this essay

Hosting our ML Model with Hugging Face Inference Endpoint

Basically, all the ML deployment option we came across were unusable, until we came across the glorius Hugging Face Inference endpoint. This provided the perfect goldilocks “just right” mix between Sagemaker’s customizable but complicated setup and the cottage industry of other providers where you couldn’t customize anything.

To make it even easier, the documentation was excellent. We were able to follow along with the tutorial Custom Inference with Hugging Face Inference Endpoints by Phil Schmid that explained everything brilliantly and we were able to deploy our model easily.

If you want to create a Custom Handler for an existing model from the community, you can use the repo_duplicator to create a repository fork, which you can then use to add your handler.py.

Custom Inference with Hugging Face Inference Endpoints

We were worried that we would have to deploy 3 separate endpoints to support the Whisper, BERT and LFQA model in the same instance but impressively, we initialized the other 2 models exactly the same way we did for the Whisper model and it “just worked”.

Full code snipet

"""
See: https://www.philschmid.de/custom-inference-handler
"""
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import whisper
import torch

class EndpointHandler():
    WHISPER_MODEL_NAME = "tiny.en"
    SENTENCE_TRANSFORMER_MODEL_NAME = "multi-qa-mpnet-base-dot-v1"
    QUESTION_ANSWER_MODEL_NAME = "vblagoje/bart_lfqa"
    def __init__(self, path=""):
        # load the models

        device = "cuda" if torch.cuda.is_available() else "cpu"

        self.whisper_model = whisper.load_model(WHISPER_MODEL_NAME).to(device)

        self.sentence_transformer_model = SentenceTransformer(SENTENCE_TRANSFORMER_MODEL_NAME)
				
				self.question_answer_tokenizer = AutoTokenizer.from_pretrained(self.QUESTION_ANSWER_MODEL_NAME)
				self.question_answer_model = AutoModelForSeq2SeqLM.from_pretrained(self.QUESTION_ANSWER_MODEL_NAME).to(device)

Huge shoutout to the Huggingface team. They’ve solved a very important problem. If you are an ML deployment startup in this space, literally just copy what Huggingface is doing.

When we look at the advancements in the AI space. Model creation and implementation is very advanced however, model deployment still has a lot of room for improvement.

However, Hugging Face’s amazing service did not come cheap.

Cost of Hosting an ML Model

Hosting our model on Hugging Face costs $438.05 a month. We use a Small GPU 1X Nvidia Tesla T4.

The following is the invoice of what we paid in the last 2 months.

  • December 2022: $225.48
  • January 1-7: $102.76

Screen Shot 2023-01-07 at 7.40.08 PM.png

We have a lot of thoughts regarding the cost of deploying a model using Huggingface and other options. See the Appendix for more on this.

Conclusion

Machine Learning applications are an increasingly important part of our society. Atlas is a great way to make these tools accessible for everyone so we can become participants and not just spectators in the glorious intelligence revolution.

Appendix

Unexpected Billing

While writing this post and updating Atlas, we logged on to our Huggingface Deployed Endpoints tab and realized we accidentally left 4 servers running at the same time. Which cost us to accidentally waste about $100 in unused server costs.

Screen Shot 2023-01-21 at 9.59.17 AM.png

Screen Shot 2023-01-21 at 9.41.54 AM.png

Huggingface doesn’t support updating existing endpoints to a newer commit (updates by revision hash). So each time we update handler.py we had to delete existing endpoints and deploy a new one. In times when we were updating the code frequently we created newer endpoitns, forgot to delete the older ones and thus we had 4 endpoints running at the same time!

We accept responsibility that this was our fault but we also think Huggingface could do a better job preventing users from unintended billing.

  1. Allow updating by reivison.
  2. Set up billing alerts. If users go over a user-defined amount, they should either get an email or have the option to have their service automatically paused.
  3. Add the abilit to pause endpoints. Currently the only way to stop an instance is to delete it. There should be a simple pause option availebe.

Hugginface probably has a long backlist of things to fix, but in my opinion, is 100% a very high priority fix because there is a direct dollar value attached to fixing this for users.

Again, we’re very wary of the incentive alignment problem in fixing this. They make more money if they don’t fix this problem so I can imagine them deprioritizing fixing this issue. But if Huggingface is taking a long-term view, people will use HuggingFace more and they will make more money if they removed the unexpected cost anxiety and gave people more control over their costs on Hugginf Face.

Looking for other ML Hosting Solutions

Hosting an ML model on Hugginface is simply too expensive so we’re actively looking for a cheaper alternative.

Note that It’s not necesseraly that paying $400 is the problem, it’s that our service is still a relatively small side project serving less than 100 requests a day. I understand the unit economics of hosting a GPU instance so we’re willing to pay even upwards of $400 if we were using compute that much. But $400 for such low usage is way too much.

If you’re at a different company or you know of a better solution please reach out.

You can message Tomiwa on twitter(@tomiwa1a) or send an email to tomiwa with atila.ca as the domain name.

Before you recommend something, please note that we have very specific criteria.

  1. Usage based pricing: The game changing feature that Huggingface needs to implement is usage-based pricing. The auto-scaling feature is kind of in that direction but it’s not quite there.
  2. Show me exactly how deploying on your platform will be cheaper or at least equally as easy as deploying on Huggingface. Preferably with a tutorial as easy to use as the one Phil made.
    1. Note, all of our infrastructure is open-sourced so you can already see our ML hosting setup.
    2. Please provide specific examples on how we would use your service before you ask us to “schedule a demo”.
  3. Transparent pricing. Tell us exactly how much it’s going to cost, a range or a calculator that we can use to budget
    1. Another thing I like about Huggingface is that they are very transparent, we know exactly what I’m going to be paying. I know it’s because they use a fixed-price dedicated instance vs on-demand or usage based but it would be good to see other companies adopt this practice.

I know your incentives may cause you to not want to show us how to spend less money on your platform, but consider that it’s better than us migrating completely and you get $0. One idea we saw was hosting the Huggingface Models on Sagemaker using the Sagemaker Huggingface inference toolkit. Note that the tutorial talks about finetuning and training which we’re not interested in doing, just deploying and inference.

We have an active issue on Github where we’re discussing the different options.

If you’re at Huggingface and reading this, please show me how we can continue using your service or models in a cheaper way. I want to keep using Hugging face, you’re one of my favorite AI companies at the moment. I love your mission of democratizing machine learning and you have great people working there. However, $400 a month to host an ML model will not democratize machine learning.

The AI Revolution will be Open

Atlas is an open platform. We chose this because we believe that open platforms last longer than closed platforms and that they’re better for society.

AI and Machine learing is going to be one of the most powerful technologies for humanity. Such a powerful tool should be built in the open as a forcing function for transparency and making decisions that are most aligned with humanity.

Atlas being an open platform means that it’s both open source and open state. Open source means that the frontend and backend code is open-source and licensed as copyleft, which means that anyone is free to fork it and use it but they must also open source it.

Open state means that we will make as much of our data as possible to be open and easily accessible and easily exportable. Starting with a JSON dump of the transcribed videos.

All the ML models used to build Atlas are also open-source. The main benefit is that it reduces our API counterparty risk such as having access unexpectedly cut off or prices raised and it also allows anyone to also re-create our setup using their own inrastructure.

Twitter Search is a vastly underrated tool. It played a big role in helping me learn what we need to build Atlas. Vast.ai, Pinecone, Whisper, Banana.dev are just a few examples of tools that we discovered through casually browsing Twitter and Advanced Twitter Search.

Here’s some tips for finding information on Twitter:

  • Search using “username + <keyword>”
    • If you know someone who does a lot of work in ML, you can search their name and a relevant keyword to see what they think about a tool or topic
    • For example: Pieter levels works on a bunch of AI projects so when I was deciding what tools to use, I would search for his name and the topic I’m interested in
    • You can also learn what people in their community think about a topic
    • For example, levelsio has an audience of people who engage with his tweets, many of whom are also very knowledgeable.
    • So even if he doesn’t mention vast.ai it can return results of someone someone who replied to his tweet and mentions vast.ai it’s usually a highly relevant tweet
  • Search by URL: Useful if you come across an article or a Github repo and want to see what people are saying about it and their thoughts
  • Read the comments: Sounds obvious but a lot of gems are found buried in replies to posts, most people just read the main tweet and original likes but
  • Keyword search on Twitter
    • Similar to how you can append “reddit” to any Google search to avoid SEO spam
    • You can also take that same search query and put it into Twitter to get very interesting results crowdsourced from people
    • For example: searching Twitter for “openai whisper” revealed this cool ML deployment tool called Lightning which is really relevant to my current search for ML deployment tools
      • Bonus tip: Search by date is useful if you’re finding that results from a certain date are filling up the results too much