This Week in Mojo 2023-06-23
- Blog post: Modular natively supports dynamic shapes for AI workloads
- Notebook: Ray tracing in Mojo includes writing to a numpy ndarray representing an image from Mojo
- TWIML AI did a podcast interview with Chris Lattner, answers summarized below
- Added guides and new sections to mojodojo.dev/guides
Mojo Playground Update
Added support for overloading on parameter signature. For example, it is now possible to write the following:
fn foo[a: Int](x: Int): pass fn foo[a: Int, b: Int](x: Int): pass
For details on the overload resolution logic, see the programming manual.
cost_of() function has been added to
Autotune. This meta-function must be invoked at compile time, and it returns the number of MLIR operations in a function (at a certain stage in compilation), which can be used to build basic heuristics in higher-order generators.
from Autotune import cost_of fn generator[f: fn(Int) -> Int]() -> Int: @parameter if cost_of[fn(Int) -> Int, f]() < 10: return f() else: # Do something else for slower functions...
Added a new example notebook with a basic Ray Tracing algorithm.
assert_param_msg() in the
Assert module has been renamed to
Overloads marked with
@adaptive now correctly handle signatures that differ only in declared parameter names, e.g. the following now works correctly:
@adaptive fn foobar[w: Int, T: DType]() -> SIMD[T, w]: ... @adaptive fn foobar[w: Int, S: DType]() -> SIMD[S, w]: ...
- Issue #219 - Issue when redefining a function and a struct defined in the same cell.
- Issue #355 - The loop order in the Matmul notebook for Python and naive mojo have been reordered for consistency. The loop order now follows (M, K, N) ordering.
- Issue #309 - Use snake case naming within the testing package and move the asserts out of the TestSuite struct.
Mojo Team Answer
Logo and brand community usage
We definitely want the community to be able to use the Mojo logo and name. We should get a proper web page up that describes this, we're behind on this mostly getting the details sorted out. My current understanding: we want people to be free to use the word Mojo and Mojo🔥, and using the Mojo🔥 logo is fine. The things we need to protect are:
- Don't represent that you are speaking on behalf of modular
- Don't use the "Modular M" with the notch taken out without permission.
It is fine to use Mojo or Mojo🔥 with a normal M.
We've also seen a lot of the troubles of other communities, and want to ensure that the Mojo🔥 community has a clear understanding of our trademark rights, and the relevant community usage from the beginning.
The spirit of what we want to achieve is essentially to have a "Community Logo" and a "Official Logo" that enables a flexible use for the community, but also provides us with an ability to have "Official Use" when needed. There will be subtle differences (i.e. the Notch in the M, the style of the Fire icon etc) but enabling our incredible community to use the logo is definitely our goal.
BNF notation for Mojo
We currently use a hand written parser in C++ and don't have a formal grammar unfortunately; we do have a black grammar, but it is "quite a superset" of the language. It would be great to build this out and document it.
Three World Problem
Python has a dependence on C/C++ for performance and hardware-focused tasks, Mojo directly addresses the
three world problem of Python, C/C++, and accelerator languages required for CPUs, GPUs, TPUs etc.
Technology Mojo Builds on
Mojo builds on a lot of technologies where the research has been done and implemented such as MLIR, which is an evolution of LLVM that has enabled a new generation of compiler technologies. MLIR is now widely utilized across the entire industry for AI accelerators, we built it at Google and then open sourced it, and it's now part of LLVM. LLVM is an umbrella of technologies that includes MLIR, the Clang compiler for C/C++, and the fundamental building blocks like code generation for an x86 processor, so we build directly on top of that as well. This is the core of how we make the hardware go really fast.
Speeding up Python
Despite Python's popularity for research and training models, it presents challenges for deploying at scale, often leading teams to rewrite their models in C++. The Modular inference engine was constructed entirely on top of Mojo, and now is the fastest inference engine for TensorFlow and PyTorch models by maximizing the potential of hardware. In comparison to Python, Mojo is compiled, eliminates the GIL, introduces types and adds metaprogramming features which make it a substantially different thing.
There's a big difference though, Python already allows you to add types like TypeScript, but they're just for tools to identify bugs and obvious mistakes in your code, those types aren't used to improve runtime performance. Mojo takes it to the next step, we often see 10x-20x performance improvements just by adding a few type annotations.
In the traditional world of Python if you run into performance problems, or if you need access to low level features you have to build a hybrid package of half C/C++ and half Python. In Mojo you can continue writing dynamically typed code, and you can also use lower-level syntax and put more effort into performance, instead of having to switch to a completely different language where the debugger no longer works on both sides.
MLIR to unlock exotic hardware
AI isn't just about a GPU, even though so much thinking around AI technology is focused on that, the Modular team spent years working Google TPUs, they're highly specialized for AI workloads and scale to exaflops of compute, they're also internally really weird.
Mojo is built on top of MLIR which We built back at Google, now it's being used by basically the who's who of all the hardware industry. LLVM talks to CPUs and some GPUs, but it's never been successful at targeting AI accelerators and video optimization engines, and all the other weird hardware that exists in the world. That's the role that MLIR provides, Mojo fully exposes MLIR and all the nerdery that goes into compiler technology, and gives it to library developers.
It's important that you can talk to TPUs or other exotic hardware in their native language, which in the case of a TPU is a 128x128 tile, being able to expose that out in the language is really quite important. It's more than just CPUs and GPUs, we've built it to have really long legs so it can bring us into the future.
Modular's building what we called a unified AI engine, people are familiar with PyTorch and TensorFlow and these machine learning frameworks that provide APIs, underneath the covers there's a lot of technology for getting things onto a CPU and GPU through things like CUDA. And so our engine fits at that level of the stack, the cool thing about it particularly when you're deploying, is that it talks to a lot of different hardware.
It also talks to both TensorFlow and PyTorch, so when you're taking a model from research like a nice PyTorch model off Hugging Face, and you want to deploy this thing. We don't actually want all of PyTorch in a production Docker container, you want a low dependency efficient way to serve the model. The process of going from PyTorch and into deployment is what the modular technology stack can help with.
Other cloud providers and Mojo
SambaNova's chip is, from my understanding, what's called a Coarse-grain reconfigurable architecture (CGRA), which is a super parallel and has almost nothing to do with CPUs. Graphcores are apparently lots of things that look like CPUs, but their memories are weird and different, and the way they communicate is very structured. What our technology stack enables companies like SambaNova is a way to implement a compiler for their chip. They're the experts on their chip, they understand how it works, Modular can provide something to plug into so that they get all the benefits of TensorFlow and PyTorch.
One of the challenges with hardware accelerators is that the tools provided by non-dominant players often prove difficult to use, especially with regard to compatibility. For instance, Apple's Core ML which interacts with neural accelerators, isn't compatible with all models. This often results in complications when attempting to integrate models onto Apple devices.
These issues are recognized by numerous leaders at software companies integrating AI into their products, they see firsthand the long deployment times and the need for large, specialized, and expensive teams. This is largely due to the discrepancy between the tools used for hardware deployment and those used for AI model training. Companies have to build an entire technology stack from the bottom up, there's very little code reuse across hardware. And it's very difficult to track the speed of AI, PyTorch moves fast and you need a very dedicated and responsive team.
The compiler and technology problems to make the hardware work are really difficult, and so there are a lot of really smart people working on this, but if you're always focused on getting the next ship out the door, you can't take a step back and look at this whole technology stack. That's the leap that modular is driving forward.
The AI industry owes a huge debt of gratitude to CUDA, if you go back to the AlexNet moment it was a confluence of ImageNet, datasets and the fact that GPUs enabled an amount of compute. People forget that CUDA enabled researchers to get a machine learning model running on a GPU, which the hardware was definitely not designed for back in the day. Now AI has taken over and it's different, but the initial breakthrough was really in a large part thanks to CUDA. A lot of technology has been built on top of CUDA and it's very powerful, flexible, and hackable and that's great, but it's put us into a mode where one vendor has this dominant position and it's very difficult if you're a hardware vendor to be able to play in this ecosystem.
There's the XLA compiler that Modular staff worked on at Google, and there are new compilers every day being announced by different companies where they're making make ML go fast, for example on GPUs. The problem with that is that they've lost one of the things that made CUDA really powerful, which is the programmability.
As compiler enthusiasts, we transformed AI code generation into a compiler-centric problem, which excluded non-compiler experts. AI isn't just about matrix operations but involves comprehensive parallel compute tasks including data loading and preprocessing. The battle against CUDA lock-in has resulted in a different lock-in, pushing potential contributors out. Our solution at Modular aims to accommodate all current systems such as TensorFlow or PyTorch without requiring code rewriting, we're willing to grapple with the complex task of implementing thousands of operators to benefit everyone. Instead of creating a tool for one hardware and framework, we address the broader issue of deployment across various hardwares and frameworks. Mojo's mission is to simplify the stack, encourage collaboration among various experts, and invite more participation, ultimately to reduce the overwhelming complexity in the AI field.
The packaging problem is often due to incompatible systems that are lashed together, for example Python packages a lot of C/C++, C's never had a package manager that's any good. So you look at these old problems that we've been struggling with, get rid of the C code and suddenly packaging is way simpler. This is one of the things that Mojo is providing with a unified language to drive complexity down.
Debugging Complicated Problems
What happens when you need to deploy a model through Core ML or one of the many other hardware interfaces, and the results don't work. Well now you need to know not just PyTorch, not just your model, not just Core ML, but also the translator, compiler and all these other things. You keep digging and you find out it's handling the edge padding on a convolution slightly differently. All of these tools were supposed to be making it easy aren't reliable, it's a leaky abstraction where if something goes wrong you have to understand all of this complexity.
And so this is what causes it to take three months to deploy a model, leaders ask why it's taking so long but they don't realize that the tool set, this fundamental technology that all this stuff is built on top of, it's not up to high standards. No C programmer would tolerate AI tools of this quality, it's just crazy.
But again, this is just the maturity of the AI technology space, and by solving that problem we should see way more inclusion and the kinds of companies that are able to work with AI, and that'll have a big impact on the world.
The engine itself can stand alone and you can use the engine as a drop-in replacement for running TensorFlow and PyTorch models in production. And so TensorFlow is quite good at production, but we're showing 3-5x better performance on, for example Intel CPUs, AMD CPUs or an ARM-based Graviton server in AWS. That's a massive cost savings and it's also a massive latency improvement, so many of our customers love that because then they can turn around and make their models bigger, which is a huge deal for them.
One of the things also that our customers love is that Google and Meta don't actually support TensorFlow or PyTorch, people forget that these are not products, these are open source projects and more like hobbies for the megacorps. So what we're essentially offering is a supported and performance optimized version of TensorFlow and PyTorch, the enterprises we talked to that care about their costs often they want somebody that they can call. It's analogous to running your own mail server, very few companies do that, so why do we do it with AI infrastructure.
Currently it's because there's no choice, there's nobody to reach out to who can actually can do this. The technology platform at Meta and Google has diverged a lot from what the rest of the industry uses, they both have their own chips and specific use cases, so they're not focused on the traditional CPU, GPU and public cloud use case. Because it's a product for us we can actually support it, invest a huge amount of energy into it, and it's why we have such phenomenal results as well.
In the case of modular and why we built Mojo, our business objective is go make ML really awesome, we care about the matrix multiplications and the convolutions and the core operations that people spend all their time on in AI. And so we rewrote all of that stuff in Mojo, this isn't like rewriting Matplotlib, this is like rewriting Intel MKL, or the CUDA implementation of these CUDA kernels equivalent. That's where we've put our energy into because that's what enables unlocking of the hardware, performance, and usability.