Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> PyTorch has dominated the AI scene since TF1 fumbled the ball at 10th yard line

can you explain why you think TensorFlow fumbled?



I see good answers already, but here's a concrete example:

In my University we had to decide between both libraries so, as a test, we decided to write a language model from scratch. The first minor problem with TF was that (if memory serves me right) you were supposed to declare your network "backwards" - instead of saying "A -> B -> C" you had to declare "C(B(A))". The major problem, however, was that there was no way to add debug messages - either your network worked or it didn't. To make matters worse, the "official" TF tutorial on how to write a Seq2Seq model didn't compile because the library had changed but the bug reports for that were met for years with "we are changing the API so we'll fix the example once we're done".

PyTorch, by comparison, had the advantage of a Python-based interface - you simply defined classes like you always did (including debug statements!), connected them as variables, and that was that. So when I and my beginner colleagues had to decide which library to pick, "the one that's not a nightmare to debug" sounded much better than "the one that's more efficient if you have several billions training datapoints and a cluster". Me and my colleagues then went on to become professionals, and we all brought PyTorch with us.


This was also my experience. TensorFlow's model of constructing then evaluating a computation graph felt at odds with Python's principles. It made it extremely difficult to debug because you couldn't print tensors easily! It didn't feel like Python at all.

Also the API changed constantly so examples from docs or open source repos wouldn't work.

They also had that weird thing about all tensors having a unique global name. I remember I tried to evaluate a DQN network twice in the same script and it errored because of that.

It's somewhat vindicating to see many people in this thread shared my frustrations. Considering the impact of these technologies I think a documentary about why TensorFlow failed and PyTorch took off would be a great watch.


The inability to use print debug to tell me the dimensions of my hidden states was 100% why TF was hard for me to use as a greenhorn MSc student.

Another consequence of this was that PyTorch let you use regular old Python for logic flow.


In 2018, I co-wrote a blog post with the inflammatory title “Don’t use TensorFlow, try PyTorch instead” (https://news.ycombinator.com/item?id=17415321). As it gained traction here, it was changed to “Keras vs PyTorch” (some edgy things that work for a private blog are not good for a corporate one). Yet the initial title stuck, and you can see it resonated well with the crowd.

TensorFlow (while a huge step on top of Theano) had issues with a strange API, mixing needlessly complex parts (even for the simplest layers) with magic-box-like optimization.

There was Keras, which I liked and used before it was cool (when it still supported the Theano backend), and it was the right decision for TF to incorporate it as the default API. But it was 1–2 years too late.

At the same time, I initially looked at PyTorch as some intern’s summer project porting from Lua to Python. I expected an imitation of the original Torch. Yet the more it developed, the better it was, with (at least to my mind) the perfect level of abstraction. On the one hand, you can easily add two tensors, as if it were NumPy (and print its values in Python, which was impossible with TF at that time). On the other hand, you can wrap anything (from just a simple operation to a huge network) in an nn.Module. So it offered this natural hierarchical approach to deep learning. It offered building blocks that can be easily created, composed, debugged, and reused. It offered a natural way of picking the abstraction level you want to work with, so it worked well for industry and experimentation with novel architectures.

So, while in 2016–2017 I was using Keras as the go-to for deep learning (https://p.migdal.pl/blog/2017/04/teaching-deep-learning/), in 2018 I saw the light of PyTorch and didn’t feel a need to look back. In 2019, even for the intro, I used PyTorch (https://github.com/stared/thinking-in-tensors-writing-in-pyt...).


Actually, I opened “Teaching deep learning” and smiled as I saw how it evolved:

> There is a handful of popular deep learning libraries, including TensorFlow, Theano, Torch and Caffe. Each of them has Python interface (now also for Torch: PyTorch)

> [...]

> EDIT (July 2017): If you want a low-level framework, PyTorch may be the best way to start. It combines relatively brief and readable code (almost like Keras) but at the same time gives low-level access to all features (actually, more than TensorFlow).

> EDIT (June 2018): In Keras or PyTorch as your first deep learning framework I discuss pros and cons of starting learning deep learning with each of them.


The original TensorFlow had an API similar to the original Lua-based Torch (the predecessor to PyTorch) that required you to first build the network, node by node, then run it. PyTorch used a completely different, and much more convenient approach, where the network is built automatically for you just by running the forward pass code (and will then be used for the backward pass), using both provided node types and arbitrary NumPy compatible code. You're basically just writing differentiable code.

This new PyTorch approach was eventually supported by TensorFlow as well ("immediate mode"), but the PyTorch approach was such a huge improvement that there had been an immediate shift by many developers from TF to PyTorch, and TF never seemed able to regain the momentum.

TF also suffered from having a confusing array of alternate user libraries built on top of the core framework, none of which had great documentation, while PyTorch had a more focused approach and fantastic online support from the developer team.


LuaTorch is eager-execution. The problem with LuaTorch is the GC. You cannot rely on traditional GC for good work, since each tensor is megabytes (at the time), now gigabytes large, you need to collect them aggressively rather than at intervals (Python's reference-counting system solves this issue, and of course, by "collecting", I don't mean free the memory (PyTorch has a simple slab allocator to manage CUDA memory)).


With Lua Torch the model execution was eager, but you still had to construct the model graph beforehand - it wasn't "define by run" like PyTorch.

Back in the day, having completed Andrew Ng's ML coursew, I then built my own C++ NN framework copying this graph-mode Lua Torch API. One of the nice things about explicitly building a graph was that my framework supported having the model generate a GraphViz DOT representation of itself so I could visualize it.


Ah, I get what you mean now. I am mixing up the nn module and the tensor execution bits. (to be fair, the PyTorch nn module carries over many these quirks!).


I'm no machine learning engineer but I've dabbled professionally with both frameworks a few years ago and the developer experience didn't even compare. The main issue with TF was that you could only chose between a powerful but incomprehensible, poorly documented [1], ultra-verbose and ever changing low-level API, and an abstraction layer (Keras) that was too high level to be really useful.

Maybe TF has gotten better since but at the time it really felt like an internal tool that Google decided to just throw into the wild. By contrast PyTorch offered a more reasonable level of abstraction along with excellent API documentation and tutorials, so it's no wonder that machine learning engineers (who are generally more interested in the science of the model than the technical implementation) ended up favoring it.

[1] The worst part was that Google only hosted the docs for the latest version of TF, so if you were stuck on an older version (because, oh I don't know, you wanted a stable environment to serve models in production), well tough luck. That certainly didn't gain TF any favors.


For me it was about 8 years ago. Back then TF was already bloated but had two weaknesses. Their bet on static compute graphs made writing code verbose and debugging difficult.

The few people I know back then used keras instead. I switched to PyTorch for my next project which was more "batteries included".


Imagine a total newbie trying to fine-tune an image classifier, reusing some open source example code, about a decade ago.

If their folder of 10,000 labelled images contains one image that's a different size to the others, the training job will fail with an error about unexpected dimensions while concatenating.

But it won't be able to say the file's name, or that the problem is an input image of the wrong size. It'll just say it can't concatenate tensors of different sizes.

An experienced user will recognise the error immediately, and will have run a data cleansing script beforehand anyway. But it's not experienced users who bounce from frameworks, it's newbies.


> An experienced user will recognise the error immediately, and will have run a data cleansing script beforehand anyway. But it's not experienced users who bounce from frameworks, it's newbies.

Even seasoned developers will bounce away from frameworks or libraries - no matter if old dogs or the next hot thing - if the documentation isn't up to speed or simple, common tasks require wading through dozens of pages of documentation.

Writing good documentation is hard enough, writing relevant "common usage examples" is even harder... but keeping them up to date and working is a rarely seen art.

And the greatest art of all of it is logging. Soooo many libraries refuse to implement detailed structured logging in internal classes (despite particularly Java and PHP offering very powerful mechanisms), making it much more difficult to troubleshoot problems in the field.


I just remember TF1 being super hard to use as a beginner and Google repeatedly insisting it had to be that way. People talk about the layering API, but it's more than that, everything about it was covered with sharp edges.


I personally believe TF1 was serving the need of its core users. It provided a compileable compute graph with autodiff, and you got very efficient training and inference from it. There was a steep learning curve, but if you got past it, things worked very very well. The distributed TF never really took off—it was buggy, and I think they made some wrong earlier bets in the design for performance reasons that they should have been sacrificed in favor of simplicity.

I believe some years after the TF1 release, they realized the learning curve was too steep, they were losing users to PyTorch. I think also the Cloud team was attempting to sell customers on their amazing DL tech, which was falling flat. So they tried to keep the TF brand while totally changing the product under the hood by introducing imperative programming and gradient tapes. They killed TF1, upsetting those users, while not having a fully functioning TF2, all the while having plenty of documentation pointing to TF1 references that didn’t work. Any new grad student made the simple choice of using a tool that was user-friendly and worked, which was PyTorch. And most old TF1 users hopped on the band wagon.


First, the migration to 2.0 in 20219 to add eager mode support was horribly painful. Then, starting around 2.7, backward compatibility kept being broken. Not being able to load previously trained models with a new version of the library is wildly painful.


I only remember 2015 TF and I was wondering: why would I use Python to assemble a computational graph when what I really want is to write code and then differentiate through it?


Greenfielding TF2.X and not maintaining 1.X compatibility




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: