This is a very one sided article. Shouldn't there be a comparison with TP-Link and all other brands available in-terms of security? Otherwise they're just targeting a company for political reasons.
The article is in response to a very one-sided government ban (well, reported ban) on TP-Link products. The company is being targeted for what appears to be political reasons, the article even said so in the first paragraph:
Experts say while the proposed ban may have more to do with TP-Link’s ties to China than any specific technical threats
Direct use of Codex + GPT5 or Claude Code CLI gives a better result, compared to using the same models in Cursor. I've compared both. Cursor applies some of their augmentation, which reduces the output size, probably to save on tokens.
What they don't mention is all the tooling, MCPs and other stuff they've added to make this work. It's not 30 hours out of the box. It's probably heavily guard-railed, with a lot of validated plans, checklists and verification points they can check. It's similar to 'lab conditions', you won't get that output in real-world situations.
Yeah, I thought about that after I looked at the SWE-bench results. It doesn't make sense that the SWE results are barely an improvement yet somehow the model is a more significant improvement when it comes to long tasks. You'd expect a huge gain in one to translate to the other.
Unless the main area of improvement was tools and scaffolding rather than the model itself.
Create a CEO Logic Agent that helps the CEO to make better decisions on AI? /s
What CEO is likely looking for are 'PR' points, not often not a real strategy. If they can announce and pretend they're going all-in on AI, that's what's needed.
From your side, having AI mentioned in everything you do will help the conversation. If your code's docs are improved with an AI IDE, you're going hard on AI. Ignore the time you spend on fixing AI's errors.
Doing things for 'funding' and doing things that gets the work done are not always the same. One is a marketing/PR act, the other one is a product development act.
If funding is a real concern, the CEO's approach might be valid, because without funding, you won't have a job, and there won't be a product. So split your time in helping the CEO to achieve what s/he wants in getting the right message out.
As you're saying, if the CEO has built a great team, and great technology, we can't think the CEO is completely ignorant on what's going on.
Your CTO/CIO (if any) will know more about what realistically possible and what's not. If you have an 'AI Team', then there should be a CTO/CIO, and you're not directly talking to CEO about strategy?
Here are some notes I made to understand each of these models and when to use them.
# OpenAI Models
## Reasoning Models (o-series)
- All `oX` (o-series aka `omni`) models are reasoning models.
- Use these for complex, multi-step, reasoning tasks.
## Flagship/Core Models
- All `x.x` and `Xo` models are the core models.
- Use these for one-shot results
- Examples: 4o, 4.1
## Cost Optimized
- All `-mini`, `-nano` are cheaper, faster models.
- Use these for high-volume, low effort tasks.
## Flagship vs Reasoning (o-series) Models
- Latest flagship model = 4.1
- Latest reasoning model = o3
- The flagship models are general purpose, typically with larger context windows. These rely mostly on pattern matching.
- The reasoning models are trained with extended chain-of-thought and reinforcement learning models. They work best with tools, code and other multi-step workflows. Because tools are used, the accuracy will be higher.
# List of Models
## 4o (omni)
- 128K context window
- complex multimodal, applications requiring the top level of reliability and nuance
## 4o-mini
- 128K context window
- Use: multimodal reasoning for math, coding, and structured outputs
- Use: Cheaper than `4o`. Use when you can trade off accuracy vs speed/cost.
- Dont Use: When high accuracy is needed
## 4.1
- 1M context window
- Use: For large context ingest, such as full codebases
- Use: For reliable instruction following, comprehension
- Dont Use: For high volume/faster tasks
## 4.1-mini
- 1M context window
- Use: For large context ingest
- Use: When a tradeoff can be made with accuracy vs speed
## 4.1-nano
- 1M context window
- Use: For high-volume, near-instant responses
- Dont Use: When accuracy is required
- Examples: classification, autocompletion, short-answers
## o3
- 200K context window
- Use: for the most challenging reasoning tasks in coding, STEM, and vision that demand deep chain‑of‑thought and tool use
- Use: Agentic workflows leveraging web search, Python execution, and image analysis in one coherent loop
- Dont Use: For simple tasks, where lighter model will be faster and cheaper.
## o4-mini
- 200K context window
- Use: High-volume needs where reasoning and cost should be balanced
- Use: For high throughput applications
- Dont Use: When accuracy is critical
## o4-mini-high
- 200K context window
- Use: When o4-mini results are not satisfactory, but before moving to o3.
- Use: Compex tool-driven reasoning, where o4-mini results are not satisfactory
- Dont Use: When accuracy is critical
## o1-pro-mode
- 200K context window
- Use: Highly specialized science, coding, or reasoning jobs that benefit from extra compute for consistency
- Dont Use: For simple tasks
## Models Sorted for Complex Coding Tasks (my opinion)
1. o3
2. Gemini 2.5 Pro
3. Claude 3.7
2. o1-pro-mode
3. o4-mini-high
4. 4.1
5. o4-mini
I suspect it’s not a smarter developer thing, but a stupider code thing.
Programming for a client is making their processes easier, ideally as few clicks as possible. Programming for a programmer does the same to programming.
The thing with our “industry” is that it doesn’t automate programming at all, so smart people “build” random bs all day that should have been made a part of some generic library decades ago and made available off the shelf by all decent runtimes.
Making a form with validation and data objects and a backend with orm/sql connection and migrations and auth and etc etc. It all was solved millions of times and no one bats an eye why tf they reimplement multiple klocs of it over and over again.
That’s where AI shines. It builds you this damn stupid form that takes two days of work otherwise.
Very nice.
But it’s not programming. If anything, it’s a shame. A spit into the face of programming that somehow got normalized by… not sure whom. We take a bare, raw runtime like node/python/go and a browser and call it “a platform”. What platform? It’s as platform as INT 13h is an RDBMS.
I think AI usefulness division clearly shows us that right now, but most are blind to it by inertia.
I asked about 20% pay cut and they don't do it. There's a cookie-cutter full time position, and nothing else. I suspect part of it could be stupid laws around America's insane healthcare bullshit, but I don't know.