What We’re Reading (Week Ending 07 June 2026) : The Good Investors %

We’ve constantly been sharing a list of our recent reads in our weekly emails for The Good Investors.

Do subscribe for our weekly updates through the orange box in the blog (it’s on the side if you’re using a computer, and all the way at the bottom if you’re using mobile) – it’s free!

But since our readership-audience for The Good Investors is wider than our subscriber base, we think sharing the reading list regularly on the blog itself can benefit even more people. The articles we share touch on a wide range of topics, including investing, business, and the world in general.

Here are the articles for the week ending 07 June 2026:

1. Most of the Economy Won’t Run on the Best Model – Rihard Jarc

When a company hires an accountant, it does not go out and hire a PhD in pure mathematics to reconcile the ledgers. Not because the PhD couldn’t do it — they obviously could, and probably faster — but because it makes no economic sense. The PhD is overqualified, which is just another way of saying they are too expensive for the value the task produces. The economic output of bookkeeping is capped. There is only so much upside in getting the books done. So you hire the cheapest person who clears the quality bar, and you pocket the difference…

…If you are running a drug-discovery program, you absolutely want the PhD — in fact you want five of them, plus a Nobel laureate consulting on the side. Why? Because the economic output of a single discovery is enormous, almost unbounded…

…This is, I think, exactly how the AI model market is going to bifurcate…

…Today, essentially everyone uses the state-of-the-art (SOTA) model for everything. You want to summarize an email? SOTA model. Classify a support ticket? SOTA model. Extract three fields from an invoice? SOTA model. We do this for one simple reason: the frontier models have only just crossed the threshold of being broadly truly impactful for knowledge work, and when something has only just started working, you reach for the best version of it you can find. You don’t optimize cost on a capability you weren’t sure you had last quarter.

But I believe this is a transitional behavior, not a stable equilibrium…

…We have a rapidly falling price for any given level of capability and frontier that is already shrinking in size in terms of what is actually being deployed, and we have companies burning through their annual token budgets in a matter of months.

As such, I believe that for the overwhelming majority of economically valuable knowledge work, the correct model is not the SOTA model. It’s the cheapest model that clears the task’s quality bar. And as pilots move into full production (which is the stage we are in today) — where you’re suddenly paying for millions or billions of tokens a day instead of running a demo — intelligence-per-dollar becomes the only metric that survives contact with a CFO…

…The sellers of new compute (semis) are only winners in a world of continued high-cadence spending on new compute. And my thesis specifically questions whether that cadence is necessary. So let me lay out the two states the world can be in, because the asymmetry between them is the whole argument.

Scenario 1: Capex falls or stabilizes. If you can squeeze an order of magnitude more useful tokens out of the hardware you already own — because models got smaller, cheaper, more efficient and verticalized — then you no longer need to spend $100bn+ every single year just to stay relevant. In this world, the owners of the installed base win and the sellers of new compute lose. Hyperscaler free cash flow inflects sharply upward, because capex was the one thing suppressing it. Multiples re-rate higher as the cloud business converts from a capex incinerator into a cash machine running largely paid-for, partly-depreciated hardware. And the semis de-rate, because the market finally realizes the upgrade treadmill has slowed.

Scenario 2: Capex stays high — and revenue explodes. This is the Jevons-paradox-on-steroids case. Demand is so strong that hyperscalers do both: they extract enormous output from cheap, long-lived existing hardware and keep buying new gear. Here everyone wins at once — but the hyperscalers win more, because their incremental revenue now lands on a cost base that is partly depreciated and dramatically more efficient per token. Operating leverage goes vertical.

2. Calling the Top – Dirtcheapstocks

Spacex is set to go public next month.

I read the S1 and felt like I was watching “Whose Line is it Anyway?”. You know, the show where everything’s made up and the points don’t matter…

…Spacex is eyeing a ~$1.8 trillion valuation, from the latest reports I’ve seen. SPCX did $18.7B of revenue and generated a net loss of $4.9B in 2025. Free cash flow was severely negative: -$15.8B (adjusting for stock-based comp).

So the business is valued at ~100x revenue, and revenue has been growing at a ~34% CAGR over the last two years. Q1 2026 revenue grew 15% yoy.

The business has never sustained profitability, as evidenced by a $41B accumulated deficit…

…If you pay $1.8T for a business, and want a 10% return, you need it to send you $180B in year one. If it sends you $0 in year 1, you need it to produce $198B in year 2, and every year after that until the end of days. If year 2 also produces $0, you need year 3, and every year beyond that, to produce $218B.

I don’t think it’s likely that SPCX will reach profitability in the next couple years…

…According to ChatGPT, General Motors (in the 1950’s) was the largest company in American history when measured on GAAP revenue as a percent of GDP.

This isn’t a perfect metric, but I think it helps us get a rough feel for how large a company can become as compared to the ecosystem in which it exists.

GM’s revenue was equal to ~2.3% of American GDP. This shouldn’t be surprising as GM had ~50% market share in the second most expensive asset Americans owned…

…Now let’s take this metric and apply it to SPCX.

U.S. GDP is ~$32T today. Historically speaking, it would be difficult for a single business to earn more than $750B in annual revenue.

But SPCX will conquer the world (and Mars), so let’s assume it shatters the record. Maybe SPCX revenue can be 3% of GDP, beating out every business in history by 30%!

That would imply SPCX revenue of $960B. So what kind of profit margin can we expect for this business…

…But let’s say SPCX is a killer business at scale and it can achieve 20% operating margins, and 15% net margins. And let’s say it takes us 10 years to work our way there.

So, at a $1.8T valuation, we need $180B of cash in our pocket this year to generate a 10% return.

If we are unable to earn an cumulative profit above $0 for the next 10 years, then year 11 (and every year after that) needs to pay us $466B!

Alright, so we need $466B of profit in year 11. At 15% net margins, that means we need $3.1T of revenue.

If nominal GDP compounds at 7% for a decade, then GDP will have grown to ~$64T. So, SPCX in year 11, will need to have grown its revenue to ~4.8% of GDP ($3.1T / $64T) – a percentage more than double any company in history.

To get to $3.1T of revenue in 10 years, SPCX will need to grow its top line at 67% annually. The past couple years have shown revenue growth in the 30’s…

Hmm, this is getting difficult.

3. X thread on the difference between HBF (High Bandwidth Flash) and HBM (High Bandwidth Memory) – Eugene Ng

HBF is essentially HBM but with NAND flash dies instead of DRAM. It uses similar 3D stacking and TSV technology, delivering 8-16x higher capacity than HBM in a comparable footprint, while offering similar bandwidth, much lower cost per GB, lower power, and acceptable latency for read-heavy AI inference workloads (e.g., massive model weights, long context windows, and large KV caches)…

…HBF Shines in Inference: AI inference (LLM serving) is dominated by read-heavy, capacity-bound tasks, loading huge models, managing long contexts, and high-throughput batching. HBF excels better than HBM…

…Limitations: HBF has significantly higher latency (~10 µs, roughly 100x slower than HBM), slower write performance, and limited endurance (~100k write/erase cycles), making it unsuitable for frequent updates during training…

…Training vs. Inference Shift: As inference grows faster than training in overall AI compute, hybrid HBM + HBF setups are superior to HBM alone. HBM dominates training, while HBF’s capacity and cost advantages break the “memory wall” for cheaper, higher-throughput inference at scale…

…Bottom line: HBF expands the total AI memory TAM without cannibalising HBM. It creates a new high-value inference tier, making the overall market more competitive, multi-layered, and resilient, which is great for innovation and supply diversity.

4. Project Glasswing: what Mythos showed us – Grant Bourzikas

Mythos Preview is a real step forward, and it’s worth saying that plainly before getting into anything else. We’ve been running models against our code for a while now, and the jump from what was possible with previous general-purpose frontier models to what Mythos Preview does today is not just a refinement of what came before.

It’s a different kind of tool doing a different kind of work, and that makes a clean apples-to-apples comparison to earlier models difficult. So rather than trying to benchmark Mythos Preview against general-purpose frontier models, it’s more useful to describe what it can actually do, and two features that stood out across the work we did with Mythos Preview:

Exploit chain construction – A real attack rarely uses one bug. It chains several small attack primitives together into a working exploit. For instance, it might turn a use-after-free bug into an arbitrary read and write primitive, hijack the control flow, and use return-oriented programming (ROP) chains to take full control over a system. Mythos Preview can take several of these primitives and reason about how to combine them into a working proof. The reasoning it shows along the way looks like the work of a senior researcher rather than the output of an automated scanner.
Proof generation – Finding a bug and proving it’s exploitable are two different things, and Mythos Preview can do both. It writes code that would trigger the suspected bug, compiles that code in a scratch environment, and runs it. If the program does what the model expected, that’s the proof. If it doesn’t, the model reads the failure, adjusts its hypothesis, and tries again. The loop matters as much as the bugs it finds, because a suspected flaw without a working proof is speculation, and Mythos Preview closes that gap on its own.

Some of what we describe above is not entirely unique to Mythos Preview. When we ran other frontier models through the same harness, they found a fair number of the same underlying bugs, and in some cases they got further than we expected on the reasoning side too. Where they fell short was at the point of stitching the pieces together. A model would identify an interesting bug, write a thoughtful description of why it mattered, and then stop, leaving the actual chain unfinished and the question of exploitability open.

The Mythos Preview model provided by Anthropic, as part of Project Glasswing, did not have the additional safeguards that are present in generally available models (like Opus 4.7 or GPT-5.5).

Despite this, the model organically pushes back on certain requests – much like the cyber capabilities that made it useful for vulnerability hunting, the model has its own emergent guardrails that sometimes cause it to push back on legitimate security research requests. But as we found, these organic refusals aren’t consistent – the same task, framed differently or presented in a different context, could produce completely different outcomes…

…When we first started AI-assisted vulnerability research last year, our instinct was the obvious one: point a generic coding agent at an arbitrary repository and ask it to discover vulnerabilities. This approach works, in the sense that the model will produce findings, but it doesn’t work in producing meaningful coverage of a real codebase and identifying findings of value…

…Four lessons came out of running the work at scale, and each one pointed to the need for a harness that manages the overall execution:

Narrow scope produces better findings – Telling the model “Find vulnerabilities in this repository” makes it wander. Telling it “Look for command injection in this specific function, with this trust boundary above it, here’s the architecture document and here’s prior coverage of this area” makes it do something much closer to what a researcher would actually do.
Adversarial review reduces noise – Adding a second agent between the initial finding and the queue – one with a different prompt, a different model, and no ability to generate its own findings – catches a lot of the noise that the first agent would miss if it just checked its own work. It turns out that putting two agents in deliberate disagreement is way more effective than just telling one agent to be careful.
Splitting the chain across agents produces better reasoning – Asking “Is this code buggy?” and “Can an attacker actually reach this bug from outside the system?” are two different questions, and the model is better at each one when you ask them separately, because each question is narrower than the combined version.
Parallel narrow tasks beat one exhaustive agent – Coverage improves when many agents work on tightly scoped questions and we deduplicate the results afterward, rather than asking one agent to be exhaustive.

Each of those observations is about model behavior, and put together they describe something that isn’t a chat interface anymore. It’s a harness that helps you achieve the final outcomes.

5. Open-source agents with frontier advisors: matching frontier performance through training and harness engineering – Fireworks AI

On LAB’s continuous mean-score metric, GLM 5.1 ranks highest among the open-source models we evaluated, at 0.8921 mean score putting it directly alongside frontier: Claude Opus 4.7 at 0.911, GPT-5.5 at 0.892. Kimi K2.6 (0.863) and DeepSeek V4 Pro (0.871) come in just below, both still clearly viable for production legal workloads.

On the LAB all-pass metric, the production-readiness measure, the closed frontier holds a small lead: Opus 4.7 at 14 / 100, GPT-5.5 at 11 / 100, GLM 5.1 at 12 / 100. That gap is where the rest of this post lives; the two interventions we describe below close most of it.

Cost is the headline. GLM 5.1 reaches its 0.8921 mean for $121 across the 100-task run. GPT-5.5’s nearly identical 0.892 costs $560. Claude Opus 4.7’s 0.911 mean and 14 / 100 all-pass runs $954, roughly 8× any open-source candidate.

“The customer ask is no longer ‘how do we get the smartest model on every query.’ It is ‘how do we get frontier-quality outputs on the queries that need them, and a model we control on the queries that don’t.’”…

…A single LLM call is the wrong unit of work for a legal task: reasoning chains run long, citation discipline is unforgiving, and under all-pass grading any missed criterion costs the entire task. To solve the problem, the team built a small, opinionated multi-agent harness with the open-source worker at its core. The configuration is straightforward: open weights at the core, orchestration the team can inspect and tune, and the frontier model invoked as a callable tool rather than a load-bearing dependency.

A frontier advisor as a callable tool. Treating Opus 4.7 as an advisor the worker can call on hard sub-tasks unlocked the cost savings on the harness. The GLM 5.1 worker does the bulk of the reasoning, drafting, and tool calls. There is no external router or orchestrator. The worker pulls the advisor in itself, wherever it needs a second opinion: retrieval, drafting, validation. Across the run, the advisor is invoked just 0.83 times per task on average — sparse-but-targeted use. That captures most of the quality lift of running the frontier end-to-end, at a small fraction of per-query cost, and it gives us a tunable cost/performance knob: dial advisor calls up on complex matters, down on routine ones.

The harness traces show a recognizable pattern. The worker’s turn count rises meaningfully versus a GLM 5.1-only run: the model reaches an uncertain step (typically during validation, occasionally mid-draft), calls the advisor for guidance or review, then resumes the trajectory with additional turns informed by the response. The advisor is doing less of the writing and more of the steering; the worker is doing the rest of the work it would not have known to do on its own. Sparse advisor calls, denser worker activity downstream of them.

The harness moves GLM 5.1 from 12 / 100 all-pass to 18 / 100 — higher than Claude Opus 4.7’s 14 / 100 — at $368 across the 100 tasks, roughly 39% of Opus’s $954 standalone cost (Figure 1). Against Opus the comparison is clean on both axes: −$586, +4 tasks all-pass. Against the GLM-only baseline, the advisor adds +6 tasks all-pass for +$246 — the cost increase is real, but it is the cost of beating Opus while still running the open-source worker at the core.

Disclaimer: The Good Investors is the personal investing blog of two simple guys who are passionate about educating Singaporeans about stock market investing. By using this Site, you specifically agree that none of the information provided constitutes financial, investment, or other professional advice. It is only intended to provide education. Speak with a professional before making important decisions about your money, your professional life, or even your personal life. We currently have no vested interest in any company mentioned. Holdings are subject to change at any time.