A Future of Software Development?

Introducing Specification Driven Development

Jun 01, 2025

I’m trying to follow my own advice and project my thinking into the perpetually near future rather than the current state of things. To that end, I want to make explicit some of my beliefs about the trajectory of software development and develop some concrete ideas about what a future state might look like. Similar to the post on interviewing software engineers, this is domain specific but the general pattern of analysis is of interest and can be applied to other areas. If you end up doing any similar type of near future thinking for your field, please share with me — I’d be very interested.

The emphasis on near future is particularly important. If we go too far out, then I think it’s just too easy to shrug and wave our hands and think, well, AI will have just figured it out. And it’s this nebulous and perhaps short-lived space in between here and there over which we might have the most agency.

High level premises:

The cost of writing code is rapidly approaching zero
As writing code becomes cheaper, the ability to read code goes up in value because there will be more code to review
This will only be a temporary equilibrium; there will be tremendous pressure to keep removing the next bottleneck
One way to remove that limitation is to have humans no longer write or read code itself — a complete step up the abstraction ladder
We should be thinking and designing approaches that specifically do not work with existing models and systems — if they work with current AI, they were not ambitious enough
If we’re going to have the first billion dollar company run by only a single human, we’ll need to have a lot of incremental steps along the way: the domain of an entire software team and then an entire software org run by one or a few humans

To this end, I’m1 taking a big swing and introducing a new2 framework for software development: Specification Driven Development (SDD) — disclaimer, the repo is currently just a proof of concept (one small notch above vaporware), but I’ve been making progress on developing a real prototype and will update when it’s even plausibly useful.

Specification Driven Development (SDD)

The primary goal of SDD is to provide a declarative approach where users can focus on specifying the what without needing to be concerned about the how. This sort of thinking has a long and successful history: SQL and Infrastructure as Code (Terraform) being some canonical examples. By moving up the level of abstraction in this way, we can remove the bottlenecks of both reading and writing code as well as democratizing access to software creation. Humans will now ~~write~~ create3 and review specifications of the system as their interface. If the system produces the right outputs from the right inputs with the appropriate operational constraints, then why would we care how it’s implemented?

It’s worth calling out again that software engineers will not write these specifications, but they will review them. The hope is that specifications like this that are declarative and behavioral focused will be much easier to for humans to interpret and review with high level use case context in mind. So they are intended as intermediate representations and an artifact that will serve as a source of intention. We’ll get into creation of these below. But here’s a work in progress visual sketch:

Now, there are a lot of ways that this could go wrong. I’ve tried to think through and address some of them, but I’m sure there will be many more that have been missed. This is unlikely to be the best version of even this sort of idea, but my hope is that by working on something concrete in this area, it can contribute to a broader discussion of how the field can evolve.

Core Beliefs

1. Code is an Implementation Detail

Traditional software development focuses on code as the primary artifact. SDD treats code as a disposable implementation detail - what matters is behavior.

2. Humans Excel at 'What', Machines Excel at 'How'

Humans are good at understanding business needs and defining desired outcomes. Machines are better at implementing efficient solutions. SDD leverages each's strengths.

3. Constraints Should Be Explicit and Enforced

Performance, security, and reliability requirements shouldn't be hopes - they should be explicit constraints that are automatically verified and maintained.

An initial proposal

Let’s start with a concrete example of what this might look like and then walk through some details (for subsequent examples you can find details about the intermediate representation specs in the appendix):

feature: Task Management System
  description: A system for creating, updating, and tracking tasks
 
scenarios:
  - name: Create a simple task
    when: I create a task with title "Buy groceries"
    then: 
      - A task should exist with title "Buy groceries"
      - The task should have status "pending"
      - The task should have a unique ID
      - The task should have a created_at timestamp

  - name: Complete a task
    given: A task exists with title "Buy groceries" and status "pending"
    when: I mark the task as complete
    then:
      - The task should have status "completed"
      - The task should have a completed_at timestamp
      - The completed_at should be after created_at

  - name: Cannot complete already completed task
    given: A task exists with status "completed"
    when: I try to mark the task as complete again
    then: 
      - I should receive an error "Task is already completed"
      - The task status should remain "completed"
      - The original completed_at should not change

  - name: List tasks filtered by status
    given: 
      - A task "Task 1" with status "pending"
      - A task "Task 2" with status "completed"
      - A task "Task 3" with status "pending"
    when: I list tasks with status "pending"
    then:
      - I should see exactly 2 tasks
      - The tasks should be "Task 1" and "Task 3"
 
constraints:
  performance:
    - name: API response time
      requirement: p95 latency < 100ms for all read operations
      measurement: Under load of 1000 requests/second
    
    - name: Write operation latency  
      requirement: p99 latency < 200ms for create/update operations
      measurement: Under load of 100 requests/second
    
    - name: Concurrent user support
      requirement: Support 10,000 concurrent users
      measurement: Without degradation beyond stated latencies
  
  scalability:
    - name: Data volume
      requirement: Maintain performance with 10 million tasks
    
    - name: Horizontal scaling
      requirement: System must scale linearly up to 10 nodes
  
  security:
    - name: Authentication required
      requirement: All endpoints require valid JWT token
      exception: Health check endpoint
    
    - name: SQL injection prevention
      requirement: All user inputs must be parameterized
      verification: OWASP ZAP scan shows no SQL injection vulnerabilities
    
    - name: Rate limiting
      requirement: Max 100 requests per minute per user
  
  reliability:
    - name: Uptime
      requirement: 99.9% availability 
    
    - name: Data durability
      requirement: No data loss on single node failure
    
    - name: Graceful degradation
      requirement: Read operations continue during write failures

A few thoughts:

Maybe we want to split the functional and non-functional requirements into separate files for better separation of concerns
Perhaps we can have a way to set up some project level defaults for many of the constraints that we can specify them when needed but rely on reasonable values otherwise to reduce boilerplate and unnecessary detail — plus it’s a way to inherit best practices or requirements and then modify these only when needed
We might also want to pass scenarios through an initial AI model to generate edge case scenarios that we can append — Example in appendix A1

Key Concepts

Scenarios

The intent here is to be a sort of a successor to Behavioral Driven Development. The subcomponents here are given, when, then. `Given` allows you to specify the preconditions or state that exist in a system which allows for conditional behaviors and the complexities of modern data flows. `When` is the trigger for an action, the input. And `then` is the desired output including any side effects or updates to state.

It’s also worth considering a natural extension here — the ability to provide explicit example inputs and outputs or to define properties (inspired by property based testing, which remains, in my opinion, the most underrated testing methodology and one of the weird best kept secrets in software development, even if you take nothing else from this post, if you work in software development, go learn about it).

I imagine this sort of thing as an optional inclusion when needed for increased control and formal verification. Exclude them until you need them, but when you need them you really need them. I’m unsure how to balance the increased formalism and less natural language feel of this extension. But formalisms are helpful for a reason, so if you can tolerate ambiguity and less rigorous validation then you can avoid them entirely. Example in appendix A2.

Constraints

This concept allows the inclusion of critical operational or non-functional concerns. A huge amount of work has gone into query optimizers for SQL. While we can’t take the same approach here, we can recognize that operational constraints are critical to such a system being effective. Sometimes 100ms latency is completely fine and sometimes it’s totally unacceptable. Context matters dramatically and this gives us a way to make sure the system operates as expected not just in terms of inputs and outputs but in terms of performance, scalability and security.

Including a latency constraint will force the system to write a performance test to verify that it satisfies the requirement4. Given how few performance tests are written in day to day software, this could be a dramatic improvement. And maybe we can just get some tremendous performance gains by reducing a latency constraint and letting the models run until it figures out how to improve things. Automatic performance wins could be a big upside. We need to be realistic about expectations, but the AlphaEvolve system found a 0.7% efficiency gain for Google that likely contributes to over a billion dollars of savings per year. So, this is not as unreasonable to expect as it might seem at first.

We might also consider compliance, regulatory, privacy and legal constraints. As well as those that need to apply across an entire system and not just a single feature, component or domain. A quick sketch of what that might look like can be found in appendix A3 — in which I introduce a few more key concepts: systems, domains and dependencies.

Here’s the even bigger vision. Rather than using this development framework to define features of an individual microservice, we can use it to build entire systems. Similar to how we shouldn’t really care how a single code path is implemented, why would we really care about how a system architecture is implemented as long as it satisfies the expected behavior in compliance with the appropriate constraints?

We currently make a large number of decisions based on very arbitrary and contingent factors. Why use dynamodb? “Oh, well, I’ve used it before and it’s widely used at the company”. And in many cases, this is actually a pretty reasonable answer. Local and organizational consistency has incredibly high value for humans trying to cooperate together. But imagine if there wasn’t a cost to learning a new technology, language or framework. Then our decisions would look a lot different. Let the AI cook.

An example domain and a cross domain interaction specification can be found in appendix A4. And a cross domain data flow example can be found in appendix A5.

This seems completely impractical and unrealistic

Good, that means it’s roughly ambitious enough. If our ideas don’t strain deeply against our current understanding, they are not nearly future oriented enough.

There is no way that AI can write bug free software this complex

There is no way that humans can write bug free software this complex either. The right comparison is certainly not perfection and in some near future state, I think it’s easy to image this approach producing fewer bugs because we will have invested so much more effort in the specification and expected behaviors. In many under specified systems the current code sort of is the underlying specification because nothing else exists.

And also, just wait 3-6 months.

Debugging Systems We Didn't Build

This is a daunting one — if we don’t have a solution here, I think the whole paradigm in is in doubt.

When something goes wrong at 3am, how do you debug a system where you've never seen the code?

Here are a handful of attempts at trying to address this. Ultimately, I don’t think we can know if they are sufficient until someone actually tries to run a non-trivial system like this in production. Maybe a plausible initial experimentation path is to use this approach to re-implement a fully robust and existing system and have that parallel system running as a fallback for some period of time.

Behavioral Observability

Include monitoring and alerting into the system in a way that will be human understandable. Yes, have granular metrics around particular system calls, but wrap or roll this up to behavioral scenario based level. Automatically create per scenario dashboards nested within service and domain hierarchies. Have an AI system that describes the observed behavior in high level human terms and also provides hypotheses around potential root causes.

For some subset of cases, potentially allow the system to be self-healing. If the AI debugging agent can identify the cause and suggest a fix without breaking any other constraints, then just let it roll forward. Then have these changes flagged for special review by humans (along with appropriate AI analysis to aid in contextual understanding).

When this is not sufficient, humans can attempt to fix the system while still operating at the scenario or constraint specification level. There’s a good chance that a system will be under-specified. It’s almost impossible to really understand the full scope of a domain and how a complex distributed system operates with real customer data. But, when given additional data around how a system is specifically failing, it’s much easier to understand the new constraint or behavior that needs to be specified. Having a behavioral description with supporting metrics and hypotheses should in most cases be enough to enable a on-call engineer to write the additional specification that’s needed.

Humans I guess?

Until it’s just really clear that AI is better even at this side of things, we’ll still need to keep developers on staff who can just dig into an unfamiliar code base and figure out what needs to be fixed. But I’ve been on-call for systems I didn’t write and didn’t really have a deep sense of ownership over. Issues that are way more common in the middle of the night are — nodes 476, 512 and 733 all look to be misbehaving outliers, let’s restart them to see if the overall system success rate and latency can get back into normal range. So general operational management skills will probably be enough in a lot of situations. And, I think you can probably get a lot of mileage by ensuring that the AI bakes generic mitigation strategies into the service.

And maybe this is just a temporary problem.

The Developer Experience (DX) seems pretty awful

How are we going to generate these specifications? Is writing these yaml files actually any better than just writing the code? It also lacks an iterative quality, we rarely know the complete specification and will write poorly specified systems.

This is a deeply compelling objection and I think a lot of insight can be gained by considering it carefully. One is that it surfaces the deep difficulty of writing these specifications (apart from the tediousness of writing configuration files) which is the irreducible complexity of fully specifying a system. It’s just too hard to know in advance. We cannot understand a system until we interact with it.

What we need is a process of discovering the underlying specification that we want. I’m reminded of the work of Bret Victor. If we can have tooling that helps iteratively uncover these specifications — while giving us insight to the system as it’s evolving, then maybe there is something still here.

Part of why SQL has worked is that the domain is so constrained. With an unconstrained domain, we must be able to express so much more. And by being able to express so much more, it’s easy to get lost along the way. Maybe what we want is not specification driven development, but the specifications to exist primarily as intermediate representations that are human reviewable, but distinctly not human created. Specifications are the artifacts and source of reproducible truth of a system, but we do not create the system in that way.

I don’t have the full vision here, but think exploring and thinking about this design space is valuable. One of the core ideas I started this thought experiment with was that writing and reading code were already being decoupled. So it feels like a natural extension that the creation of a specification does not need to be the artifact that a human reviews. And maybe in a few years even having such an intermediate representation for human review is unnecessary. One of the strange things about thinking in the current trajectory is to potentially embrace ideas and approaches with a great respect for them being ephemeral.

This was some of the most valuable feedback I received on this idea and a clearly critical oversight. But, software engineering is a team sport :)

A few more screen shots of some work in progress here:

Details in the specification are ambiguous

Take the example of 99.9% uptime as a constraint. Humans struggle to define this as well. If one of dozens of endpoints is down, is the service down? What if a handful of endpoints have success rates below 50%? How can an AI model infer our intent here, when we probably don’t even fully understand it ourselves?

One of the benefits of the behavioral driven observability is that we an have our uptime metrics defined so strongly in terms of user impact. Are users able to accomplish the tasks they want to? If so, the system is up. Historically, part of what has made this difficult for engineers, managers, and product folks to agree on is that we are composing smaller individual metrics to represent the system. We go bottom up. But if the metrics in place have already been holistically conceived, they will be top down and more easily serve this purpose.

Formal Specification is a Requirement of Communication

“Other technical professions use formal specs, even though computers aren't involved at all:
- Architects and mechanical engineers need engineering drawings.
- Doctors have formal taxonomies and terminology to communicate efficiently about disease and treatments.
- These are much more ambiguous than computer programs, because human judgement can fill in the details. If blueprints were like programs, they'd describe how to hammer in every nail. But now AIs will allow programmers to be more ambiguous.”5

Maybe there is just too much irreducible complexity inherent in these problems to propose something that simplifies things that much — some level of formalism may just be essential. I find this quite compelling and it warrants additional consideration and thought. It’s exactly this sort of thing that I hope the broader conversation includes.

But code quality will go down

I’m deeply unconvinced that this is true. Maybe with current models. Sure, I could be wrong about the trajectory of AI capabilities. But if I’m not, then it is a failure of imagination to understand how code quality will actually go up.

And further, code quality won’t matter that much. If the system does what it says, it’s relatively easy for a human to verify, and the system can be fixed when it’s broken — it’s just quite unclear to me what the value of code quality even is.

And in fact, won’t we just get better code by asking the models directly — rather than using an intermediate representation? Probably, at least that’s my guess. But the core problem we’re trying to solve for isn’t code generation it’s reducing the burden of code review, which I see as a rapidly upcoming bottleneck. Maybe it won’t persist long enough to be a substantial problem.

This isn’t the job I signed up for

Yes, this makes sense. Maybe you like writing code. Perhaps it’s a core part of your identity. Likely you’re quite proud of hard earned skills that have provided for you and your family. It’s critical to not minimize the emotional impact of the change we’re all going through.

Many, maybe even most, jobs are changing. Some people will like them more (I’ve always been more attracted to the problem solving, customer impact, and collaboration aspects of software development) and some people will like them less. This seems almost completely unavoidable to me. Part of why I think projecting into the near future is so valuable is to try to help myself and others understand coming disruption just a bit better. I find a sense of agency and optionality comforting6.

Part of why I’m thinking about this, is that very explicitly I do not want my job to be nothing but code review. But if code review is what’s providing the most value, then that’s what a lot of software engineers will spend a lot of time doing. So, in some ways, this is me pragmatically trying to nudge against that future.

Some of these ideas are good, but X, Y or Z

Great, I’m sure I’ve missed many things and maybe conceptually there’s just a much better way to accomplish the goals I’m hoping for. What seems absolutely critical to me is building the muscle and reflex to design for the near future instead of the recent past. And using the frameworks or design patterns we’ll want in 6 in 12 months can only happen if we start thinking about it now and working on it soon.

Conclusion

Ultimately, I think the thought exercise is more valuable than the framework itself. The objections are substantial and collectively carry a lot of weight. We may need more ideas that fail before we find the ones that work.

But what I remain convinced of is that verification is going up in value. And attempts to reduce the cost or cognitive burden of verification will be useful in many areas. This feels like a very natural consequence of the cost of generation and creation going down, but I don’t think I was able to understand the connection quite as clearly without going through this exercise. Eventually, we may trust AI models to do the verification as well, but in the near future this just simply won’t be true and we need strategies for navigating that time.

Why do I care so much about solving this problem? Because verification and validation work doesn’t feel as good. Its not nearly as satisfying as creating. I thrive on rough first drafts, incremental progress, understanding the broad context, and strategic thinking. Verification is an entirely different skillset and has a distinct psychological profile. I am not a good editor. Insofar as I’m a good code reviewer, it’s primarily due to being pragmatic and charitable. If we don’t solve this problem in a way that works, I think that work will get a lot worse for a lot of people very quickly.

Recently, I’ve been mostly energized and engaged with the benefits of improved creation. And I still feel that. Being able to sketch out a proof of concept for SDD so quickly including a reasonable frontend UI feels empowering. I can create more and different things than I could even a few months ago. But this is part of why I think the exercise of looking ahead is so vital — the natural conclusion of this trajectory makes me much more concerned.

This feels like a good time to call out a distinction that needs to be made more and more when discussing AI. When we think of being pro or anti AI along a single axis, we miss so much. We need to clarify and explain whether we’re talking about capabilities, impact to ourselves, or influence on society and the world. I’m grateful to this piece by Valerie Ehrlich: It's Okay to be Uncertain About AI for helping crystalize this idea for me. So for the record, I’m currently:

A strong optimist on the AI capabilities axis
A mild pessimist on how AI will impact me personally
And deeply uncertain about the value of AI to society (but I’m relatively confident in polarizing outcomes and substantial short term disruption that is net negative)

Appendix

A1

# AI-generated edge cases
- name: Handle concurrent task completion
  given: 
    - A task exists with status "pending"
    - Two users try to complete it simultaneously
  when: Both users mark the task as complete at the same time
  then:
    - Only one completion should succeed
    - The other should receive an error
    - The task should have exactly one completed_at timestamp

- name: Prevent SQL injection in task title
  when: I create a task with title "'; DROP TABLE tasks; --"
  then:
    - A task should be created with that exact title
    - No database corruption should occur
    - The system should remain functional

- name: Handle extremely long task titles
  when: I create a task with a title of 10,000 characters
  then:
    - I should receive an error "Title too long (max 255)"
    - No task should be created

A2

scenarios:
  - name: Create task with validation
    inputs:
      - name: title
        type: string
        constraints:
          min_length: 1
          max_length: 255
        examples:
          - "Buy groceries"
          - "完成项目" # Unicode support
          - "Task with emoji 🎯"
      - name: description
        type: string
        optional: true
        examples:
          - "Milk, eggs, bread"
          - null
      - name: due_date
        type: datetime
        optional: true
        constraints:
          must_be_future: true
        examples:
          - "2025-12-31T23:59:59Z"
          - null
    
    outputs:
      - name: task_id
        type: uuid
        properties:
          - unique: true
          - format: "uuid_v4"
      - name: created_at
        type: datetime
        properties:
          - less_than_or_equal: now()
      - name: status
        type: string
        value: "pending" # Literal value
      - name: title
        type: string
        properties:
          - equals: input.title # References input
    
    properties:
      - idempotency: 
          description: "Creating the same task twice returns the same ID"
          key: ["title", "description", "due_date"]
    
    properties:
      - state_transitions:
          valid:
            - from: "pending", to: ["in_progress", "completed", "cancelled"]
            - from: "in_progress", to: ["completed", "cancelled", "pending"]
          invalid:
            - from: "completed", to: ["pending", "in_progress"]
            - from: "cancelled", to: ["pending", "in_progress", "completed"]

# Property-based test definitions
property_tests:
  - name: "Task title transformations preserve meaning"
    property: |
      for_all(title: string):
        create_task(title.strip()).title == title.strip()
    generators:
      title:
        type: string
        strategies:
          - random_unicode: {min_length: 1, max_length: 255}
          - with_whitespace: {prefix: true, suffix: true}
          - with_special_chars: ["emoji", "rtl_text", "zalgo"]
    
  - name: "Concurrent operations maintain consistency"
    property: |
      for_all(operations: array[operation]):
        final_state(parallel_execute(operations)) == final_state(sequential_execute(operations))
    generators:
      operations:
        type: array
        size: {min: 10, max: 100}
        items:
          one_of:
            - create_task: {title: random_string()}
            - update_task: {id: random_existing_id(), updates: random_updates()}
            - delete_task: {id: random_existing_id()}

# Cross-scenario constraints
constraints:
  referential_integrity:
    - name: "Task references are valid"
      property: "for_all(reference: task_id): exists(task.id == reference)"
    
  data_consistency:
    - name: "No orphaned subtasks"
      property: "for_all(subtask): exists(parent_task.id == subtask.parent_id)"

A3

system: Multi-tenant E-commerce Platform
  description: B2B SaaS platform where businesses can create online stores
  
domains:
  - name: tenant_management
    dependencies: [authentication, billing]
    
  - name: product_catalog  
    dependencies: [tenant_management, search, media_storage]
    
  - name: shopping_cart
    dependencies: [product_catalog, pricing_engine]
    
  - name: order_processing
    dependencies: [shopping_cart, payment_gateway, inventory]
    
  - name: search
    dependencies: [product_catalog]
    
  - name: analytics
    dependencies: [order_processing, tenant_management]

global_constraints:
  multi_tenancy:
    - name: Complete data isolation
      requirement: No tenant can access another tenant's data
      verification: Automated penetration testing between tenants
    
    - name: Performance isolation  
      requirement: One tenant's load cannot impact another's performance
      measurement: Load test with asymmetric tenant usage
  
  compliance:
    - name: PCI DSS Level 1
      requirement: Full compliance for payment processing
      
    - name: GDPR compliance
      requirement: Data deletion, portability, consent management
      
    - name: SOC 2 Type II
      requirement: Audit trail for all data access

  scale:
    - name: Tenant capacity
      requirement: Support 10,000 active tenants
      
    - name: Catalog size
      requirement: 1M products per tenant without degradation
      
    - name: Order volume
      requirement: 100k orders/hour peak across all tenants

# Individual domain specifications follow...

A4

domain: product_catalog
  description: Manages products, variants, and inventory
  
  data_model:
    consistency_requirements:
      - name: Inventory accuracy
        requirement: Inventory counts must be eventually consistent within 1 second
    
      - name: Price consistency
        requirement: Price changes must be immediately consistent

  scenarios:
    - name: Create product with variants
      given: I am authenticated as a tenant admin
      when: I create a product "T-Shirt" with variants for size [S,M,L] and color [Red,Blue]
      then:
        - 6 variant SKUs should be created
        - Each variant should have independent inventory tracking
        - Product should be searchable within 5 seconds
    
    - name: Bulk import products
      given: I have a CSV with 50,000 products
      when: I upload the CSV for import
      then:
        - Import should be processed asynchronously
        - I should receive progress updates via websocket
        - 95% of products should be searchable within 2 minutes
        - Failed imports should generate detailed error report
    
    - name: Concurrent inventory updates
      given: Product "SKU-123" has inventory count of 10
      when: 15 simultaneous orders try to purchase 1 unit each
      then:
        - Exactly 10 orders should succeed
        - 5 orders should fail with "insufficient inventory"
        - No negative inventory should be possible
        - Inventory reserved but not purchased should release after 15 minutes

  constraints:
    performance:
      - name: Product detail page
        requirement: p99 < 50ms including all variant data
        cache_strategy: Tenant-specific Redis cache with 5-minute TTL
        
      - name: Search latency
        requirement: p95 < 100ms for searches across 1M products
        implementation_hint: ElasticSearch with tenant-specific indices
    
    resilience:
      - name: Search degradation
        requirement: If search is unavailable, browse by category still works
        
      - name: Image CDN failure
        requirement: Products display with placeholder images if CDN fails

Note the extensibility of the framework where we can add new features like `cache_stategy` or `implementation_hint` as the framework evolves and becomes more feature complete. There should probably be a fully thought through set of design principles that helps adjudicate what types of concepts and functionality warrant top level exposure and prompt engineering development to ensure they improve rather than detract from the system.

A5

interaction: checkout_flow
  description: Complete purchase flow across multiple domains
  
  participants:
    - shopping_cart
    - pricing_engine  
    - payment_gateway
    - inventory
    - order_processing
    - notification_service
    
  flow:
    - name: Complete purchase
      steps:
        1: Cart requests final pricing from pricing_engine
        2: Payment_gateway reserves payment authorization
        3: Inventory reserves items
        4: Order_processing creates order
        5: Payment_gateway captures payment
        6: Inventory confirms reservation
        7: Notification_service sends confirmation
      
      failure_handling:
        - name: Payment fails after inventory reserved
          rollback: Release inventory reservation within 30 seconds
          
        - name: Inventory unavailable after payment authorized
          rollback: Void payment authorization
          compensation: Offer backorder or refund option
  
  constraints:
    - name: Atomicity
      requirement: Either complete entire flow or rollback cleanly
      
    - name: Idempotency
      requirement: Submitting same order twice results in single order
      
    - name: Distributed timeout
      requirement: Entire flow completes or fails within 30 seconds

Thanks to Jason Held, Josh Montague, and Alex Beal for valuable feedback on earlier versions of this post. Their insights and feedback deeply reinforce my claims from earlier posts that software engineering is a team sport. Among their many collective contributions were insights into the UX (no one wants to write a bunch of yaml files), the nature of formal specification as a requirement of communication, and the importance of iterative real world data on complex projects (for example, the limitations of microbenchmarks). Bad ideas remain clearly mine.

I assume the major labs already have more sophisticated versions of something like this internally, but for proprietary reasons don’t discuss it openly. Or maybe they’re still struggling with the same problem of verification being a bottleneck?

For more on why we will create rather than write these specs directly, see the “The Developer Experience (DX) seems pretty awful” objection

There are meaningful limitations to micro-benchmarks, so for this to work holistically, it would need to monitor and observe real system data as well.

Email from Alex Beal 5/31/2025.

Even if these are illusory, I still think I’ll get placebo benefit from them and importantly I’m not so pessimistic as to assume they are illusory.

The Future Was Yesterday

Discussion about this post