Parsix: Parse Don't Validate

didibus · on May 15, 2021

I have to be honest, I'm not seeing what problem this is trying to solve. Anyone can enlighten me?

Edit: Ok I think I understand...

It seems the problem would be that if you're implementing a function that takes a user email as a string, and that function is in a lower layer of the application, like inside the data access layer. It is difficult at this point to know if the email string you will be passed as input has already been validated or not. Thus you might be tempted to re-implement validation for it at your level inside this function as well and have an assertValidEmail check.

This can lead to a littering of validation throughout the code base, as each implemented function worries that the input isn't validated and re-validates it, possibly using slightly different rules each time.

Furthermore, if you decide to not validate it again, you might be left wondering, but am I sure it'll have been validated prior? How can I be sure? Someone in the future could easily start calling my function and forget to validate the email before calling it? This could eventually lead to a security issue or just a bug, by introducing a code path that doesn't ever validate the email string.

Thus if instead you'd re-write your function so it takes an email as a ValidEmail type (or object), and not as a string, you force the caller to for sure remember to validate the email first. And you also can safely assume if you're getting an email as a ValidEmail type that it has been validated. It could also technically allow you to localize the validation logic to the ValidEmail type constructor, avoiding possible duplicate attempts at validating email with different rules.

And it seems the latter "style" the author calls "Parsing" while the former they call "Validating", in the sense that since the function validating returns a modified structure it "parsed" it, because a string became a ValidEmail, thus parsing a string into a ValidEmail, as opposed to simply validating that the string is valid as an email.

And finally, this is a little library to help make use of this pattern in Kotlin.

jpeloquin · on May 15, 2021

Paraphrased from the repo's readme: Suppose you have a program that consumes user input. Users often give bad input so the program needs to validate user input before acting on it. One way to validate is to call a function (e.g., `check_input`) on the user input and if it doesn't raise an error the input is safe for consumption by the rest of the program. The repo author considers this approach to be risky because the programmer can inadvertently omit or bypass `check_input` and the program still compiles and runs without complaint.

The repo presents an alternative validation approach, which is to parse the user input into a data type (or, not quite equivalently, into a class). The parsing process serves as validation. Consumer functions are written such that they only accept the parsed data type. Therefore it is now impossible for the programmer to inadvertently omit or bypass validation of user input.

The library is a set of convenience functions for actually writing these parsing / validation functions.

atoav · on May 15, 2021

So in short: instead of representing user input (e.g. a Email address) as a string – which you can forget to validate – the idea here is to create a own data type for it, and use the validation step to create said data type.

The rest of your program then works with this data type instead of the string and this way you will get a type error whenever you accidentally use unvalidated data.

A nice idea that goes into a similar direction is to expand on this and create more types for different levels of trust. E.g. you could have the data types ValidatedEmail, VerifiedEmail and TrustedEmail and define precisely how one becomes the other. This way your typesystem will already tell you what is valid and what is not and you can't accidental mix them up.

TeMPOraL · on May 15, 2021

You can also further generalize this idea by noticing you can encode all kinds of life cycle information in your type system. As you transform some data in a sequence of steps, you can use types to document and enforce the steps are always executed in order.

In this example, the user input validation step is f(String) -> ValidatedEmail, then the process of verifying it is f(ValidatedEmail) -> VerifiedEmail. But the same principle can apply to e.g. append() operation being f(List[T], T) -> NonEmptyList[T], and you can write code accepting NonEmptyList to save yourself an emptiness check. Or, take a multi-step algorithm that gets a list of users, filters them by some criterion, sorts the list, and sends these users e-mails. Type-wise, it's a flow of Users -> EligibleUsers -> SortedEligibleUsers -> ContactedEligibleUsers.

And then, why should types be singular anyway? You should be able to tag data with properties, and then filter on or transform a subtag of them. This is the area of theory I'm not familiar with yet, but I imagine you should be able to do things like:

List[User] -> List[User, NonEmpty] -> List[User[Eligible], NonEmpty] -> List[User[Eligible], NonEmpty, Sorted[Asc]] -> List[User[Contacted], Sorted[Asc]].

Or,

Email -> Email[Validated] -> Email[Validated, Verified] -> Email[Validated, Verified, Trusted].

I'm sure there's a programming language that does that, and then there's probably lots of reasons that this doesn't work in practice. I'd love to know about them, as I haven't encountered anything like it in practice, except bits and pieces of compiler code that can sometimes propagate such information "in the background", for optimization and correctness checking.

rictic · on May 15, 2021

The closest I've seen is Idris, which lets you pass around data along with proofs about that data. The proofs can be generated at either runtime or compile time (the usual distinction between the kinds of things you can do at runtime vs compile time is blurred in all sorts of delightful ways in Idris).

The Idris Book (https://www.manning.com/books/type-driven-development-with-i...) is structured in a very practical, "learn some then build some" format that I found to be a joy to read.

akersten · on May 15, 2021

I really like this idea. It feels like weapons-grade TypeScript and could solve an entire class of logical error. If someone has a good method they're already using to encode strict typing like this I'd love to check it out.

greggman3 · on May 16, 2021

This site on F# used to show how to do the same in C# (Which would be more applicable to typescript?) but, the usually coding via types is best in a language designed to make that easy. AFAIK Typescript (nor C++, C#, Java) are designed for that purpose where as Haskell, F# are.

That doesn't mean you can't do it in those other languages, only that it's tedious compared to languages that are designed for it.

https://fsharpforfunandprofit.com/posts/designing-with-types...

https://fsharpforfunandprofit.com/posts/fsharp-decompiled/

eru · on May 16, 2021

This is pretty common practice in languages like Haskell.

Gregaros · on May 16, 2021

Reminds me of dependent types: https://en.wikipedia.org/wiki/Dependent_type

_greim_ · on May 15, 2021

To keep building on this, I think the word "parsing" is just the tip of the iceberg. Parsing is one way to port data across a type boundary, where the source and dest types are optimized for different use cases (e.g. serialization vs type-safe representation). Since the semantic Venn diagrams of any two types might have areas of non-overlap, parse-don't-validate means establishing clear boundaries in your program where those translations happen, then defining the types on either side of the boundary to rule out the possibility of nonsense states elsewhere throughout the program. The idea of nonsense states is closely related and discussed more here[0] and here[1].

[0] http://blog.jenkster.com/2016/06/how-elm-slays-a-ui-antipatt...

[1] https://kentcdodds.com/blog/make-impossible-states-impossibl...

_lqaf · on May 15, 2021

Reminds me a little of taint checking in Perl and Ruby, in reverse.

feurio · on May 16, 2021

Yes. Also I thought of when I have created a Domain in Postgres (https://www.postgresql.org/docs/9.5/sql-createdomain.html) with a constraint written in PL/Perl to ensure that the data going in is always valid (eg. check digits in IDs, a bit like in credit cards).

Smaug123 · on May 15, 2021

If you mean "why Parse, Don't Validate", you should read the original blog post, linked at the top of the article. It's… transformative, if you aren't already aware of the principle. https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

If you mean "why this library", well, I guess parser combinators are nice! Some may say that a declarative statement of the parsing restrictions is better than a procedural implementation, on general principles.

cle · on May 15, 2021

There are lots of siblings explaining why “parse don’t validate”.

But also, it’s not always wise to take this to an extreme. I’ve seen over the years many scenarios where dev teams were over-enthusiastic about this and parsed themselves into a corner by making system components over-strict and enforcing invariants that weren’t necessary to enforce, making them much harder to change later.

The right answer is, of course, somewhere in the middle, and depends on your domain and situation.

Dylan16807 · on May 16, 2021

> The right answer is, of course, somewhere in the middle, and depends on your domain and situation.

I strongly disagree. The problem you describe, of being overly strict, is orthogonal to whether you validate or parse. There is no "middle" between parsing and being strict. Parse-validate and strict-nonstrict are separate axes.

Iazel · on May 15, 2021

hi, @cle! Curious to hear more about that, were they actually running validation/assertions in constructors?

cle · on May 15, 2021

That can be a case of that yeah. Using your example, a lot of devs might use that email parsing logic in various independent components of the same system. Eg if you have a reporting component that sends you business reports, that component really shouldn’t be validating the structure of email addresses…if you need to refine the parsing logic now you’ve got to do coordinated deployments, possibly backfills, etc., whereas if you just treated it as an opaque string in that system you’d be better off.

This isn’t really a criticism of the approach, it’s super useful, just that it needs to be applied judiciously. “Parse all the things” isn’t always the best advice.

Kranar · on May 15, 2021

Your response isn't clear.

>that component really shouldn’t be validating the structure of email addresses

That's right, as the article says you no longer validate data, instead you parse data. Components don't take strings and validate them, they take an EMail address.

>If you need to refine the parsing logic now you’ve got to do coordinated deployments

If you need to refine the parsing logic, you modify the constructor of the EMail class to do whatever refinements you need.

I have no doubt you may have seen codebases that did something wrong, and that wrong thing may have been related to parsing or validation, but nothing you've said indicates that there is something wrong with replacing validation with parsing, so that code always operates on semantically meaningful data types that are valid by construction, instead of opaque binary blobs that need to constantly be properly interpreted in ad-hoc ways.

cle · on May 16, 2021

I guess what I'm saying is to be judicious about what invariants a particular piece of code should care about, within the context of the larger system.

> That's right, as the article says you no longer validate data, instead you parse data. Components don't take strings and validate them, they take an EMail address.

That doesn't solve the problem, by parsing it my code is caring about its syntax and structure, even if my use case doesn't require that. That creates coupling, and in a distributed system that kind of unnecessary coupling can create major headaches when trying to change things.

An even simpler example that many people can empathize with is using strict, closed enums in a distributed system (very easy to do with e.g. Java). Safely adding a new enum value is basically impossible at that point, until you make them "open" enums that can safely parse unknown values.

Again I am not disagreeing in principle, I am only saying that this has ergonomic problems in practice and is easy to misuse.

Iazel · on May 16, 2021

I see, that would be problematic indeed. I strongly believe in Bounded Contexts, a concept from Domain Driven Design. As for the Ubiquitous Language (ie: product & engineers naming things the same way, code included), parsing logic should also be segregated and specific to your domain. I would argue that you should have different parsers for different purposes even inside the same system, for example one could be applied to inputs coming from an User, while another could be applied to an API and one more when coming from a DB.

I think I would stress this point more in the README, thanks for sharing :thumbsup:

GordonS · on May 15, 2021

If it helps, here's a related blog post but with a C# slant:

https://andrewlock.net/using-strongly-typed-entity-ids-to-av...

The author refers to using primitives everywhere as "primitive obsession", and proposed using types instead.

dmux · on May 15, 2021

Similar to the idea of "microtypes" (I've most often seen it used in Java circles):

https://www.markhneedham.com/blog/2009/03/10/oo-micro-types/

Iazel · on May 15, 2021

Cool to see you perfectly got the point in the end! I wonder though, were you confused by the README? What made it clear for you?

didibus · on May 15, 2021

Hum, it was the people here who replied to my question, and also reading the linked article.

I think my confusion was in trying to frame things as parsing VS validating. While I now appreciate that use of word, now that I understand, it also caused my biggest source of confusion.

That's because I think most people think of parsing as conversion, like I turn a String to an Int. Where as in your case, you're simply wanting to tag a type as having been validated, but you don't really convert the type itself, so you simply wrap it in another type in order to tag it as having been validated simply because the language offers no other way to tag the type with meta-information for the compiler to assert statically.

So because it seemed more like you're just wrapping the input, but still all code will be using the input value as it is, extracting it out of your wrapped type, the idea that you were "Parsing" and not "Validating" well just confused me.

sullyj3 · on May 16, 2021

It's true that here the data type was just a tag. There's always a decision to be made about how much structure you want to enforce by the construction of the data type. You'd still consider it to be "parsing" because you now statically know that an Email is valid according to the rules of the parser - in a loose sense, it's a "conversion" between a type you know nothing about, and one you know a lot about.

If you wanted to go further you could start going the "correct by construction" route - having a data type that enforces more invariants. For example, you might store the recipient name, domain name, and tld of your email in separate fields. Then your parser would more obviously be a parser. I think of the "mere tag" type as a kind of degenerate case of this, rather than something totally separate.

Iazel · on May 16, 2021

I see, thanks for sharing. I will improve the README by adding a clear definition for parsing, validation and deserialization, so that we can all be on the same page ;)

matheusmoreira · on May 15, 2021

This also has security implications. The input handling layer is critical. Bugs in parsing and validation code are responsible for a huge number of vulnerabilities.

More details: http://langsec.org/

> The Language-theoretic approach (LANGSEC) regards the Internet insecurity epidemic as a consequence of ad hoc programming of input handling at all layers of network stacks, and in other kinds of software stacks.

> LANGSEC posits that the only path to trustworthy software that takes untrusted inputs is treating all valid or expected inputs as a formal language, and the respective input-handling routines as a recognizer for that language.

mirekrusin · on May 15, 2021

Imagine you're writing typescript project. You type everything and have type safety. This type safety is an illusion on I/O boundaries – whenever ie. JSON.parse(...) from file/websocket/http happens. To preserve type safety, you want to use something like [0] to do runtime type assertions. Once i/o boundaries are parsing unknown types at runtime into what is defined as static types, your type safety is guaranteed.

[0] https://github.com/appliedblockchain/assert-combinators

TheAceOfHearts · on May 15, 2021

Refining types so they encode all desired constraints before use. This is explained in the linked article: Parse, don’t validate [0].

It helps reduce the risk of using invalid inputs by representing constraints over the value as part of the type.

For example: a common problem in web development security is that query parameters aren't properly validated which can lead to denial of service attacks. As a trivial example of this, consider a web server which paginates some data using "offset" and "limit" by passing those parameters directly to a database query; an attacker could set "limit" to some incredibly high value and cause the server to crash. If you're just doing validation on your inputs it's possible that some usage could end up being overlooked.

[0] https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

gregors · on May 15, 2021

So real question - in the "offset" "limit" example what makes it any more safe if at first the programmer sets those types to be integers? The same problem persists does it not?

Does the explicit creation of a type add this introspection? I'm not convinced that it does. Now once you fix this bug, encoding it in a type prevents it from creeping into other parts of the the code. This seems more like DRY principles in action.

didibus · on May 15, 2021

Yeah, it seems to be more about guarantees as a code base grows larger and more people touch it.

If there's a Limit class whose constructor and setter all check that the range is between say 5 to 100, and all existing code that needs the limit uses the instance of Limit, it just becomes less likely a code change is made that uses the limit input as it was directly provided by the user (and thus possibly out of range).

But you'd still need to have had someone be smart enough to make sure the Limit class does prevent limits that could cause DB crashes.

In practice I'm thinking, ok, so someone must have thought... Hey we should validate this user input and put in some logic for it.

So I think what this says is, validation works by having all external input validated as they are received. But it can be easy to make a code change at the boundary where you forget to add proper validation. If all existing functions in the lower layers, like in the data access layer, are designed to take a Limit object, the person who took a limit as external input and was about to pass it to the query function will get a compile error and realize... Oh I need to first parse my integer limit into a Limit, and thus reminds them to use the thing that enforces the valid range.

If instead the code had a util function called assertValidLimit, and the query function took a limit as an integer, it be easy for that person to forget to add a call to assertValidLimit when getting the limit from the user and then pass that unvalidated to the query and possibly cause a vulnerability.

And lastly, it seems they argue, if you were to validate instead in the query function itself, thus it wouldn't matter if others forget to validate since where it matters would, but then it is hard to fail at that layer, since you might have already made other changes and that can leave your state corrupted.

So basically it seems the argument is:

"It is best to validate external input at the boundary as soon as it is received, but it can be easy to forget to do so and that's dangerous. So to help you not forget, have all implementing functions take a different type then the type of the external input, which will remind people... Oh right I need to parse this thing first and in doing so assert it's valid as well.

Iazel · on May 15, 2021

Well said! I would only like to add that I highly discourage adding validations/assertions in the actual data class, this often make them hard to work with and reuse. It is better to have this parsing logic as a simple function, perhaps at factory level if you prefer that kind of flavor :)

TheAceOfHearts · on May 15, 2021

Apologies if I did a poor job of explaining, what you wrote seems in agreement with what I was attempting to convey.

If one were only using integer types then the same problem would persist, that's correct. The problem would be solved by defining our limit type to only represent positive integers up to a specific safe value.

Type refinement is done on the input boundaries of the system during runtime to prevent errors from propagating.

mbildner · on May 15, 2021

This is not yet possible in Typescript, but imagine if you could define a numerical subtype that requires your input be below some threshold eg:

`type Limit = 0..100;`

See discussion here: https://github.com/Microsoft/TypeScript/issues/15480

steventhedev · on May 15, 2021

There's an entire class of vulnerabilities caused by having separate verification and parsing logic, typically with fields that usually only one is used, but the format supports multiple. The verifier checks the first one but the parser uses the last one.

StreamBright · on May 15, 2021

Interesting naming. Strongly typed languages (especially in the ML family) have best practices that include using types instead of strings as function parameters. Email type itself is enough to skip validation in each function accepting that particular type.

I think this is great first step using functional languages but you can go much much deeper than that.

https://www.slideshare.net/ScottWlaschin/the-power-of-compos...

finnh · on May 15, 2021

It encourages people to use strongly typed classes rather than primitives, even if the type simply wraps a primitive.

As a result you can't pass a invalid (say) accountID deep into your code, bc validity is guaranteed to be checked early when you "parse" an input string into the "AccountId" type.

So: internal interfaces defined using non-primitive types, so internal methods don't need to keep validating their input. Conversion to said types happens early and predictably, catching bad values before they (eg) hit the database.

jinwoo68 · on May 15, 2021

As they said in README, it’s inspired by Alex King’s Parse, don’t validate [1].

Basically, rather than write a validation function, write a parser that returns a result of a specific type and use that type everywhere else. Then you can make sure the raw inputs are always validated.

[1] https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

anaerobicover · on May 15, 2021

Small correcting, the author is named Alexis

jinwoo68 · on May 15, 2021

Whoops. Thanks for correcting.

coderintherye · on May 15, 2021

The linked blog post explains it pretty well. Essentially, it seems to be solving for unexpected cases or incorrect validation by using of static typing and passing the expected type back in the return rather than a boolean. I'm not sure I've encountered enough issues with validation functions to use this pattern, but it does seem like a more robust way of writing them.

rdedev · on May 15, 2021

I find this approach combined with phantom data types really cool. Now you can easily introduce a semantic differentiation between two instances of the same data type but without much overhead

throwawayboise · on May 15, 2021

I do as much of this as I can with database constraints. Foreign key constraints, or check constraints, or even triggers if necessary (though I do try to avoid them).

Databases tend to outlive application code, or may be fronted by different applications (internal vs external for example). Keeping the constraints with the data is the best way to ensure that your data remains consistent within itself.

jhardy54 · on May 15, 2021

I do this too, but I’m always frustrated by the mismatch between database constraints and application constraints. For example, when using Django you can declare a field as varchar(32) but that constraint isn’t checked until you actually insert the row into the database. I suppose maybe that’s not a problem in languages with more mature type safety ecosystems?

Iazel · on May 15, 2021

Yeah, I've also worked with weak type systems in the past too (PHP, Ruby, JS), so I can definitely share the pain! I learned the hard way how much easier it is to build complex systems when you have a compiler helping you ;)

teddyh · on May 16, 2021

Please do not confuse strong/weak typing with dynamic/static typing. Many dynamically typed languages are also weakly typed, but some dynamic langugages, like Python, are strongly typed.

Iazel · on May 16, 2021

Fair point, thanks for calling it out. I should have said "statically checked, strongly typed languages"

jhardy54 · on May 15, 2021

What are you building with now? Rust/Go/something snazzy?

Iazel · on May 16, 2021

I'm using Kotlin and still PHP in some legacy part at work, but whenever I can in my free-time I use Haskell, Elm and read about FP concepts. What about you? :)

lolinder · on May 16, 2021

Judging from the library they wrote, I'd guess Kotlin.

Iazel · on May 15, 2021

I see, this is also an interesting approach and definitely have its usages. Thinking about it, though, it has its own limitations when it comes to scalability and business requirements naturally out from the database box, eg: how would you ensure an S3 file reference is actually valid and it does exist?

mixedCase · on May 16, 2021

You can't, since the file can be deleted from under you after verification anyway. But you can treat it as an "S3FileReference" or something arbitrarily generic like "RemoteFile" and have associated procedures/methods for verifying the properties you want at any point without confusing it with "just a string"

twic · on May 15, 2021

Great, but why do you need a library for this? I just write classes with a falliable static parse method and a private constructor.

It looks like this library was written by someone labouring under the mistaken belief that it's better to build and use a DSL to create the illusion of declarativity than to just write a line or two of normal code (eg the focusedParse stuff).

Also, i demur somewhat at calling this parsing. It's tracking validation using typestate.

mixedCase · on May 16, 2021

You'll want to read on "parser combinator" as a concept.

A parser combinator library can provide a good repertoire of primitives, standard interfaces for defining and using the parsers, and utilities for working with those interfaces.

User-friendly error reporting comes to mind.

slver · on May 16, 2021

This only matters when you parse complex syntax. Most input isn’t.

mixedCase · on May 16, 2021

Boring Line-of-Business Applications have their bread and butter by parsing JSON and treating it with more meaning than what `string | number | boolean | object` can express. Therefore, they can benefit from parsers.

If you think JSON primitives are "enough to express most input" then you missed the point; because of course you can, but no person who has learnt to use types to their advantage would want to write error-prone code and an unnecessary amount of unit tests that deal with ambiguous behavior for no good reason.

slver · on May 18, 2021

Parsing JSON with various slightly specialized string fields like "email" or "url" and so on is precisely what I had in mind by simple input.

There's absolutely nothing there for "parser combinators" to do.

JSON parsing is a standalone step, and the values retrieved from it can be passed to another EXISTING (mind you) validators to handle the rest.

I live and breath this every day as I deal with forms and APIs for a living. And I never was like "oh man this is hard, if I only had parser combinators".

> If you think JSON primitives are "enough to express most input" then you missed the point

Unclear why you're trying to put words in my mouth, but JSON is enough to describe the composition of input (in terms of lists, dictionaries and leaf scalar values). Which as simple as it is, turns out to be a very significant part of the task.

Of course you have additional constraints on top of that.

But no need for "parser combinators". To JSON, a string is a string. Any additional operations performed on that string happen AFTER it's decoded from a JSON string format to an in-memory native string literal. Those are entirely independent steps. A simple function composition would do i.e. json => string => email.

And that's more or less how I parse arbitrary data input. All of which is simple to parse. Relative to, you know, say parsing a full programming language.

Iazel · on May 16, 2021

Talking about "parsing", I see that different people have a different understanding of it. I will fix this point by writing in the README what we mean when we way "parsing" :)

Allow me to disagree with the "tracking validation using typestate" bit. This is just one way, but using Parsix you can also go from a String to a more complex type like Email(local: Local, domain: Domain), where Local and Domain can also be other complex types. The point is, you should parse as much or as little as you need in your domain. That's why we call it "general-purpose parsing" ;)

Iazel · on May 16, 2021

If that style works for you, please be my guest! As usual, this library is just a tool, it may or may not fit your use case, and that's fine ;)

By no means we want to pass the message that you have to use this library to benefit from the "Parse, don't validate" pattern. What we want is to make the pattern more mainstream and provide a nice set of functionalities out of the box :)

Also I would like to encourage everyone to get as much as you need from this library. If you don't like `focusedParse`, please don't use it! xD

billytetrud · on May 15, 2021

To me this just looks like they're arguing for using class types rather than raw strings. The parsing seems kind of orthogonal and a special case of the kinds of validation you might want to do.

It's also misleading in that the code is still doing validation, just in a different place.

didibus · on May 15, 2021

Yeah, but I think it's even more so, they're arguing that you should model the fact that something has been validated or not, and functions should indicate if they expect a validated form of input or not.

In that sense, using types is only one way to do this, but you could model that in other ways. For example:

    var foo = "123"
    foo = validFoo(foo)
    print(foo)
    > {"value" : "123",
       "valid?" : true}

And now you could have:

    function bar(validFoo) {
      if (!validFoo.get("valid?"))
        throw new InvalidInputException("Foo must be validated prior to calling bar.")
      ...

}

Now types are a convenient way to do this that also gives you static checking for it, but I believe the idea is more to model that things were validated and expects validated input or fail.

That allows you to push all validation at the boundary, and make sure that no one ever forgets to validate the input, because if they do, the inner functions will fail reminding the caller: Please remember to validate this!

billytetrud · on May 15, 2021

Makes sense. It just seems like parsing is kind of a separate issue and shouldn't be entangled with the concept of input validation.

Iazel · on May 16, 2021

Interesting, would you mind explaining why parsing and validation are two different concerns?

billytetrud · on May 16, 2021

When I think of "parsing" I'm thinking of string parsing. But for example, validation of a number being in range or an obect containing a correct set of properties is not in the realm of string parsing, but is in the realm of validation.

alserio · on May 15, 2021

I mean, yes, the point is that it is a better place to do the validation step. Also, parse is generic to mean from a representation to a more structured one.

Iazel · on May 15, 2021

Yes, it is basically a combination of proofing some data has been validated by encoding this proof in a specific type, like Email :) We want to popularize this idea and make it easier to work with it by offering some nice, type-safe abstraction.

alex_duf · on May 15, 2021

I think this can be summarised by "model your domain by using types, then let the compiler ensure you're not doing anything silly"

hinkley · on May 16, 2021

Also known as

Stop Using Stringly Typed Data

andrenarchy · on May 16, 2021

For TypeScript there is the pure functional https://github.com/paperhive/fefe

Less boilerplate than the other solutions I know... for example it allows you to define types easily just by writing a parser/validator ("transformation" in fefe terminology).

araknafobia · on May 15, 2021

What forces you to use validator function and not just do Email(input)? This is how you actually do it in Scala if anyone interested https://github.com/fthomas/refined

adamgordonbell · on May 16, 2021

Yeah, refined types are a great solution to this problem, but can lead to slow compile times. Just having private constructor for a case class and a public apply method that is an Option or Either or something can go a long way in my experience.

samatman · on May 15, 2021

As a minor point of order, the exact phrase "parse, don't validate" has been conventional wisdom in langsec circles since I got involved, so 2014 at the earliest.

I asked around on the work Matrix as to who actually coined it, but it's the weekend.

This is not to take anything away from @lexi_lambda, who cited her sources and documented an interesting type-theoretic approach to applying the principle. She did a great job!

If anyone wants to do a deeper dive, look into langsec, language-theoretic security. There's a lot of prior art to explore.

matt-noonan · on May 16, 2021

The principle was certainly known, but I think Alexis really does deserve the credit for the catchy "parse, don't validate" wording. A Google search for that phrase, restricted to October 2019 and earlier, has no results (or rather, the results that do show up all are more recent additions such as comments, appended to previously-existing content)

samatman · on May 16, 2021

I assure you this isn't the case, I have personally heard it said as early as 2014.

Meredith Patterson got back to me and attributes it to Sergey Bratus. We're a little vague on when, but it was quite some time ago.

It's a great blog post, and it popularized the slogan, which is the important part. She was quite clear in the original post to cite langsec, everyone here is on a collegial basis.

My point was not really about 'credit', it was about langsec. If the ideas in this library, and that post, are interesting, there's a lot more to discover in langsec. That's it.

Iazel · on May 16, 2021

Cool, didn't know that, thanks for sharing!

skybrian · on May 15, 2021

This library seems to be providing a framework and doesn’t include any interesting parsers. (There is no email address parser, despite the example.) It seems to allow for some composition of parsers, but the basic idea is a design pattern that’s simple enough that it doesn’t obviously require a framework.

So it seems like most of the value comes from standardizing on domain types like Username, Email, and so on. Using a framework doesn’t get you there, and it adds a dependency on the framework.

Iazel · on May 15, 2021

Hi skybrian, would you mind explaining why do you see this as a framework?

About missing interesting parsers, you are right, for now only the core part is done. Based on community interest, we will work on complementary packages, like more common parsers, easy integration with a web framework like ktor, effectful parsers based on coroutine, etc...

Lots of work ahead :D

skybrian · on May 16, 2021

Well, it's a minimal framework. Parsers are supposed to implement a particular function definition [1] and use the "Parsed" type for their return value, which if widely adopted, will result in references to the core library's types appearing in a lot of API's.

The advantage is that you can write code using generics that works with any parser, but I'm a little skeptical about the value of generic code.

[1] https://github.com/parsix/parsix#build-your-own-parse

Iazel · on May 16, 2021

I see, but isn't that true for any library? xD

In the end, `Parsed` should only be handled whereever you receive the initial input, so the rest of your business logic will be free of it :)

In the end, this is just a tool and it's fine if it isn't fitting your particular style or use case ;)

skybrian · on May 17, 2021

Yes, there's nothing particularly wrong with it, but core types should be standardized or you end up with multiple string libraries (as in C) or multiple dependency injection libraries (as in Java). Validation libraries already exist for Java.

So if this sort of thing takes off there is likely to be a standardization process at some point.

Go works better for this because common interfaces can be implemented by "coincidence." (Structural types.)

Iazel · on May 18, 2021

I see and I would love for this to become a standard :D But I guess that will require quite some time and a lot of luck

robertlagrant · on May 16, 2021

Interesting stuff. Having read Parse, Don't Validate, it would be good to know how to face potential practical pitfalls, such as:

1) how to serialise these types into different formats (e.g. "how do I save you in a SQL database?) - inside the type, which requires one size fits all, or as separate mapping functions, which could result in significant code proliferation?

2) how to cope with partial validation, e.g. do I need a wrapper class for (say) some form inputs that have these types in its fields, but override nullable on if it's off?

3) sort of a combo of (1) and (2), but how do I serialise groups of these typed fields in different ways? E.g. I want to save a model made up of these typed fields via SQLAlchemy and I also want to publish it to RabbitMQ. How do I manage the potentially varied formats in a type-safe way? When everything was Just Strings™ it appeared to Just Work™.

4) how to compose these things with monads - e.g. if I have an optional field, should Maybe treat the value as nullable, or should the type itself contain that option for validation purposes elsewhere?

Even as I'm writing these questions I'm coming up with a potential pattern for using this sort of thing, but I'm curious to hear thoughts.

Iazel · on May 16, 2021

Hey, thanks for starting this discussion, I think it will be an interesting one :)

1) Just to be clear, when I read "serialise" I understand that you want to parse some object into a format suitable for storage. I would honestly go with the common flow: use some library like kotlinx.serialization or whatever your database library support. If this isn't what you meant, please let me know.

2) I personally like well defined type. For example, if you have a case where Email must be like Email(local, domain, tld) and another where you just want a String out of your email, I would rather use two different types for the two different use cases and therefore different parsing logic (which will probably share most code, but still produce different results). On the other hand, if you want to model the fact that some piece of information could be provided or not by the external source, than yes, I would just go with nullable types, which in Kotlin is like `Something?`, but in other language are known as Maybe or Optional.

3) Just to be clear, I will strongly advice against having validation in object constructors. Once this is out of the picture, you are left with simple Value Objects: simple, immutable, bundle of data with a defined structure and semantic meaning. When you work with these simple objects, it's very easy to derive different views for whatever use case you need. I'm not familiar with SQLAlchemy and Python ecosystem in general, but I guess there will be some way of converting a specific type into one that is suitable for DB consumption.

4) Both Parse and Parsed are actually Monads ;) And we know something about Monads, is that composing different Monads together is usually not straightforward xD That's said, in your particular example of having an optional field, I would see the parser produce a Maybe<Something> (or Something? in Kotlin)

Let me know if I got you right and what you think about it :)

ledauphin · on May 15, 2021

I've been looking for a solid Typescript implementation of "parse don't validate" that performs runtime parsing using semantics attached to the defined Typescript types themselves. In other words, much like attrs for Python, I want to be able to define a low/no-boilerplate type, and then register parsers for those types that will work recursively to parse my data, resulting in the specified Typescript type.

Has anyone seen or written something like this?

bmuon · on May 15, 2021

I've been using this small library inspired by Elm/Swift decoders [1]. It works, but it's not low boilerplate.

I'm gravitating towards GraphQL now because strict parsing is built into it, so there is no need for all this boilerplate.

https://www.npmjs.com/package/@mojotech/json-type-validation

ledauphin · on May 15, 2021

we use GraphQL for this purpose as well, but I'd also like to be able to validate across other boundaries.

However, as I'm saying this, I wonder if I've been looking at this problem wrong. Since we already generate types from GraphQL schemas, maybe I should figure out how to use the same client side parser that's already in my GraphQL client, define a GraphQL schema for the types I'm interested in, and then just generate and use those types.

One thing that doesn't necessarily give me is the ability to define custom parsers corresponding to custom types. At least, I think most of that sort of thing is usually done server side with GraphQL.

So, thank you for the link and also the inspiration for considering an alternative.

renke1 · on May 15, 2021

Not exactly what you want, I think, but there is zod [0].

I really would like to see nominal typing support in TypeScript. Currently, it's hard to validate a piece of data (or parse for that matter) once and have other functions only operate on that validated data. There are (ugly?) workarounds though [1].

[0]: https://github.com/colinhacks/zod [1]: https://gist.github.com/dcolthorp/aa21cf87d847ae9942106435bf...

brundolf · on May 15, 2021

Use io-ts: https://github.com/gcanti/io-ts

You define a decoder schema, and then the resulting TS type gets automatically derived for you. You can then run data through the decoder, it will err if there's a mismatch, or return a value of the inferred type otherwise.

iddan · on May 15, 2021

It is definitely possible as Flowtype got it right. I hope one day it will come to TypeScript as well

jart · on May 16, 2021

Didn't really understand the linked blog posts, but "Parse Don't Validate" sounds like another way of stating Postel's Maxim, which is to be liberal in what you accept but conservative in what you send. I generally try to follow that principle with the way I've written the redbean web server. When it comes to things like URLs and HTTP messages, there's a whole bunch of characters in each field that the RFCs say aren't allowed, but the parser code is still able to work fine even if they're permitted. For example, the RFC says ` isn't allowed in a path, but why should the parser care? It's not like that's a syntactic element. There's no reason to consult a map of validate characters, which slows things down, until you want to re-encode the parsed URL. The re-encoded version will turn the parsed ` into %60 since that's what you might send back to the network, and that needs to be perfect, since you don't know if the other end has one of those strict as a matter of policy parsers, that ignore what's possible and enforces only what's legal.

Iazel · on May 16, 2021

Hey there, I wonder if this article I wrote about the same concept would make it more clear? It's on medium, but free :) https://pelligra-s.medium.com/parse-dont-validate-in-kotlin-...

I honestly think the two arguments are different though. We could say that Postel's Maxim is about "how much you should parse/validate", while the problem we are addressing is rather "how you should do it" :)

jart · on May 16, 2021

Oh I see. You define parsing more in the yacc sense. Carry on. I've been working with char arrays too long.

IshKebab · on May 16, 2021

I'm not sure if you're aware but Postel's Maxim is fairly universally viewed as a huge mistake. It leads to security issues and poor reliability.

jart · on May 16, 2021

Says you. There's an Internet draft where some guy from Mozilla basically said "Postel was wrong; the solution is to always be conservative" https://datatracker.ietf.org/doc/html/draft-iab-protocol-mai... However it's a draft so he only speaks for himself; it's not Internet law. He also doesn't get it. Deviancy being ignored by a parser shouldn't be interpreted as intent to support. Non-conforming use cases are accepted at the pleasure of the implementor.

Consider this. The Internet never would have become popular and inclusive like it became if it was designed to be highly strict, structured, and ordered like Signalling System 7. I like the fact that a 14 year kid can write an HTTP client or server, if they want to, and if they do it'll most likely be mostly correct. Stuff like that makes people fall in love with the Internet from the earliest ages. If we look at the history of companies like DEC vs. PC it's pretty clear technology can't survive >1 generation when there's no career track for hobbyists.

In some cases deviancy in protocols has helped standards become better. Consider HTML. If we had things Tim Berners Lee's way, we'd all be writing ugly verbose XHTML that validates. It wasn't until HTML5 that W3C embraced the chaos of the true standard and formalized the non-conformant use-cases (such as being able to omit <html>, <head>, </tr> </td>, </p>, etc.) and it made the HTML language much more pleasant.

So we should take the concerns of the "always be conservative" crowd with a grain of salt. Because they're the people who wanted XHTML. They're the people who ban numbers like 0xC0,0x80 because some version of some Oracle or Microsoft piece software had a bug once. I don't like how people who embrace policies like that invariably write code that destroys text data in some misguided effort to keep people safe from themselves. There should be better reasons for preventing things which are possible from happening.

nabla9 · on May 17, 2021

I agree that being relaxed enables growth.

The negative consequences come later. Protocols become cultural artifacts with history. You need to study the history of bugs and 'features' to learn the de-facto standard.

To write a good HTTP client that parses and renders HTTP/HTML/CSS/ in the wild from the scratch you need to learn how Chrome, Safari interpret and implement specs.

It's possible to write future proof strict protocols with detailed instructions on how to handle new versions, unknown extensions, etc. gracefully.

IshKebab · on May 16, 2021

> Says you.

Says lots of people. I actually wasn't even aware of that draft RFC you linked.

> we'd all be writing ugly verbose XHTML that validates

And you think that's a bad thing???!!

> It wasn't until HTML5 that W3C embraced the chaos of the true standard and formalized the non-conformant use-cases (such as being able to omit <html>, <head>, </tr> </td>, </p>, etc.) and it made the HTML language much more pleasant.

That is a crazy rewriting of history!

> I like the fact that a 14 year kid can write an HTTP client or server, if they want to, and if they do it'll most likely be mostly correct.

What I don't even... Have you seen HTTP?

jart · on May 16, 2021

I have seen HTTP. I even wrote an HTTP server once, for fun. It was pretty easy. https://news.ycombinator.com/item?id=26271117

IshKebab · on May 17, 2021

But is it correct? I would be surprised.

In any case easy for you is clearly not a reasonable measure!

Waterluvian · on May 15, 2021

I developed a pattern in typescript (I'm sure it's not original) where I have an interface describing an API entity and a class of the same name with only static methods, one of which is Foo.fromApi() that validates and parses.

I haven't seen any need to bring a library in to handle this. Though it would be nice to marry the worlds of TS, API, and Json Schema.

lhnz · on May 15, 2021

It doesn't use json schema but you might be interested in something like this: https://gcanti.github.io/io-ts/

(You can define runtime encoder/decoders which produce typed values.)

brundolf · on May 15, 2021

io-ts is fantastic (I linked it myself above). The killer feature is that it infers the static types of your runtime schemas for you, so you don't have to define them twice. You make a change to the schema, the rest of your code will typecheck against it.

slver · on May 16, 2021

Wrapping primitives comes with its own set of problems. Like having to unbox it all the time for things like basic display.

So you either validate it all the time or you unbox it all the time.

Also with dedicated primitives you need to validate the email when you read it from the dB to hydrate it. So you removed couple of validations here but added a couple there.

My solution is stick to basic DTO and validate in the end at the service receiving the data, and let the errors, if any, propagate up the chain in a way that preserves where the input came from. In most cases I validate once.

So this issue we’re talking about. Doesn’t happen.

Iazel · on May 16, 2021

I guess it depends on how you structure your code.

If you validate at constructor level, then you have this problem. However if you actually parse it, then this will have to be external to the actual class and you are left with a simple Value Object. This means that you can apply constraints only when it makes sense. Most of the time, there is little reason to validate something coming from the DB, so I would rather skip it.

About unboxing, that depends. I would expect for the overall business logic to always use the more structured types, probably this will only be needed once you need to serialize them :)

That's said, if you are happy with your current way, please keep doing it! I would still suggest to give this style a try though :)

devit · on May 16, 2021

"Parse don't validate is suboptimal"

What you ideally want is a first step that "validates", i.e. creates a representation from text that is easy to use but also succeeds for any input, and then a further "parsing" step that converts it to another representation such that the parsing only succeeds for valid input and is invertible for any value of the representation type, and then finally code that checks constraints that aren't captured by the type system.

This way you can support IDEs that need to edit partially incorrect code.

Iazel · on May 16, 2021

Hey, I fail to see how this is suboptimal. Some points I don't understand: * What's the goal of a validation that always succeed? * What's the value of checking constraints after you parsed it and not capture this information?

In the end, parsing is just a combination of what you described: ensure something fit a particular shape and constraints :)

smnrchrds · on May 15, 2021

Fun fact: Parsix was the name of a Linux distro optimized for Persian speakers.

brundolf · on May 15, 2021

Similar thing for TypeScript: https://github.com/gcanti/io-ts