I wish JAX worked with windows natively (without using wsl). I teach a very high level intro to numpy and would _love_ to have my students try jax. These students are relatively new to programming and the idea of using a linux shell or having to compile anything themselves just wouldn't work.
I'm excited about Deno, but I'm finding that the docs still need to be improved. For example, I'm trying to build a tcp server. I'm not able to get information on how back-pressure is handled.
I can see that Deno.listen returns an object which implements reader and writer interfaces, but it isn't clear to my how to look for events, such as disconnect or that new data is available.
I wish there were examples showing how to correctly parse frames or implement protocols.
I'm sure these things will be expanded over time, partly by programmers in the community, but from the outside, things are still a bit rough.
Their payment model is attractive by its fairness:
A micropayment an author gets is proportional to the amount of time a paid reader spent on reading the article as compared to the total time he spent on medium.
And friends links open free access to those who isn't aware or don't like this model.
I am a data architect in my day job. Within the realm of data management, I'd say "metadata management" [1] is the general category this fits within.
I would say, yes this idea is known/very common, as data architecture is as much about the descriptive language we use as anything. I mean, "business glossaries", taxonomy, even just naming conventions [2] in coding, these are all related.
If you build enough databases/tables or even code yourself, you inevitably come across the "how to name things" problem [3]. If all you have to sort on for the
known meaning of a thing (column, table, file, etc.) is a single string value, then encoding meaning into it is quite common. This way, a sort creates a kind of "grouping". Many database vendors follow standard naming conventions - such as Oracle, for example [4]. It is considered a best practice when designing/building the metadata for a large system, to establish a naming convention. Among other things, it makes finding things easier, as well as all the potential for automation.
You get all kinds of variations on this, such as, should the "ID_" come as a prefix or a suffix (i.e. "_ID"). One's initial thought is to use it as a prefix so all the related types group together, but then that becomes much more difficult if you want to sort items by their functional area (e.g. DRIVER_ID, DRIVER_IND, etc.).
One other place you see something similar is in "smart numbers" which is an eternal argument - should I use a "dumb identifier" (GUID, integer) or a "smart one" (one encoding additional meaning) [5].
I mean, basically, any time you can encode information in the meta-data of data, I think you can then operate on it by following "convention over configuration" (as mentioned elsewhere in the discussion comments).
The only problem I see is that such conventions can, at times be limiting - depending on the length of your metadata columns, and the variability you are trying to capture - which is why I believe, generally, metadata is often better separated and linked to the data it describes - this decoupling allows for much more descriptive metadata than one could encode in simple a single string value. Certainly, you can get a long way with an approach like this, but I suspect you would run into 80/20 rule limitations.
Using naming in this way is a form of tight coupling, which could be seen as an anti-pattern in terms of meta-data flexibility, in some cases.
In terms of database normalization, delimiting multiple fields within a column name field violates the "atomic columns" requirement of the first though sixth normal forms (1NF - 6NF)
Are there standards for storing columnar metadata (that is, metadata about the columns; or column-level metadata)?
In terms of columns, SQL has (implicit ordinal, name, type) and then primary key, index, and [foreign key] constraints.
RDFS (RDF Schema) is an open W3C linked data standard.
An rdf:Property may have a rdfs:domain and a rdfs:range; where the possible datatypes are listed as instances of rdfs:range. Primitive datatypes are often drawn from XSD (XML Schema Definition), or https://schema.org/ . An rdfs:Class instance may be within the rdfs:domain and/or the rdfs:range of an rdf:Property.
RDFS is generally not sufficient for data validation; there are a number of standards which build upon RDFS: W3C SHACL (Shapes and Constraint Language), W3C CSVW (CSV on the Web).
There is some existing work on merging JSON Schema and SHACL.
CSVW builds upon the W3C "Model for Tabular Data and Metadata on the Web"; which supports arbitrary "annotations" on columns. CSVW can be represented as any RDF representation: Turtle/Trig/M3, RDF/XML, JSON-LD.
> an annotated tabular data model: a model for tables that are annotated with metadata. Annotations provide information about the cells, rows, columns, tables, and groups of tables […]
> A .meta protocol should implement the W3C Tabular Data Model: [...]
...
The various methods of doing CSV2RDF and R2RML (SQL / RDB to RDF Mapping) each have a way to specify additional metadata annotations. None stuff data into a column name (which I'm also guilty of doing with e.g. "columnspecs" in a small line-parsing utility called pyline that can cast columns to Python types and output JSON lines).
...
Even JSON5 is insufficient when it comes to representing e.g. complex fractions: there must be a tbox (schema) in order to read the data out of the abox (assertions; e.g. JSON). JSON-LD is sufficient for representation; and there are also specs like RDFS, SHACL, and CSVW.
I see the line of thinking you're going down. There are ISO standards for data types, in a sense I could see why one would seek a standard language for defining the metadata/specification of a type as data. Have to really think about that some more.. in a way a regex could be seen as a compact form of expressing the capability of a column in terms of value ranges or domains, but to define the meaning of the data, not so much.
Your interpretation of the atomic columns requirement is a little different than my understanding. That requirement of normalization only applies to the "cells" of columnar data, it says nothing about encoding meaning into column names, which are themselves simply descriptive metadata.
I mean, for sure you wouldn't want to encode many values/meanings into a column name (some systems have length restrictions that would make that impossible, I'm not sure it makes sense anyway), but just pointing out that technically the spec does not make that illegal. Certainly, adding minor annotations within the name of a column separated by a supported delimiter does not, in my opinion, violate normalization rules at all. I mean things like "ID_" or similar.
Have you looked at INFORMATION_SCHEMA in SQL databases? [1] You mentioned SQL metadata and constraints, that is as close to a standard feature for querying that information there is, some databases do it using similar but non-standard ways (Oracle for example).
Also, not standard but, many relational databases support extended properties or Metadata for objects (tables, views, columns, etc.) - you can often come up with your own scheme although rarely do I see people utilize these features. [2] [3]
At some point it feels like we are more talking about type definitions and annotations, applied to data columns.
Maybe like, BNF [4] for purely data table columns (which are essentially types)?
Can you describe a bit more about what is going on in the project? The file you linked is over 2.5k lines of c++ code, and that is just the “setup” file. As you say, this is supposed to be a statistical model, I expected this to be R, Python or one of the standard statistical packages.
Oh gosh yes, the amount of `just works` Fortran in science is one of those things akin to COBOL in business. I just know some people are thinking 10 years - ha, be some instances of 40 and possible 50 years for some. Heck, the sad part is many will have computer systems older than 10 years just as it links to this bit of kit and the RS232 just works with the DOS software fine as and the updated version had issues when they last tried. That's a common theme with specialist kit attached to a computer for control - medical as well has that.
I know two fresh PhDs from two different schools whose favorite language is fortran. I think it's rather different from cobol in that way -- yes, the old stuff still works, but newer code cuts down on the boilerplate and is much more readable. And yeah, the ability to link to 50 year-old battle-tested code is quite a feature.
It is essentially a detailed simulation of viral spread, not just a programmed distribution or anything. It's all in C++ because it's pretty performance-critical.
Because much of this code was written in the 80's, I suspect. In general, there's a bunch of really old scientific codebases in particular disciplines because people have been working on these problems for a looooonnngg time.
I love this. The code is simple and documented. However, whenever I’ve tried to understand autograd, I get stuck at dual numbers.
As a programmer, I understand building up a computation graph where each node is some sort of an elementary function which knows how to take its own gradient. So a constant/scalar node has derivative/gradient of zero, x^n has derivative of nx^(n-1), etc. these gradients are passed from the end to the beginning according to the chain rule, etc., etc.
However, autograd is not supposed to be the symbolic differentiation we learned in high school.
This project doesn’t seem to have anything to do with duals...confused!
There are two ways to implement autograd, reverse-mode and forward-mode. Reverse-mode is what minigrad uses, and what most ML libraries these days use by default, since it computes gradients of all inputs (wrt one output) in a single pass. It's exactly what you describe in the 2nd paragraph.
Forward-mode autograd is the technique that can use dual numbers. It computes all gradients of one input (wrt all outputs) in a single pass. Dual numbers is a pretty neat mathematical trick, but I'm not aware of anyone that actually uses them to compute gradients.
The most approachable explanation of dual numbers I've seen is in Aurelien Geron's book Hands-On Machine Learning (Appendix D). There are articles online but I found them more technical.
In terms of semantics, sure. However a real part of the charm of APL (and J, and K) are in the syntax, and how notating your program causes you to think about it differently. Something like or/and outer product in the game-of-life oneliner is very straightforward APL, but is much clunkier to write in numpy.
I’ve seen this mentioned before, including a blog post by the fast.ai folks. Any idea where I can get details?
If my tabular data set is small, what kind of embedding can I get out of it? Or is the idea that a larger data set is used for embeddings of categorical data?
Pre-trained embeddings are only helpful if they are trained on a different (ideally larger) dataset or even a different task, but with the same kind of input data. So you would need to find out where else something similar to the data in your tables appears. If some of the data is text, word embeddings may be applicable. Or if you're trying to analyze user activity by time and location, you might try to transfer what can be learned about the influence of holidays from publicly observable activity e.g. on Twitter (just a random idea that popped into my head, no guarantee that it can actually work).
Of course if all you have are numbers without context, there isn't a lot you can do to improve the situation.
I think this is mainly a thing for perception (images and sounds). Tabular data would have to match up with the training dataset, and "most" interesting tabular models are the sports of things guarded like piles of gold by the businesses that build them...