Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you're publishing a dataset in the terabytes it does actually make sense to at least do a pass over it and make sure the data you're using isn't skewed in any undesirable way that would cause problems down the road. For example, if you're releasing 5tb of face photos for training facial recognition nets, it would certainly be a problem if all the faces are white women or asian men - the result would probably be over-fit and not perform as well for people in other categories. It would be correct to call that a diversity/inclusion issue.

Privacy and accessibility reviews serve similar purposes there, you're reducing risk by checking for these various problems and ideally they also spot ways to improve the quality of your outcomes.



It's common in fintech for data/ML models to go through similar overview. If you happen to disenfranchise a set of people because your model said not to lend to them, you risk legal jeopardy.

To clarify, I think it's good that this is a practice.


The whole point of the model is to find who not to lend to. You are always going to exclude people by definition.


There are so many ways you can accidentally systematize racism in software like automated lending.

In the past there were explicitly racist policies like redlining. This leads to a historical data set of loan denials to people in specific racial group. If that group has other traits that correlate to their race, e.g. the neighborhood they live in then you could presumably have a model that doesn't explicitly have race as a feature but uses that historical data and some subset of racially correlated features and as a result disproportionately excludes people of that race.


I am not sure how one would remove all ageism, sexism, racism, classism, title-ism, and so on from lending. The whole concept is about making a prediction about the future with sub optimal information, guessing who will default on a loan and who won't. Same goes with insurance.

I have been pretty tempted to lie about where I live in order to reduce my insurance costs. It would reduce the insurance cost by half. It seems pretty disproportionately harsh that I should get lumped together with the people who simply happen to live around me.

Is it possible to make predictions illegal if they are based on historical data from other than the individual customer?


I should clarify, the point is to not discriminate against a protected class.


Tell that to the legislators and prosecutors who create laws and enforce laws against you.


Yes, but we should exclude people for valid reasons, not for their race.


A review doesn't necessarily mean you need to resolve all diversity/inclusion issues. It can merely require that you identify the issues and understand the risks of not resolving them.


the 5tb was performance data collected from servers


Sounds like the reviewer would glance at it for 5 seconds and say 'ok'


What if some servers were excluded?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: