Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Secret scanning is now available for free on public repositories (github.blog)
186 points by soheilpro on Dec 16, 2022 | hide | past | favorite | 70 comments


Good on them. GitHub secrets cause a lot of problems. They will always create a better idiot but this idiot trap is long past due.

I also can’t wait until people base64 their creds to get past this. Explaining to some that base64 isn’t encryption tends to be hard so I imagine people will feel safe just base64 and checking it in.


base64 is far too much work. A new dev turned '"AKIAIOSFODNN7EXAMPLE"' into '"AK" + "IAIOSFODNN7EXAMPLE"' to make the security alert go away.

Thankfully, the alert was sent to enough people it was caught by someone else, and the key was destroyed before someone outside could have fun with it.


I remember reading in jshint’s docs that they purposely did not chase this kind of lint since at that point the user is clearly fighting the library.


Why was that change made? I would assume malice or incompetence?


I think we need a new phrase for this: "malicious incompetence"

A similar story from my work: "I didn't read the contents of the red warning screen because I knew I wanted to release my code".


In this case, it was an alarm that ran during build (I think, it's been a while), and used git blame to send a message like "@person: You checked in a secret! Fix this!".

So, they fixed it.


Given the fact that nobody really does that, I think it was a creative and low risk hack.

1. If you are worried about the people who have access to your codebase abusing a secret, you have a serious people problem that needs to be solved immediately and unambiguously. A motivated internal attacker can do almost anything. Organizations live or die on trust. One doesn’t need to scour for keys break in when they have a badge (or their mate’s) and they built the lock.

2. If you are concerned the secret will be discovered by a generic threat, it won’t, not with this string concatenation. It’s so rare. Should this become a common practice this would be over, retroactively even. We all saw this unfold with with m y e m a I l at y dot com obscurity, until the fine folks who worked on ScrapeBox turned up the right regex to start scraping those too.

3. Nothing else. You are right and responded right. Don’t put keys in code kids. Nobody likes having to erase history in their codebase, especially because of a careless mistake or a deliberate workaround.

I’m just saying… of all the dumb shortcuts that can break stuff, this one is one the mostly harmless end of the spectrum.


I take it from your response that you don't lock your door at night, because a motivated external attacker could just bring a ladder and break a window?


It's actually quite rare to lock doors at night in substantial parts of the world.


Exactly. Pragmatism has a place in this world. Not everywhere, not everywhen, but more than you might think.

Most of the people in my area don’t lock their doors. A decade ago in a very different neighborhood we had locks, alarm systems, and hyper-vigilance, and still lost thousands via property theft and damage. 5 miles away.

It would be maladaptive of me to bring the same level of vigilance to a different setting. It wastes resources and clouds one’s ability to trust. It slows you down every day.

I don’t know how I could have been more clear that I don’t endorse committing secrets to code, in fact I’ve been a champion for code hygiene and security everywhere I work. I just recognize and think others should as well that there are diminishing returns for precautions where they aren’t used. The returns can diminish so low that they go negative.

Even in the safe neighborhood, one might lock up when they leave for a trip, or make other reasonable preparations to increase security and obscurity.

Yes, I said everywhen, and yes I am asserting that I just coined it and that it’s brimming with greatness :P



TIL and thank you for it. I officially retract my meaningless claim of being its originator.


"Everywhy" is available so you can originate that.


everywhom and their grandwhom has claimed to originate everywhy at least once.


Are you also doing with secret (not anymore) data in public repositories on GitHub? Note context of your comment, in thread about secret scanning for public repositories on GitHub.


I do most of my work on private codebases so I went that direction. I had intended to add a qualifier for that but I missed my window to edit.

Agreed there are many ways that working with public repos makes everything dramatically more difficult. I’m pleased to see GHAS secret scanning become free. I’m not clear if that would include the pre-push secret scanning feature. If so you have a really decent toolset for prevention and detection. Remediation should be as easy as key rotation. Except keys get reused and rotation affects all users... if the keys can’t be rotated it’s a big chore for private repos owners, but de facto impossible with public codebases. There is no way of knowing who has cloned it (without paying for enterprise audit logs).


> low risk hack

That seems to be quite high risk hack to rely on such security through obscurity.


Base64: if this format wasn't secure why are kubernetes 'secrets' stored in it huh?

;)


I legitimately recently had to argue with a PM and his developers, that a base64 encoded user ID isn’t considered security best practices for API authentication. Even when I showed them how I can produce the “secret” myself, they kept arguing that I was wrong.


Ok I've been around for a long time and I don't think I would have met anyone who would have argued that since around 2009, although I guess I can remember secrets and keys going into repos as late as 2018 at places I've seen.


In fairness, the PM was the kind of guy who has no technical experience, but was arrogant enough to pretend like he did, and most of the devs were pretty junior.

It was the first time for me having to even argue about this.


Or md5. I wouldn’t be surprised to see that from some in PHP land.


With md5 hashes, the actual password isn’t there whereas base64 encoding is merely another way of representing the same bits. Yes md5 is weak, but it’s Fort Knox compared to base64.


I’m very much aware of the differences in format and effort. The idea was only tangentially related :/

Also, at the risk of being pedantic, yes, some semblance of the password is definitely there. Someone can happily go off and try to brute force it.


As long as we’re being pedantic, no they can’t (well, I guess they can try with no hope of succeeding?). You can find a sequence of bytes which will have the same md5, but you have no way of knowing that it’s the same string of bytes which someone else used to arrive at that md5. As I alluded to in my post, that information is gone.


Hah, provide the hash and have the backend crack it whenever it needs to call the api.


You're not going to crack a hash for an API key. Not even with MD5. Long random strings are the worst case scenario for trying to reverse a hash.


I forget the exact details, but if I recall correctly you can crack md5 with a for loop in PHP because you can just iterate through the full character set. Maybe it would take awhile but having seen it in action for shorter examples I doubt that’s going to stop someone sufficiently motivated. Then again, at that point I guess they’d just opt for a tool like hashcat.


You can find a collision, but you can't (unless you are very lucky) reverse the hash. The bits are not there.


Of course not. I’m referring to brute force. The idea that you can’t reverse a hash seemed so obvious to me that I didn’t feel it necessary to disclaim. You can increment characters in php like numbers[0]. It has some funny quirks. It doesn’t loop back around right away. If you write a for loop for that and pass it into the builtin for md5, you can just go until your hashes match. Of course this would take a long time for big hashes and there are other tools that can do this better if you’re motivated anyway. But hey, you can make a fun hash cracker in a few lines if you’re feeling it. My whole point was just that MD5 is fairly weak. Lots of people don’t or at least didn’t use to consider this because it was also (too) convenient.

[0]: https://stackoverflow.com/a/3567245


> I’m referring to brute force. The idea that you can’t reverse a hash seemed so obvious to me that I didn’t feel it necessary to disclaim.

You are confusing hashing with encryption. There is no general way to reverse a hash, be it brute force or an algorithmic method. There are an infinite number of strings that will generate the same MD5 hash. My point is, your for loop may eventually find a string, but it won't be the original AWS secret, so it won't work.


I wasn’t particularly in touch with the concept of hash collisions, no. Now that I’ve learned something I can revel in the fact that it only cost me a silly amount or imaginary internet points to do so. Thanks!


That’s a better deal than college :)


The full character set for an AWS key is super ridiculously huge, like heat death of the universe huge.


Hm, could you provide one as an example? I’m kidding. That’s fair. I was just thinking of ASCII. How many services live up to AWS’ standards?


Your access keys consist of an access key ID (for example, AKIAIOSFODNN7EXAMPLE) and a secret access key (for example, wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY).


One I can see now is 40 characters of... not sure - I see uppercase, lowercase, digits and special characters. Maybe it's printable ASCII?


AWS keys are ascii (hex encoded iirc), but they have so much entropy you could never guess it to reverse the hash.


Does secret scanning also apply to public GitHub Action logs and Issues (or more generally, Checks logs)?

We found Action logs to be a much bigger threat now that many folks have learned not to embed secrets directly into the code and to use secret managers instead. But even then, the secrets retrieved in a step can be printed in plaintext if someone, for example, runs that step debug mode.

Issues can also accidentally leak secrets via for example, third-party code builders that print their output in an issue.


GitHub PM here. Right now we scan code, commit metadata, issues, and issue comments. We're expanding to other content types over time, with support for pull request bodies and comments coming in early 2023. Actions logs are on our list too, but will take a little longer.

(It's worth noting that any secrets in your Actions secret store will already be redacted in any Actions logs, so those won't leak there.)


It feels like there could be a GitHub action step that just means "redact this particular string output in this task and for the rest of the Action"?


Thanks - and yes, this is meant for external secret management solutions like Vault, not GitHub Secrets, which are "safe" enough.


Searching for creds can be tricky if they can't be readily distinguished from other text.

Can anyone think of a problem with generating customer API keys that have a known prefix that makes them more detectable?

For example, a key like "FooSecret.ZTNiMGM0NDI5OGZjMWMxNDlhZmJmNGM4OTk2ZmI5". I wouldn't think that'd open up any new attacks, but I'm no expert on the matter.


GitHub PM here. We switched our own token format to something similar to the above in April of last year and have been encouraging other service providers to do the same.

The big benefit of highly identifiable tokens is not just that we can alert on them, but that we can scan for them at pre-receive time and prevent them from leaking (by rejecting the push). We already have that functionality as part of GitHub Advanced Security, and are planning to make it available (for free) on public repos in 2023.

[1] https://github.blog/2021-04-05-behind-githubs-new-authentica...


One of the big issues with secret scanning now is that it’s opt-in from platforms, and from the list of supported platforms it seems like small ones may not be able to be included.

The holy grail here would be to introduce a standardized token format that encodes a disclosure endpoint. Then platforms can issue tokens to this standard and receive notifications without needing to explicitly opt in.


For the secret scanning partner program we're happy to work with partners of any size - there are details of the program, including how to get in touch, at the link below.[1]

However, with secret scanning alerts we look for credentials from service providers we _don't_ have a partnership with, too. Our partnerships team are pretty good, so the delta isn't that big, but Asana, Notion, Intercom and Artifactory are a few of the service providers whose tokens we scan for where we don't (yet!) have a relationship to send detections. We also scan for tokens where a partnership isn't possible or would be much harder (like HashiCorp Vault service tokens).

On standardized formats, if one existed we would scan for it! However, as we've worked with dozens of service providers to update their formats we've found many have specific constraints and everyone has different preferences - as a result, for now, we're pursuing a broad church approach, rather than pushing a standard. If you haven't already read Thomas Ptacek's survey (for fly.io) I recommend it.[2]

[1] https://docs.github.com/en/developers/overview/secret-scanni...

[2] https://fly.io/blog/api-tokens-a-tedious-survey/


> standardized token format that encodes a disclosure endpoint

This should be relatively easy...

    secret:example.com:entropy-goes-here

    secret:subdomain.example.com:entropy-goes-here

    secret:example.com/path/optional:entropy-goes-here
and then a Well-Known URI (https://en.wikipedia.org/wiki/Well-known_URI) based on the embedded URL for the disclosure endpoint.


Is there a way (or a plan to have a way) to register custom prefixes to have them scanned on a specific repository?


I argued for something like that previously on HN, like adding a domain prefix 'myservice.com_secretkeyhere'. This would allow automatic discovery of the reporting/revocation endpoint from the key. Then someone pointed out that you could just use an actual URL as your secret key and have that be the URL you visit to revoke it, and I think that is genius.

Next service I make that has API keys, I will make them look like `https://secret.myservice.org/ZTNiMGM0NDI5OGZjMWM`. POSTing to that URL revokes the key, a GET shows a form explaining what it is and a button to revoke the key.

One issue is that some email services mangle URLs specifically, and that would be bad for keys.

(edit: sudhirj is the genius: https://news.ycombinator.com/item?id=28299624)


GitHub does this with the tokens they issue. They even have a checksum in the token, so they can check if the token is syntactically valid:

https://github.blog/2021-04-05-behind-githubs-new-authentica...

They have a list of supported secrets they can find via automated scans:

https://docs.github.com/en/code-security/secret-scanning/sec...


This is exactly what Stripe and I believe some other companies do, partially for this exact reason.


Another approach is to identify likely ones based on the entropy of the strings. I used a tool that did precisely this once and found some, but can't find it anymore.


Not much of a downside, but it means that they are really easy to detect for attackers as well.

Really easy to just grep through something looking for that prefix


Yep - this way my thought. I've been bitten by this is in the past: an endpoint was throwing an error and dumping out environment variables (which included API keys) - the prefixed API keys were found by crawlers and abused but the unprefixed keys were untouched (though they were cycled just in case).


Do you think they call this service their "Secret Scanta"?


We use this at our company. Wildly successful at finding tokens for most of the usual suspects. If they are including secret blocking - it will prevent someone from doing the dumb as well.

One question/behavior - if the secret scanner found something and folks resolved it -> secret blocking is enabled -> and a developer does the dumb again, should it block the PR with the new secret? Wondering if we might have something misconfigured as I have seen new secrets get added after we enabled blocking.


Hello! I am an engineer on the Secret Scanning team, thanks for the kind words!

- "push protection" (as we call it) isn't available for free, and isn't part of this rollout.

- For folks who do pay, the flow may be: a developer tries to push, they bypass the secret, are now able to push. From there, an alert is created which they can resolve (maybe it is "used in tests").

- If the _same_ secret is pushed again, we won't block that push. We also won't create a new alert; however, a new location may be recorded within the resolved alert (if you click into it).

If you're seeing push _not_ get blocked, what's most likely is that we just don't support that specific token as part of push protection (we have some much-needed improvements to do to the docs to make this more clear). Since push protection sits in front of the developer, we try not to annoy them with high-false positive tokens. There are a few other possibilities though, so hard to say.


I suspect I'm your biggest GHAS customer :) Rolled out GHAS to over 125k repos this year. Have Tayler (z...) connect you with my details, if you wanted to chat.


Don't let the perfect be the enemy of the good - this will start out in a limited detection of course, but can easily be improved with other hashes and scanning over time.


What's the workflow where people accidentally commit secrets to their git repos? I'm not sure I've ever done it; do we count the "base_secret" type of things web frameworks put in their default app templates? Certainly the more common mistake I make is forgetting to add new files, so it's mildly amusing that other people apparently have the opposite problem.


People keep adding whole tmp/ directories or output binaries to repositories, accidents like this stuff just happening. It is not a workflow, but for a scenario: people trying to run some test, on real service, to debug some weird issue, will temporary put credentials and forget to remove them before comiting the fix. Sure, someone probably will notice it in code review but it is too late if repo was public.


Lots of ways this happens either accidentally or intentionally. I think most common accident is due to forgetting to add a file to .gitignore and then using git add . . Intentionally, folks just embed secrets into code out of convenience while developing, and either never even think twice, or forget to remove them before commit & push (which becomes kinda an accident)


mostly accidental. you're working on a prototype, so to just get started you use a const at the top of your code with an API key, this then gets checked in and you then realise 'oh shit' , but by this point its within gits tree. It can still be removed, but its not a straightforward process.


If you are in the CircleCI CI/CD space then adding it to a config to power some workflow step.


Mitigation is premium. Detection should be free.

https://www.arnica.io/blog/secret-detection-needs-to-be-free...


there goes my fontawesome pro license keys stolen from other people's public repos lol


Is scanning only based on regex? Or can it, say, parse a JWT and infer who it came from through those properties?


GitHub PM here. It scans using regexes and then applies post-processing. So yep, it can (and does) parse JWTs to understand their properties.


Does this also check for private keys - SSH, OpenPGP, X.509 etc?

Also what about 2FA secrets like TOTP/WebAuthn?


Are a lot of "private-ish" repos (perhaps something that supports a real company) using Github and not self hosting? I presume this is the case, but it seems dumb.

Why not just self-host?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: