Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: TL;DRizer - an algorithmic summarizer webapp/api in java (weekend hack) (tldrzr.herokuapp.com)
56 points by mohaps on April 10, 2013 | hide | past | favorite | 67 comments


Lorem Ipsum:

From:

Etiam tincidunt dolor at est sagittis a rhoncus turpis egestas. Integer elementum erat nec nisi molestie eu tempus magna feugiat. Mauris eu ligula et ligula vulputate tempor. Etiam vel lectus et mi vulputate rutrum. Cras libero ipsum, rhoncus at accumsan id, adipiscing iaculis turpis. Cras vel metus nec enim consectetur aliquet vel nec nunc. Proin at mauris purus. Nullam nulla dui, interdum nec pharetra sit amet, vulputate a lectus. Nunc vulputate pellentesque purus at euismod. Nam in justo quis ante porttitor pellentesque. Quisque quis purus a magna scelerisque egestas quis id sapien. Ut non felis sit amet ipsum sodales placerat. Proin nibh massa, sollicitudin et posuere a, placerat convallis magna. Duis lacinia mauris sit amet ante pharetra sed bibendum lorem euismod.

To:

Mauris eu ligula et ligula vulputate tempor. Etiam vel lectus et mi vulputate rutrum. Cras libero ipsum, rhoncus at accumsan id, adipiscing iaculis turpis. Nullam nulla dui, interdum nec pharetra sit amet, vulputate a lectus. Duis lacinia mauris sit amet ante pharetra sed bibendum lorem euismod.

So much faster to read, I never had the time to read through all those design mockups!


well, it ain't called TL;DRizer for nothing! :P


Seriously though, props!


thanks


Hello. I recently finished my thesis for my MS CS degree. My thesis is about automatic summarization. It undergoes research, defense, and I think its result is good enough for me. It uses statistical approach and machine learning. My main issue about it is not the summarization part, but the text extraction part. I can't seem to extract article in a web page well enough. I'm using boilerpipe (https://code.google.com/p/boilerpipe/) for it. It can do most tricks, but it's not that good for me. May I ask how you extract the main article in the page?

Here's a preview of mine (http://www.textteaser.com/ui/article?link=http%3A%2F%2Fwww.p...). Go to its home page to read more news. It caters Philippine news and will soon enters alpha stage. I'm planning to open up the API or open source it. HN, which is better? The API is ready, registration is the only thing that it lacks.

You can try the API here: http://api.textteaser.com/api/?url=http://www.theverge.com/2...

Just replace the url parameter with the URL of what you want to summarize. Some URLs are not tested yet, and may produce errors. :)


Hi. I study CS with an inclination towards ML but I don't know anything about the topic of automated summarizing. I'm curious, since you've taken a ML approach, did you still need to rely on NLP and if so, was this very problematic? Also, do you perhaps know an article or a paper that could serve as a good starting point/overview of what approaches there are to summarizing and what are the current difficulties. Thanks


Hi, I recommend this following research: http://www.cs.cmu.edu/~nasmith/LS2/das-martins.07.pdf http://www.aclweb.org/anthology-new/W/W03/W03-1204.pdf

I still have other papers, but those can be a good starting point.

In my thesis, NLP is done via statistical approach. It learns from it's previous summaries does have somewhat learning. I don't see any problems combining NLP with Machine Learning. Can you elaborate on this?


Thanks, these look like exactly what I was looking for.

Re NLP, I meant to say NLP from sentiment analysis perspective (not sure if that's the right way to put it). So my question was whether you had to extract meaning of one or few related sentences or only process the text in a statistical manner (which you answered).


Look up text rank. Essentially it amounts to representing text as a graph with sentences as nodes and some sort of similarity measure as edge weights. You then run page rank on that graph.


Interesting. Thanks! :)


You should try diffbot. They use a vision based method to extract text from webpages. The tool looks pretty polished and seems to work rather well.


Yeah. I tried it. It's good! The problem is it's not free. Although it's cheap, I still don't have that money/revenue to pay for it. The free tier is not that much. If I have the money, I'll definitely change it to Diffbot. :)


Shoot them an email, they seem nice enough about academic/nonprofit work.


Generated 5 Sentence Summary for http://www.businessinsider.com/why-marissa-mayer-bought-a-30...

Back in March, Yahoo bought a startup called Summly for $30 million. Before Yahoo shut it down, Summly was a news aggregation app for smartphones. According to Summly's own Web site, the technology behind the app was "built" by an organization called "SRI International," not by the startup's employees. And indeed, inside Yahoo, Summly is called "Yahoo's Siri." A source close to Yahoo says that CEO Marissa Mayer believes summarization technology is "going to be huge for Yahoo" as it builds "personalized news feeds" into mobile versions of its "core experiences," including Yahoo Finance and Yahoo Sports. The job of implementing this technology at Yahoo will not be given to anyone from Summly, including its young CEO.

--

Edit: Adding this...

Generated 3 Sentence Summary of Gettysburg Address

Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract.


http://TLDRstuff.com comes back with:

Why Marissa Mayer Bought A $30M Startup - Business Insider: The deal got a lot of attention because Summly's CEO is 17-year-old Nick D'Aloisio. Acquiring Summly seems to have been an almost incidental side effect of a deal Yahoo made with SRI for a piece of "summarization technology." Until Yahoo bought it, SRI International held equity in Summly. The job of implementing this technology at Yahoo will not be given to anyone from Summly, including its young CEO.

Notice that the version from TLDR Stuff actually has the Answer in it. "Incidental side effect of a deal Yahoo make with SRI"

It also tells you that they aren't interested in the young CEO.

This is possible because it is not a keyword density Algo, the core technology called Liquid Helium is a Language Heuristics Engine and it can put weight on which sentences are Causality, and which are Subject matter. This creates a version of the text that tells you Who What Why and if there is still space, How. You can't do that with just a KeyWord density or where in the article is this system.

Summly claimed to have that tech, SRI has some of it, but what they really have is a nice Concept Tree, and sentence parser.

A far cry from a system that knows which points are important, not just which points are most talked about, because as you see in this Business insider article the important part isn't "what is summly" or "who is nick" or "Who is mayer" it is "Why did Yahoo do this" and that is captured in the TLDRSTuff/Stremor version, and not in TLDRizer.


Agreed!


This misses the most important line in the article: Acquiring Summly seems to have been an almost incidental side effect of a deal Yahoo made with SRI for a piece of "summarization technology".


the algo is still kinda "dumb". It basically tries to get top N keywords - most frequent non stopwords, stems them and goes through the sentences looking for which of them (upto max summary length) contain the keywords. I'll keep whittling at it over nights/weekends to see if I can make it more "semantically aware"

edit: some work is needed on the tokenization also. currently, I don't preserve non-period punctuation.


You might want to look into topic modeling to extract the most meaningful topics (set of co-occuring words). It could greatly improve your results compared to the keyword approach. Interesting initiative, keep us posted !


There's tons of human-generated TLDRs out there; I wonder if anyone's tried machine-learning from them.


I have to say this summary seems to make the whole summly purchase a lot clearer for me, rightly or wrongly, intentional or not.

$30 million to have the entire tech sphere spread the word on how Yahoo are evolving and are going to make peronalised summarized on the go news a core feature seems like a bargain (there certainly is a market for something like that I would say -- or at the very least, if implemented well, it certainly would be nice to have feature -- possibly a habit changer).

They no longer seem to me at the very least, the old who are they again company that they once were.

It's a shame they won't be changing their name anytime soon though. That on it's own puts me off a little. Forgive me for the snark.


I really like this.

Here's an example of one of PG's essays run through the algorithm: [http://paulgraham.com/startupideas.html]

The most important thing to understand about paths out of the initial idea is the meta-fact that these are hard to see. Empirically, the way to have good startup ideas is to become the sort of person who has them. If you know a lot about programming and you start learning about some other field, you'll probably see problems that software could solve. Some of the most valuable new ideas take root first among people in their teens and early twenties. So if you're a young founder (under 23 say), are there things you and your friends would like to do that current technology won't let you. But there may still be money to be made from something like journalism. Similarly, since the most successful startups generally ride some wave bigger than themselves, it could be a good trick to look for waves and ask how one could benefit from them.

If you ran examples of PG's essays through this, people would see the immediate benefit.


I think no algorithm can perform such a summarizing task. If you're looking for summaries of PG's essays see here http://tldr.io/discover/paulgraham.com.



haha! :) Yeah, this was kinda fueled by the news of the summly acquisition and too many red bulls drunk during the drive from LA to SFO after wondercon.


YES. As a college student, this is amazing for those long readings for classes one isn't interested int. I wanted to build sort of the reverse of this at one point (take a question/prompt as input, generate a response).

Are you planning on open sourcing this?


yeah, I plan on open sourcing this. waiting on some technicalities.


Great, please post about it or contact me when you do


Sorta cool. The Yahoo purchase of Summly is not much more than a PR play. The technology wasn't/isn't there. And, while this "weekend hack" is neat, the quality of summaries isn't close to that of the TLDR plug-in (http://www.tldrstuff.com/#desktop) Not only does the Stremor plug-in get "what is important with the article" the plug-in is simply on all of my browsers and works FAST. Fun discussion though, and props to little Nick.


I've got an IOException while trying to summarize http://matt-welsh.blogspot.com/2013/04/running-software-team... And a different one for http://googleblog.blogspot.com I guess you should put more effort in your html parser. Try Apache Tika, perhaps.



okay, added a fix to try and extract article text from non-feed urls. try http://tldrzr.herokuapp.com/tldr/?feed_url=http://matt-welsh... :)


ah, it won't work directly for webpages (html). the url is expected to be that of a RSS/Atom feed. for the html web pages, copy pasting the text to the textarea works.

Will try to add url content type detection in the next cut and summarizing non-feed url's next up


Now the url can be a page, I try to extract the article text using boilerpipe. :) Also added a simple GET endpoint for linking. Try this summary of PG's "Writing and Speaking" essay: http://tldrzr.herokuapp.com/tldr/?feed_url=http://www.paulgr...


So, when is Yahoo! buying it? How much the deal?


Yahoo will buy something that USES this tech, and outsources the actual building as well...


It seems to not properly handle embedded HTML; using my feed (http://www.dp.cx/blog/rss.xml), look at the story titled "The Difficulty of Parsing the Web" and notice the <select /> box that is rendered.


Not bad for a weekend, but http://www.tldrstuff.com does a much better job. Especially where the sentences don't break on . like where J. R. R. Tolkien is concerned.

And The TLDR Plugin works with HTML and on all western languages.


As as cofounder of http://tldr.io this confirms our vision that for now (and for many years) only people can perform such a hard task like summarizing.


I agree with "for now" but not "for many years". Right now, most or all automatic summarizers are doing extraction. Which is just lifting sentences from the original article itself. It is different from the human perception of summary, which is abstraction. That uses the most important parts of the article and paraphrase it for easy reading.

Right now, abstraction or paraphrasing is hard to do by a computer. But I think and hopefully it will be possible in few years time. There are various open source and academic tools that can do some pretty good NLP. I'm looking into Apache OpenNLP, and WordNet. I'm hoping for 2 or 3 years time.

BTW, I have an app similar to your tldr.io. Check my HN comment (https://news.ycombinator.com/item?id=5523770) for more info about it. ;)


Changing the sentences adds bias. Maintaining the author's intent is important.

Generating news highlights from lots of sources might be cool as computer generated content. But rewriting an author's story in new words is not adding value it is just ripping them off.


Thanks. I got pretty good insights. Bias doesn't come into my mind. So you're saying multi-document "summarization" maybe the next step of consumer automatic summarization? There are many research about multi-document summarization, and will look into it.


It seems like your comment was well intended, but you come off as a bit presumptuous.

The problem TODAY is not whether or not a computer can summarize, but rather to what extent we as humans are satisfied with the computer's summary.

In some cases a dumb summary is good enough (first 200 characters for example). Given this baseline, and a target (human summary), you have to admit it's really an incremental process.


well put, hayksaakian. Also, never underestimate the built in auto-correct of the human mind :) There will always be a market for expert-curated approaches, but sometimes it's just cheaper to algorithmically "Crunch" it. Sometimes RAIN MAN counting toothpicks is enough, but when you need Ramanujam... :D

Also, do keep in mind... this is 2 hrs worth of coding time late on a sunday night. I don't have a CS degree, just a utilitarian/curious programmer who sometimes is stupid enough not to realize how hard a problem I'm tackling. :) Someone better qualified can do a much better job. Sometimes "just good enough" is good enough! :)


People are inherently biased. It's impossible for any person to read news and not inject their own personal tendencies into a summary they write.


Awesome! Looks super similar to an old sideproject of mine, www.bookshrink.com. The algorithm's different -- yours is aimed more towards summaries, while mine was aimed at sentence importance.


yeah, I'm working on adding more summarizer algorithms. I've been thinking on the lines of weighing up rhetorical questions, weighing down exclamation mark (cheap sarcasm detection) etc.


I experimented with giving bonuses to proper nouns and verbs, as well as giving a slight advantage to shorter sentences.


Updated the app with some goodies like links to summaries, ability to summarize all types of urls (not just feed urls) and a "spiffy" new logo :) Also did some css fixes etc.



Tried the rss feed from my blog and got an NPE:

http://blog.medusis.com/rss

Does it expect a specific format?


try now. the blog.medusis.com/rss link works now. Thanks for the feedback. since this grabs the page text (when no rss text is found) a lot of junk like copyright notices etc. shows up in summary. Will have to add some logic to scrub those. It also behaves horribly with code snippets.


Excellent, thanks, it does work now.

So what does it do exactly? It seems to extract some sentences more or less at random from the text...?


the algo is listed at the bottom of the home page. Will be opensourcing this code soon.


no, i use ROME to parse RSS feeds. So it should be able to handle whatever that can handle. Let me check


ah, got it. if no entry text is present, I was assuming each entry to have a description field. Fixed it. if both entry text and description are missing, it fetches the url text of the link and summarizes it. pushing to heroku


Now to sell it to Yahoo for $30 Million!


Also see http://tldr.it (a RailsRumble 2010 entry)


nice :) much better UI. As you can tell, I really suck at HTML/JS coding.


I like your UI. If you change anything, keep this page flow and don't put much else on the page. Check out medium.com for readable font inspiration.

Awesome work!


check out twitter bootstrap, it's pretty simple to use and won't make your projects look so bad. Also, nice app


Your UI is better.


Very cool! I can't wait for it to be open sourced.


trying to figure out (short of creating a new repo from current code) how to mirror the heroku git repo for this on github


Can't you add another remote for it? One for GitHub and one for Heroku, then just push to both when you want to update.


yeah, that's what I ended up doing. :) first time using git. These old bones have been ground up bad by CVS/Subversion :P Here's the announcement: https://news.ycombinator.com/item?id=5535827


tl;dr




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: