Not sure about the technique wiby uses to filter garbage, but I could imagine a search index that contains only HTML4 pages could be quite resourceful.
Nice idea. Another option albeit way more reliant and resource intensive, checking the wayback machine for page is very old + content hasn't changed much. A lot of these alt search engines look at signals of lack of JS and tracking but that is sort-of a cross-section of more non-commercial stuff.