// Puttting it all together func Dedup(oURL *url.URL, least3Files []repo.FullArticle, lg loghttp.FuncBufUniv, fs fsi.FileSystem) *html.Node { opts := domclean2.CleaningOptions{Proxify: true, Beautify: true} // opts.FNamer = fNamer opts.AddOutline = true // opts.RemoteHost = fetch.HostFromStringUrl(least3Files[0].Url) opts.RemoteHost = oURL.Host // // domclean for i := 0; i < len(least3Files); i++ { fNamer := domclean2.FileNamer(logDir, i) fNamer() // first call yields key lg("cleaning %4.1fkB from %v", float64(len(least3Files[i].Body))/1024, stringspb.ToLenR(least3Files[i].Url, 60)) doc, err := domclean2.DomClean(least3Files[i].Body, opts) lg(err) fileDump(lg, fs, doc, fNamer, ".html") } if false { // // Textify with brute force for i := 0; i < len(least3Files); i++ { fNamer := domclean2.FileNamer(logDir, i) fNamer() // first call yields key bts, err := fs.ReadFile(fNamer() + ".html") lg(err) doc, err := html.Parse(bytes.NewReader(bts)) lg(err) textifyBruteForce(doc) var buf bytes.Buffer err = html.Render(&buf, doc) lg(err) b := buf.Bytes() b = bytes.Replace(b, []byte("[br]"), []byte("\n"), -1) fileDump(lg, fs, b, fNamer, "_raw.txt") } } // // Textify with more finetuning. // Save result to memory. textsByArticOutl := map[string][]*TextifiedTree{} for i := 0; i < len(least3Files); i++ { fNamer := domclean2.FileNamer(logDir, i) fnKey := fNamer() // first call yields key bts, err := fs.ReadFile(fNamer() + ".html") doc, err := html.Parse(bytes.NewReader(bts)) lg(err) fNamer() // one more // mp, bts := BubbledUpTextExtraction(doc, fnKey) fileDump(lg, fs, bts, fNamer, ".txt") mpSorted, dump := orderByOutline(mp) fileDump(lg, fs, dump, fNamer, ".txt") textsByArticOutl[fnKey] = mpSorted // for k, v := range mpSorted { // if k%33 != 0 { // continue // } // log.Printf("%3v: %v %14v %v\n", k, v.SourceID, v.Outline, v.Lvl) // } } // // // We progress from level 1 downwards. // Lower levels skip weeded out higher levels, // to save expensive levenshtein comparisons var skipPrefixes = map[string]bool{} for weedStage := 1; weedStage <= stageMax; weedStage++ { fNamer := domclean2.FileNamer(logDir, 0) fnKey := fNamer() // first call yields key levelsToProcess = map[int]bool{weedStage: true} frags := similarTextifiedTrees(textsByArticOutl, skipPrefixes, map[string]bool{fnKey: true}) similaritiesToFile(fs, logDir, frags, weedStage) for _, frag := range frags { if len(frag.Similars) >= numTotal-1 && frag.SumRelLevenshtein/(numTotal-1) < 0.2 { skipPrefixes[frag.Outline+"."] = true } } b := new(bytes.Buffer) for k, _ := range skipPrefixes { b.WriteString(k) b.WriteByte(32) } // log.Printf("%v\n", b.String()) } // // Apply dedup fNamer := domclean2.FileNamer(logDir, 0) fNamer() // first call yields key bts, err := fs.ReadFile(fNamer() + ".html") lg(err) doc, err := html.Parse(bytes.NewReader(bts)) lg(err) dedupApply(doc, skipPrefixes) // A special after dedup cleaning: // Remove ol and cfrm attributes var fr func(*html.Node) fr = func(n *html.Node) { if n.Type == html.ElementNode { attr2 := make([]html.Attribute, 0, len(n.Attr)) for _, attr := range n.Attr { if attr.Key != "ol" && attr.Key != "cfrm" { attr2 = append(attr2, attr) } } n.Attr = attr2 } for c := n.FirstChild; c != nil; c = c.NextSibling { fr(c) } } fr(doc) if false { // does not add value var b7 bytes.Buffer err := html.Render(&b7, doc) lg(err) doc, err = domclean2.DomClean(b7.Bytes(), opts) lg(err) } else { domclean2.DomFormat(doc) } return doc }
// handleFetchURL either displays a form for requesting an url // or it returns the URL´s contents. func handleFetchURL(w http.ResponseWriter, r *http.Request, m map[string]interface{}) { lg, b := loghttp.BuffLoggerUniversal(w, r) _ = b // on live server => always use https if r.URL.Scheme != "https" && !util_appengine.IsLocalEnviron() { r.URL.Scheme = "https" r.URL.Host = r.Host lg("lo - redirect %v", r.URL.String()) http.Redirect(w, r, r.URL.String(), http.StatusFound) } /* To distinguish between posted and getted value, we check the "post-only" slice of values first. If nothing's there, but FormValue *has* a value, then it was "getted", otherwise "posted" */ rURL := "" urlAs := "" err := r.ParseForm() lg(err) if r.PostFormValue(routes.URLParamKey) != "" { urlAs += "url posted " rURL = r.PostFormValue(routes.URLParamKey) } if r.FormValue(routes.URLParamKey) != "" { if rURL == "" { urlAs += "url getted " rURL = r.FormValue(routes.URLParamKey) } } // lg("received %v: %q", urlAs, rURL) if len(rURL) == 0 { tplAdder, tplExec := tplx.FuncTplBuilder(w, r) tplAdder("n_html_title", "Fetch some http data", nil) m := map[string]string{ "protocol": "https", "host": r.Host, // not fetch.HostFromReq(r) "path": routes.ProxifyURI, "name": routes.URLParamKey, "val": "google.com", } if util_appengine.IsLocalEnviron() { m["protocol"] = "http" } tplAdder("n_cont_0", c_formFetchUrl, m) tplExec(w, r) } else { r.Header.Set("X-Custom-Header-Counter", "nocounter") bts, inf, err := fetch.UrlGetter(r, fetch.Options{URL: rURL}) lg(err) tp := mime.TypeByExtension(path.Ext(inf.URL.Path)) if false { ext := path.Ext(rURL) ext = strings.ToLower(ext) tp = mime.TypeByExtension(ext) } w.Header().Set("Content-Type", tp) // w.Header().Set("Content-type", "text/html; charset=latin-1") if r.FormValue("dbg") != "" { w.Header().Set("Content-type", "text/html; charset=utf-8") fmt.Fprintf(w, "%s<br>\n %s<br>\n %v", inf.URL.Path, tp, inf.URL.String()) return } opts := domclean2.CleaningOptions{Proxify: true} opts.Beautify = true // "<a> Linktext without trailing space" opts.RemoteHost = fetch.HostFromStringUrl(rURL) // opts.ProxyHost = routes.AppHost() opts.ProxyHost = fetch.HostFromReq(r) if !util_appengine.IsLocalEnviron() { opts.ProxyHost = fetch.HostFromReq(r) } doc, err := domclean2.DomClean(bts, opts) var bufRend bytes.Buffer err = html.Render(&bufRend, doc) lg(err) w.Write(bufRend.Bytes()) } }