Commit Graph

55 Commits

Author SHA1 Message Date
Marcin Bachry 24c16fc349 Fix crash in url preview when html tag has no text
Signed-off-by: Marcin Bachry <hegel666@gmail.com>
2016-12-14 22:38:18 +01:00
Johannes Löthberg 32c8b5507c preview_url_resource: Ellipsis must be in unicode string
Signed-off-by: Johannes Löthberg <johannes@kyriasis.com>
2016-12-01 13:12:13 +01:00
Erik Johnston f90b3d83a3 Add None check to _iterate_over_text 2016-08-17 15:17:17 +01:00
Erik Johnston 109a560905 Flake8 2016-08-16 14:57:21 +01:00
Erik Johnston 48b5829aea Fix up preview URL API. Add tests.
This includes:

- Splitting out methods of a class into stand alone functions, to make
  them easier to test.
- Adding unit tests to split out functions, testing HTML -> preview.
- Handle the fact that elements in lxml may have tail text.
2016-08-16 14:53:24 +01:00
Erik Johnston 5bcccfde6c Don't include html comments in description 2016-08-05 14:45:11 +01:00
Erik Johnston b5525c76d1 Typo 2016-08-04 16:10:08 +01:00
Erik Johnston e97648c4e2 Test summarization 2016-08-04 16:09:09 +01:00
Erik Johnston 58c9653c6b Don't infer paragrahs from newlines 2016-08-02 18:50:24 +01:00
Erik Johnston 6b58ade2f0 Comment on why we clone 2016-08-02 18:41:22 +01:00
Erik Johnston 9e66c58ceb Spelling. 2016-08-02 18:37:31 +01:00
Erik Johnston f83f5fbce8 Make it actually compile 2016-08-02 18:32:42 +01:00
Erik Johnston aecaec3e10 Change the way we summarize URLs
Using XPath is slow on some machines (for unknown reasons), so use a
different approach to get a list of text nodes.

Try to generate a summary that respect paragraph and then word
boundaries, adding ellipses when appropriate.
2016-08-02 18:25:53 +01:00
Erik Johnston 09a17f965c Line lengths 2016-06-15 16:58:12 +01:00
Erik Johnston 1e9026e484 Handle floats as img widths 2016-06-15 16:58:05 +01:00
Erik Johnston a60169ea09 Handle og props with not content 2016-06-15 16:57:48 +01:00
Mark Haines eb79110beb Clean up the blacklist/whitelist handling.
Always set the config key with an empty list, even if a list isn't specified.
This means that the codepaths are the same for both the empty list and
for a missing key. Since the behaviour is the same for both cases this
makes the code somewhat easier to reason about.
2016-05-16 13:03:59 +01:00
Mark Haines 8d7ad44331 Report per request metrics for all of the things using request_handler 2016-04-28 10:57:49 +01:00
Erik Johnston e8884e5e9c Add self.media_repo to PreviewUrlResource 2016-04-19 14:51:34 +01:00
Erik Johnston a7001c311b _make_dirs was moved to MediaRepository 2016-04-19 14:49:31 +01:00
Erik Johnston 9181e2f4c7 Add store to PreviewUrlResource 2016-04-19 14:48:24 +01:00
Erik Johnston fb76a81ff7 Reorder imports 2016-04-19 14:45:05 +01:00
Erik Johnston 43f0941e8f Split out BaseMediaResource into MediaRepository
This is so that a single MediaRepository can be shared across all
resources, rather than having a "copy" per resource.

In particular this allows us to guard against both the thumbnail and
download resource triggering a download of remote content at the same
time.
2016-04-19 11:24:59 +01:00
Matthew Hodgson aaabbd3e9e explicitly pass in the charset from Content-Type to lxml to fix cyrillic woes better 2016-04-15 14:32:25 +01:00
Matthew Hodgson 84f9cac4d0 fix cyrillic URL previews by hardcoding all page decoding to UTF-8 for now, rather than relying on lxml's heuristics which seem to get it wrong 2016-04-15 13:20:08 +01:00
Matthew Hodgson f78b479118 fix urlparse import thinko breaking tiny URLs 2016-04-14 15:23:55 +01:00
Erik Johnston d0633e6dbe Sanitize the optional dependencies for spider API 2016-04-13 13:38:09 +01:00
Erik Johnston 17515bae14 PEP8 2016-04-11 11:02:50 +01:00
Matthew Hodgson 5ffacc5e84 fix typos and needless try/except from PR review 2016-04-11 10:39:16 +01:00
Matthew Hodgson 83b2f83da0 actually throw meaningful errors 2016-04-08 21:36:59 +01:00
Mark Haines b36270b5e1 Fix pep8 warning 2016-04-08 19:52:23 +01:00
Matthew Hodgson 1ccabe2965 more PR feedback 2016-04-08 18:58:08 +01:00
Matthew Hodgson dafef5a688 Add url_preview_enabled config option to turn on/off preview_url endpoint. defaults to off.
Add url_preview_ip_range_blacklist to let admins specify internal IP ranges that must not be spidered.
Add url_preview_url_blacklist to let admins specify URL patterns that must not be spidered.
Implement a custom SpiderEndpoint and associated support classes to implement url_preview_ip_range_blacklist
Add commentary and generally address PR feedback
2016-04-08 18:37:15 +01:00
Matthew Hodgson cf51c4120e report image size (bytewise) in OG meta 2016-04-03 23:57:05 +01:00
Matthew Hodgson 0834b152fb char encoding 2016-04-03 12:59:27 +01:00
Matthew Hodgson 8b98a7e8c3 pep8 2016-04-03 12:56:29 +01:00
Matthew Hodgson eab4d462f8 fix etag typing error. fix timestamp typing error 2016-04-03 02:02:46 +01:00
Matthew Hodgson c3916462f6 rebase all image URLs 2016-04-03 01:33:12 +01:00
Matthew Hodgson 110780b18b remove stale todo 2016-04-03 00:48:31 +01:00
Matthew Hodgson b09e29a03c Ensure only one download for a given URL is active at a time 2016-04-03 00:47:40 +01:00
Matthew Hodgson 7426c86eb8 add a persistent cache of URL lookups, and fix up the in-memory one to work 2016-04-03 00:31:57 +01:00
Matthew Hodgson d1b154a10f support gzip compression, and don't pass through error msgs 2016-04-02 03:06:39 +01:00
Matthew Hodgson 5037ee0d37 handle missing dimensions without crashing 2016-04-02 02:29:57 +01:00
Matthew Hodgson b26e8604f1 make meta comparisons case insensitive 2016-04-02 01:35:44 +01:00
Matthew Hodgson 5fd07da764 refactor calc_og; spider image URLs; fix xpath; add a (broken) expiringcache; loads of other fixes 2016-04-02 00:35:49 +01:00
Matthew Hodgson c60b751694 fix assorted redirect, unicode and screenscraping bugs 2016-04-01 02:17:48 +01:00
Matthew Hodgson 683e564815 handle spidered relative images correctly 2016-03-31 23:52:58 +01:00
Matthew Hodgson 72550c3803 prevent choking on invalid utf-8, and handle image thumbnailing smarter 2016-03-31 15:14:14 +01:00
Matthew Hodgson bb9a2ca87c synthesise basig OG metadata from pages lacking it 2016-03-31 14:15:09 +01:00
Matthew Hodgson a8a5dd3b44 handle requests with missing content-length headers (e.g. YouTube) 2016-03-31 01:55:21 +01:00