Commit Graph

532 Commits

Author SHA1 Message Date
7ef0e8c870
hadapi: Increase back-off when connection is reset. 2018-07-01 14:55:35 +10:00
d750dabba3
hadapi: Handle no response 2018-07-01 14:51:44 +10:00
021975cb58
crawler: Dump config at start-up 2018-07-01 14:42:12 +10:00
e5ac25cbd8
server: Don't cast parameters
`argparse` should do that for us.
2018-07-01 14:42:04 +10:00
caf7fd18c0
crawler: Fix return when no new user data exists. 2018-07-01 14:27:04 +10:00
9e2f145cbc
server: Pass in crawler settings 2018-07-01 14:25:16 +10:00
16b994c389
crawler: Fetch only UIDs we haven't seen already. 2018-07-01 13:01:53 +10:00
ebb0394c60
hadapi: Add function to retrieve user IDs alone.
Without returning the user info, so we can quickly skip the users we
already have.
2018-07-01 12:49:42 +10:00
23d8b0506f
crawler: Don't consider top-level domains.
This unfairly biases the "international" domains.
2018-06-04 08:23:51 +10:00
06a755395f
crawler: Allow LinkedIn links 2018-06-03 20:10:35 +10:00
1d87af9896
crawler: Loosen up restrictions on Google+ links. 2018-06-03 20:05:29 +10:00
462422998d
crawler: Also count parent domains of hosts.
This will naturally keep splitting until we've got nothing left, so
given the link `hadsh.vk4msl.id.au`, will generate a count for
`hadsh.vk4msl.id.au`, `vk4msl.id.au`, `id.au` and `au`.
2018-06-03 18:42:18 +10:00
8052924283
main.js: Display hostnames.
This is done the same way as for words, and so green means they're
mostly referenced by legitimate users, red means they're mostly spammer
hosts, and yellow means it's either new or used by both.
2018-06-03 18:31:03 +10:00
d7147b2d8e
server: Expose user hostname scores. 2018-06-03 18:30:00 +10:00
0e28f6376d
crawler: Fix copy-paste error 2018-06-03 18:02:40 +10:00
a4d7a0ca7a
server: Tally up the hostnames linked to by users. 2018-06-03 18:00:56 +10:00
1785e0e6ce
db.db: Expose Hostname and UserHostname 2018-06-03 18:00:30 +10:00
d98761c1bb
db.model: Add hostnames field to User 2018-06-03 17:58:56 +10:00
d1e9ed1ef8
crawler: Tally up the hostnames linked-to by users.
We'll score these the same way as words.
2018-06-03 17:57:18 +10:00
3dfe434fc1
db.model: Add Hostname and UserHostname
These will track the host name used in users' links and score them, as
the spammers seem to like pushing the same addresses over and over.

(Long term, I want to de-reference redirected links like t.co too.)
2018-06-03 17:49:11 +10:00
999179ae85
db.model: Fix representation for UserWord. 2018-06-03 17:48:04 +10:00
35e36af073
hadapi: Don't show content of all responses.
Some are JPEGs and PNGs, these are not meaningful to display in logs.
2018-06-03 11:13:15 +10:00
88bd611e8d
hadapi: Handle removed per_page parameter
It's meaningless for the workaround code.
2018-06-02 21:01:28 +10:00
f6b6932ca8
hadapi: Actually return the response 2018-06-02 20:21:45 +10:00
f55795675d
server: API to use own HTTP client instance, don't pass to crawler. 2018-06-02 16:52:20 +10:00
a5337b6a80
crawler: Remove HTTP client
We use the one exposed by the API now.
2018-06-02 16:51:25 +10:00
e3ca7b7a19
hadapi: Expose retrieval function that respects limits.
This basically is a wrapper around the HTTP client, and tries to ensure
that *all* requests meet the request limits to avoid getting blocked.
2018-06-02 16:49:53 +10:00
841d89898e
hadapi: Use workaround code for users sorted by newest.
This has been broken for a while now, and doesn't look like it'll ever
get fixed, so just work around it here.
2018-06-02 16:38:07 +10:00
1710dfdbc0
hadapi: Treat connection reset as a sign to back-off.
Also reduce our query rate a little bit to avoid the problem.
2018-06-02 16:36:01 +10:00
23f34b45c4
crawler: Match URIs against whitelist.
Some users put a link to "Google+" that is to something other than
Google+, so the link type from the Hackaday.io is therefore not
reliable.  Better to do it ourselves.
2018-06-02 14:29:19 +10:00
3c8c562dfe
crawler: Back up on catch-up interval. 2018-06-02 13:37:12 +10:00
01bb64a43c
crawler: Use HAD creation time for user age. 2018-06-02 13:15:12 +10:00
07e79665a6
crawler: Set user created and had_created 2018-06-02 13:14:15 +10:00
c2b500a8a5
db.model: Add had_created.
I note that newer users, you see the actual creation time, but for older
users, this is reported as 0.  So, store the value as-given, rather than
fudging it.  `created` will now reflect when we first saw the user.
2018-06-02 13:13:17 +10:00
40ad79dd3d
server: Report back error 400 on failed OAuth callback. 2018-06-02 12:47:02 +10:00
cf1c8bb536
crawler: Fix retrieval of historical users. 2018-06-02 11:19:16 +10:00
ae40f23d2d
hadapi: Report error responses 2018-06-02 11:09:53 +10:00
738794f0d5
server, crawler: Add ability to make non-members admins
This is done by specifying the user's UID on the command line.
2018-06-02 11:02:07 +10:00
af74cd8b46
crawler: Handle SQL error in fetching newer users 2018-06-02 09:44:25 +10:00
ba3dac2e24
crawler: Handle extra parameter in background_fetch_hist_users 2018-06-02 09:43:37 +10:00
db202c3c7c
crawler: Scan newest profiles page for new users.
We're far enough behind that doing a range fetch for
${HIGHEST_UID}-50,${HIGHEST_UID} doesn't work well anymore.  So revert
back to the older behaviour.
2018-06-02 09:26:13 +10:00
6c0e0455ca
crawler: Fix default definitions. 2018-05-04 20:08:02 +10:00
2ecf5d4f87
crawler: Increase historical scan rate if we see existing users.
That means we're behind and need to catch up.
2018-05-04 20:06:45 +10:00
2f1eca42de
crawler: Define some constants 2018-05-04 19:50:10 +10:00
fc2b53b0a4
main.js: Tweak loading message
Show user ID criteria so we can get an idea how far back it's looking.
2018-05-03 22:28:22 +10:00
f2685df564
crawler: Don't re-inspect users older than 4 weeks. 2018-05-03 21:44:02 +10:00
bdd888c2f8
main.js: Keep loading until we see users 2018-05-02 22:05:37 +10:00
abf0820a9a
server: Set cache control on newcomers feed.
Otherwise the browser caches the response.
2018-05-02 21:46:05 +10:00
ecf5bc2aa6
crawler: Defer if no response
If we get no response for a deferred user, defer them again.
2018-05-02 21:33:01 +10:00
4cdd8fcc25
hadapi: Drop range query in work-around 2018-05-02 21:26:07 +10:00
3d09727206
crawler: Add debug notes when inspections are done. 2018-05-02 21:25:14 +10:00
e61fdc2096
crawler: Debug user inspection 2018-05-02 21:14:46 +10:00
a16d728376
crawler: Tweak debugging output 2018-05-02 20:59:10 +10:00
bc57dc1825
hadapi: Add debugging 2018-05-02 20:50:07 +10:00
a4d5836016
server: Set User Agent on HTTP client. 2018-05-02 20:35:00 +10:00
40619d0712
crawler: Defer if account is less than an hour old. 2018-03-04 18:10:44 +10:00
93f9cd806d
crawler: Handle socket.gaierror checking link validity. 2018-03-03 17:46:39 +10:00
f8ad744d08
crawler: Update tokens, don't blind insert 2018-03-03 17:08:06 +10:00
e89dd8065a
crawler: Report traceback on failure. 2018-03-03 17:04:25 +10:00
3c9ae0afc5
crawler: Catch more general SQL errors. 2018-03-03 16:47:16 +10:00
55e9cd5427
main.js: Sort words; use title to display stats. 2018-03-03 15:24:41 +10:00
9a742db0e8
wordstat: Exclude punctuation 2018-03-03 14:55:29 +10:00
c1fee190bf
main.js: Highlight using colour derived from score. 2018-03-03 14:46:36 +10:00
25418a830c
main.js: Hide users with no data and still pending. 2018-03-03 14:00:02 +10:00
2aa1724e7a
crawler: Cancel further inspection if not needed. 2018-03-03 13:52:35 +10:00
6eaa8198c5
crawler: Defer if |score| < 0.5
If it's above, there's probably enough to go on.
2018-03-03 13:28:51 +10:00
7face2a12c
main.js: Display next inspection time. 2018-03-03 13:26:56 +10:00
6d554920fe
server: Expose inspection count and next inspection time. 2018-03-03 13:24:29 +10:00
dbb4993a7a
main.js: Add defer button; don't auto-mark deferred users. 2018-03-03 13:13:24 +10:00
55b9c71d88
server: Report whether inspection is pending 2018-03-03 13:04:17 +10:00
bf0060e2be
server: Remove user from DeferredUsers on classification. 2018-03-03 12:59:44 +10:00
8264ee667f
db.db: Expose DeferredUser 2018-03-03 12:57:27 +10:00
75c13a79ee
crawler: Defer if user has less than 10 tokens to score.
They might only have a small handful; maybe half a dozen.
2018-03-03 12:02:05 +10:00
ac0295f4cc
crawler: Don't defer historical users.
These are already old users, if they haven't shown spam traits by now,
they never will.
2018-03-03 11:52:00 +10:00
cc0eef4f93
crawler: Skip deferred user scanning if none due. 2018-03-03 11:47:55 +10:00
4e42937224
crawler: Kick off initial deferred inspections at start-up. 2018-03-03 11:46:39 +10:00
430ec91386
crawler: Fix deferral
*Actually* adding the object helps!
2018-03-03 11:45:32 +10:00
3d7644071c
crawler: Report when user inspection is deferred. 2018-03-03 11:42:18 +10:00
45f5c4dcfd
crawler: Fix deferral 2018-03-03 11:38:08 +10:00
f5e83670eb
db.model: Add missed __tablename__ 2018-03-03 11:36:14 +10:00
b0da8c6c27
crawler: Re-check deferred users every 15 minutes. 2018-03-03 11:35:20 +10:00
f87ef52b3b
crawler: Defer inspection if no score. 2018-03-03 11:26:48 +10:00
800fff3a49
db.model: Add deferred user to database.
This allows us to persist deferred users, so that we can re-visit new
accounts that perhaps had no content to inspect when first created.
2018-03-03 11:19:09 +10:00
5913853afb
crawler: Always scan for user pages.
Unlike projects; there's no top-level 'count' of users' pages in the
user's data.
2018-03-03 10:32:58 +10:00
27e6db6961
crawler: Ignore deleted users 2018-03-03 08:48:32 +10:00
76007706ae
crawler: Bump processing delay to 15 minutes. 2018-03-02 23:01:13 +10:00
6732087090
crawler: Drop background user update. 2018-03-02 22:11:31 +10:00
9f2402a2c8
server: Use dedicated database connection for session. 2018-03-02 22:05:13 +10:00
349c597394
server: Wrap database calls in try-finally
Close database in finally block.
2018-03-02 22:00:03 +10:00
0721911998
crawler: Fix user age calculations 2018-03-02 21:44:56 +10:00
af58e9006c
server: Connect to and close off database in worker.
Don't pass connection across threads, as it seems to leave the link
hanging.
2018-03-02 21:42:25 +10:00
721cf280a0
crawler: Report user age on inspection. 2018-03-02 21:30:17 +10:00
870e47500a
server: Use previously tokenised words.
Since we've pretty much dealt with all users that were not previously
tokenised; we can drop that bit of backward-compatible code that
tokenises on classification.

We should be able to just commit once too, which should speed things up.
2018-03-02 21:12:46 +10:00
3128b2f271
crawler: Have more patience retrieving data. 2018-03-02 21:01:06 +10:00
bc287c2f74
crawler: Throttle down retrieval rates.
So we don't run out of tokens before the first week!
2018-03-02 20:53:32 +10:00
b00f860a00
hadapi: Have some more patience 2018-03-02 20:47:44 +10:00
2c6320ee41
crawler: Report when inspection is delayed/takes place. 2018-03-02 20:38:17 +10:00
42dbdb56e0
crawler: Ignore empty links. 2018-03-02 20:08:09 +10:00
22d998bcb9
crawler: Delay inspection of users.
Sometimes, you catch a spambot user, but they haven't exhibited any
traits yet, so delay inspection by 5 minutes.
2018-03-02 20:05:33 +10:00
0de910395d
server: Handle user without detail. 2018-03-02 19:42:50 +10:00
8646c09ee1
server: Make number of users returned configurable.
Limit to 10 by default.
2018-03-02 19:41:48 +10:00
748192875a
server: Wrap ClassifyHandler in semaphore.
This has the potential to deadlock if multiple instances run together,
so wrap it in a semaphore to reduce the probability of this.
2018-03-02 19:11:36 +10:00
fa6635ba9d
server: Generate JSON response in worker thread.
Passing back the User object is not a good idea as it becomes detached
from the session.
2018-03-02 19:06:40 +10:00
d3779e0509
server: Re-locate log message output. 2018-03-02 18:47:55 +10:00
6b69f93861
server: Pass through user_id to classify handler. 2018-03-02 18:46:43 +10:00
aa0334abbc
server: Add missed thread_count argument 2018-03-02 18:40:22 +10:00
89c737687a
server: Defer classifying to worker thread.
Updating the database scores can take a while, let's not block the main
thread.
2018-03-02 18:39:18 +10:00
63dcb19bfd
resizer: Use a worker pool 2018-03-02 18:34:15 +10:00
374e441466
server: Add a worker pool 2018-03-02 18:34:07 +10:00
ad53662bde
pool: Add thread pool wrapper.
This implements a "thread pool" using Tornado's `Semaphore` class to
limit the number of worker threads active at any one time, and using the
`Queue` class to enqueue requests.
2018-03-02 18:29:42 +10:00
dda54cd23a
db.model: Add string representations 2018-03-01 22:38:50 +10:00
dd0fbd51e5
server: Fix user classify handler 2018-03-01 22:15:15 +10:00
d98ac5e85d
server: Use separate DB instance per request.
Otherwise, if a SQL error occurs, it trips up the whole server.
2018-03-01 22:07:50 +10:00
7d0d686418
server: Retry word adjacency addition. 2018-03-01 22:02:37 +10:00
47ac3dcd64
server: Commit word adjacencies on create. 2018-03-01 21:57:04 +10:00
04b997a8e8
server: Fix typo in ClassifyHandler 2018-03-01 21:51:47 +10:00
c00ef5195a
crawler: Don't assume UserWord/UserWordAdjacent are unique. 2018-03-01 21:40:17 +10:00
d9bfd5188d
crawler: Don't reclassify existing users. 2018-03-01 21:36:05 +10:00
f9bfe4e718
crawler: Commit after each user page/project 2018-03-01 21:32:48 +10:00
5f5084bcd3
crawler: Handle missing 'page' and 'last_page' 2018-03-01 21:19:15 +10:00
eee087843a
hadapi: Fix typo in _project_query_opts. 2018-03-01 21:11:06 +10:00
76fd0e2f47
crawler: Only scan pages/projects if user info says they exist.
If we get a 'projects' or 'pages' count, then inspect.  Otherwise we
burn up API requests for no good reason.
2018-02-25 20:47:14 +10:00
601e1be5a7
crawler: Crawl users' pages and projects.
Just tokenise the content, don't bother storing the pages or projects
themselves.
2018-02-25 20:42:08 +10:00
25c9375703
server: Use previously obtained tokens if possible. 2018-02-25 20:30:51 +10:00
3907800cbe
main.js: Tweak border on users. 2018-02-13 08:20:28 +10:00
2bcc8191bf
server: Drop user links and detail if legit.
If we mark a user as legitimate, drop the links and user detail
associated with the user account.
2018-02-13 08:19:49 +10:00
bae080462f
main.js: Display a score gauge 2018-02-03 23:15:00 +10:00
4a325fe248
crawler, main.js: Re-work user scoring.
Just blindly summing all the words' scores does not yield a useful
measurement.  Try summing the 10 worst scored words used by the user.
2018-02-03 22:41:00 +10:00
056e469e03
server: Handle null last_update in user detail. 2018-02-03 21:49:49 +10:00
547cc3565b
main.js: Re-jig order of fields for readability. 2018-02-03 21:49:31 +10:00
00678f99c8
crawler: Announce new users 2018-02-03 15:54:40 +10:00
509da7d47a
crawler: Update new user refresh rate. 2018-02-03 15:38:37 +10:00
6211b4a77f
crawler: Debug new user crawl. 2018-02-03 15:02:26 +10:00
75153d2d2d
server: Switch sort order of users.
Since we're no longer getting reliable user creation times, use the user
ID instead.
2018-02-03 14:43:33 +10:00
567ffada54
crawler: Handle created=0
Seems HAD is no longer reporting the creation date of the user, so just
substitute the current time.
2018-02-03 14:22:59 +10:00
aca8190196
crawler: Commit more frequently.
Try to prevent roll-backs due to integrity errors.
2018-02-03 12:32:34 +10:00
1c1dd25cca
server: Use dedicated database instance for crawler. 2018-02-03 12:23:49 +10:00
fd66d80b64
main.js: Fix some bugs, display more user detail 2018-02-03 12:23:08 +10:00
e8531a6731
server: Report user word usage and tokens. 2018-02-03 10:46:02 +10:00
4a6196bd29
crawler: Classify based on word content. 2018-02-03 10:24:11 +10:00
684640fbf3
db: Add UserToken
Sometimes the regular expression picks up on tokens, and so we want to
be able to display that so the classifier isn't left scratching their
head as to why someone is suspect.
2018-02-03 10:02:58 +10:00
b329f550c7
db: Add UserWord and UserWordAdjacent object
This tracks the number of times a user uses a word or pair of words.
2018-02-03 09:59:23 +10:00
9ced2a639d
main.js: Handle errors 2018-02-03 09:55:40 +10:00
c335a24e63
server: Try to create objects in bulk lots. 2018-02-03 00:46:31 +10:00
c00a236ae2
server: Fix retrieval of words 2018-02-03 00:20:44 +10:00
6a93a997d5
server: Fix creation of WordAdjacent objects. 2018-02-03 00:14:51 +10:00
5962f3c965
server: Lowercase query not Query 2018-02-03 00:12:54 +10:00
8cb41de01a
wordstat: Handle empty sequence 2018-02-03 00:11:09 +10:00
11d2c70a33
server: Tally up word stats on classify. 2018-02-02 23:32:50 +10:00
55baabaf6d
db.db: Expose Word and WordAdjacent 2018-02-02 23:32:35 +10:00
3ac39b5a00
wordstat: Add in word statistics parsing. 2018-02-02 23:24:27 +10:00
1d9c2a49c2
db.model: Add tables for recording word frequency/adjacency. 2018-02-02 22:06:32 +10:00
64ed885f54
htmlstrip: Add HTML stripper.
This will be used for grabbing the plain text of users' profiles for
tokenisation with `polyglot`.
2018-02-02 22:00:33 +10:00
2de969742b
crawler: Flag users based on projects per minute.
A real human can't publish many projects in a minute.  These spambots
seem to have hundreds after an hour.
2018-02-02 12:07:21 +10:00
efe05083b0
crawler: Fix stashing of user project count 2018-02-02 11:59:40 +10:00
a29bb7cefb
main.js: Display number of projects if greater than zero 2018-02-02 11:57:44 +10:00
8d7985d3ae
server: Expose number of projects 2018-02-02 11:56:29 +10:00
8e16f0126b
crawler: Stash number of projects.
If there's a lot, that's a clue.
2018-02-02 11:54:08 +10:00
493aa20e9e
db.model: Add number of projects to user data. 2018-02-02 11:49:32 +10:00
291d35ea95
crawler: Record all user information.
We'll need to tokenise the word input of both spambots and non-spambot
users, and figure out word frequencies for both groups.  Thus, we need
the sample data.
2018-02-02 11:40:07 +10:00
67df0a1bb6
server: Check session expiry, update if needed. 2018-01-19 19:40:35 +10:00
6dd86fd4b5
server: Set expiry date on session. 2018-01-17 19:35:51 +10:00
ded09d57ba
db.model: Add expiry date to session 2018-01-17 19:34:29 +10:00
5dc1363f23
crawler: Tweak patterns 2018-01-11 23:22:28 +10:00
fefb571106
crawler: Use search not match.
`match` tries to match the entire string, `search` looks for substrings,
which is what we want.
2018-01-11 23:20:44 +10:00
5a33a120d5
server: Return users in auto_suspect or auto_legit groups. 2018-01-11 20:59:48 +10:00
2f4997f02b
server: Order by users by creation date, then user ID. 2018-01-11 20:53:43 +10:00
50a3dffc16
crawler: Inhibit background tasks when blocked. 2018-01-11 20:42:41 +10:00
5cd0afb196
hadapi: Re-work forbidden handling.
Rather than a dumb wait of an hour, we simply set a flag that's exposed
elsewhere.  Our background tasks can check that flag instead and inhibit
their operations themselves.

If a user action that requires API access, succeeds, the flag is
cleared and the background tasks resume.

The background tasks auto-retry after an hour.
2018-01-11 20:40:04 +10:00
6acf55a530
server: Tweak logging 2018-01-11 20:26:50 +10:00
30a8d2e219
server: Handle non-successful OAuth response. 2018-01-10 08:32:53 +10:00
8361834292
server: Fix proxying of error message. 2018-01-10 08:28:57 +10:00
72feab73e3
hadapi: Wait before acquiring semaphore 2018-01-10 08:28:00 +10:00
034e9026c8
hadapi: Don't wait after forbidden for authentication calls.
The user is *not* going to wait an hour.
2018-01-10 08:24:50 +10:00
8593196462
server: Handle 403 in authentication. 2018-01-10 08:21:07 +10:00
b4c54a624f
server: Wait up to a minute for the crawler. 2018-01-09 19:39:35 +10:00
f01fa59c98
hadapi: If we receive 403 Forbidden, back off for an hour. 2018-01-09 07:16:21 +10:00
620e349b77
main.js: Display what_i_would_like_to_do 2018-01-08 22:43:12 +10:00
fe23ad0a99
server: Expose what_i_would_like_to_do field 2018-01-08 22:43:01 +10:00
1cfb02b3ee
crawler: Also consider what_i_would_like_to_do field. 2018-01-08 22:37:05 +10:00
d5c603cf50
db.model: Add what_i_would_like_to_do column 2018-01-08 22:35:55 +10:00
6eeaeb0206
server: Don't store token. 2018-01-08 22:32:15 +10:00
a2445367d2
db.model: Drop user token.
We don't really need it beyond identifying the user during log-in.
2018-01-08 22:31:51 +10:00
f2d8ffe2d9
db.model: Delete children when user is deleted. 2018-01-08 22:15:34 +10:00
6578453db6
crawler: Handle database cock-up in user refresh. 2018-01-08 22:11:00 +10:00
cd6f4efe51
crawler: Handle SQLAlchemy exceptions in background user fetch. 2018-01-08 21:31:33 +10:00
6c20da9dfa
main.js: Display user ID
Sometimes, spammers use the same "screen name" for multiple accounts.
2018-01-08 21:23:17 +10:00
d563728efb
main.js: Scroll to top when mass-marking.
Otherwise the page jumps unpredictably.
2018-01-08 20:54:24 +10:00
632c8a762e
main.js: Navigate by UID not page.
More reliable, as the listing is changing due to us classifying users
and new users arriving all the time.
2018-01-08 20:40:44 +10:00
2d291bbe6c
main.js: Make "auto mark" feature a mass mark.
Rather than automatically marking users as clean when you get to the
bottom, since that makes the page jump around, mark them when the user
clicks a button.
2018-01-08 20:25:37 +10:00
31b903f4f9
crawler: Fix background user update. 2018-01-08 18:53:21 +10:00
8c203f9891
crawler: Log message when classifying.
I suspect there's a bug overriding classifications of existing,
classified users.
2018-01-08 18:09:31 +10:00
0e7bfe57e5
server: Implement page semantics 2018-01-08 08:34:42 +10:00
cf16fcb367
crawler: Don't re-classify classified users. 2018-01-08 08:23:17 +10:00
9282b4c24b
server: Fix user filter for new users. 2018-01-08 08:13:27 +10:00
1efe32b0d8
server: Fix wait for new users 2018-01-08 08:09:35 +10:00
12eefa7557
crawler: Handle no new users case, reduce poll rate. 2018-01-08 08:07:31 +10:00
1a0c91e13a
server: Retrieve new users from database.
No need to go out and fetch them ourselves anymore.
2018-01-08 08:05:15 +10:00
2b2a01852d
crawler: Periodically refresh up to 50 users. 2018-01-08 07:47:34 +10:00
9d19a3d44b
crawler: Set an event when new users are added. 2018-01-08 07:32:51 +10:00
4ef3d9a9d9
crawler: Keep reading older users in the background. 2018-01-08 07:29:43 +10:00
8f5bdf9299
crawler: Drop old user retrieval.
Realised that this won't work the way I had planned, because my own UID
is far lower than the block I've pulled in: it'll see that and start
there leaving a huge gap.

Plus, it just returned `users: 0`, not useful.
2018-01-07 22:30:44 +10:00
9687e3f637
crawler: Reverse start and end for old user retrieval. 2018-01-07 22:28:16 +10:00
4bf473713e
hadapi: Handle socket.gaierror(EAGAIN) 2018-01-07 22:26:34 +10:00
53686bf5d7
crawler: Quietly retrieve more users in the background. 2018-01-07 22:23:48 +10:00
f212ea3b04
crawler: Validate user URIs
Seems the users still show up in the API but are since "deleted".  So do
a check to see if the link's valid.
2018-01-07 21:27:40 +10:00
daee438af7
hadapi: Fix retrieve users by range 2018-01-07 20:51:08 +10:00
e8d67d872c
hadapi: Use range query to retrieve users. 2018-01-07 20:48:10 +10:00
6ec6116cd9
main.js: Delay auto-marking "legit" users.
Just in case the user spots something at the last minute.
2018-01-07 19:12:46 +10:00
032650601b
crawler: Also ignore Google+ 2018-01-07 18:47:57 +10:00
183dc3eef5
crawler: Fix inverted logic 2018-01-07 18:43:48 +10:00
9ae48a4f7b
main.js: Automatically mark 'auto_legit'
The number of false negatives here has been tiny, so this will make life
a little more convenient.
2018-01-07 18:32:59 +10:00
e484385a8e
crawler: Ignore 'github' or 'twitter' links.
By far the most common on this site.  Not anomalous.
2018-01-07 18:29:16 +10:00
83e7785e19
main.js: Add legit/suspect buttons.
Note, they only work if you're an admin, otherwise they do nothing.
2018-01-07 17:33:08 +10:00
8a9a1608f5
server: Implement classification endpoint. 2018-01-07 17:30:51 +10:00
6eb99f2653
hadapi, util: Re-locate body parsing
We'll need it in the server for requesting POST/PUT bodies too.
2018-01-07 16:55:32 +10:00
6b5c4e8b6a
crawler: Periodically refresh admin group members. 2018-01-07 16:44:37 +10:00
9ad7b7dd86
crawler: Refresh admin group at start-up. 2018-01-07 16:34:31 +10:00
dcfe89fba5
crawler: Store project ID 2018-01-07 16:17:30 +10:00
db39ddbbcb
server: Pass through project ID from command line. 2018-01-07 16:17:15 +10:00
f16437cf38
crawler: Skip users that have been classified by a human 2018-01-07 16:05:07 +10:00
2083e66894
main.js: Show groups and tags of users. 2018-01-07 15:43:35 +10:00
906aaa97be
server: Expose user groups and tags 2018-01-07 15:34:08 +10:00
dcc61c644d
crawler: Automatically file users into groups. 2018-01-07 15:25:39 +10:00
d3a9599822
crawler: Drop redundant page_last_refresh 2018-01-07 15:20:05 +10:00
42432241d3
model: Fix reference on Tags 2018-01-07 15:16:50 +10:00
3de3d25b8d
server: Drop unused classes 2018-01-07 14:58:12 +10:00
03ef9f8a8a
crawler: Drop unused classes 2018-01-07 14:57:33 +10:00
2deb05dccc
db.db: Drop unused classes 2018-01-07 14:57:06 +10:00
2dbf81f6eb
db.model: Model many-to-many properly. 2018-01-07 14:56:56 +10:00
a462f39358
server: Convert date/time to ISO format 2018-01-07 14:39:47 +10:00
c7a6529f17
main.js: Show user creation date 2018-01-07 14:38:35 +10:00
1369a6089f
server: Show user creation date. 2018-01-07 14:38:23 +10:00
0bfd1ec6cc
crawler: Update creation date on existing users. 2018-01-07 14:35:19 +10:00
2accdee2a3
crawler: Tweak newest user page refresh
- Use the database to persist when we last checked a page, so we don't
  flog HAD's site unnecessarily.
- Bump the starting offset to 1 and the timeout to a day.
2018-01-07 14:22:17 +10:00
4c979ecc43
db.model: Add page refresh metadata. 2018-01-07 14:10:04 +10:00
b3c5c5e304
crawler: Add user creation date 2018-01-07 14:06:25 +10:00
db3c47dd89
db.model: Add creation date to user 2018-01-07 14:03:25 +10:00
4861579deb
crawler: Skip pages loaded in the last hour.
Beyond page 10, if we've loaded that page in the last hour, assume
nothing has changed.
2018-01-07 12:28:09 +10:00
650178fcf3
main.js: Use page number from response. 2018-01-07 11:39:50 +10:00
786cc1339a
crawler: Fix retrieval when existing users seen. 2018-01-07 11:39:31 +10:00
6be5e9a297
server: Retrieve page from fetch_new_users. 2018-01-07 11:36:00 +10:00
b09494b726
crawler: Report current page, move to next if no users. 2018-01-07 11:35:29 +10:00
83aa52a117
crawler: Don't inspect existing users.
When browsing through the "new user" list, skip accounts that have been
inspected already.
2018-01-07 11:30:48 +10:00
15c9970ccc
main.js: Show profile information if present. 2018-01-07 10:15:27 +10:00
96f2e1c63f
server: Fix reference to crawler in avatar retrieval. 2018-01-07 10:08:20 +10:00
e05e70cb88
crawler: Fix Return in synchronous function. 2018-01-07 10:06:35 +10:00
e98ea98228
crawler: Fix lazy retrieval of avatar. 2018-01-07 09:59:15 +10:00
041e1f921c
crawler: Don't fetch actual avatar image until required. 2018-01-07 09:56:25 +10:00
c02a5509cb
server: Fetch avatar if needed 2018-01-07 09:55:50 +10:00