Commit Graph

185 Commits

Author SHA1 Message Date
3ac39b5a00
wordstat: Add in word statistics parsing. 2018-02-02 23:24:27 +10:00
1d9c2a49c2
db.model: Add tables for recording word frequency/adjacency. 2018-02-02 22:06:32 +10:00
64ed885f54
htmlstrip: Add HTML stripper.
This will be used for grabbing the plain text of users' profiles for
tokenisation with `polyglot`.
2018-02-02 22:00:33 +10:00
aec6fee429
LICENSE: Switch to GPLv3.
We need a natural language tokeniser, and I found a pretty comprehensive
one in `polyglot`.  It is GPLv3 licensed, and so in respect of the
`polyglot` authors' wishes, we will switch this project's license to
GPLv3 for compatibility.
2018-02-02 20:37:38 +10:00
2de969742b
crawler: Flag users based on projects per minute.
A real human can't publish many projects in a minute.  These spambots
seem to have hundreds after an hour.
2018-02-02 12:07:21 +10:00
efe05083b0
crawler: Fix stashing of user project count 2018-02-02 11:59:40 +10:00
a29bb7cefb
main.js: Display number of projects if greater than zero 2018-02-02 11:57:44 +10:00
8d7985d3ae
server: Expose number of projects 2018-02-02 11:56:29 +10:00
8e16f0126b
crawler: Stash number of projects.
If there's a lot, that's a clue.
2018-02-02 11:54:08 +10:00
493aa20e9e
db.model: Add number of projects to user data. 2018-02-02 11:49:32 +10:00
291d35ea95
crawler: Record all user information.
We'll need to tokenise the word input of both spambots and non-spambot
users, and figure out word frequencies for both groups.  Thus, we need
the sample data.
2018-02-02 11:40:07 +10:00
67df0a1bb6
server: Check session expiry, update if needed. 2018-01-19 19:40:35 +10:00
6dd86fd4b5
server: Set expiry date on session. 2018-01-17 19:35:51 +10:00
ded09d57ba
db.model: Add expiry date to session 2018-01-17 19:34:29 +10:00
5dc1363f23
crawler: Tweak patterns 2018-01-11 23:22:28 +10:00
fefb571106
crawler: Use search not match.
`match` tries to match the entire string, `search` looks for substrings,
which is what we want.
2018-01-11 23:20:44 +10:00
5a33a120d5
server: Return users in auto_suspect or auto_legit groups. 2018-01-11 20:59:48 +10:00
2f4997f02b
server: Order by users by creation date, then user ID. 2018-01-11 20:53:43 +10:00
50a3dffc16
crawler: Inhibit background tasks when blocked. 2018-01-11 20:42:41 +10:00
5cd0afb196
hadapi: Re-work forbidden handling.
Rather than a dumb wait of an hour, we simply set a flag that's exposed
elsewhere.  Our background tasks can check that flag instead and inhibit
their operations themselves.

If a user action that requires API access, succeeds, the flag is
cleared and the background tasks resume.

The background tasks auto-retry after an hour.
2018-01-11 20:40:04 +10:00
6acf55a530
server: Tweak logging 2018-01-11 20:26:50 +10:00
30a8d2e219
server: Handle non-successful OAuth response. 2018-01-10 08:32:53 +10:00
8361834292
server: Fix proxying of error message. 2018-01-10 08:28:57 +10:00
72feab73e3
hadapi: Wait before acquiring semaphore 2018-01-10 08:28:00 +10:00
034e9026c8
hadapi: Don't wait after forbidden for authentication calls.
The user is *not* going to wait an hour.
2018-01-10 08:24:50 +10:00
8593196462
server: Handle 403 in authentication. 2018-01-10 08:21:07 +10:00
b4c54a624f
server: Wait up to a minute for the crawler. 2018-01-09 19:39:35 +10:00
f01fa59c98
hadapi: If we receive 403 Forbidden, back off for an hour. 2018-01-09 07:16:21 +10:00
620e349b77
main.js: Display what_i_would_like_to_do 2018-01-08 22:43:12 +10:00
fe23ad0a99
server: Expose what_i_would_like_to_do field 2018-01-08 22:43:01 +10:00
1cfb02b3ee
crawler: Also consider what_i_would_like_to_do field. 2018-01-08 22:37:05 +10:00
d5c603cf50
db.model: Add what_i_would_like_to_do column 2018-01-08 22:35:55 +10:00
6eeaeb0206
server: Don't store token. 2018-01-08 22:32:15 +10:00
a2445367d2
db.model: Drop user token.
We don't really need it beyond identifying the user during log-in.
2018-01-08 22:31:51 +10:00
f2d8ffe2d9
db.model: Delete children when user is deleted. 2018-01-08 22:15:34 +10:00
6578453db6
crawler: Handle database cock-up in user refresh. 2018-01-08 22:11:00 +10:00
cd6f4efe51
crawler: Handle SQLAlchemy exceptions in background user fetch. 2018-01-08 21:31:33 +10:00
6c20da9dfa
main.js: Display user ID
Sometimes, spammers use the same "screen name" for multiple accounts.
2018-01-08 21:23:17 +10:00
d563728efb
main.js: Scroll to top when mass-marking.
Otherwise the page jumps unpredictably.
2018-01-08 20:54:24 +10:00
632c8a762e
main.js: Navigate by UID not page.
More reliable, as the listing is changing due to us classifying users
and new users arriving all the time.
2018-01-08 20:40:44 +10:00
2d291bbe6c
main.js: Make "auto mark" feature a mass mark.
Rather than automatically marking users as clean when you get to the
bottom, since that makes the page jump around, mark them when the user
clicks a button.
2018-01-08 20:25:37 +10:00
31b903f4f9
crawler: Fix background user update. 2018-01-08 18:53:21 +10:00
8c203f9891
crawler: Log message when classifying.
I suspect there's a bug overriding classifications of existing,
classified users.
2018-01-08 18:09:31 +10:00
0e7bfe57e5
server: Implement page semantics 2018-01-08 08:34:42 +10:00
cf16fcb367
crawler: Don't re-classify classified users. 2018-01-08 08:23:17 +10:00
9282b4c24b
server: Fix user filter for new users. 2018-01-08 08:13:27 +10:00
1efe32b0d8
server: Fix wait for new users 2018-01-08 08:09:35 +10:00
12eefa7557
crawler: Handle no new users case, reduce poll rate. 2018-01-08 08:07:31 +10:00
1a0c91e13a
server: Retrieve new users from database.
No need to go out and fetch them ourselves anymore.
2018-01-08 08:05:15 +10:00
2b2a01852d
crawler: Periodically refresh up to 50 users. 2018-01-08 07:47:34 +10:00