I'm collecting data from pol. I can do lots of cool stuff with this data, but what do you guys want to know? Here's some basics that I've gotten so far.
If youre doing it from archives youre as dumb as the thinktanks who have done it for years
Benjamin Brown
This is my first go at it, so yes, the archives are most accessible. Scraping those ensures that I don't have to re-scrape the same ones for updates. I can always scale it up with minimal work.
Henry Jackson
You're a retard. You think you're special making this data set? You're not. Kill yourself
Oliver Adams
Why are you guys being so hard on him? It sounds like a fun project.
how did you het pnly 280k posrs? /cvg/ threads should have more than that already alonr
Liam Lewis
You can chunk out 30% as USA [meme flag]
Justin Watson
I've just been taking stuff from the archives. For now, I've been saving Thread#, UserID, TimePosted, Flag, URL of image, and post text.
It's super basic, but I just started Saturday night.
Nicholas Jenkins
Im proud of you bucko.
Elijah Davis
that sounds fun. how you gonna dive into the post text? seems like the most interesting one here.
Liam Reyes
I have an idea. Find this year's top 100 non-get posts with the most replies and post them.
Xavier Wright
Interesting shit
Jose Carter
>Canada has the highest per capita use Wtf is wrong with us
James Gonzalez
Keep it simple and post meme flags vs national id in one pic
Brayden Perry
bbc posts aren't going to write themselves
Leo Wilson
Tell us how many times we say nigger, nigger.
Nathan Jones
Like I said in OP, it'd be nice to see if certain meme flags say certain phrases more often. We might be able to spot some trends to better identify the shills.
I was originally hoping to see trends for certain words over time, but I underestimated just how much data is on Yas Forums. I'll have some trouble scraping a decent sample size regularly, but it can be done.
I don't expect to learn anything practical with this. It's just something I'll do while it interests me.
Jordan Martinez
also non-op posts of course
Bentley White
How many times was nothingburger mentioned in the past 3 months on a weekly basis? I think this would be very groundbreaking data.
I may eventually get some population data so we can see all the numbers per capita.
Owen Reyes
Holy shit dude you have serious problems
Nicholas Rogers
Cool stuff OP. Don't let faggots get you down. I remember a long time ago moot saying that most of the worst shitposts were from Australia and he considered banning all of them.
Sebastian Gutierrez
I just started my data collection. I was surprised at how much people post on here. It's unlikely I can store all that data, but I'll keep adding to the database.
Of 280k posts i've saved from the past 2 days, only 334 mention the exact phrase "nothing burger". I'll need to use better logic to find all the variations.
Landon Cox
ew almost 8% angloids
Hudson Price
That fade where your picture cuts off at the bottom is aesthetic as fuck.
Jeremiah Johnson
Whats the rarest memeflag and also whats the most used word?
Colton Taylor
No, he's just a retard.
John Perry
I haven't parsed out the "reply" portion of each post. that's going to take some work.
Of my tiny dataset, the OP with the most replies is this one.
Jayden White
Can you pull out the most commonly used phrases? Looking for the most popular copypastas. I don't know how you'd program this but check the correlation of each post with every other post, and give me the top 10 most popular 3+ word phrases on Yas Forums.
Jeremiah Murphy
THIS IS AN EXCELLENT IDEA
Easton Torres
what is the distribution of posts ending in consecutive integers by countries
I am confident that Australia has a massive share of them relative to their overall posting share
Now everyone post a picture of your face so I can include that in the data.
Lincoln Ross
To those of us archiving and cataloguing the recent uptick in shill posts and COINTEL raids, do you have anything in your data set that might prove useful in tracking it? I know it's a difficult metric to tie down but copypastas, flag rarity to content type (looking at you chang), and image name could help. Any info in regards to that would be cool, but I feel you probably aren't one of us so meh.
Does post text include if someone links to another post? This should be joined. You want to know whether a comment is a response or a standalone comment because it changes the context.
I’d also recommend collecting whether something is green text or not. If green text and a link, then it’s often a quote or summary. Compare green text against the link text. This may help later in filtering duplicate text occurrences or attributing text to users when it’s really just a quote that is then shit on. See where I’m going. You need to normalize the data before diving into the analyzation of it so that it can be viewed in the proper context.
After that. I’d like to know the origin of certain topics, where they originate from and when. It should then be able to determine if a certain topic evolves naturally or is pushed in a concerted effort. Then we can ask why.
Isaiah Powell
DELET THIS
Julian Price
There's missing information if you do it with the archive it's not reliable. Also there's already several agency doing the exact same and they come to stupid conclusion all the times.
Jonathan Young
I haven't done much work parsing out the actual posts yet. Here's the most common posts with exact matches with a count of at least 20. Hopefully the link works and people can't fuck with it; i'm not too familiar with google docs.
This, at the end of the day with the way these forums function its difficult if not downright impossible to tie anything meaningful to posts and use it in any meaningful ways. Our big boy human brains are able to see things like typing style, stereotypes, inside jokes, etc. to be able to identify posters, shills, or to understand what prior knowledge/group the person has/belongs to. Artificial Intelligence is a far cry from being able to do that with any meaning or consistency but its grand to watch special interest groups spend millions just to say "They shitpost and are mean >:(". One special interest (israeli) organization used metadata to track images as they propagate around the web to help them shill but then Anons found a way to track it themselves and fucked it up, which was grand.
Roughly half the threads posted today have been chink slide threads. Who cares? This place is a shithole.
James Fisher
Do i have to call people a nigger to be welcomed here lol? i've been using Yas Forums for a few months now, and it's opened my eyes to things I never would have discovered otherwise since people are too afraid to talk about real issues in public.
Looking for shills would be great. I'm not sure if the archives will be good enough for that, and I don't know how machine learning works. I'd have to manually find trends which should be doable.
Shill detection is not data driven. It's agenda driven. Although i guess keywords would work on some level. But the keyword schizos are the worst anons because language adoption is not a real indication of anything.
Tyler Ross
>but what do you guys want to know? Accuracy of mutts law. How quickly it rings true in every thread
Juan Edwards
Easy to VPN from chyna
Leo Bailey
Nah, not that you aren't a raving wignat. Just that most far right wingers aren't the "Making a large data set as a hobby" type. The point of the ideology is to appeal to the real world and getting away from dopamine drips such as internet addiction. Makes it difficult to teach proper OpSec or to combat the left on their own territory. It's getting better as the Zoomers flood in but it's still a tad uncommon. Sorry m8. If you could get some meaningful data on the topic it'd be helpful to us lamp-lighters trying to help newfags understand the difference between us and the shills/cointelpro.
>1 post by this id Soulless chink bugman shill with small penis and smooth brain who thinks like a cat. His parents are ashamed.
Tyler Davis
>Why are you guys being so hard on him? Why whiteknight for OP? You know it's just a journo "exposing hate" or shareblue, or ADL or JIDF or QCHQ or Reddit or discord trannies. it can't be good. FUCK OP
Samuel Sullivan
Wtf. Just get a notepad and for every thread write 1 for true and 0 for false. It's pretty fucking observable.
Kevin Thomas
I'd expect more Kekistani really. I guess less people come here for the lulz these days.