search-mastodon/mastodon-get-posts.org

91 lines
2.6 KiB
Org Mode
Raw Normal View History

2023-08-08 16:36:48 +00:00
* Initialize
2023-08-08 22:28:22 +00:00
- Run this part for either of the following two sections.
2023-08-08 05:58:15 +00:00
#+begin_src python
2023-08-08 06:02:02 +00:00
import json # to parse data
import requests # to get data
2023-08-08 22:28:22 +00:00
from datetime import date # to get the current date
import os
2023-08-08 16:36:48 +00:00
# get user id
instance = "https://social.edu.nl"
username = "mishavelthuis"
id = json.loads(requests.get(f"{instance}/api/v1/accounts/lookup?acct={username}").text)['id']
# get current date
current_date = date.today()
# Create filename for data output
#current_dir="/".join(inspect.getfile(inspect.currentframe()).split("/")[:-1])
download_dir=os.path.expanduser("~/Downloads")
file_name_save=f'{download_dir}/mydata_{current_date}_{username}.csv'
#+end_src
2023-08-08 05:58:15 +00:00
* Get/refresh data
- I used [[https://jrashford.com/2023/02/13/how-to-scrape-mastodon-timelines-using-python-and-pandas/][this]] setup.
2023-08-08 22:28:22 +00:00
- The results are saved in a csv file, so you don't have to download all messages for every text search. (You only have to refresh the data every now and then).
2023-08-08 05:58:15 +00:00
#+begin_src python
2023-08-08 16:36:48 +00:00
import json # to parse data
import requests # to get data
import pandas as pd # work with data
import subprocess # for getting access token from pass
import os # to remove file
2023-08-08 22:28:22 +00:00
# To start with a fresh file
2023-08-08 16:36:48 +00:00
os.remove(file_name_save)
2023-08-08 05:58:15 +00:00
2023-08-08 16:36:48 +00:00
url = f'{instance}/api/v1/accounts/{id}/statuses'
2023-08-08 05:58:15 +00:00
params = {
'limit': 40
}
results = []
2023-08-08 16:36:48 +00:00
num_done = 0
2023-08-08 05:58:15 +00:00
while True:
2023-08-08 16:36:48 +00:00
print(f'{num_done} statuses downloaded')
try:
r = requests.get(url, params=params)
toots = json.loads(r.text)
except:
print("request didn't work")
2023-08-08 05:58:15 +00:00
if len(toots) == 0:
break
2023-08-08 16:36:48 +00:00
try:
max_id = toots[-1]['id']
params['max_id'] = max_id
except Exception as error:
print("An error occurred with max_id:", error)
2023-08-08 05:58:15 +00:00
2023-08-08 16:36:48 +00:00
num_done=num_done+40
2023-08-08 05:58:15 +00:00
2023-08-08 16:36:48 +00:00
try:
df = pd.DataFrame(toots)
df.to_csv(file_name_save, mode='a', index=False)
except Exception as error:
print("An error occurred with df:", error)
num_done=num_done-40
2023-08-08 05:58:15 +00:00
#+end_src
* Use/search data
2023-08-08 22:28:22 +00:00
- You can use the csv-file saved in the previous section to search posts.
2023-08-08 05:58:15 +00:00
#+begin_src python
2023-08-08 16:36:48 +00:00
import pandas as pd # work with data
from bs4 import BeautifulSoup # to more easily read the html output
2023-08-08 05:58:15 +00:00
df=pd.read_csv(file_name_save)
query="test"
# Search for words
2023-08-08 16:36:48 +00:00
for index, i in df.iterrows():
if isinstance(i['content'],str):
if query in i['content']:
soup = BeautifulSoup(i['content'], 'html.parser')
2023-08-08 05:58:15 +00:00
readable_text = soup.get_text(separator=' ', strip=True)
2023-08-08 16:36:48 +00:00
print(i['url'])
print(i['created_at'])
2023-08-08 05:58:15 +00:00
print(readable_text)
print("----")
#+end_src