Skip to content

Instantly share code, notes, and snippets.

@rjurney
Created October 1, 2023 03:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rjurney/08e2a63babe9c08894f3f6ef94c5681c to your computer and use it in GitHub Desktop.
Save rjurney/08e2a63babe9c08894f3f6ef94c5681c to your computer and use it in GitHub Desktop.
Code that clusters the dirty journal name property of an arXiv citation graph to create clean journal names as labels for classification
#
# Create a pd.DataFrame of the nodes for analysis in a notebook
#
# Extract nodes and their attributes into a list of dictionaries
node_data = [{**{"node": node}, **attr} for node, attr in G.nodes(data=True)]
# Convert the list of dictionaries into a DataFrame
node_df = pd.DataFrame(node_data)
# Embed the dirty Journal-ref and cluster it to produce labels.
model = SentenceTransformer("sentence-transformers/paraphrase-MiniLM-L6-v2")
for column in [
"Journal-ref",
]: # "Title", "Abstract"]:
embeddings = model.encode(node_df[column].tolist())
node_df[f"{column}Embedding"] = embeddings.tolist()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment