Skip to content

Instantly share code, notes, and snippets.

@mohdsanadzakirizvi
Created January 26, 2020 08:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mohdsanadzakirizvi/79d16de3c9c3dac3cde87bd37affbde5 to your computer and use it in GitHub Desktop.
Save mohdsanadzakirizvi/79d16de3c9c3dac3cde87bd37affbde5 to your computer and use it in GitHub Desktop.
import re
import nltk
nltk.download('stopwords')
# download stopwords list from nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
def clean_text(text):
# converting to lowercase
newString = text.lower()
# removing links
newString = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '', newString)
# fetching alphabetic characters
newString = re.sub("[^a-zA-Z]", " ", newString)
# removing stop words
tokens = [w for w in newString.split() if not w in stop_words]
# removing short words
long_words=[]
for i in tokens:
if len(i)>=4:
long_words.append(i)
return (" ".join(long_words)).strip()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment