Skip to content Skip to sidebar Skip to footer

Python Read Specific Data From Text File

I'm struggling with trying to grasp this. I need to create a pandas DataFrame object with the following entries for each review: Product ID Number of people who voted this review

Solution 1:

If I understood your question I believe you want to read from a file with the structure you wrote. You can use the following code that will create an array with every review being a dictionary:

#Opening your file
your_file = open('file.txt')

#Reading every line
reviews = your_file.readlines()

reviews_array = []
dictionary = {}

#We are going through every line and skip it when we see that it's a blank line
for review in reviews:
    this_line = review.split(":")
    if len(this_line) > 1:
        #The blank lines are less than 1 in length after the split
        dictionary[this_line[0]] = this_line[1].strip()
        #Every first part before ":" is the key of the dictionary, and the second part id the content.else:
        #If a blank like was found lets save the object in the array and reset it#for the next review
        reviews_array.append(dictionary)
        dictionary = {}

#Append the last object because it goes out the last else
reviews_array.append(dictionary)

print(reviews_array)

This code will print something like this:

[
{'review/text': 'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.', 'review/profileName': 'delmartian', 'review/summary': 'Good Quality Dog Food', 'product/productId': 'B001E4KFG0', 'review/score': '5.0', 'review/time': '1303862400', 'review/helpfulness': '1/1', 'review/userId': 'A3SGXH7AUHU8GW'},
{'review/text': 'Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".', 'review/profileName': 'dll pa', 'review/summary': 'Not as Advertised', 'product/productId': 'B00813GRG4', 'review/score': '1.0', 'review/time': '1346976000', 'review/helpfulness': '0/0', 'review/userId': 'A1D87F6ZCVE5NK'},
{'review/text': 'bla blas', 'review/profileName': 'Natalia Corres "Natalia Corres"', 'review/summary': '"Delight" says it all', 'product/productId': 'B000LQOCH0', 'review/score': '4.0', 'review/time': '1219017600', 'review/helpfulness': '1/1', 'review/userId': 'ABXLMWJIXXAIN'}
]

You can access every object like this:

for r in reviews_array:
    print(r['review/userId'])

And then you will have this result:

A3SGXH7AUHU8GW
A1D87F6ZCVE5NK
ABXLMWJIXXAIN

Solution 2:

Here is a start, I couldn't decipher a couple of your fields/columns so it might need more logic and text massaging. Similar to other answers: parse the text into dictionary key:value pairs - using a regular expression find the pairs.

import collections, re

fields = {'productId':'Product ID', 'score':'Rating',
          'helpfulness':'Number Voting', 'text':'Review'}

pattern = r'/([^:]*):\s?(.*)'
kv = re.compile(pattern)

data = collections.defaultdict(list)
withopen('file.txt') as f:
    reviews = f.read()

for match in kv.finditer(reviews):
    key, value = match.groups()
    if key in fields:
        data[fields[key]].append(value)

df = pd.DataFrame.from_dict(data)

Post a Comment for "Python Read Specific Data From Text File"